ResNet

Abstract

ResNet (He, Zhang, Ren, and Sun, Microsoft Research Asia, 2015) made it possible to train networks of $50$ , $101$ , even $152$ layers, where plain networks had stalled at around $20$ . It won the ILSVRC of 2015 with a top-5 error of $3.57%$ , the first time a machine crossed the estimated human level on the benchmark. Its single idea, the residual connection, is now in nearly every deep architecture, vision or not. This note covers the problem it solved, the block that solved it, and how the blocks assemble into the network.

The degradation problem

The obvious way to make a network more powerful is to add layers. By 2015 this had stopped working: beyond a point, adding layers made the network worse, and worse in a very specific way.

It is not overfitting

The deeper plain network had higher training error, not just higher test error. Overfitting would show as low training error and high test error; this was the opposite. The deeper model was failing to fit the data it was shown. The problem is therefore one of optimization, not of capacity or generalization.

The failure is counterintuitive. A deeper network contains a shallower one as a special case: take the shallow network and set the extra layers to the identity function, and the deep network computes exactly what the shallow one does. So a deep network should never be worse than its shallow prefix. Yet plain stacks of convolutions, trained by gradient descent, could not find that identity-preserving solution. Learning to leave a signal unchanged, through several nonlinear layers, is hard.

Residual learning: learn the deviation from identity

ResNet changes what each block is asked to learn. Instead of having a stack of layers compute a target mapping $H (x)$ directly, it has them compute only the residual

F (x) = H (x) - x,

and then adds the input back via a shortcut, so the block outputs

H (x) = F (x) + x .

The benefit is clearest in one case. If the best thing a block can do is nothing (pass its input through unchanged), a plain stack would have to learn the identity, which is hard; a residual block only has to drive its weights toward zero so that $F (x) \to 0$ , which is easy. The question each block answers shifts from “what should I compute” to “how far should I deviate from leaving the input alone”, and small deviations are exactly what gradient descent finds easily.

The shortcut is also a gradient highway

The added $x$ does double duty. On the forward pass it makes the identity easy to represent; on the backward pass it gives the gradient an unattenuated path straight through the block, because the derivative of $F (x) + x$ with respect to $x$ carries a $+ 1$ term that no amount of depth can shrink. This is the direct remedy for the vanishing gradient, and the full argument, shared with Highway networks and the Transformer residual stream, is developed in skip connections. The same additive-identity idea appears across time, rather than across depth, as the LSTM cell state.

The residual block: basic and bottleneck

Skip connection versus residual block

The two terms are easy to confuse:

a skip connection (also called a shortcut or identity connection) is just the arrow that carries $x$ forward and adds it back: the $+ x$ in $F (x) + x$ ;

a residual block is the whole unit: the small stack of convolutions that computes $F (x)$ together with the skip connection that adds the input. So the skip connection is one component, the shortcut; the residual block is the module built around it.

Put differently, the skip connection is the wire, the residual block is the circuit that includes it. The general idea and the gradient analysis of skip connections are in skip connections; this note is about the block ResNet builds from one.

The block comes in two forms.

The basic block is two $3 \times 3$ convolutions, each followed by batch normalization and a ReLU, with an identity shortcut from input to output:

x \to [conv \to BN \to ReLU \to conv \to BN] + x ReLU \to out .

The bottleneck block, used in the deepest variants, keeps the cost down with a many-few-many shape: a $1 \times 1$ convolution reduces the channel depth (many to few), a $3 \times 3$ convolution does the spatial work on the thinner map, and a second $1 \times 1$ convolution restores the depth (few to many). The expensive $3 \times 3$ therefore runs on a small fraction of the channels.

The figure states the same argument as a short chain of problems and fixes:

Problem

In a plain deep stack, identity mappings $x \to I (x) \equiv x$ are hard to learn, and the gradient struggles to flow back along long sequences of convolutional layers.

Solution

Add a skip connection so the layers learn only the residual $F (x) = H (x) - I (x)$ , and the block outputs $F (x) + x$ .

Problem

Reaching $50$ to $152$ layers with the basic two- $3 \times 3$ block becomes too expensive.

Solution

Switch to the bottleneck: a $1 \times 1$ convolution cuts the feature maps before the $3 \times 3$ and another restores them after (the many-few-many approach), so the costly $3 \times 3$ runs on far fewer channels.

The bottleneck is the Inception trick again

The $1 \times 1$ reduce-then-restore pattern is the same channel bottleneck introduced by the Inception module and analyzed in replacing the dense head: a $1 \times 1$ convolution mixes channels cheaply, so wrapping the expensive $3 \times 3$ between two of them makes a $50$ -to- $152$ -layer network affordable.

The architecture, step by step

The figure is ResNet-34, and its colours are not decoration: each colour is one stage, a run of residual blocks at a fixed channel count, and the channel count doubles from stage to stage.

Reading the diagram

🟠 Stem: a single $7 \times 7$ convolution (stride $2$ , $64$ channels) and a max pool, which together take the $224 \times 224$ input down to $56 \times 56$ before any residual block runs.

🟪Stage 1 ( $64$ ch) → 🟢 Stage 2 ( $128$ ) → 🟥 Stage 3 ( $256$ ) → 🔵 Stage 4 ( $512$ ): four runs of $3 \times 3$ residual blocks. At each new stage the channels double and the spatial size halves (the /2 labels), so the resolution falls $56 \to 28 \to 14 \to 7$ .

⬜ Head: a global average pool and a single fc 1000.

The arrows are the shortcuts: a solid arrow is a pure identity (input and output share a shape); a dotted arrow is a projection shortcut, needed where a stage boundary changes the shape.

Not every shortcut is an identity

The dotted arrows mark where the shortcut is not a plain identity. At a stage boundary the channels double and the spatial size halves, so the input $x$ and the block’s output $F (x)$ no longer have the same shape, and a literal $F (x) + x$ is impossible. There ResNet replaces the identity with a projection shortcut: a $1 \times 1$ convolution with stride $2$ that matches both the channel count and the resolution. So the “identity” connection is a true identity most of the time and a small learned projection at the few places the shape changes. The clean $H (x) = F (x) + x$ holds exactly only when $x$ and $F (x)$ are the same size.

The same stage pattern, scaled to different depths, gives the standard family:

Variant	Block	Layers	Parameters
ResNet-18	basic	$18$	$\approx 11$ M
ResNet-34	basic	$34$	$\approx 22$ M
ResNet-50	bottleneck	$50$	$\approx 24$ M
ResNet-101	bottleneck	$101$	$\approx 43$ M
ResNet-152	bottleneck	$152$	$\approx 59$ M

The bottleneck buys depth almost for free

Notice that ResNet-50 is sixteen layers deeper than ResNet-34 yet carries barely more parameters ( $24$ M against $22$ M). That is the many-few-many block at work: squeezing the channels before the $3 \times 3$ keeps each block cheap, so the budget buys depth rather than width. It is the reason even ResNet-152, almost ten times deeper than VGG16, still has far fewer parameters, helped further by ending, like GoogLeNet, in global average pooling rather than a heavy dense head.

A clean identity matters: pre-activation

In the original block a ReLU sits after the addition, on the main path, so what flows down the shortcut is an identity-then-ReLU, not a pure identity. A year later the same authors (He et al., Identity Mappings in Deep Residual Networks, 2016) showed that moving the batch-norm and ReLU before the convolutions (“pre-activation”), so that nothing at all touches the shortcut, opens the gradient highway completely, from the last layer to the first. With that single change they trained a 1001-layer network. Adding a shortcut is not enough on its own: the shortcut has to be kept free of nonlinearities for the highway to stay open.

Why it mattered beyond the leaderboard

	Before ResNet	With residual connections
Trainable depth	roughly $20$ layers	$100$ + layers
Adding layers	eventually hurts training	keeps helping
The unit of design	a layer computing a mapping	a block deviating from identity

A ResNet is an ensemble of many shallow paths

Because each block can be either used or bypassed through its shortcut, unrolling a network of $n$ residual blocks reveals $2^{n}$ paths of different lengths from input to output (Veit et al., 2016), and most of the gradient flows through the short ones. This is why a trained ResNet is robust to deleting individual layers at test time, where a plain network collapses: the remaining paths still carry the signal. A residual network is less a single very deep model than an ensemble of many shallower ones that share weights.

The most reused idea in the chapter

The residual connection outgrew image classification entirely. The residual stream, a running sum that each block reads from and writes a small update back into, is the backbone of the Transformer and of essentially every large model trained since. ResNet’s lasting lesson is architectural: make the identity the default, and let each block learn only how much to change it.

Deep Learning: Zero to Hero

Explorer

The degradation problem

Residual learning: learn the deviation from identity

The residual block: basic and bottleneck

The architecture, step by step

Why it mattered beyond the leaderboard

Graph View

Table of Contents

Backlinks