Abstract
ResNet (He, Zhang, Ren, and Sun, Microsoft Research Asia, 2015) made it possible to train networks of , , even layers, where plain networks had stalled at around . It won the ILSVRC of 2015 with a top-5 error of , the first time a machine crossed the estimated human level on the benchmark. Its single idea, the residual connection, is now in nearly every deep architecture, vision or not. This note covers the problem it solved, the block that solved it, and how the blocks assemble into the network.
The degradation problem
The obvious way to make a network more powerful is to add layers. By 2015 this had stopped working: beyond a point, adding layers made the network worse, and worse in a very specific way.
It is not overfitting
The deeper plain network had higher training error, not just higher test error. Overfitting would show as low training error and high test error; this was the opposite. The deeper model was failing to fit the data it was shown. The problem is therefore one of optimization, not of capacity or generalization.
The failure is counterintuitive. A deeper network contains a shallower one as a special case: take the shallow network and set the extra layers to the identity function, and the deep network computes exactly what the shallow one does. So a deep network should never be worse than its shallow prefix. Yet plain stacks of convolutions, trained by gradient descent, could not find that identity-preserving solution. Learning to leave a signal unchanged, through several nonlinear layers, is hard.
Residual learning: learn the deviation from identity
ResNet changes what each block is asked to learn. Instead of having a stack of layers compute a target mapping directly, it has them compute only the residual
and then adds the input back via a shortcut, so the block outputs
The benefit is clearest in one case. If the best thing a block can do is nothing (pass its input through unchanged), a plain stack would have to learn the identity, which is hard; a residual block only has to drive its weights toward zero so that , which is easy. The question each block answers shifts from “what should I compute” to “how far should I deviate from leaving the input alone”, and small deviations are exactly what gradient descent finds easily.
The shortcut is also a gradient highway
The added does double duty. On the forward pass it makes the identity easy to represent; on the backward pass it gives the gradient an unattenuated path straight through the block, because the derivative of with respect to carries a term that no amount of depth can shrink. This is the direct remedy for the vanishing gradient, and the full argument, shared with Highway networks and the Transformer residual stream, is developed in skip connections. The same additive-identity idea appears across time, rather than across depth, as the LSTM cell state.
The residual block: basic and bottleneck
Skip connection versus residual block
The two terms are easy to confuse:
- a skip connection (also called a shortcut or identity connection) is just the arrow that carries forward and adds it back: the in ;
- a residual block is the whole unit: the small stack of convolutions that computes together with the skip connection that adds the input. So the skip connection is one component, the shortcut; the residual block is the module built around it.
Put differently, the skip connection is the wire, the residual block is the circuit that includes it. The general idea and the gradient analysis of skip connections are in skip connections; this note is about the block ResNet builds from one.
The block comes in two forms.
The basic block is two convolutions, each followed by batch normalization and a ReLU, with an identity shortcut from input to output:
The bottleneck block, used in the deepest variants, keeps the cost down with a many-few-many shape: a convolution reduces the channel depth (many to few), a convolution does the spatial work on the thinner map, and a second convolution restores the depth (few to many). The expensive therefore runs on a small fraction of the channels.
The figure states the same argument as a short chain of problems and fixes:
Problem
In a plain deep stack, identity mappings are hard to learn, and the gradient struggles to flow back along long sequences of convolutional layers.
Solution
Add a skip connection so the layers learn only the residual , and the block outputs .
Problem
Reaching to layers with the basic two- block becomes too expensive.
Solution
Switch to the bottleneck: a convolution cuts the feature maps before the and another restores them after (the many-few-many approach), so the costly runs on far fewer channels.
The bottleneck is the Inception trick again
The reduce-then-restore pattern is the same channel bottleneck introduced by the Inception module and analyzed in replacing the dense head: a convolution mixes channels cheaply, so wrapping the expensive between two of them makes a -to--layer network affordable.
The architecture, step by step

The figure is ResNet-34, and its colours are not decoration: each colour is one stage, a run of residual blocks at a fixed channel count, and the channel count doubles from stage to stage.
Reading the diagram
- 🟠 Stem: a single convolution (stride , channels) and a max pool, which together take the input down to before any residual block runs.
- 🟪Stage 1 ( ch) → 🟢 Stage 2 () → 🟥 Stage 3 () → 🔵 Stage 4 (): four runs of residual blocks. At each new stage the channels double and the spatial size halves (the
/2labels), so the resolution falls .- ⬜ Head: a global average pool and a single
fc 1000.- The arrows are the shortcuts: a solid arrow is a pure identity (input and output share a shape); a dotted arrow is a projection shortcut, needed where a stage boundary changes the shape.
Not every shortcut is an identity
The dotted arrows mark where the shortcut is not a plain identity. At a stage boundary the channels double and the spatial size halves, so the input and the block’s output no longer have the same shape, and a literal is impossible. There ResNet replaces the identity with a projection shortcut: a convolution with stride that matches both the channel count and the resolution. So the “identity” connection is a true identity most of the time and a small learned projection at the few places the shape changes. The clean holds exactly only when and are the same size.
The same stage pattern, scaled to different depths, gives the standard family:
| Variant | Block | Layers | Parameters |
|---|---|---|---|
| ResNet-18 | basic | M | |
| ResNet-34 | basic | M | |
| ResNet-50 | bottleneck | M | |
| ResNet-101 | bottleneck | M | |
| ResNet-152 | bottleneck | M |
The bottleneck buys depth almost for free
Notice that ResNet-50 is sixteen layers deeper than ResNet-34 yet carries barely more parameters (M against M). That is the many-few-many block at work: squeezing the channels before the keeps each block cheap, so the budget buys depth rather than width. It is the reason even ResNet-152, almost ten times deeper than VGG16, still has far fewer parameters, helped further by ending, like GoogLeNet, in global average pooling rather than a heavy dense head.
A clean identity matters: pre-activation
In the original block a ReLU sits after the addition, on the main path, so what flows down the shortcut is an identity-then-ReLU, not a pure identity. A year later the same authors (He et al., Identity Mappings in Deep Residual Networks, 2016) showed that moving the batch-norm and ReLU before the convolutions (“pre-activation”), so that nothing at all touches the shortcut, opens the gradient highway completely, from the last layer to the first. With that single change they trained a 1001-layer network. Adding a shortcut is not enough on its own: the shortcut has to be kept free of nonlinearities for the highway to stay open.
Why it mattered beyond the leaderboard
| Before ResNet | With residual connections | |
|---|---|---|
| Trainable depth | roughly layers | + layers |
| Adding layers | eventually hurts training | keeps helping |
| The unit of design | a layer computing a mapping | a block deviating from identity |
A ResNet is an ensemble of many shallow paths
Because each block can be either used or bypassed through its shortcut, unrolling a network of residual blocks reveals paths of different lengths from input to output (Veit et al., 2016), and most of the gradient flows through the short ones. This is why a trained ResNet is robust to deleting individual layers at test time, where a plain network collapses: the remaining paths still carry the signal. A residual network is less a single very deep model than an ensemble of many shallower ones that share weights.
The most reused idea in the chapter
The residual connection outgrew image classification entirely. The residual stream, a running sum that each block reads from and writes a small update back into, is the backbone of the Transformer and of essentially every large model trained since. ResNet’s lasting lesson is architectural: make the identity the default, and let each block learn only how much to change it.