VGGNet

Abstract

VGGNet (Simonyan and Zisserman, Visual Geometry Group, Oxford, 2014), also called OxfordNet, made one decisive point: a large convolution can be replaced by a stack of small $3 \times 3$ convolutions, and depth built entirely from $3 \times 3$ kernels outperforms shallower networks with large ones. But the paper’s real method was a clean controlled experiment in depth, and its legacy is as much the uniform “VGG block” template as the $3 \times 3$ idea itself. This note covers all three, the idea, the experiment, and the cautionary parameter cost, pointing to the full receptive-field proof where it lives elsewhere on the site.

The one idea: small kernels, stacked

VGG’s lasting discovery, still used today, is that there is no need for large convolution kernels. A convolution with a big kernel can be factored into a cascade of smaller ones:

a $5 \times 5$ convolution is equivalent, in receptive field, to two stacked $3 \times 3$ convolutions;
a $7 \times 7$ convolution is equivalent to three stacked $3 \times 3$ convolutions.

Equivalence of receptive field, not of weights

The equivalence is not an algebraic identity between the kernels; it is an equivalence of receptive field, the region of the input that influences one output unit.

The full derivation (the $R_{N} = 1 + \sum_{i} (k_{i} - 1)$ formula, the worked $5 \times 5$ and $7 \times 7$ cases, and the two reasons the factorization is worth doing) is developed in stacking small kernels. It is not repeated here; only the conclusions VGG drew from it are.

The two advantages that follow, both established in that note, are what made the design win:

Fewer parameters. Three $3 \times 3$ kernels cost $3 \times 9 C = 27 C$ weights against a single $7 \times 7$ kernel’s $49 C$ , a reduction of about $45%$ at the same receptive field (and $\approx 28%$ for $5 \times 5$ versus two $3 \times 3$ ).
More nonlinearity. Each $3 \times 3$ layer is followed by its own ReLU, so a stack interposes several nonlinearities where a single large kernel interposes one, and the stack can represent a strictly richer class of functions over the same region.

Why $3 \times 3$ , and not smaller

A $3 \times 3$ kernel is the smallest one that still has a sense of direction. A $1 \times 1$ kernel has no spatial extent at all (it sees one pixel, so it cannot tell left from right or detect an edge); a $2 \times 2$ kernel has no central pixel, so it has no symmetric notion of “a center and its neighbors.” A $3 \times 3$ kernel is the first to place a center pixel among all eight of its neighbors, the minimum needed to represent an oriented edge or a corner. This is why, once VGG showed that depth could supply the receptive field, $3 \times 3$ became the near-universal kernel size: it is the smallest unit of spatial pattern worth stacking.

The receptive field is nominal, not effective

By its last convolutional layer a deep VGG has a nominal receptive field covering most of the image. The effective receptive field, the region that actually carries weight in an output unit’s gradient, is far smaller and roughly Gaussian, concentrated near the centre (Luo et al., 2016). Stacking small kernels grows the nominal field linearly with depth but the effective field only as its square root, so a network can in principle see the whole image long before it meaningfully does. This gap is part of why even very deep plain CNNs handle truly global, long-range relationships poorly, the weakness that attention was later built to fix.

Depth was the experiment

The popular memory of VGG is “stack $3 \times 3$ kernels,” but the paper’s actual contribution was an experiment. Fixing the kernel at $3 \times 3$ removes nearly every other variable, which let the authors vary one thing in isolation, depth, and measure its effect alone. They trained a family of configurations of growing depth:

Configuration	Weight layers
VGG11 (A)	$11$
VGG13 (B)	$13$
VGG16 (D)	$16$
VGG19 (E)	$19$

Accuracy improved monotonically with depth up to $16$ to $19$ layers, then saturated. At the time this was among the cleanest pieces of evidence that, given the right building block, deeper is better, which is why “VGG16” and “VGG19” are named for their depth rather than any other feature. The saturation past $19$ layers was itself informative: it foreshadowed the degradation problem that ResNet would diagnose and fix a year later, namely that plain stacks cannot be pushed arbitrarily deep.

The architecture: uniformity as a principle

VGG turns that single idea into an entire design philosophy: use nothing but $3 \times 3$ convolutions (stride $1$ , padding $1$ , which preserves spatial size), interleaved with $2 \times 2$ max pooling, and simply make the network deep. The result is two standard configurations, named for their number of weight layers:

VGG16: 13 convolutional layers and 3 fully connected, 16 in total;
VGG19: 16 convolutional layers and 3 fully connected.

The regularity is the point. There are almost no architectural decisions to make per layer, only how many $3 \times 3$ blocks to stack before each pooling step, which made VGG easy to understand, reproduce, and transfer.

The channel-doubling rule, and the "VGG block"

VGG fixed two patterns the field still uses. First, the block: a short run of $3 \times 3$ convolutions followed by a $2 \times 2$ pool, repeated. Second, the channel-doubling rule: every time pooling halves the height and width, the next block doubles the channel count ( $64 \to 128 \to 256 \to 512$ ). The two move in opposite directions on purpose. Halving each spatial dimension cuts the number of positions by four, while doubling the channels only multiplies by two, so the activation volume shrinks by a factor of two per stage even as the representation grows richer per position. This “stage” structure, narrowing space while widening channels in step, became the template that almost every later backbone, ResNet included, inherited.

Result

VGG took first place in localization and second in classification at ILSVRC 2014, behind GoogLeNet. Its top-5 classification error of about $7.3%$ was a large step down from the previous year, achieved with a strikingly plain design.

Checking the dimensions against the formulas

The diagram of VGG16 reports a spatial chain $224 \to 112 \to 56 \to 28 \to 14 \to 7$ . Every number in it follows directly from the output-size formulas derived in padding and stride and pooling, using only the two operations VGG is built from.

The $3 \times 3$ convolutions preserve the spatial size. They run with stride $S = 1$ and padding $P = 1$ , so the stride- $1$ output formula gives

W_{out} = W_{in} - K + 2 P + 1 = W_{in} - 3 + 2 (1) + 1 = W_{in} .

That $P = 1$ is exactly the same-padding value $P = \frac{K - 1}{2} = \frac{3 - 1}{2} = 1$ , which is why a $3 \times 3$ convolution changes only the channel count and leaves height and width untouched. It is the reason an entire block of convolutions sits at one resolution in the diagram (for instance the two $224 \times 224 \times 64$ layers, or the three $28 \times 28 \times 512$ layers).

The $2 \times 2$ max pooling halves it. Each downsampling step is a $2 \times 2$ pool with stride $S = 2$ and no padding, whose output size is

W_{out} = ⌊ \frac{W _{in} - K}{S} ⌋ + 1 = ⌊ \frac{W _{in} - 2}{2} ⌋ + 1.

Applying it five times from the $224 \times 224$ input reproduces the chain exactly:

Pool	$W_{in}$	$⌊ (W_{in} - 2) /2 ⌋ + 1$	$W_{out}$
1	$224$	$111 + 1$	$112$
2	$112$	$55 + 1$	$56$
3	$56$	$27 + 1$	$28$
4	$28$	$13 + 1$	$14$
5	$14$	$6 + 1$	$7$

The chain lands on $7 \times 7$ , confirming the final feature map of $7 \times 7 \times 512$ drawn just before the dense layers. Each convolutional block leaves the resolution fixed; each pool halves it; the channel count, meanwhile, doubles at every stage ( $64 \to 128 \to 256 \to 512$ ) until it saturates at $512$ . Everything in the figure is accounted for.

Why the halving stays exact

Every pooling input ( $224, 112, 56, 28, 14$ ) is even, so $(W_{in} - 2) /2$ is already a whole number and the floor discards nothing. This is not luck: the $224 \times 224$ input is a multiple of $2^{5} = 32$ (indeed $224 = 7 \times 32$ ), and five clean halvings of a multiple of $32$ land exactly on $7$ . An input that was not a multiple of $32$ would lose a pixel to the floor at some stage, and the chain would not close so neatly: this is precisely why $224 \times 224$ is the standard ImageNet resolution.

That final $7 \times 7 \times 512$ map is where VGG’s cost is paid, which is the subject of the next section.

Training a deep net in 2014

VGG arrived before batch normalization and residual connections, which is easy to forget and central to appreciating it. A 19-layer network trained from scratch by plain SGD was, at the time, genuinely hard to optimize: the vanishing gradient made the early layers learn slowly, and a poor initialization could stall training entirely. The authors worked around this by training the shallowest configuration first and using its weights to initialize the deeper ones, growing the network in stages.

That this awkward staged procedure was necessary is exactly the difficulty that batch normalization and residual connections would soon remove: only a year later, ResNet trained a $152$ -layer network from scratch in a single run. VGG sits precisely on the boundary between the two eras, which is why it is at once the high-water mark of plain deep networks and the clearest motivation for what replaced them.

Multi-scale training and testing, the forgotten half of VGG's accuracy

VGG’s published numbers were not obtained at a single resolution. Images were rescaled so that the shorter side $S$ varied (training with $S$ jittered in $[256, 512]$ , testing over several scales and averaging the results), which shows each filter objects at many sizes and acts as a powerful, nearly free augmentation. This “scale jittering” contributed a sizable share of VGG’s accuracy, and it is routinely dropped when the network is summarized as merely “deep stacks of $3 \times 3$ “.

The cost: a parameter-heavy head

VGG’s weakness is the mirror image of its strength. The uniform convolutional stack is parameter-efficient, but the network ends in three large fully connected layers, and those dominate the count: VGG16 has about 138 million parameters, of which roughly 123 million live in the dense head.

The single most expensive layer is the first fully connected one, which flattens the final $7 \times 7 \times 512$ feature map into a $4096$ -unit layer:

7 \times 7 \times 512 \times 4096 \approx 102 million parameters,

in one layer, more than the entire convolutional stack beneath it.

VGG is expensive at both ends, in two different currencies

The parameters and the compute do not live in the same place. The $138$ million parameters are dominated by the fully connected head (about $123$ M), so VGG is heavy to store. The floating-point operations, by contrast, are dominated by the early convolutional layers, which run $3 \times 3$ filters over the largest, $224 \times 224$ feature maps: VGG16 costs roughly $15.5$ billion multiply-adds per image, several times more than AlexNet and an order of magnitude more than GoogLeNet. So VGG is painful in two distinct ways, large on disk because of the head and slow at inference because of the front. Parameter count and compute are decoupled, and VGG is the clearest example of the gap between them.

VGG is the worked example of the dense-head problem

These exact numbers are the running example in replacing the dense head with convolutions. VGG is where the conv-to-dense parameter explosion is most visible, and the reason later backbones replaced the flattened dense head with global average pooling. VGG simplified the convolutional body to perfection and left the head untouched; the next architectures fixed the head.

Why VGG outlived its leaderboard ranking

VGG was overtaken on accuracy within a year, yet it remained one of the most widely reused backbones for far longer, and not merely out of habit. Its uniform stack grows the receptive field gradually and produces a smooth, well-behaved hierarchy of features, which turned out to be ideal for measuring perceptual similarity between images rather than classifying them. The style transfer of Gatys et al. (2015) reads texture and style off the correlations (Gram matrices) of VGG’s convolutional features; the perceptual losses of Johnson et al. (2016) train image generators by matching VGG activations. A network can stop being the best classifier and still be the most useful feature extractor, which is exactly the backbone idea in action.

Deep Learning: Zero to Hero

Explorer

The one idea: small kernels, stacked

Depth was the experiment

The architecture: uniformity as a principle

Checking the dimensions against the formulas

Training a deep net in 2014

The cost: a parameter-heavy head

Graph View

Table of Contents

Backlinks