AlexNet

Abstract

AlexNet (Krizhevsky, Sutskever, and Hinton, 2012) is the network that ended the debate about deep learning for vision. It won the ILSVRC of 2012 by a margin no previous method had approached, and almost every design choice it made, ReLU activations, dropout, GPU training, is still in use. This note covers what was new in it, its layer-by-layer structure, and the parameter pattern that the rest of the chapter exists to fix.

The name comes from the first author, Alex Krizhevsky.

What AlexNet introduced

AlexNet did not invent its key ingredients so much as combine them at a scale that finally worked:

it made ReLU popular again, showing that a non-saturating activation trains a deep network far faster than the saturating sigmoid or tanh;
it introduced dropout as a regularizer in the dense layers;
it used max pooling for downsampling.

A dissent on max pooling

LeCun never endorsed max pooling: average pooling is often considered to preserve more information, and max pooling discards everything in a window except its strongest response (the trade-off is examined in downsampling). Max pooling won out in practice mostly because it is slightly cheaper and latches onto the dominant activation. The disagreement is a useful reminder that not every choice in a landmark network is the provably optimal one.

The win was a recipe, not a single trick

It is tempting to credit AlexNet’s leap to one idea, but the result came from several reinforcing one another. Alongside ReLU and dropout, the network leaned heavily on data augmentation: random $224 \times 224$ crops taken from $256 \times 256$ images, horizontal flips, and a PCA-based colour perturbation (“fancy PCA”) that jitters the lighting, with predictions averaged over ten crops at test time. Augmentation multiplied the effective dataset by orders of magnitude and did as much for generalization as dropout. The lesson, often lost, is that 2012’s breakthrough was architecture and data and regularization and GPUs together, not any one of them in isolation.

Two things that stand out

The first kernels are huge: $11 \times 11$ . The very first convolutional layer uses $11 \times 11$ filters, large enough that the learned weights can be viewed directly as little images, and when they are, they show oriented edges and colour blobs, the bottom tier of the feature hierarchy made visible. Modern networks have since abandoned such large kernels in favor of stacks of $3 \times 3$ (the lesson of VGG).

The network is deep and heavy for its time. AlexNet has 8 weight layers (5 convolutional, 3 fully connected) and about 62 million parameters. For comparison, LeNet-5, the first trainable CNN, had roughly 60 thousand: a thousandfold increase in a decade and a half.

Layer by layer

The full structure, for a $3 \times 227 \times 227$ input, is below. The parameter count of a convolutional layer is $(K_{W} \times K_{H} \times C_{in} + 1) \times C_{out}$ , the kernel volume plus a bias, once per output channel.

Layer	Output	Filter	Stride	Pad	Parameters
Input	$3 \times 227 \times 227$	-	-	-	-
Conv1 + ReLU	$96 \times 55 \times 55$	$11 \times 11$	$4$	$0$	$(11 \cdot 11 \cdot 3 + 1) \cdot 96 = 34, 944$
Max Pool	$96 \times 27 \times 27$	$3 \times 3$	$2$	$0$	$0$
LRN	$96 \times 27 \times 27$	-	-	-	$0$
Conv2 + ReLU	$256 \times 27 \times 27$	$5 \times 5$	$1$	$2$	$(5 \cdot 5 \cdot 96 + 1) \cdot 256 = 614, 656$
Max Pool	$256 \times 13 \times 13$	$3 \times 3$	$2$	$0$	$0$
LRN	$256 \times 13 \times 13$	-	-	-	$0$
Conv3 + ReLU	$384 \times 13 \times 13$	$3 \times 3$	$1$	$1$	$(3 \cdot 3 \cdot 256 + 1) \cdot 384 = 885, 120$
Conv4 + ReLU	$384 \times 13 \times 13$	$3 \times 3$	$1$	$1$	$(3 \cdot 3 \cdot 384 + 1) \cdot 384 = 1, 327, 488$
Conv5 + ReLU	$256 \times 13 \times 13$	$3 \times 3$	$1$	$1$	$(3 \cdot 3 \cdot 384 + 1) \cdot 256 = 884, 992$
Max Pool	$256 \times 6 \times 6$	$3 \times 3$	$2$	$0$	$0$
Dropout (0.5)	$256 \times 6 \times 6$	-	-	-	$0$
FC6 + ReLU	$4096$	-	-	-	$256 \cdot 6 \cdot 6 \cdot 4096 = 37, 748, 736$
Dropout (0.5)	$4096$	-	-	-	$0$
FC7 + ReLU	$4096$	-	-	-	$4096 \cdot 4096 = 16, 777, 216$
FC8 (output)	$1000$	-	-	-	$4096 \cdot 1000 = 4, 096, 000$
Total	-	-	-	-	$\approx 62.3$ million

The output layer has 1000 neurons, one per ILSVRC category, each producing the probability that the image belongs to that class.

Why $227$ and not $224$ ?

The ImageNet images are $224 \times 224$ , yet the arithmetic of the first layer only works at $227$ : an $11 \times 11$ filter with stride $4$ and no padding turns a $227$ -wide input into $(227 - 11) /4 + 1 = 55$ , the correct Conv1 width, whereas $224$ does not divide cleanly. The paper’s figure says $224$ ; the implementation effectively used $227$ (a padding of $3$ on a $224$ input gives the same result). It is a small, famous inconsistency worth knowing when reproducing the network.

Where the parameters and the compute live

Splitting the totals between the convolutional and the dense part exposes a striking asymmetry:

	Convolutional layers	Fully connected layers
Share of parameters	$\approx 6%$	$\approx 94%$
Share of compute (FLOPs)	$\approx 95%$	$\approx 5%$

The dense head holds almost all of the memory (the weights), while the convolutional backbone does almost all of the work (the arithmetic). The first fully connected layer alone, mapping the flattened $256 \times 6 \times 6$ activation to $4096$ units, carries $37.7$ million weights, more than ten times the entire convolutional stack.

This is the problem the chapter goes on to solve

That $94%$ of the parameters sit in the head is not a quirk of AlexNet; it is the parameter explosion quantified in replacing the dense head with convolutions, and the reason later architectures (GoogLeNet, ResNet) drop the dense head in favor of global average pooling. AlexNet is the concrete case that makes the cost visible.

Two details now obsolete

Two pieces of AlexNet did not survive in their original form:

Local Response Normalization (LRN), a normalization across nearby channels, later superseded by batch normalization.

The two-GPU split, used only to fit the network in memory, which forced its convolutions into two parallel halves. The split itself is gone, but it is the origin of grouped convolution, a hardware workaround that later became a deliberate design tool.

The two GPUs split the work, and the network noticed

That two-GPU split was a memory workaround, but it left a fingerprint on what the network learned. Because the two halves exchanged information only at a few layers, each was free to specialize, and they did: visualized, one GPU’s first-layer filters come out largely colour-agnostic (oriented edges and gratings), the other’s colour-specific (blobs of opponent colour). A division forced by hardware became a spontaneous division of labor, an early and accidental demonstration that grouped pathways tend to specialize, the property later put to deliberate use in grouped convolution and ResNeXt.

Deep Learning: Zero to Hero

Explorer

What AlexNet introduced

Two things that stand out

Layer by layer

Where the parameters and the compute live

Graph View

Table of Contents

Backlinks