Abstract

AlexNet (Krizhevsky, Sutskever, and Hinton, 2012) is the network that ended the debate about deep learning for vision. It won the ILSVRC of 2012 by a margin no previous method had approached, and almost every design choice it made, ReLU activations, dropout, GPU training, is still in use. This note covers what was new in it, its layer-by-layer structure, and the parameter pattern that the rest of the chapter exists to fix.

The name comes from the first author, Alex Krizhevsky.

What AlexNet introduced

AlexNet did not invent its key ingredients so much as combine them at a scale that finally worked:

  • it made ReLU popular again, showing that a non-saturating activation trains a deep network far faster than the saturating sigmoid or tanh;
  • it introduced dropout as a regularizer in the dense layers;
  • it used max pooling for downsampling.

A dissent on max pooling

LeCun never endorsed max pooling: average pooling is often considered to preserve more information, and max pooling discards everything in a window except its strongest response (the trade-off is examined in downsampling). Max pooling won out in practice mostly because it is slightly cheaper and latches onto the dominant activation. The disagreement is a useful reminder that not every choice in a landmark network is the provably optimal one.

The win was a recipe, not a single trick

It is tempting to credit AlexNet’s leap to one idea, but the result came from several reinforcing one another. Alongside ReLU and dropout, the network leaned heavily on data augmentation: random crops taken from images, horizontal flips, and a PCA-based colour perturbation (“fancy PCA”) that jitters the lighting, with predictions averaged over ten crops at test time. Augmentation multiplied the effective dataset by orders of magnitude and did as much for generalization as dropout. The lesson, often lost, is that 2012’s breakthrough was architecture and data and regularization and GPUs together, not any one of them in isolation.

Two things that stand out

The first kernels are huge: . The very first convolutional layer uses filters, large enough that the learned weights can be viewed directly as little images, and when they are, they show oriented edges and colour blobs, the bottom tier of the feature hierarchy made visible. Modern networks have since abandoned such large kernels in favor of stacks of (the lesson of VGG).

The network is deep and heavy for its time. AlexNet has 8 weight layers (5 convolutional, 3 fully connected) and about 62 million parameters. For comparison, LeNet-5, the first trainable CNN, had roughly 60 thousand: a thousandfold increase in a decade and a half.

Layer by layer

The full structure, for a input, is below. The parameter count of a convolutional layer is , the kernel volume plus a bias, once per output channel.

LayerOutputFilterStridePadParameters
Input----
Conv1 + ReLU
Max Pool
LRN---
Conv2 + ReLU
Max Pool
LRN---
Conv3 + ReLU
Conv4 + ReLU
Conv5 + ReLU
Max Pool
Dropout (0.5)---
FC6 + ReLU---
Dropout (0.5)---
FC7 + ReLU---
FC8 (output)---
Total---- million

The output layer has 1000 neurons, one per ILSVRC category, each producing the probability that the image belongs to that class.

Why and not ?

The ImageNet images are , yet the arithmetic of the first layer only works at : an filter with stride and no padding turns a -wide input into , the correct Conv1 width, whereas does not divide cleanly. The paper’s figure says ; the implementation effectively used (a padding of on a input gives the same result). It is a small, famous inconsistency worth knowing when reproducing the network.

Where the parameters and the compute live

Splitting the totals between the convolutional and the dense part exposes a striking asymmetry:

Convolutional layersFully connected layers
Share of parameters
Share of compute (FLOPs)

The dense head holds almost all of the memory (the weights), while the convolutional backbone does almost all of the work (the arithmetic). The first fully connected layer alone, mapping the flattened activation to units, carries million weights, more than ten times the entire convolutional stack.

This is the problem the chapter goes on to solve

That of the parameters sit in the head is not a quirk of AlexNet; it is the parameter explosion quantified in replacing the dense head with convolutions, and the reason later architectures (GoogLeNet, ResNet) drop the dense head in favor of global average pooling. AlexNet is the concrete case that makes the cost visible.

Two details now obsolete

Two pieces of AlexNet did not survive in their original form:

  • Local Response Normalization (LRN), a normalization across nearby channels, later superseded by batch normalization.
  • The two-GPU split, used only to fit the network in memory, which forced its convolutions into two parallel halves. The split itself is gone, but it is the origin of grouped convolution, a hardware workaround that later became a deliberate design tool.

The two GPUs split the work, and the network noticed

That two-GPU split was a memory workaround, but it left a fingerprint on what the network learned. Because the two halves exchanged information only at a few layers, each was free to specialize, and they did: visualized, one GPU’s first-layer filters come out largely colour-agnostic (oriented edges and gratings), the other’s colour-specific (blobs of opponent colour). A division forced by hardware became a spontaneous division of labor, an early and accidental demonstration that grouped pathways tend to specialize, the property later put to deliberate use in grouped convolution and ResNeXt.