GoogLeNet

Abstract

GoogLeNet (Szegedy et al., 2014) won the ILSVRC of 2014 with a top-5 error of $6.7%$ , using twelve times fewer parameters than AlexNet and a fraction of VGG’s compute. It is built by stacking the Inception module, which is developed in its own note. This note covers the network those modules form: how it is assembled, the two choices that made a 22-layer network trainable in 2014 (global average pooling and auxiliary classifiers), and why it is so efficient.

GoogLeNet’s name is a deliberate homage to LeNet, the first trainable CNN. Its building block, the multi-branch Inception module with its $1 \times 1$ bottlenecks, is the subject of a separate note; what follows is the network those modules form.

Two more decisions that made it work

Beyond the module itself, two further choices were decisive.

Global average pooling instead of a dense head. GoogLeNet does not flatten its final feature map into large fully connected layers. It averages each feature map to a single number (global average pooling), producing a short vector that feeds almost directly into the softmax. This removes the parameter explosion that dominated AlexNet and VGG (see replacing the dense head), and is most of why a 22-layer network has only about 5 million parameters.

Global average pooling does more than save parameters

Averaging each feature map to one number also imposes a correspondence between feature maps and output concepts: each channel is pushed to behave like a spatial “confidence map” for one category, which is more interpretable and acts as a structural regularizer (the idea comes from Network in Network). As a bonus, the average is well defined over any spatial extent, so a global-pooled network is no longer tied to a single input resolution.

Auxiliary classifiers. Training a 22-layer network in 2014, before residual connections and widespread batch normalization, ran into the vanishing-gradient problem: the gradient reaching the earliest layers was too weak. GoogLeNet attached two extra softmax classifiers to intermediate layers, added their losses (down-weighted) to the main loss during training, and removed them at inference. The stated intent was to inject a stronger gradient into the lower layers.

What became of these two ideas

Global average pooling stayed and is now standard. The auxiliary classifiers did not, and their story is a small lesson in how the field’s understanding shifts. Once batch normalization and the residual connections of ResNet solved the gradient problem directly, the authors themselves (in the Inception v3 paper) reported that the auxiliary heads did not help early in training as the gradient story predicted; they helped only late, and mainly when they carried batch normalization. The honest verdict is that they acted as regularizers, not as a necessary gradient path, and later networks dropped them. The original justification was largely right about the symptom and wrong about the mechanism.

Efficiency, measured

The headline “twelve times fewer parameters” is only half the story, because parameters and compute are different costs, and GoogLeNet cut both at once.

Network	Parameters	Compute (approx. multiply-adds)
AlexNet	$\approx 62$ M	$\approx 1.1$ B
VGG16	$\approx 138$ M	$\approx 15.5$ B
GoogLeNet	$\approx 5$ M	$\approx 1.5$ B

VGG is heavy in both columns; GoogLeNet is light in both. The parameter saving comes from the global-average-pooling head; the compute saving comes from the $1 \times 1$ bottlenecks inside the Inception module. Holding the two costs apart is essential when comparing architectures, because a network can be small to store yet slow to run, or the reverse: VGG is the cautionary case of being expensive in both currencies.

The full network, read off the diagram

The diagram uses only four kinds of block, set apart by colour. Each block carries its operation and, where it applies, the kernel size, stride, and padding: Conv 3x3+1(S) is a $3 \times 3$ convolution at stride $1$ with same padding, and (V) marks valid padding instead.

The colour legend

🔵 Blue, convolutions and fully connected layers (Conv, FC): the only blocks with learnable weights, doing the feature extraction and the final scoring.

🔴 Red, pooling (MaxPool, AveragePool): parameter-free blocks that downsample or aggregate.

🟢 Green, concatenation and normalization (DepthConcat, LocalRespNorm): DepthConcat stacks an Inception module’s branches along the channel axis; LocalRespNorm is the now-obsolete local response normalization, used only in the stem.

🟡 Yellow, softmax (SoftmaxActivation): the classifier heads that turn the final features into class probabilities, including the two auxiliary outputs.

Read left to right, the network falls into four parts:

Part	Blocks, in order	What it does
Stem	🔵 `Conv 7x7+2` → 🔴 `MaxPool` → 🟢 `LocalRespNorm` → 🔵 `Conv 1x1` → 🔵 `Conv 3x3` → 🟢 `LocalRespNorm` → 🔴 `MaxPool`	a plain front end that brings the $224 \times 224$ input down to a small grid and channel count, so no module runs on a large map
Nine Inception modules	four parallel branches: 🔵 `1x1` ; 🔵 `1x1`→`3x3` ; 🔵 `1x1`→`5x5` ; 🔴 `MaxPool`→🔵 `1x1` ; all merged by 🟢 `DepthConcat`	multi-scale feature extraction, in three groups, with a 🔴 `MaxPool` halving the resolution between groups as the channels grow
Two auxiliary heads (training only)	🔴 `AveragePool 5x5` → 🔵 `Conv 1x1` → 🔵 `FC` → 🔵 `FC` → 🟡 `SoftmaxActivation`	feed `softmax0` and `softmax1`; removed at inference
Main head	🔴 `AveragePool 7x7` (global average pooling) → 🔵 `FC` → 🟡 `softmax2`	replaces a flattened dense head and produces the prediction

Where the compute goes: reduce early, then go deep

A subtle consequence of the stem: by the time the first Inception module runs, the map is already down to $28 \times 28$ , and every later module runs on a smaller grid still. The expensive multi-branch modules therefore never touch a large feature map, which is the other half of why GoogLeNet is so cheap (the $1 \times 1$ bottlenecks are the first half). The rule it set, do the aggressive spatial reduction early and cheaply and then spend the depth and parameters on small maps, is one almost every architecture since has followed.

Twenty-two weight layers deep, GoogLeNet is far deeper than AlexNet, yet the diagram is mostly the same module repeated: once the Inception module is right, the network is largely its repetition at shrinking resolutions.

Designed to a budget

Unlike its predecessors, GoogLeNet was explicitly designed under a fixed computational budget, with deployment on modest hardware in mind. Its efficiency was not a happy accident but the stated objective, which is why so much of the design is about doing more per parameter and per FLOP rather than simply adding depth and width. In spirit it is the ancestor of the mobile-first efficient backbones that came later.

Deep Learning: Zero to Hero

Explorer

Two more decisions that made it work

Efficiency, measured

The full network, read off the diagram

Graph View

Table of Contents

Backlinks