Inception v3

Abstract

Inception v3 (Szegedy et al., Rethinking the Inception Architecture for Computer Vision, 2015) refines GoogLeNet into a set of reusable design principles rather than a single clever module. Its headline techniques, factorizing convolutions (including the asymmetric $n \times 1$ and $1 \times n$ split), efficient grid reduction, and label smoothing, are now standard tools. This note covers each and the principle behind them.

Inception v3 is the third iteration of the Inception family. The Inception module itself, the parallel multi-branch block with $1 \times 1$ bottlenecks, and the name both belong to the original Inception module (introduced with GoogLeNet, Inception v1); this note is about what the later versions changed inside it.

The refinements paid off. Inception v3 reached about 3.6% top-5 error on ImageNet-1k, below the roughly $5%$ human estimate, with 48 learnable layers (more than $200$ operations in all) yet only about 22 million parameters, an order of magnitude fewer than VGG’s $138$ million.

The guiding principle: factorize, and never bottleneck the representation

Two ideas run through the whole design.

the first is factorization: replace one expensive operation with a sequence of cheaper ones that compute something comparable;
the second is a warning, avoid representational bottlenecks: the spatial size and channel depth of the feature maps should shrink gently and gradually, never collapse abruptly, since information thrown away early cannot be recovered by any later layer.

Factorizing convolutions

Inception v3 pushes the factorization idea of VGG further, in two stages.

Into smaller square kernels. A $5 \times 5$ convolution is replaced by two stacked $3 \times 3$ convolutions, and a $7 \times 7$ by three, exactly the receptive-field equivalence proved in stacking small kernels, saving parameters and adding a nonlinearity at each split.

Into asymmetric kernels. A $3 \times 3$ convolution is further split into a $1 \times 3$ convolution followed by a $3 \times 1$ convolution. The two one-dimensional passes cover the same $3 \times 3$ region but cost less:

square kernel 3 \times 3 = 9 ⟶ asymmetric split 1 \times 3 + 3 \times 1 = 6,

a reduction of about a third per channel, with the saving growing for larger $n \times n$ .

Asymmetric factorization is the separable-filter trick from classical vision

Splitting an $n \times n$ convolution into $1 \times n$ then $n \times 1$ is exactly the separable filter idea from classical image processing: a 2D filter that is the outer product of two 1D filters (a rank-1 kernel) can be applied as two cheap 1D passes. The Gaussian blur and the Sobel edge operator are separable for precisely this reason. Inception v3 bets that many useful learned filters are approximately separable, so paying for a full 2D kernel is largely wasteful, the same low-rank intuition the $1 \times 1$ bottleneck applies across channels rather than across space.

Where asymmetric factorization helps, and where it does not

The $n \times 1$ and $1 \times n$ split is most effective on medium-sized feature maps (grids of roughly $12$ to $20$ on a side). On very early, high-resolution layers it works poorly, so Inception v3 applies it only in the middle of the network. This is a recurring theme in architecture design: a technique is rarely universally good, it is good in a regime, and naming the regime is part of understanding it.

Efficient grid reduction

Downsampling poses a dilemma. Pooling first and then widening the channels wastes information (a representational bottleneck); widening first and then pooling is correct but expensive. Inception v3 sidesteps both by reducing the grid with a parallel block, one branch a stride- $2$ convolution and one branch a stride- $2$ pooling, concatenated. The map is downsampled and deepened at the same time, without the sharp bottleneck either order alone would create.

Label smoothing

Inception v3 introduced label smoothing, a regularizer that has since become standard. Instead of training against a hard one-hot target (probability $1$ on the true class, $0$ on all others), the target is softened: a large fraction of the probability mass stays on the true class and a small fraction is spread uniformly over the rest.

Why softening the labels helps

A hard one-hot target asks the network to drive the true logit to $+ \infty$ relative to the others, which encourages overconfidence and large weights, and never lets the loss reach zero on correct examples. Spreading a little mass onto the other classes removes that pressure: the network is rewarded for being confident but not for being certain, which improves calibration and generalization. It is a cheap, output-side complement to the input-side regularizers like dropout.

Label smoothing, in one line of algebra

Training against a softened target is just cross-entropy against a mixture: the target becomes $(1 - ε)$ on the true class plus $ε$ spread uniformly over the rest, so the loss splits into the ordinary cross-entropy plus an $ε$ -weighted term pulling every output toward the uniform distribution. That second term is a penalty against over-confidence, the output-side analogue of weight decay: weight decay keeps the weights from growing without bound, label smoothing keeps the logit gaps from growing without bound. It is why a label-smoothed network is better calibrated, with predicted probabilities closer to the frequencies they claim.

Reading the architecture

The blocks are unlabelled. Convolutions, in orange, make up almost the whole network; the handful of other colours mark the pooling, the reductions, and the head.

The colour legend

🟠 Orange, convolution (Conv + batch normalization + ReLU): the workhorse, and very nearly every block in the figure.

🔵 Blue, pooling: the average-pool branch inside each Inception module, and the global average pool that opens the head.

🟢 Green, max pooling: the downsampling pools in the stem, at the far left.

🔴 Red, join and reduce: where a module’s parallel branches are concatenated and, between stages, where the spatial grid is shrunk.

🟣 Purple and 🟥 dark red, the head: dropout and the fully connected layer, then the softmax output.

Read left to right, the network is three families of Inception module at shrinking grid sizes, with a reduction between each. (Inception v3 takes a $299 \times 299$ input, slightly larger than the usual $224$ .)

Stage	Grid	What it contains
Stem	$299 \to 35$	plain 🟠 convolutions and 🟢 max pools that downsample before any module runs
Inception-A ( $\times 3$ )	$35 \times 35$	the $3 \times 3$ -factorized modules (a $5 \times 5$ becomes two $3 \times 3$ )
🔴 reduction	$35 \to 17$	shrink the grid without a representational bottleneck
Inception-B (repeated)	$17 \times 17$	the asymmetric $1 \times 7$ , $7 \times 1$ modules
🔴 reduction	$17 \to 8$	shrink again
Inception-C ( $\times 2$ )	$8 \times 8$	the wide modules, factorized branches placed side by side
Head	$8 \to 1$	🔵 global average pooling → 🟣 dropout and fully connected → 🟥 softmax

The short branch that drops downward to its own 🟥 softmax is the single auxiliary classifier, attached to the $17 \times 17$ stage and used only during training. Everything else is the three module types, repeated. This is why the figure is a long, mostly uniform chain rather than the individually hand-laid layers of AlexNet: once the module types are fixed, the architecture is largely their repetition at shrinking resolutions, ending like GoogLeNet and ResNet in global average pooling rather than a heavy dense head.

The smaller pieces

Batch normalization entered the Inception line (from v2 onward) to counter internal covariate shift, the drift in each layer’s input distribution as the layers below it keep updating; v3 uses it throughout, including inside the auxiliary classifier, where it doubles as a regularizer. See batch normalization.
The network is trained with RMSProp rather than plain SGD.

The lasting contribution

Inception v3’s value is less any one module than its catalogue of transferable principles: factor large operations into small ones, keep the representation from collapsing, regularize the targets as well as the weights. These outlived the specific Inception family and are visible in nearly every architecture since.

Deep Learning: Zero to Hero

Explorer

The guiding principle: factorize, and never bottleneck the representation

Factorizing convolutions

Efficient grid reduction

Label smoothing

Reading the architecture

The smaller pieces

Graph View

Table of Contents

Backlinks