Abstract
Inception v3 (Szegedy et al., Rethinking the Inception Architecture for Computer Vision, 2015) refines GoogLeNet into a set of reusable design principles rather than a single clever module. Its headline techniques, factorizing convolutions (including the asymmetric and split), efficient grid reduction, and label smoothing, are now standard tools. This note covers each and the principle behind them.
Inception v3 is the third iteration of the Inception family. The Inception module itself, the parallel multi-branch block with bottlenecks, and the name both belong to the original Inception module (introduced with GoogLeNet, Inception v1); this note is about what the later versions changed inside it.
The refinements paid off. Inception v3 reached about 3.6% top-5 error on ImageNet-1k, below the roughly human estimate, with 48 learnable layers (more than operations in all) yet only about 22 million parameters, an order of magnitude fewer than VGG’s million.
The guiding principle: factorize, and never bottleneck the representation
Two ideas run through the whole design.
- the first is factorization: replace one expensive operation with a sequence of cheaper ones that compute something comparable;
- the second is a warning, avoid representational bottlenecks: the spatial size and channel depth of the feature maps should shrink gently and gradually, never collapse abruptly, since information thrown away early cannot be recovered by any later layer.
Factorizing convolutions
Inception v3 pushes the factorization idea of VGG further, in two stages.
Into smaller square kernels. A convolution is replaced by two stacked convolutions, and a by three, exactly the receptive-field equivalence proved in stacking small kernels, saving parameters and adding a nonlinearity at each split.
Into asymmetric kernels. A convolution is further split into a convolution followed by a convolution. The two one-dimensional passes cover the same region but cost less:
a reduction of about a third per channel, with the saving growing for larger .
Asymmetric factorization is the separable-filter trick from classical vision
Splitting an convolution into then is exactly the separable filter idea from classical image processing: a 2D filter that is the outer product of two 1D filters (a rank-1 kernel) can be applied as two cheap 1D passes. The Gaussian blur and the Sobel edge operator are separable for precisely this reason. Inception v3 bets that many useful learned filters are approximately separable, so paying for a full 2D kernel is largely wasteful, the same low-rank intuition the bottleneck applies across channels rather than across space.
Where asymmetric factorization helps, and where it does not
The and split is most effective on medium-sized feature maps (grids of roughly to on a side). On very early, high-resolution layers it works poorly, so Inception v3 applies it only in the middle of the network. This is a recurring theme in architecture design: a technique is rarely universally good, it is good in a regime, and naming the regime is part of understanding it.
Efficient grid reduction
Downsampling poses a dilemma. Pooling first and then widening the channels wastes information (a representational bottleneck); widening first and then pooling is correct but expensive. Inception v3 sidesteps both by reducing the grid with a parallel block, one branch a stride- convolution and one branch a stride- pooling, concatenated. The map is downsampled and deepened at the same time, without the sharp bottleneck either order alone would create.
Label smoothing
Inception v3 introduced label smoothing, a regularizer that has since become standard. Instead of training against a hard one-hot target (probability on the true class, on all others), the target is softened: a large fraction of the probability mass stays on the true class and a small fraction is spread uniformly over the rest.
Why softening the labels helps
A hard one-hot target asks the network to drive the true logit to relative to the others, which encourages overconfidence and large weights, and never lets the loss reach zero on correct examples. Spreading a little mass onto the other classes removes that pressure: the network is rewarded for being confident but not for being certain, which improves calibration and generalization. It is a cheap, output-side complement to the input-side regularizers like dropout.
Label smoothing, in one line of algebra
Training against a softened target is just cross-entropy against a mixture: the target becomes on the true class plus spread uniformly over the rest, so the loss splits into the ordinary cross-entropy plus an -weighted term pulling every output toward the uniform distribution. That second term is a penalty against over-confidence, the output-side analogue of weight decay: weight decay keeps the weights from growing without bound, label smoothing keeps the logit gaps from growing without bound. It is why a label-smoothed network is better calibrated, with predicted probabilities closer to the frequencies they claim.
Reading the architecture

The blocks are unlabelled. Convolutions, in orange, make up almost the whole network; the handful of other colours mark the pooling, the reductions, and the head.
The colour legend
- 🟠 Orange, convolution (Conv + batch normalization + ReLU): the workhorse, and very nearly every block in the figure.
- 🔵 Blue, pooling: the average-pool branch inside each Inception module, and the global average pool that opens the head.
- 🟢 Green, max pooling: the downsampling pools in the stem, at the far left.
- 🔴 Red, join and reduce: where a module’s parallel branches are concatenated and, between stages, where the spatial grid is shrunk.
- 🟣 Purple and 🟥 dark red, the head: dropout and the fully connected layer, then the softmax output.
Read left to right, the network is three families of Inception module at shrinking grid sizes, with a reduction between each. (Inception v3 takes a input, slightly larger than the usual .)
| Stage | Grid | What it contains |
|---|---|---|
| Stem | plain 🟠 convolutions and 🟢 max pools that downsample before any module runs | |
| Inception-A () | the -factorized modules (a becomes two ) | |
| 🔴 reduction | shrink the grid without a representational bottleneck | |
| Inception-B (repeated) | the asymmetric , modules | |
| 🔴 reduction | shrink again | |
| Inception-C () | the wide modules, factorized branches placed side by side | |
| Head | 🔵 global average pooling → 🟣 dropout and fully connected → 🟥 softmax |
The short branch that drops downward to its own 🟥 softmax is the single auxiliary classifier, attached to the stage and used only during training. Everything else is the three module types, repeated. This is why the figure is a long, mostly uniform chain rather than the individually hand-laid layers of AlexNet: once the module types are fixed, the architecture is largely their repetition at shrinking resolutions, ending like GoogLeNet and ResNet in global average pooling rather than a heavy dense head.
The smaller pieces
- Batch normalization entered the Inception line (from v2 onward) to counter internal covariate shift, the drift in each layer’s input distribution as the layers below it keep updating; v3 uses it throughout, including inside the auxiliary classifier, where it doubles as a regularizer. See batch normalization.
- The network is trained with RMSProp rather than plain SGD.
The lasting contribution
Inception v3’s value is less any one module than its catalogue of transferable principles: factor large operations into small ones, keep the representation from collapsing, regularize the targets as well as the weights. These outlived the specific Inception family and are visible in nearly every architecture since.