Abstract

By 2017 the ILSVRC had been won and effectively solved: a machine had crossed the estimated human error rate, and adding layers bought ever smaller gains. The frontier shifted from accuracy at any cost to the same accuracy at a fraction of the cost, pushed by the need to run vision models on phones and embedded hardware. This note frames that shift and names the reusable strategies that define the efficient backbones of this section.

The networks that won the ILSVRC, from AlexNet to ResNet, chased accuracy with little regard for size or speed; the parameter explosion was tolerated as the price of winning. Once the benchmark saturated, a wave of architectures inverted the question:

Question

How few parameters, and how little computation, can still reach a target accuracy?

Smaller models are cheaper to store, faster at inference, and easier to train, and they are what makes deep vision deployable away from the data center.

The plan: strategies, not architectures

This section does not walk through each of these networks in detail. What matters, and what carries forward, is the compact set of reusable strategies each one contributed. Those strategies combine into a single highly efficient family, EfficientNet, and into most of the convolutional backbones that are state of the art today. A modern efficient CNN is, in effect, the classical backbones plus the techniques collected in this section.

The roll-call: efforts focused on reducing the parameter count

Six lines of work, each contributing one transferable idea.

WorkYearWhat it introduced
SqueezeNet (Iandola et al.)2016AlexNet-level accuracy with roughly 50× fewer parameters
Xception (Chollet)2017depthwise separable convolution, dropped into an Inception-v3-style network
MobileNetV2 (Sandler et al.)2018the inverted residual block, tailored to mobile-scale models
MNASNet (Tan et al.)2019MobileNet and SENet blocks assembled by an automated Neural Architecture Search (NAS) driven by reinforcement learning, optimised for on-device latency
EfficientNet (Tan and Le)2019state-of-the-art accuracy through compound scaling of a MNASNet-like baseline
Knowledge Distillation (Hinton et al.)2015a teacher-student recipe that transfers a large model’s knowledge into a small one

The last entry is the odd one out, by date and by kind: knowledge distillation is not an architecture but a training technique, and it predates the rest. It belongs here because it chases the same goal, a small model that behaves like a large one, from the training side rather than the design side.

The notes that follow develop the architectural strategies from this list, above all the depthwise separable convolution and the inverted residual block, and then show how EfficientNet composes them with a principled rule for scaling depth, width, and resolution together. Knowledge distillation is treated on its own, as the training-time route to the same end.