Abstract
By 2017 the ILSVRC had been won and effectively solved: a machine had crossed the estimated human error rate, and adding layers bought ever smaller gains. The frontier shifted from accuracy at any cost to the same accuracy at a fraction of the cost, pushed by the need to run vision models on phones and embedded hardware. This note frames that shift and names the reusable strategies that define the efficient backbones of this section.
The networks that won the ILSVRC, from AlexNet to ResNet, chased accuracy with little regard for size or speed; the parameter explosion was tolerated as the price of winning. Once the benchmark saturated, a wave of architectures inverted the question:
Question
How few parameters, and how little computation, can still reach a target accuracy?
Smaller models are cheaper to store, faster at inference, and easier to train, and they are what makes deep vision deployable away from the data center.
The plan: strategies, not architectures
This section does not walk through each of these networks in detail. What matters, and what carries forward, is the compact set of reusable strategies each one contributed. Those strategies combine into a single highly efficient family, EfficientNet, and into most of the convolutional backbones that are state of the art today. A modern efficient CNN is, in effect, the classical backbones plus the techniques collected in this section.
The roll-call: efforts focused on reducing the parameter count
Six lines of work, each contributing one transferable idea.
Work Year What it introduced SqueezeNet (Iandola et al.) 2016 AlexNet-level accuracy with roughly 50× fewer parameters Xception (Chollet) 2017 depthwise separable convolution, dropped into an Inception-v3-style network MobileNetV2 (Sandler et al.) 2018 the inverted residual block, tailored to mobile-scale models MNASNet (Tan et al.) 2019 MobileNet and SENet blocks assembled by an automated Neural Architecture Search (NAS) driven by reinforcement learning, optimised for on-device latency EfficientNet (Tan and Le) 2019 state-of-the-art accuracy through compound scaling of a MNASNet-like baseline Knowledge Distillation (Hinton et al.) 2015 a teacher-student recipe that transfers a large model’s knowledge into a small one The last entry is the odd one out, by date and by kind: knowledge distillation is not an architecture but a training technique, and it predates the rest. It belongs here because it chases the same goal, a small model that behaves like a large one, from the training side rather than the design side.
The notes that follow develop the architectural strategies from this list, above all the depthwise separable convolution and the inverted residual block, and then show how EfficientNet composes them with a principled rule for scaling depth, width, and resolution together. Knowledge distillation is treated on its own, as the training-time route to the same end.