Abstract
Most modern computer-vision systems, whatever the task, are built from three components in series: a backbone that extracts features from the image, an optional neck that aggregates those features across scales, and a head that turns them into the task’s output. This note defines the three, explains why the split is useful, and sets the frame for the classical backbones that follow (AlexNet, VGG, GoogLeNet, ResNet).
In the image domain, where deep learning first matured, a recognizer is rarely a single undifferentiated network. It is assembled from parts with distinct jobs, and naming them makes the rest of the chapter easier to read.

The backbone: a feature extractor
The backbone is the convolutional network responsible for feature extraction. Whatever the downstream task (classification, detection, segmentation), the backbone is the component that understands the content of the image, by turning raw pixels into a stack of feature maps that make that content explicit.
This is precisely the hierarchy of features developed earlier in the chapter: edges and textures low in the network, object parts and whole objects high up.
One backbone, one dataset
The classical backbones below are all trained on a single large dataset, ImageNet, whose images are resized to a standard resolution, a multiple of chosen so that the five downsampling steps of a typical backbone land on a clean map (the same arithmetic worked out for VGG). A backbone trained once on ImageNet is then reused on other tasks by swapping the head, which is the basis of transfer learning. The reusable, task-agnostic nature of the feature extractor is exactly why it earns a name of its own.
The backbone is reusable: transfer learning
Because the backbone learns a general-purpose feature space and the head makes a task-specific decision, the head can be detached and the backbone’s features reused directly. The feature vector it produces can be fed to a different classifier entirely (an SVM, a random forest, a gradient-boosted tree), or a fresh head can be trained on a new dataset or task while the backbone is kept and only fine-tuned. Reusing a backbone trained on one problem to solve another is the heart of transfer learning, and it works for exactly the reason the components are split this way: general features below, a task-specific decision above.
The neck: multi-scale aggregation
The neck is the bridge between the backbone and the head, and its job is to aggregate features across scales.
A plain convolutional pipeline ends its conv-and-pool chain with a single feature representation: the highest, coarsest level of abstraction. Yet every downsampling step along the way produced a feature map at a different scale, and those intermediate maps are discarded. The neck recovers them:
- it collects feature maps from several downsampling stages, each a representation of the image at a different scale;
- it combines them into a multi-resolution, multi-scale representation;
- the result is a model far less sensitive to the size of the object it must find, since a small object is sharp at a fine scale and a large one at a coarse scale.
The neck is what classification backbones omit
The classical backbones in this chapter are pure classification networks, and they have no neck: a single global feature vector is enough to name the dominant object. The neck becomes essential for detection and segmentation, where objects of many sizes must be located precisely. Its canonical form is the Feature Pyramid Network (FPN), which fuses the backbone’s coarse, semantically rich maps with its fine, spatially precise ones. The reason it helps is a division of knowledge: a deep, coarse map knows what is in the image but has lost where (large receptive field, low resolution), while a shallow, fine map knows where precisely but little about what. FPN carries the deep “what” back down and adds it to the shallow “where”, so every scale ends up both semantic and well-located.
The head: task specialization
The head encodes the downstream task. It consumes the refined (and, where a neck is present, multi-scale) features and produces the final output. Its architecture follows the goal:
- a classification head returns class scores;
- a detection head returns bounding boxes and labels;
- a segmentation head returns a per-pixel label map;
- and so on, one head per family of computer-vision tasks.
The three components at a glance
| Component | Role | Reused across tasks? | Examples |
|---|---|---|---|
| Backbone | extract features from the image | yes, this is its purpose | AlexNet, VGG, ResNet |
| Neck | aggregate features across scales | partly | FPN, PANet |
| Head | produce the task-specific output | no, it is task-specific | classification, detection, segmentation heads |
Recap: what a backbone is
A backbone is a CNN that takes a 2D image and returns a 2D representation of it (a stack of feature maps). That representation is turned into a 1D vector by flattening the final feature map, or by global average pooling, which is the step a classification head performs before its dense layers.
Why the chapter is about backbones, not heads
Of the three components, the backbone is the one almost all the research went into, for a structural reason: it is the reusable part. A better head improves one task; a better backbone, pretrained once and transferred everywhere, lifts every downstream task at once. That leverage is why the history of computer-vision architecture is told as a history of backbones, AlexNet to VGG to ResNet, with the neck and head comparatively stable.
The rest of the chapter walks through the classical backbones in the order they won the ILSVRC, each introducing one decisive idea that the modern architectures still rely on.