ResNeXt

Abstract

ResNeXt (Xie, Girshick, Dollár, Tu, and He, 2017) adds one idea to ResNet: a third design axis it calls cardinality, the number of parallel transformation paths inside a block. Each ResNet bottleneck is replaced by a set of identical parallel bottlenecks whose outputs are summed, which is exactly a grouped convolution. At the same parameter and compute budget as ResNet, the extra paths buy more accuracy than making the network deeper or wider. The name is a pun: ResNeXt is ResNet plus the next dimension.

The next dimension: cardinality

A convolutional network has long had two axes to scale: depth (how many layers) and width (how many channels per layer). ResNeXt names and exploits a third, cardinality: the number of parallel paths a block splits its computation into.

A ResNet bottleneck squeezes a $256$ -channel input to $64$ , transforms it with one $3 \times 3$ convolution, and expands it back to $256$ . ResNeXt replaces that single path with $C$ identical parallel bottlenecks (the slide’s $C = 32$ ), each operating on a thin slice of the channels, and sums their outputs before adding the residual shortcut. $C$ is the cardinality.

Aggregated transformations

The block computes an aggregated transformation, a sum of $C$ small networks plus the shortcut:

y = x + i = 1 \sum C T_{i} (x),

where each $T_{i}$ is a small bottleneck (a $1 \times 1$ reduce, a $3 \times 3$ , a $1 \times 1$ expand) of the same shape, and the $C$ paths differ only in their learned weights. This is the split-transform-merge pattern: split the input across paths, transform each, merge by summation.

Inception's idea, made uniform

Split-transform-merge is exactly what the Inception module does, but Inception’s branches are hand-designed and heterogeneous: a $1 \times 1$ , a $3 \times 3$ , a $5 \times 5$ , a pooling, each chosen by the architect. ResNeXt keeps the multi-branch power but makes every branch identical, so there is nothing to tune per branch, only one number to set: the cardinality. It recovers Inception’s strength with VGG-and-ResNet simplicity, stacking one repeated block and turning a single knob.

A neuron is the simplest aggregated transformation

The paper frames the block through the neuron itself. An ordinary neuron computes $\sum_{i} w_{i} x_{i}$ , a sum of the simplest possible transformations, each input scaled by one weight. ResNeXt replaces each scalar transform $w_{i} x_{i}$ with a small network $T_{i} (x)$ and keeps the sum. Cardinality is then just the number of terms in that sum, exactly as a neuron’s fan-in is the number of terms in $\sum_{i} w_{i} x_{i}$ . So “aggregated transformations” is just a neuron with a small network in place of each weight.

Which is just a grouped convolution

The $C$ parallel bottlenecks sound expensive to build, but they collapse into something the hardware already does cheaply. Because the paths are identical and their inputs are disjoint slices of the channels, the whole set of parallel $3 \times 3$ convolutions is exactly one grouped convolution with $C$ groups. The paper draws the block three equivalent ways; the grouped-convolution form is the one actually run.

The two blocks side by side make the change concrete:

the ResNet bottleneck (left): $256 1 \times 1 64 3 \times 3 64 1 \times 1 256$ ;
the ResNeXt bottleneck (right, $32 \times 4 d$ ): $256 1 \times 1 128 3 \times 3, group = 32 128 1 \times 1 256$ .

Same parameters, more feature maps

The grouped $3 \times 3$ is far cheaper than a dense one ( $32$ groups of $4$ channels cost a thirty-second of a dense $128$ -wide convolution), and ResNeXt spends the freed budget on wider $1 \times 1$ layers: its bottleneck carries $128$ channels where ResNet’s carries $64$ . The two changes nearly cancel in parameter count, so a ResNeXt block costs about the same as a ResNet block but moves twice as many feature maps through its middle, split into $32$ independent groups. More capacity, same budget, which is exactly what the slide means by “same number of parameters, but with more feature maps”.

Why cardinality matters

Cardinality beats width and depth

ResNeXt’s central result is empirical: at a fixed parameter and compute budget, raising cardinality improves accuracy more than raising width or depth. Many small transformations aggregated outperform one large transformation of the same total size. A ResNeXt clearly beats the ResNet of equal cost and reaches the accuracy of substantially deeper ones, by spending its budget on parallel paths rather than extra layers. Cardinality joined depth and width as a knob worth turning, and the same split-into-groups idea runs through the efficient backbones that followed.

Reading " $32 \times 4 d$ "

The two numbers are the cardinality and the per-path width. ResNeXt-50 ( $32 \times 4 d$ ) has $C = 32$ paths, each $4$ channels wide ( $4 d$ ), so the grouped $3 \times 3$ carries $32 \times 4 = 128$ channels in $32$ groups. ResNeXt-101 ( $64 \times 4 d$ ) widens to $64$ paths. The " $50$ " and " $101$ " are the layer counts inherited from the matching ResNet: ResNeXt is ResNet’s skeleton with every bottleneck’s $3 \times 3$ made grouped.

The lasting idea

ResNeXt’s contribution is less a network than a principle: capacity is not only depth and width, but also the number of parallel, low-dimensional paths a layer is split into. Its mechanism, a grouped convolution wrapped in a residual block, is simple; its lesson, that aggregating many small transformations is an efficient way to grow a network, is the same intuition that the Inception module reached for and that the mobile architectures after it took to the extreme.

Deep Learning: Zero to Hero

Explorer

The next dimension: cardinality

Aggregated transformations

Which is just a grouped convolution

Why cardinality matters

Graph View

Table of Contents

Backlinks