Abstract
A grouped convolution splits the input channels into independent groups and convolves each group on its own, instead of letting every filter see every input channel. It cuts the cost of a convolution by a factor of , and it does something subtler too: it pushes the network to learn clusters of specialized features. Born as a hardware workaround in AlexNet, it is now a deliberate design axis behind ResNeXt, MobileNet, and ShuffleNet. This note defines it, counts its saving, shows the feature clustering it induces, and traces where it leads.
What it is: standard versus grouped
In a standard convolution, every output filter spans the entire input depth: a filter on a -channel input is itself , so each output channel is a function of all input channels.
A grouped convolution with groups partitions the input channels into blocks of channels. Each filter sees only its own group’s channels, the groups are convolved independently, and their outputs are concatenated. One dense convolution becomes smaller convolutions running in parallel that never exchange information.

The animations below show the same contrast in motion: with one group a single kernel sweeps the full-depth input; with two groups the input depth is split into two stacks, each convolved by its own kernels, and the two outputs are stacked back together.
| 1 group (standard) | groups | |
|---|---|---|
| Channels each filter sees | all | |
| Cross-channel mixing | full | only within a group |
| Independent convolutions |
| 1 Group (standard) |
|---|
| 2 Groups |
The saving
A standard convolution from to channels costs
parameters. With groups, each group maps inputs to outputs, and there are of them:
Grouping by divides both the parameters and the compute by . The cost is paid in expressivity: an output channel can no longer mix information from channels outside its group.
Grouping splits the channels into independent lanes
A grouped convolution also changes how the network organizes its features, for a mechanical reason: group of a layer receives input only from group of the layer before it. Information flows in parallel lanes, lane 1 into lane 1, lane 2 into lane 2, that never cross. Each lane is a small network on its own, and because the lanes cannot share, each learns its own kind of feature. Two pictures make this concrete.
The lanes are visible in the filter correlations. Take two adjacent layers; for every pair of filters, one from each layer, measure how correlated they are, and arrange the numbers in a square (bright for correlated, dark for not). With a standard convolution the square has no structure, because every filter is connected to every filter. With a grouped convolution (), eight bright blocks line up along the diagonal: each block is one lane, whose filters are tied to the matching lane in the next layer and unrelated to the other seven. The block-diagonal picture is exactly the eight lanes, drawn.

AlexNet’s two lanes became edges and colour. AlexNet ran with , one lane per GPU. Its first-layer filters, below, divide cleanly: one lane learned grayscale edge detectors (the oriented, Gabor-like stripes), the other learned colour detectors (the low-frequency colour blobs). Two isolated lanes settled on two different jobs by themselves. This is the specialization the AlexNet note describes, and the correlation squares above are its general form.

Grouping is also a regularizer
A grouped convolution is two things at once: it is cheaper by the factor counted above, and it is a constraint that encourages specialization, because cutting the channels into isolated lanes pushes the network toward a few clusters of distinct features instead of one tangled mass. This second effect is part of why ResNeXt gets more from many small grouped paths than from one wide dense layer of the same size.
From a hardware hack to a design axis
An accident that became a tool
Grouped convolution was not invented for any of this. AlexNet was split across two GPUs because the whole network would not fit in the memory of one, and that split forced its convolutions into two groups that communicated only at a few layers. The idea was then recognized as useful in its own right and turned from a constraint into a deliberate knob.
Three later architectures made it central:
- ResNeXt treats the number of groups, which it calls cardinality, as a first-class design axis alongside depth and width, and shows that raising cardinality improves accuracy more cheaply than widening or deepening.
- MobileNet uses the extreme case , a depthwise convolution in which each channel has its own single-channel filter, paired with a pointwise convolution to mix the channels back together. The pair, a depthwise separable convolution, is the workhorse of the efficient backbones.
- ShuffleNet adds a channel shuffle between grouped convolutions, permuting channels so that information eventually crosses group boundaries.
The trade-off, stated once
Grouping cuts cost by a factor of but silos the channels: with no further mixing, the groups become disconnected sub-networks. Every architecture that uses grouping therefore restores cross-channel communication somewhere, with a pointwise convolution (MobileNet) or an explicit channel shuffle (ShuffleNet). The depthwise-plus-pointwise pattern is the cleanest resolution: do the spatial work per channel, cheaply, then mix channels with a ([[replacing-the-dense-head|the same mixing seen elsewhere]]).