Grouped Convolution

Abstract

A grouped convolution splits the input channels into $G$ independent groups and convolves each group on its own, instead of letting every filter see every input channel. It cuts the cost of a convolution by a factor of $G$ , and it does something subtler too: it pushes the network to learn clusters of specialized features. Born as a hardware workaround in AlexNet, it is now a deliberate design axis behind ResNeXt, MobileNet, and ShuffleNet. This note defines it, counts its saving, shows the feature clustering it induces, and traces where it leads.

What it is: standard versus grouped

In a standard convolution, every output filter spans the entire input depth: a filter on a $C_{in}$ -channel input is itself $K \times K \times C_{in}$ , so each output channel is a function of all input channels.

A grouped convolution with $G$ groups partitions the $C_{in}$ input channels into $G$ blocks of $C_{in} / G$ channels. Each filter sees only its own group’s channels, the $G$ groups are convolved independently, and their outputs are concatenated. One dense convolution becomes $G$ smaller convolutions running in parallel that never exchange information.

The animations below show the same contrast in motion: with one group a single kernel sweeps the full-depth input; with two groups the input depth is split into two stacks, each convolved by its own kernels, and the two outputs are stacked back together.

	1 group (standard)	$G$ groups
Channels each filter sees	all $C_{in}$	$C_{in} / G$
Cross-channel mixing	full	only within a group
Independent convolutions	$1$	$G$

1 Group (standard)

2 Groups

The saving

A standard $K \times K$ convolution from $C_{in}$ to $C_{out}$ channels costs

K^{2} C_{in} C_{out}

parameters. With $G$ groups, each group maps $C_{in} / G$ inputs to $C_{out} / G$ outputs, and there are $G$ of them:

G \times K^{2} \cdot \frac{C _{in}}{G} \cdot \frac{C _{out}}{G} = \frac{K ^{2} C _{in} C _{out}}{G} .

Grouping by $G$ divides both the parameters and the compute by $G$ . The cost is paid in expressivity: an output channel can no longer mix information from channels outside its group.

Grouping splits the channels into independent lanes

A grouped convolution also changes how the network organizes its features, for a mechanical reason: group $k$ of a layer receives input only from group $k$ of the layer before it. Information flows in $G$ parallel lanes, lane 1 into lane 1, lane 2 into lane 2, that never cross. Each lane is a small network on its own, and because the lanes cannot share, each learns its own kind of feature. Two pictures make this concrete.

The lanes are visible in the filter correlations. Take two adjacent layers; for every pair of filters, one from each layer, measure how correlated they are, and arrange the numbers in a square (bright for correlated, dark for not). With a standard convolution the square has no structure, because every filter is connected to every filter. With a grouped convolution ( $G = 8$ ), eight bright blocks line up along the diagonal: each block is one lane, whose filters are tied to the matching lane in the next layer and unrelated to the other seven. The block-diagonal picture is exactly the eight lanes, drawn.

AlexNet’s two lanes became edges and colour. AlexNet ran with $G = 2$ , one lane per GPU. Its first-layer filters, below, divide cleanly: one lane learned grayscale edge detectors (the oriented, Gabor-like stripes), the other learned colour detectors (the low-frequency colour blobs). Two isolated lanes settled on two different jobs by themselves. This is the specialization the AlexNet note describes, and the correlation squares above are its general form.

Grouping is also a regularizer

A grouped convolution is two things at once: it is cheaper by the factor $G$ counted above, and it is a constraint that encourages specialization, because cutting the channels into isolated lanes pushes the network toward a few clusters of distinct features instead of one tangled mass. This second effect is part of why ResNeXt gets more from many small grouped paths than from one wide dense layer of the same size.

From a hardware hack to a design axis

An accident that became a tool

Grouped convolution was not invented for any of this. AlexNet was split across two GPUs because the whole network would not fit in the memory of one, and that split forced its convolutions into two groups that communicated only at a few layers. The idea was then recognized as useful in its own right and turned from a constraint into a deliberate knob.

Three later architectures made it central:

ResNeXt treats the number of groups, which it calls cardinality, as a first-class design axis alongside depth and width, and shows that raising cardinality improves accuracy more cheaply than widening or deepening.
MobileNet uses the extreme case $G = C_{in}$ , a depthwise convolution in which each channel has its own single-channel filter, paired with a $1 \times 1$ pointwise convolution to mix the channels back together. The pair, a depthwise separable convolution, is the workhorse of the efficient backbones.
ShuffleNet adds a channel shuffle between grouped convolutions, permuting channels so that information eventually crosses group boundaries.

The trade-off, stated once

Grouping cuts cost by a factor of $G$ but silos the channels: with no further mixing, the $G$ groups become $G$ disconnected sub-networks. Every architecture that uses grouping therefore restores cross-channel communication somewhere, with a $1 \times 1$ pointwise convolution (MobileNet) or an explicit channel shuffle (ShuffleNet). The depthwise-plus-pointwise pattern is the cleanest resolution: do the spatial work per channel, cheaply, then mix channels with a $1 \times 1$ ([[replacing-the-dense-head|the same $1 \times 1$ mixing seen elsewhere]]).

Deep Learning: Zero to Hero

Explorer

Grouped Convolution

What it is: standard versus grouped

The saving

Grouping splits the channels into independent lanes

From a hardware hack to a design axis

Graph View

Table of Contents

Backlinks