Abstract

A standard convolution mixes information across space and across channels in a single dense operation, and pays for it with parameters. The depthwise separable convolution, introduced by the Xception architecture (Chollet, 2017), factorises that one operation into two cheaper ones:

  • a depthwise step that filters each channel in space on its own;
  • a pointwise () step that recombines the channels into the desired number.

It reaches a comparable mapping at roughly of the parameters and multiplications, about an order of magnitude, by committing to one assumption: that spatial and cross-channel correlations can be learned separately.

A standard convolution is parameter-heavy

A standard 2D convolutional layer learns one kernel for every pair of input and output channel. A filter is a stack of kernels, and there are filters, one learned per output channel, so, ignoring biases, the layer holds

parameters, and the same factor sets the number of multiplications. Every filter has to fit a spatial pattern and a recipe for combining channels at once, which is both expensive to store and harder to train.

The idea: separate space from channels

A standard convolution does two jobs in one stroke: it looks for a pattern in space and it mixes the channels. The depthwise separable convolution rests on the observation that these two jobs can be pulled apart and done in turn. Splitting them produces the two steps the operation is named after:

  • a depthwise (or channel-wise) convolution, a 2D convolution applied to each input channel on its own ( per channel), which handles the spatial part;
  • a pointwise convolution, a convolution that maps the channels to the target channels, which handles the channel part.

The word separable names this split. A standard 2D convolution is almost a 3D operation, since its filter spans the full channel depth even though it slides only in two dimensions; the factorisation breaks that joint object into a purely spatial 2D part followed by a purely cross-channel part.

Where the names "depthwise" and "pointwise" come from

The depth of an activation volume is its number of channels: a tensor is a stack of maps, and that stack is what makes it “deep” (this is the channel axis, not the spatial depth of a 3D convolution). The suffix -wise means “separately for each”, as in element-wise or row-wise. So depthwise means “separately for each slice along the depth axis”, that is per channel, which is why it is also called channelwise. The name is mildly counterintuitive: a depthwise convolution does not mix across depth, it processes each channel on its own and leaves the cross-channel mixing to the pointwise step.

The two names are even taken from different axes. Depthwise is named after the channel axis: one spatial kernel per channel, with no channel mixing. Pointwise is named after the spatial point: a window that mixes channels at a single location, with no spatial extent. In short, depthwise does space without channels, and pointwise does channels without space.

What the factorisation really assumes

The depthwise separable convolution is not just a cheaper convolution, it is a convolution with a built-in hypothesis: that the spatial correlations and the cross-channel correlations in a feature map can be mapped independently, with no loss that matters. This is the thesis of Xception, read as an extreme Inception module: where an Inception module splits a convolution into a few partly independent spatial-and-channel paths, Xception pushes the split to its limit, one independent spatial filter per channel followed by a single channel mixer. The saving below is what that hypothesis buys; any accuracy it costs is what the hypothesis gives up when space and channels are not truly separable.

Depthwise convolution

Given a 3D input tensor of shape , the first step convolves it one channel at a time. Each input channel is assigned its own dedicated kernel, and that kernel sees only that channel. This is the exact opposite of the standard layer, where every kernel reads across all channels at once: here there is no cross-channel mixing at all.

The channel count is therefore left unchanged. The input channels produce output maps, each one the spatial filtering of a single input channel, and stacking them gives a volume that has been filtered in space but not yet recombined across channels. With , the depthwise output still has channels.

Depthwise convolution

Depthwise convolution is grouped convolution taken to its extreme

A grouped convolution splits the channels into groups and convolves each group on its own. The depthwise convolution is the limiting case : every group holds exactly one channel, so the partition is as fine as it can be. The standard convolution sits at the other end, , where a single group contains all channels. On this axis the depthwise step removes cross-channel mixing entirely, and the pointwise step that follows is precisely what restores it, in the cheapest possible form.

Pointwise (1 by 1) convolution

The depthwise step has filtered each channel in space but has left the channels uncombined, and it cannot change their number. The second step repairs both at once. A convolution, also called a pointwise or channel-wise convolution, places one weight per input channel at a single spatial location and sums across channels, which is exactly a per-pixel linear remap of the channel vector; the mechanism is detailed in the note on the 1×1 convolution. Using such filters maps the depthwise maps to any target number of output channels, mixing the spatially filtered channels into new combinations.

Depthwise separable convolution

Same output shape as a standard convolution, different interior

The two steps in sequence reproduce what a standard convolution offers at its interface. With the same kernel size, stride, and padding, the output has the same shape a standard convolution would produce: the same spatial size , fixed by the depthwise step, and the same number of output channels , fixed by the pointwise step. What changes is the interior: instead of one dense operation that fits spatial and channel structure jointly, there are two specialised ones, spatial first and channel second, each far smaller than the operation they replace.

A worked example

The figure uses no padding. A input with channels passes through a depthwise convolution, which removes positions per axis and yields a volume (). The pointwise convolution then remaps the channels to the target count, in the example, leaving the spatial size untouched: the output is . The spatial size is set entirely by the depthwise step and the channel count entirely by the pointwise step, which is the separation made visible.

Why it is cheaper: parameters and compute

The parameter count

Counting the two steps separately makes the saving exact.

  • Depthwise. One kernel per input channel ( if the kernel is not square), and there are channels, with no second channel index because each kernel touches one channel only. This costs weights.
  • Pointwise. One kernel per pair of input and output channel, that is weights.

Adding the two gives the total,

against the of the standard layer. The vanilla count is the one derived in the convolutional layer note; the decisive difference is that its single product of three factors has here become a sum of two much smaller terms.

Standard 2D convolutionDepthwise separable
Spatial mixingjoint with channelsdepthwise step
Channel mixingjoint with spatialpointwise step
Output channels
Parameters
Multiplications (output size )

The reduction factor

Taking the ratio of the separable count to the standard one, the common factor cancels and the rest splits cleanly:

The identical algebra holds for the multiplication counts, because every term in the table carries the same spatial factor , so parameters and compute shrink by the same factor. Two readings of this result matter:

  • The number of output channels is usually large (often , , ), so is negligible and the reduction is dominated by . For the common this is : close to an order of magnitude fewer parameters and multiplications, the figure usually quoted.
  • The term is not negligible everywhere. In the first layers, where is small, it lifts the ratio, so the early layers save proportionally less than the deep ones.

Square kernels here, in general

The counts above use for a square kernel, the standard isotropic choice motivated in the local receptive field note (and, for odd sizes, in padding and stride). The factorisation itself requires nothing square: for a general kernel, replace by throughout, so the depthwise step costs weights and the reduction factor becomes .

Fewer FLOPs is not the same as faster

The saving counts arithmetic, not wall-clock time. The depthwise step does very little computation per value it reads from memory (its arithmetic intensity is low), so on parallel hardware such as GPUs it is typically memory-bandwidth bound rather than compute bound, and rarely runs times faster despite doing times less arithmetic. A related point: inside a depthwise separable block the cheap-looking pointwise step usually dominates both parameters and multiply-adds, because its term outweighs the depthwise whenever . This is why later efficient architectures, such as ShuffleNet, go on to attack the pointwise cost as well.

Net of that caveat, the layer is genuinely smaller, faster, and easier to train: fewer parameters mean a smaller hypothesis space to fit and less memory to move, which also helps convergence.

Depthwise separable is not the same as spatially separable

Two different factorisations share the word separable. The depthwise separable convolution here separates space from channels: a per-channel spatial filter, then a channel mix. A spatially separable convolution instead separates the two spatial axes, replacing a kernel by a column filter followed by a row filter (the classical trick that turns a 2D Gaussian into two 1D Gaussians). Spatial separability requires the kernel itself to be rank one and is rarely learned in modern CNNs; depthwise separability is the one that defines the efficient backbones of this section.

Multidomain feature extraction

The saving is the headline, but the factorisation also changes what a layer learns, and this is its second, subtler property. A convolutional network is a learnable feature extractor, and a depthwise separable layer learns those features in two stages: first within each channel in the spatial domain, then across channels. This is the same kind of structured decomposition that a grouped convolution induces, carried down to single channels, with spatial features extracted first and recombined along the channel dimension in a separate step.

Whether this two-stage extraction helps depends on the data. On ordinary RGB images the spatial and channel axes are fairly homogeneous, and depthwise separability is mostly an efficiency trade: a large saving for a small accuracy cost that a deeper or wider model, now affordable within the same budget, tends to recover. Where it genuinely improves the features is in domains whose two axes carry different kinds of structure, so that forcing the network to treat them separately matches the data rather than constraining it.

A concrete instance is electroencephalography (EEG), the recording of electrical activity from electrodes placed at fixed positions on the scalp. Each electrode produces a 1D waveform over time. A depthwise separable design lets the network treat the two axes in separate stages: temporal structure within each electrode’s signal, then spatial structure across electrodes, that is across regions of the cortex. Time and head location are physically different kinds of thing, and handling them separately mirrors that difference rather than fighting it; this is the principle behind EEGNet (Lawhern et al., 2018). A standard convolution would entangle the two from the first layer, learning filters that are less cleanly separated along either axis.

Recap

The depthwise separable convolution factorises a standard convolution into a per-channel depthwise spatial step and a pointwise channel-mixing step. Two consequences follow:

  • Efficiency. Parameters and multiplications both fall by , about an order of magnitude for , giving smaller, faster, easier-to-train models (with the caveat that fewer FLOPs do not translate one-to-one into lower latency).
  • Structured feature extraction. Features are learned in the spatial domain first and recombined across channels second. On homogeneous image data this is mainly a favourable efficiency trade; on data whose axes are heterogeneous, such as the time and electrode-location axes of EEG, the forced separation can match the structure of the problem and improve the features themselves.