Dropout

Dropout is a stochastic regularization technique that differs from L2 regularization in where it acts: L2 adds an explicit penalty to the loss to shrink the weights, while dropout injects noise directly into the activations, by randomly switching off a fraction of hidden units at every training step. The mechanism is architectural rather than analytical: it does not modify the loss at all, but modifies the network whose loss is being evaluated.


1. Training a neural network with dropout

Consider a supervised training pair , where denotes the target.

Before DropoutAfter Dropout
In the standard setting, one performs a forward pass of the input through the full network, followed by backpropagation to compute the gradient contribution of that example or mini-batch.With dropout, one first samples a random binary mask and temporarily deactivates part of the hidden units. The input and output layers are typically left unchanged, while the hidden computation graph is thinned for that training step.

Reading the diagram

In the diagram, the dropped neurons are still shown in transparency (ghosted). This highlights an important point: the architecture itself has not been destroyed; only the active subnetwork used in the current optimization step has changed.

In its simplest form, dropout acts on a hidden activation vector

by sampling independent Bernoulli variables

and collecting them into a mask vector

The masked activations are then

where denotes elementwise multiplication. Equivalently, componentwise,

Here is the keep probability.

What is the dropout mask?

The vector is called the dropout mask. It is a binary selector that decides, for the current training step, which activations survive and which are set to zero.

  • if , the unit is kept;
  • if , the unit is dropped for that step.

So the mask does not change the architecture permanently. It only specifies which part of the layer is active in the current sampled subnetwork.

Keep probability vs. dropout rate

Many libraries expose the dropout rate

instead of the keep probability . For example, dropout=0.5 usually means that each selected activation is zeroed with probability , hence kept with probability .

If , then on average half of the selected hidden units remain active at each step.
This is a common choice, but it is not the definition of dropout. Different layers may use different keep probabilities, and in classical implementations the input layer, when regularized, is often assigned a milder dropout rate than hidden layers.

Where dropout acts, and one mask per layer

Two points are easy to miss and matter throughout the rest of the note:

  • Dropout is applied per-layer, with an independently sampled mask at every layer where it is enabled. The superscript in is not decorative: dropping a unit at layer tells nothing about whether the corresponding unit at layer is dropped or kept.
  • Dropout is applied to hidden activations, not to the output layer. The output of the network is what the loss reads, and zeroing out output units would corrupt the supervision signal directly. The input layer is sometimes dropped with a milder rate (e.g., ) as a form of input noise, but this is optional. The canonical placement is on the activations of one or more hidden layers.

What "selected hidden units" means

A unit is selected for dropout simply by being in a layer that the architect has decided to regularize. Within such a layer, every unit is then independently dropped or kept with the configured Bernoulli rate. Dropout is therefore a per-layer choice at design time (which layers get dropout, with which rate), followed by a per-unit random decision at each training step.


2. Training procedure with dropout

Once the mask has been sampled, the forward and backward passes are executed on the resulting thinned network.

  1. Sample a fresh binary mask for the chosen layer activations.
  2. Zero the units selected by the mask.
  3. Run forward propagation and backpropagation through this thinned network.
  4. Update the shared parameters using the resulting gradient estimate.

Then the process is repeated with a new random mask at the next optimization step.

Conceptually, the network is therefore trained under a continuously changing computational graph.
The learned parameters must work well not for one single fixed hidden configuration, but for many randomly perturbed configurations.

Shared parameters

A subtle but crucial point is that dropout does not train a completely different set of parameters for every mask. All these thinned networks share weights and biases. That weight sharing is what makes dropout computationally feasible.

Mental model

A useful mental model is: training with dropout means repeatedly training one shared network under many random partial failures. The network is therefore pushed to distribute information more robustly across its hidden representation.


3. Test-time inference

Test-time rule

At inference time, one does not sample dropout masks. The model is evaluated as a deterministic network.

This creates an immediate question: during training, each unit is active only with probability , while at test time all units are present.
Without correction, the activations would be systematically larger at inference than during training.

The mismatch, made concrete

Consider a hidden unit that, during training under dropout with keep probability , fires on average half the time. The next layer receives the contribution , which is zero half the time and the other half, with average . The downstream weights are tuned, by gradient descent, to operate on this halved average input.

At test time, with dropout disabled, the unit is always on: the downstream layer suddenly receives , twice the average value the weights were tuned for. The pre-activation of the next layer is therefore systematically about larger than it ever was during training, and the network operates in a regime it has never seen, leading to degraded or unpredictable predictions.

For the classical formulation

one compensates by scaling activations by at test time, or equivalently by scaling the outgoing weights by .

Why does this make sense? Because

So the deterministic test-time scaling aligns the expected activation level seen during training with the one used during inference.

Inverted dropout

Modern libraries usually implement inverted dropout instead:

In that convention,

so no additional scaling is needed at test time.

This is the most common implementation in practice.

Why expectation matching is not the whole story

The expectation argument is exact only up to the next affine transformation.

Suppose the masked activations are fed into the next layer through

Then expectation passes through this affine map:

So if , replacing the random mask by its mean really does give the correct expected pre-activation at the next layer.

The difficulty starts after the nonlinearity:

For a generic nonlinear activation , one has in general

So the deterministic scaled network need not equal the exact average prediction of all dropout subnetworks once further nonlinear layers are involved.

This is the key point: dropout scaling matches the average behavior exactly at the affine level, but in a deep nonlinear network the full test-time model is usually only an approximation to exact ensemble averaging.

In practice the approximation is excellent

The theoretical gap above does not undermine the practical effectiveness of inverted dropout: the deterministic mean network is, empirically, an extremely good test-time predictor for nearly every deep architecture, and matches or exceeds the accuracy of explicit Monte Carlo averaging over many sampled masks. The approximation is precise enough at the activation level that the downstream nonlinearities, in trained networks, do not push it noticeably off. Inverted dropout has therefore been the default for over a decade across CNNs, MLPs, RNNs and Transformers.


4. The magic behind dropout

Why should dropout help?

Why should randomly deleting part of the network during training improve generalization instead of simply making optimization harder?

The answer is best understood from two complementary viewpoints:

  • dropout behaves like a computationally cheap form of ensemble averaging over a huge family of subnetworks;
  • dropout discourages fragile co-adaptations between units and forces the network to learn features that remain useful under random perturbations.

Both viewpoints matter, but the ensemble perspective is the deeper one.

4.1. Dropout as approximate ensemble averaging

Each sampled mask defines a thinned network, denoted by

where is the collection of shared parameters.

In this high-level discussion, the layer superscript is suppressed and denotes the sampled dropout pattern that defines the current thinned subnetwork.

If a network has droppable units, then, in principle, there are exponentially many possible masks and therefore exponentially many possible thinned subnetworks.
This is the source of the famous statement that dropout is related to an ensemble of exponentially many networks.

However, this statement needs to be interpreted carefully.

An explicit ensemble would require:

  • training each network separately,
  • storing a separate parameter set for each network,
  • averaging their predictions at inference.

Dropout does not do that.

Instead, dropout trains a single parameterized system whose many subnetworks all share weights.
At every optimization step, SGD samples one mask and therefore one thinned network, producing a stochastic estimate of the objective

In plain words, the same shared parameter set is trained so that it performs well on average across many randomly sampled subnetworks, rather than only for one fixed network realization.

What is ensemble-like, exactly?

The ensemble flavor comes from the fact that many different subnetworks contribute to learning and that the final predictor aims to reflect their collective behavior.

The mechanism is not classical bagging:

  • the subnetworks are not independent;
  • they are trained on the same data distribution;
  • and, most importantly, they share parameters.

Why this matters

This is the real reason dropout is powerful: it borrows part of the statistical benefit of ensemble methods without paying the prohibitive cost of training and storing a large collection of separate models.

Weight sharing is the decisive idea: it compresses an exponential family of subnetworks into a single trainable model.
What that shared model actually computes at test time is the question the next part makes precise.

4.2. What happens at inference: exact statement vs. approximation

Each dropout mask specifies which units are kept and which are dropped.
Once the mask is fixed, the original network turns into one specific thinned subnetwork, which produces its own predictive distribution

What does "the average prediction of all dropout subnetworks" mean?

This expression refers to the following ideal thought experiment: for the same input , evaluate the prediction of every subnetwork generated by every possible mask , and then combine all those predictions according to the probability of sampling each mask.

If the ensemble interpretation were taken literally, the procedure would be:

  1. enumerate all possible masks ;
  2. run the corresponding subnetwork for each mask;
  3. combine all the resulting predictive distributions.

That would be a genuine ensemble over all subnetworks induced by dropout.

Here the word average is being used informally to mean the collective prediction obtained from all subnetworks. It does not automatically mean a simple arithmetic average of class probabilities. Indeed, in the exact special case discussed below, the correct combination turns out to be a normalized geometric mean.

Why the literal ensemble is not used at test time

The literal ensemble view is computationally hopeless in realistic networks. If there are many droppable units, the number of possible masks grows exponentially. So test-time dropout does not explicitly evaluate and combine all subnetworks.

Instead, it uses a single deterministic mean network:

  • in the classical presentation, one keeps all units and scales activations or outgoing weights by the keep probability;
  • in inverted dropout, the scaling is already done during training, so the test-time network is simply the full deterministic network.

What is the test-time network actually computing?

This is the central mathematical question behind the ensemble interpretation of dropout.

  • Exact special case. If, after dropout, the masked representation is sent directly to one final linear classifier followed by a softmax, then the deterministic mean network is not merely heuristic: it computes exactly the normalized geometric mean of the predictive distributions of all thinned subnetworks.
  • General deep nonlinear case. If additional nonlinear transformations appear after the dropped units, this exact identity no longer holds. In that setting, the deterministic network should be understood as a tractable approximation to full model averaging, not as an exact equality.

Why does a normalized geometric mean appear here?

For positive numbers , the arithmetic mean is

whereas the geometric mean is

In the dropout setting, each thinned subnetwork assigns a probability to class . In the exact special case above, these probabilities are combined multiplicatively, not additively. That is why the relevant object is a geometric mean rather than an arithmetic mean.

This is why the paper speaks of a normalized geometric mean: one first combines the subnetwork probabilities multiplicatively, then renormalizes across classes so that the final outputs still form a proper probability distribution.

Intuitively, this favors classes that are supported consistently across many subnetworks, not classes that receive high probability from only a small minority of them.

Key takeaway

This point is easy to misunderstand: dropout is ensemble-like in general, but the clean exact formula belongs only to a restricted architectural setting. Outside that setting, the ensemble interpretation remains conceptually useful, but mathematically approximate.

This is why the claim “dropout is just averaging many networks” is directionally correct but mathematically incomplete.

It is more precise to say:

Precise ensemble interpretation

Dropout trains a weight-sharing family of thinned networks and, at test time, replaces the intractable full model average with a deterministic approximation obtained by scaling activations or weights appropriately.

4.3. What the geometric mean buys

The geometric mean interpretation is not a cosmetic detail. It reveals that the deterministic network is not simply “voting” like a naive arithmetic average.

If many subnetworks assign very low probability to a class, the geometric mean suppresses that class strongly.
So the final predictor favors outputs that are consistently supported across many subnetworks, not merely rescued by a few highly confident ones.

This helps explain why dropout often yields predictions that are less tied to brittle, overly specialized internal feature combinations.

4.4. Dropout vs. bagging

Dropout vs. bagging

It is useful to compare dropout with classical ensemble methods such as bagging.

  • Bagging trains many independent models, often on bootstrap-resampled datasets, and averages their predictions.
  • Dropout trains many implicitly defined subnetworks inside one shared model and averages them only approximately.
  • Bagging spends memory and computation on model independence.
  • Dropout gives up independence, but gains enormous efficiency through parameter sharing.

So dropout should be viewed as an ensemble-inspired regularizer, not as a literal replacement for a fully trained ensemble.


5. A second viewpoint: preventing co-adaptation

The original motivation behind dropout was also expressed in terms of preventing complex co-adaptations among feature detectors.

Without dropout, a hidden unit may become useful only because several specific partner units are also present and tuned in a highly specialized way.
This can produce brittle internal representations: the network performs well on the training set, but relies on delicate feature interactions that do not generalize.

With dropout, a unit cannot safely rely on the presence of particular companions, because those companions may be absent at the next step.
Therefore, the unit is pressured to learn features that remain useful across many random contexts.

Core regularizing effect

This is one of the core regularizing effects of dropout: it discourages a representation in which correctness depends on a narrow, fragile coalition of hidden units.

This viewpoint also clarifies the conceptual difference from L2 regularization.

  • L2 regularization directly discourages large weights.
  • Dropout directly discourages brittle dependence on specific hidden pathways.

Both can improve generalization, but they do so through different mechanisms.

The two viewpoints reconciled

The ensemble view (Section 4) and the co-adaptation view above are not competing explanations: they are two descriptions of the same phenomenon at different levels of abstraction.

  • At the statistical level, dropout trains a weight-sharing family of subnetworks and aggregates them at test time. This is the ensemble story.
  • At the representational level, the constraint that the shared parameters must work well across many random subnetworks forces each unit to be individually useful, because no individual partner can be relied upon. This is the anti-co-adaptation story.

The ensemble framing is more mathematically rigorous; the co-adaptation framing is more directly tied to how features end up being learned. Both arrive at the same conclusion: dropout pushes the network toward robust, distributed, individually-useful representations, not brittle conjunctions of specific units.


6. Dropout in PyTorch

In PyTorch, dropout is a layer that can be inserted anywhere in the forward pass, exactly like a linear or activation layer. The layer implements inverted dropout by default, and its behaviour is controlled by the model’s training/evaluation mode rather than by an explicit argument.

import torch
import torch.nn as nn
 
class MLPWithDropout(nn.Module):
    def __init__(self, n_in=784, n_hidden=512, n_out=10, p=0.5):
        super().__init__()
        self.fc1   = nn.Linear(n_in, n_hidden)
        self.drop1 = nn.Dropout(p=p)           # applied to fc1's activation
        self.fc2   = nn.Linear(n_hidden, n_hidden)
        self.drop2 = nn.Dropout(p=p)           # applied to fc2's activation
        self.fc3   = nn.Linear(n_hidden, n_out)  # output: NO dropout here
 
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.drop1(x)                      # drops with probability p
        x = torch.relu(self.fc2(x))
        x = self.drop2(x)
        return self.fc3(x)                     # logits, no dropout
 
model = MLPWithDropout(p=0.5)
 
# Training mode: dropout is active, each forward pass uses a fresh random mask
model.train()
x = torch.randn(32, 784)
logits_train = model(x)
 
# Evaluation mode: dropout is disabled, the network is deterministic
model.eval()
with torch.no_grad():
    logits_eval = model(x)

Subtle bugs to look out for

Two points deserve emphasis:

  • The p argument is the drop probability, not the keep probability. nn.Dropout(p=0.5) zeros each input element with probability (keeps each with probability ). This is the opposite convention from the variable used in the theory above; remembering this avoids subtle bugs.
  • model.train() and model.eval() toggle dropout** on and off across the entire network. Forgetting to call model.eval() before validation or inference is one of the most common deep-learning bugs: the model produces stochastic predictions that vary across runs, and the reported metrics fluctuate accordingly.

Typical dropout rates

The rates below are starting points that work in most settings; they can be tuned by validation.

layer locationtypical drop probability rationale
input layer to (optional)input noise; too aggressive corrupts the supervision signal
hidden layers (MLP) to the canonical Hinton/Srivastava regime
hidden layers (CNN, fully connected head) to applied only to the dense head, rarely to convolutional layers
convolutional feature mapsusually , or use Spatial Dropoutelement-wise dropout on a convolutional map drops a few pixels per channel rather than meaningful features
Transformer / attention layers on attention weights and residual pathsthe value used in most published Transformer architectures
recurrent layersuse variational dropout (Gal & Ghahramani 2016)applying naive dropout at every time step destroys temporal continuity; the right construction shares the mask across time
output layeralways dropping output logits corrupts the supervision; never done in practice

Interaction with BatchNorm

BatchNorm and Dropout do not play well together

Combining BatchNorm with dropout in the same network requires care. Two related issues arise:

  • Variance shift. Dropout multiplies activations by a Bernoulli mask, changing their variance during training. BatchNorm’s running statistics are computed under this perturbed distribution; at test time, with dropout off and BatchNorm using its accumulated statistics, the moments no longer match what BatchNorm expects.
  • Order matters. Putting dropout before BatchNorm causes BatchNorm to normalize the noisy (dropped) activations; putting dropout after BatchNorm injects noise into the normalized output. Neither order is clearly “correct”, and both have been shown to degrade performance compared with using either technique alone.

The empirical resolution in modern architectures: use one or the other, not both, in the same block. CNNs that use BatchNorm (ResNet, EfficientNet) typically drop the use of dropout in convolutional layers entirely, sometimes retaining a single dropout layer right before the final classifier.

Important

Transformers, which use LayerNorm (not BatchNorm), pair freely with dropout: LayerNorm normalizes per-sample, so it is not affected by the variance shift across the batch.

Variants of dropout

Several variants address specific architectural settings:

  • DropConnect (Wan et al., 2013) drops individual weights rather than activations. The effect is similar but the noise is structured differently; rarely used in modern practice.
  • Spatial Dropout (Tompson et al., 2014) drops entire feature maps in convolutional layers, rather than individual pixels. Better suited to CNNs because spatially adjacent pixels of the same feature map are highly correlated.
  • Variational Dropout (Gal & Ghahramani, 2016) reuses the same mask across all time steps of a recurrent network, preserving temporal continuity. Standard practice for dropout in RNNs.
  • DropPath / Stochastic Depth (Huang et al., 2016) drops entire residual blocks. Used in very deep ResNets and Transformers as a regularizer that interacts well with skip connections.
  • Attention dropout (specific to Transformers) applies dropout to the attention weights after softmax, before they multiply the values. Standard component of every published Transformer.

Monte Carlo Dropout: dropout at test time, on purpose

A second life for dropout: uncertainty estimation

The test-time rule “disable dropout” is the right default for point predictions, but it discards a useful source of information. Monte Carlo Dropout (Gal & Ghahramani, 2016) flips the rule: at inference time, keep dropout on, run the same input through the network times (typically to ), and use the distribution of the predictions as a measure of the model’s uncertainty.

The mean of the predictions is the point estimate; the variance is an approximation to epistemic uncertainty, the part of the prediction uncertainty that comes from not knowing the right model rather than from inherent noise in the data. The theoretical justification rests on a Bayesian interpretation of dropout as approximate variational inference over a posterior on the weights.

Concretely, in PyTorch one keeps the model in training mode at inference time to retain dropout, then samples:

model.train()                         # keep dropout active
preds = torch.stack([model(x) for _ in range(K)])  # shape (K, batch, classes)
mean = preds.mean(dim=0)              # ensemble mean
std  = preds.std(dim=0)               # per-example uncertainty estimate

This is widely used in medical imaging, autonomous driving and active learning, where knowing when the model does not know is as important as the prediction itself.


7. Practical impact

The historical importance of dropout is difficult to overstate. The original JMLR paper reported strong improvements across a wide range of tasks, including:

  • handwritten digit recognition,
  • speech recognition,
  • image classification,
  • and other high-capacity neural-network settings where overfitting was severe.

Even when the absolute numerical gain on a benchmark appears modest, that gain can be extremely meaningful when it is achieved by a regularizer that is simple, architecture-agnostic, and broadly applicable.

When dropout is especially natural

Dropout is especially natural when a model has enough capacity to memorize the training set and one wants a simple way to inject stochastic robustness directly into the hidden representation.

Practical takeaway

Dropout is most compelling when model capacity is high, overfitting is a real risk, and one wants a regularizer that acts directly on the learned representation rather than only through an explicit penalty on the weights.

For these reasons, dropout became one of the canonical regularization techniques in deep learning.

8. Primary sources

Dropout was introduced by Hinton et al. (2012) and developed by Srivastava et al. (2014). Both are collected in Optimization and Regularization.