Intro
Goal: Achieving Small Weights
In supervised learning, if one wants the model to prefer small weights, the cleanest way to express that preference is to modify the loss function itself.
The loss is the mathematical object that defines what training is trying to optimize.
Solution: Regularization
A regularization term can be added to the loss function to encourage the learning of small weights. While several regularization strategies exist (for example L1 regularization), the present note focuses on L2 regularization. In plain gradient descent and SGD, the resulting update rule can be rewritten as weight decay; later sections explain precisely when that equivalence holds and when it breaks.
Why prefer small weights at all
The intuition that small weights correspond to simpler, more generalizable models is examined critically in Should small weights be preferred?. The short answer: small weights make the network less sensitive to local noise in the inputs, so it is harder for the network to memorize the idiosyncrasies of the training set instead of the underlying regularity. This note takes that intuition as given and focuses on how to push the weights small in practice. The orthogonal mechanism of Dropout achieves a related effect by injecting noise into the activations rather than penalizing the weights.
Per-Sample L2-Regularized Loss
| Per-sample Loss function | L2 Regularized Expression | Description |
|---|---|---|
| Cross-entropy | The first term is the standard cross-entropy. The second term is the sum of the squares of all network weights, scaled by . | |
| MSE | The quadratic (MSE) loss can also be regularized in the same way. | |
| L2 Regularized Form | is the unregularized loss function. is called the regularization term. |
Regularization parameter
is the regularization parameter that controls the trade-off between fitting the training data and keeping the weights small.
Assumptions Behind the above formulas
The previous formulas treat the training example as fixed and view the loss as a function of the network parameters. This is only a notational simplification: the purpose is to isolate how the regularization term changes the objective with respect to the weights, not to make any claim about the stochastic nature of the dataset.
Three loss objects appear in this note
The notation distinguishes three closely related quantities throughout the discussion. Keeping them straight from the start avoids confusion later.
symbol meaning regularization coefficient unregularized loss for a single training example none unregularized loss averaged over the full dataset of examples none per-sample loss with regularization, used in this section only as a pedagogical device full-dataset regularized loss, the actual training objective mini-batch stochastic approximation of , used at every SGD step The factor in the per-sample form becomes in the full-dataset form: the same regularization term is added once to the total loss, so dividing the total loss by to get the average also divides the regularization term by . The next subsection makes this transition explicit.
Why the Square in L2 Regularization?
The squared term is used because:
- It is always positive: both positive and negative weights increase the cost.
- It avoids cancellations: without squaring, weights of opposite sign could offset one another in the penalty.
- It penalizes large weights more strongly: squaring amplifies large values, discouraging the network from letting any single weight grow too much.
Why Does It Help Against Overfitting?
Since training aims to minimize the cost function, the term is also minimized.
This leads the network to prefer solutions with overall smaller weights, which are often more stable and less prone to overfitting (see the previous note, in particular the discussion of the Lipschitz bound and how small weights limit the network’s sensitivity to input perturbations).
General L2-Regularized Loss
So far, the L2-regularized loss has been expressed with respect to a single training sample .
In practice, however, the loss is computed over the entire training set of samples by averaging across them.
This leads to the following general form:
- The first term corresponds to the average loss over the entire dataset.
- The second term is the sum of the squares of all the network’s weights. It penalizes large weights and is scaled by so that the regularization strength is comparable in magnitude to the averaged loss term.
Why does become ?
The per-sample formula adds the whole regularization term to a single sample’s loss. To recover the full-dataset objective, both pieces should be added across the dataset and then averaged:
The regularization term does not depend on the sample , so it factors out of the sum:
Strictly speaking, this would give back , not . The conventional further division by in the regularization term, , comes from the desire to keep on a scale independent of the dataset size: with , the per-step gradient of the regularization term, , has the same order of magnitude as a per-step gradient of the averaged data loss. Without this further division, a larger dataset would effectively make the regularization weaker relative to the data term, and would have to be re-tuned. The factor is the price paid for making a transferable hyperparameter.
Two Conventions for the Regularized Loss
Sum convention
The loss grows linearly with .
The regularization always weighs , but the balance shifts as the dataset size changes.Mean convention
The loss is normalized per sample, making it independent of .
The relative strength of the regularization remains consistent even if the dataset size varies.Which convention to use?
In modern Deep Learning, the mean convention is often the more convenient one, especially with mini-batch SGD.
This is because mini-batches approximate the dataset average: using the mean keeps the data term and the regularization term on comparable scales, regardless of the batch size or the total number of samples. This makes hyperparameter tuning (for example the choice of ) more interpretable across different training regimes.
Mini-batch L2-Regularized Loss
In practice, training is not performed on the entire dataset at once, but on mini-batches of samples.
The mini-batch L2-regularized loss is written as:
where:
-
Data term: it uses the factor since the loss is averaged over the training samples in the current mini-batch. This is a stochastic approximation of the full-dataset loss:
-
Regularization term: this term is scaled by (the total dataset size), not by .
This ensures that the strength of the penalty is independent of the chosen mini-batch size.
Dividing by would make the effect of regularization artificially stronger or weaker depending on the batch size.
Note
It is also important to note that the regularization term does not include the biases, as will be explained further on.
Intuitively, the effect of regularization is to make the network tend to prefer small weights, all else being equal.
Large weights will only be allowed if they significantly improve the first part of the loss function.
Important
In other words, regularization can be seen as a trade-off between the minimization of the original loss function and the reduction of weight magnitudes.
The relative importance of these two elements depends on the value of :
- when is small, the loss is driven mainly by the original objective;
- when is large, the loss is dominated by the weight penalty term.
The procedure for selecting will be discussed later.
Warning
At first glance, it is not at all obvious that such a trade-off could help reduce overfitting.
However, it turns out that it does. The reason why this happens was discussed in the previous note.
GD for a L2-regularized network
To apply the gradient descent learning algorithm in a regularized neural network, it is necessary to compute the partial derivatives and for all the weights and biases in the network.
Tip
Strictly speaking, the backpropagation equations should be reformulated to account for the change in the loss function that now includes the regularization term.
In practice, however, one can take a shortcut and directly modify the weight update rule, without altering the backpropagation equations.
Given the form of the regularized loss function the focus is placed on the weight , and the partial derivative of the regularization term with respect to it is computed.
| Derivative of the regularized loss w.r.t | Derivative of the regularized loss w.r.t |
|---|---|
| Since: then: |
Important
The terms and can be computed using standard backpropagation.
Consequently, the gradient of the regularized loss function is obtained in a straightforward way: apply standard backpropagation and add to the partial derivative of the -th weight.
| GD weight update rule | GD bias update rule |
|---|---|
| This is exactly the same rule as in standard gradient descent, except that the weight is first rescaled by a factor of . This rescaling is known as weight decay, since it tends to make the weights smaller. At first glance, it might seem that the weights are relentlessly pushed toward zero, but this is not the case: the other term in the update (the derivative of the unregularized loss function) can still push the weights to increase if doing so reduces the unregularized loss. | The partial derivatives with respect to the biases remain unchanged, so the learning rule for the biases does not differ from the standard one. |
Note
This approach makes it possible to incorporate regularization in a simple and modular way, without the need to reformulate the entire backpropagation.
Mini-batch SGD for a L2-regularized network
Question
What changes with mini-batch stochastic gradient descent (SGD)?
| SGD weight update rule | SGD bias update rule |
|---|---|
| Just as in the unregularized case, can be estimated by averaging over a mini-batch of training examples. The sum is taken over the examples in the mini-batch. is the unregularized loss function for each example. This is exactly the same rule as in unregularized stochastic gradient descent, except for the weight decay factor , which reduces the weights at each step. | The sum is taken over the examples in the mini-batch. The regularized update rule for the biases is identical to the unregularized one. |
L2 regularization meets adaptive optimizers: AdamW
The derivation above shows that, for plain SGD, adding the L2 penalty to the loss is mathematically equivalent to a multiplicative shrinkage of the weights at every step. This is the classical identification of L2 regularization with weight decay.
The identification breaks for adaptive optimizers (Adam, RMSProp, Adagrad). In these methods, the per-coordinate effective step size is rescaled by a running estimate of the gradient magnitude; if the L2 gradient is added to the data gradient before the rescaling, then the regularization itself becomes a function of the gradient history, which it was never meant to depend on. The cleanest fix is AdamW (Loshchilov and Hutter, 2017), which applies the weight-decay shrinkage as a separate step outside Adam’s adaptive transformation.
In one line
- SGD: L2 regularization weight decay. Either implementation gives the same update.
- Adam + L2 (coupled): distorts the regularization through adaptive moments. Not equivalent to weight decay.
- AdamW: applies weight decay decoupled from Adam’s adaptive step. The correct choice whenever Adam-style optimization is combined with nonzero weight decay.
Full treatment in a dedicated note
The mathematical reason coupled Adam + L2 produces an anisotropic, history-dependent shrinkage (and the exact form of the AdamW update rule that fixes it) is derived in AdamW. The present note focuses on L2 in the SGD setting and on the practical PyTorch parameters needed to use it correctly with each optimizer family.
Regularization and biases
Note
L2 regularization is usually not applied to bias terms.
While it is technically possible to include them, their practical effect is often small, which is why omitting them has become a common convention.
Why Large Biases Do Not Pose a Problem
A large bias does not amplify input sensitivity in the same way large weights do.
Large weights directly control how strongly changes in the input affect the neuron, whereas a bias mainly shifts the activation threshold. For that reason, large biases are usually less problematic from the point of view of overfitting.
Large Biases Can Enhance Network Flexibility
Allowing biases to remain unconstrained can increase the flexibility of the network’s behavior.
In particular, biases can shift activation thresholds without creating the same kind of input-dependent amplification that large weights produce.
For these reasons, in practice biases are almost never included in the regularization terms.
L2 regularization in PyTorch
PyTorch exposes L2 regularization through the weight_decay argument of every optimizer in torch.optim. Two things about this implementation are easy to get wrong and worth stating explicitly.
How PyTorch's
weight_decaydiffers from the theoreticalThe theoretical L2-regularized loss used throughout this note is
with regularization coefficient . PyTorch optimizers do not perform the division by internally: they apply
weight_decaydirectly to the per-step update. The PyTorch parameter corresponds to the coefficient that multiplies in the gradient, i.e., under the averaged-loss convention. The conversion is thereforeForgetting this conversion is the single most common practical confusion around L2 in PyTorch: the same theoretical translates into very different
weight_decayvalues for datasets of very different sizes.
Which optimizer interprets
weight_decayas true L2Whether PyTorch’s
weight_decayactually implements the L2 regularization derived in this note depends on the optimizer:
Optimizer What weight_decaydoesTrue L2 regularization? optim.SGDadds to the gradient before the step yes (equivalent to L2) optim.Adamadds to the gradient, then passes through Adam’s adaptive transform no (coupled, distorts the regularization) optim.AdamWapplies as a separate step after the Adam update yes (decoupled weight decay; mathematically clean for Adam-family) The reason the Adam row reads “no” and AdamW’s row reads “yes” is derived in detail in AdamW. The practical recipe is short: use
SGDwithweight_decayorAdamWwithweight_decay; never useAdamwithweight_decayif true L2 regularization is the goal.
Code example: enabling weight decay in PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Sequential(
nn.Linear(784, 256), nn.ReLU(),
nn.Linear(256, 10),
)
# SGD with classical L2 regularization (== weight decay for SGD)
opt_sgd = optim.SGD(model.parameters(), lr=0.1, weight_decay=1e-4)
# AdamW with decoupled weight decay (the correct adaptive-optimizer recipe)
opt_adamw = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)Excluding biases (and norm parameters) from weight decay
A standard refinement: apply weight_decay only to weight matrices, not to biases, BatchNorm/LayerNorm scale and shift parameters, or embedding tables. This is achieved with parameter groups:
decay, no_decay = [], []
for name, p in model.named_parameters():
if not p.requires_grad:
continue
# Exclude 1-D parameters (biases, norm gammas/betas) from weight decay
if p.dim() == 1 or name.endswith('.bias'):
no_decay.append(p)
else:
decay.append(p)
opt = optim.AdamW(
[{'params': decay, 'weight_decay': 1e-2},
{'params': no_decay, 'weight_decay': 0.0}],
lr=1e-3,
)This pattern is standard in modern Transformer training pipelines (BERT, GPT, ViT all use it).
BatchNorm + weight decay: a subtle interaction
In a network with Batch Normalization right after a linear or convolutional layer, the scale of the layer’s weights is fundamentally absorbed by BatchNorm’s normalization. Multiplying the weights of the layer by any positive constant produces the same output, because BatchNorm divides by the standard deviation of the pre-activations.
Applying
weight_decayto such weights does not, strictly speaking, change the function the network computes; it only shrinks the scale that BatchNorm will subsequently undo. The effect of weight decay in BN-equipped networks is therefore indirect: it changes the effective learning rate along certain directions, rather than shrinking the actual function the network represents. This explains why optimalweight_decayvalues often look surprisingly large in modern image-classification training recipes (e.g.\ or ) compared with the values used in pre-BN architectures.The interaction is analyzed in depth by van Laarhoven (2017, “L2 Regularization versus Batch and Weight Normalization”) and is part of the reason modern training pipelines tune
weight_decayand learning rate jointly.
For a deeper mathematical analysis of what L2 regularization does to the optimal weights of a model (not just the per-step update), see L2 regularization in depth: the eigendecomposition of the loss Hessian reveals that L2 shrinks much more strongly along low-curvature directions of the loss than along high-curvature ones. The other regularizers in this section, Dropout and Data augmentation, are complementary and operate by different mechanisms.