Intro
Goal: Achieving Small Weights
When pursuing a specific objective in Deep Learning (in this case, preferring small weights), the most common strategy is to modify the loss function.
Indeed, it is through the loss function that the training objective of the model is formally defined.
Solution: Regularization
A regularization term can be added to the loss function encourage learning small weights. While different approaches exist (e.g., L1 regularization and others), in the following discussion the focus will be on L2 regularization (also known as weight decay), which penalizes large weights by adding their squared values to the loss.
Per-Sample L2-Regularized Loss
| Per-sample Loss function | L2 Regularized Expression | Description |
|---|---|---|
| Cross-entropy | The first term is the standard cross-entropy. The second term is the sum of the squares of all network weights, scaled by . | |
| MSE | The quadratic (MSE) loss can also be regularized in the same way. | |
| L2 Regularized Form | is the unregularized loss function. is called the regularization term. |
Regularization parameter
is the regularization parameter that controls the trade-off between fitting the training data and keeping the weights small.
Assumptions Behind the above formulas
The previous formulas refer to the case where the training sample is considered fixed.
It is therefore assumed that the input data do not vary, a typical condition in theoretical analyses, where the focus is placed on the network’s behavior with respect to the weights rather than on the variability of the dataset.
Why the Square in L2 Regularization?
The squared term is used because:
- It is always positive: both positive and negative weights increase the cost.
- It avoids cancellations: without squaring, positive and negative weights of the same magnitude could cancel each other.
- It penalizes large weights more strongly: squaring amplifies large values, discouraging the network from letting any single weight grow too much.
Why Does It Help Against Overfitting?
Since training aims to minimize the cost function, the term is also minimized.
👉 This leads the network to prefer solutions with overall smaller weights, which tend to be more stable and less prone to overfitting (see previous note)
General L2-Regularized Loss
So far, the L2-regularized loss has been expressed with respect to a single training sample .
In practice, however, the loss is computed over the entire training set of samples by averaging across them.
This leads to the following general form:
- The first term corresponds to the average loss over the entire dataset.
- The second term is the sum of the squares of all the network’s weights. It penalizes large weights and is scaled by so that the regularization strength is comparable in magnitude to the averaged loss term.
Two Conventions for the Regularized Loss
Sum convention
➡️ The loss grows linearly with .
➡️ The regularization always weighs , but the balance shifts as the dataset size changes.Mean convention
➡️ The loss is normalized per sample, making it independent of .
➡️ The relative strength of the regularization remains consistent even if the dataset size varies.Which convention to use?
In modern Deep Learning, the mean convention is regarded as more robust, particularly with mini-batch SGD.
This is because mini-batches approximate the dataset average: using the mean ensures that both the data term and the regularization term are on the same scale, regardless of the batch size or the total number of samples. This stability makes hyperparameter tuning (e.g., the choice of ) more reliable and consistent across different training regimes.
Mini-batch L2-Regularized Loss
In practice, training is not performed on the entire dataset at once, but on mini-batches of samples.
The mini-batch L2-regularized loss is written as:
where:
-
Data term: it uses the factor since the loss is averaged over the training samples in the current mini-batch. This is a stochastic approximation of the full-dataset loss:
-
Regularization term: this term is scaled by (the total dataset size), not by .
This ensures that the strength of the penalty is independent of the chosen mini-batch size.
Dividing by would make the effect of regularization artificially stronger or weaker depending on the batch size.
Note
It is also important to note that the regularization term does not include the biases, as will be explained further on.
Intuitively, the effect of regularization is to make the network tend to prefer small weights, all else being equal.
Large weights will only be allowed if they significantly improve the first part of the loss function.
Important
In other words, regularization can be seen as a trade-off between the minimization of the original loss function and the reduction of weight magnitudes.
The relative importance of these two elements depends on the value of :
- when is small, the loss is driven mainly by the original objective;
- when is large, the loss is dominated by the weight penalty term.
The procedure for selecting will be discussed later.
Warning
At first glance, it is not at all obvious that such a trade-off could help reduce overfitting.
However, it turns out that it does. The reason why this happens was discussed in the previous note.
GD for a L2-regularized network
To apply the gradient descent learning algorithm in a regularized neural network, it is necessary to compute the partial derivatives and for all the weights and biases in the network.
Tip
Strictly speaking, the backpropagation equations should be reformulated to account for the change in the loss function that now includes the regularization term.
In practice, however, one can take a shortcut and directly modify the weight update rule, without altering the backpropagation equations.
Given the form of the regularized loss function the focus is placed on the weight , and the partial derivative of the regularization term with respect to it is computed.
| Derivative of the regularized loss w.r.t | Derivative of the regularized loss w.r.t |
|---|---|
| Since: then: |
Important
The terms and can be computed using standard backpropagation.
Consequently, the gradient of the regularized loss function is obtained in a straightforward way: apply standard backpropagation and add to the partial derivative of the -th weight.
| GD weight update rule | GD bias update rule |
|---|---|
| This is exactly the same rule as in standard gradient descent, except that the weight is first rescaled by a factor of . This rescaling is known as weight decay, since it tends to make the weights smaller. At first glance, it might seem that the weights are relentlessly pushed toward zero, but this is not the case: the other term in the update (the derivative of the unregularized loss function) can still push the weights to increase if doing so reduces the unregularized loss. | The partial derivatives with respect to the biases remain unchanged, so the learning rule for the biases does not differ from the standard one. |
Note
This approach makes it possible to incorporate regularization in a simple and modular way, without the need to reformulate the entire backpropagation.
Mini-batch SGD for a L2-regularized network
Question
What changes with mini-batch stochastic gradient descent (SGD)?
| SGD weight update rule | SGD bias update rule |
|---|---|
| Just as in the unregularized case, can be estimated by averaging over a mini-batch of training examples. The sum is taken over the examples in the mini-batch. is the unregularized loss function for each example. 👉 This is exactly the same rule as in unregularized stochastic gradient descent, except for the weight decay factor , which reduces the weights at each step. | The sum is taken over the examples in the mini-batch. 👉 The regularized update rule for the biases is identical to the unregularized one. |
AdamW (Adam + L2 regularization)
Do not use Adam + L2 regularization
In Adam, the learning rate is rescaled for each parameter based on the history of the sum of gradients and the sum of squared gradients.
Strictly speaking, however, this is not a simple arithmetic sum, but rather an exponential moving average (EMA).The problem arises when trying to apply Adam’s rescaling strategy to the entire term:
Doing so would be a serious conceptual mistake.
The Pitfall of Adam + L2 Regularization
This is because Adam is a learning rate rescaling strategy designed for terms of the form , not for the terms that arise from regularization. These two terms serve distinct purposes.
Adam is designed to adjust the learning rate based on the behavior of the loss gradient, not the regularization term.
Rescaling as well would lead to a paradox:
while the goal of regularization is to encourage small weights, the rescaling would end up encouraging some weights more than others to become small, whereas the objective should be to enforce uniformly small weights.Therefore, when using Adam together with a weight decay term, directly coupling Adam and L2 is not correct — it is a mistake!!!
AdamW: Decoupled Weight Decay
The correct way to combine Adam with L2 regularization is to use AdamW,
a variant specifically designed to apply weight decay properly by
decoupling it from the adaptive gradient update.AdamW operates in two distinct phases:
- First phase: apply the standard Adam update rule to the loss gradient, with adaptive learning rate rescaling.
- Second phase: apply weight decay as a separate update, subtracting a term proportional to , scaled only by the learning rate , after adaptive LR.
✅ In summary, if L2 regularization is required with Adam, AdamW must be used.
Unlike Adam + L2, which incorrectly mixes objectives, AdamW decouples weight decay from gradient adaptation, ensuring stable and mathematically consistent regularization.

Recap
While the terms L2 regularization and weight decay are often used interchangeably, they are technically distinct and their equivalence depends on the optimizer being used.
With SGD: The two are mathematically equivalent. Adding the L2 penalty term to the loss function results in a gradient update rule that is identical to applying a proportional decay to the weights at each step.
With Adaptive Optimizers (e.g., Adam): The equivalence breaks. Adam adapts the learning rate for each parameter based on the history of its gradients. If L2 regularization is implemented by simply adding its derivative () to the loss gradient, Adam will incorrectly rescale this regularization term as well. This can lead to ineffective or unpredictable regularization.
✅ This is precisely why AdamW was introduced. It decouples weight decay from the gradient update, applying it directly to the weights after the Adam optimization step. This ensures that the decay is applied consistently, as intended by the original concept of weight decay.
Regularization and biases
Note
L2 regularization is usually not applied to bias terms.
While it is technically possible to include them, empirical results show that the effect is often negligible, which is why this is largely a conventional choice.
Why Large Biases Do Not Pose a Problem
A large bias does not make a neuron sensitive to inputs in the same way large weights do.
Therefore, there is no need to worry that large biases will cause the network to learn the noise in the data.
Large Biases Can Enhance Network Flexibility
Allowing large biases can increase the flexibility of the network’s behavior.
For example, large biases facilitate neuron saturation, an effect that can sometimes be functional or even desirable.
For these reasons, in practice biases are almost never included in the regularization terms.
L2 regularization in PyTorch
When implementing L2 regularization in theory, one often encounters formulas that include an explicit division by the dataset size .
However, PyTorch optimizers handle regularization in a slightly different way: the weight_decay parameter is applied directly during the update step, without any internal scaling by or by the mini-batch size.
Weight Decay in PyTorch
In PyTorch, the argument
weight_decayof optimizers (SGD,Adam,AdamW, etc.) is applied directly to the gradients or parameters and is not scaled by the dataset size or the mini-batch size .
- Theoretical (mean-loss convention)
Here, the effective coefficient in front of is .
- PyTorch implementation
Optimizers update gradients asThis means that in PyTorch you must set
if your theoretical reference uses the averaged-loss form above.
- SGD:
weight_decayadds this term directly to the gradient (coupled L2 regularization).- Adam: behaves like SGD if
weight_decayis set, but this is not the correct way to combine Adam and L2.- AdamW: applies decoupled weight decay by scaling parameters as each step, which is the recommended approach.
✅ Summary: PyTorch does not perform the scaling internally. If your theoretical formula has , you should pass
weight_decay = λ/n.
Weight Decay vs. L2 Regularization in PyTorch
In PyTorch, the argument
weight_decayin optimizers (SGD,Adam,AdamW, etc.) controls how weight decay is applied during training:
With SGD,
weight_decaybehaves exactly like L2 regularization.
After autograd computes the gradient , the optimizer internally forms a modified gradientand uses this in the update rule
With Adam, PyTorch does the same:
but then is passed into Adam’s adaptive rescaling machinery.
This incorrectly couples the regularization term with Adam’s learning-rate adaptation, soAdam + weight_decayin PyTorch does not correspond to true L2 regularization.To address this, PyTorch provides AdamW, where weight decay is applied in a separate step (decoupled).
Here, Adam works only on (the true gradient of the loss), and after that step the weights are decayed asensuring consistent and theoretically correct regularization.
Practical takeaway:
- For SGD,
weight_decay≈ L2 regularization.- For Adam, always use AdamW with .