L2 regularization

Intro

Goal: Achieving Small Weights

When pursuing a specific objective in Deep Learning (in this case, preferring small weights), the most common strategy is to modify the loss function.
Indeed, it is through the loss function that the training objective of the model is formally defined.

Solution: Regularization

A regularization term can be added to the loss function encourage learning small weights. While different approaches exist (e.g., L1 regularization and others), in the following discussion the focus will be on L2 regularization (also known as weight decay), which penalizes large weights by adding their squared values to the loss.

Per-Sample L2-Regularized Loss

Per-sample Loss function	L2 Regularized Expression	Description
Cross-entropy	$- \sum_{j} [y_{j} ln a_{j}^{L} + (1 - y_{j}) ln (1 - a_{j}^{L})] + \frac{λ}{2} w \sum w^{2}$	The first term is the standard cross-entropy. The second term is the sum of the squares of all network weights, scaled by $λ /2$ .
MSE	$\frac{1}{2} (y - a^{L})^{2} + \frac{λ}{2} w \sum w^{2}$	The quadratic (MSE) loss can also be regularized in the same way.
L2 Regularized Form	$L_{x} = L_{x} + \frac{λ}{2} w \sum w^{2}$	$L_{x}$ is the unregularized loss function. $\frac{λ}{2} \sum_{w} w^{2}$ is called the regularization term.

Regularization parameter

$λ \geq 0$ is the regularization parameter that controls the trade-off between fitting the training data and keeping the weights small.

Assumptions Behind the above formulas

The previous formulas refer to the case where the training sample $x$ is considered fixed.
It is therefore assumed that the input data do not vary, a typical condition in theoretical analyses, where the focus is placed on the network’s behavior with respect to the weights rather than on the variability of the dataset.

Why the Square in L2 Regularization?

The squared term $w_{j}^{2}$ is used because:

It is always positive: both positive and negative weights increase the cost.

It avoids cancellations: without squaring, positive and negative weights of the same magnitude could cancel each other.

It penalizes large weights more strongly: squaring amplifies large values, discouraging the network from letting any single weight grow too much.

Why Does It Help Against Overfitting?

Since training aims to minimize the cost function, the term $\sum_{j} w_{j}^{2}$ is also minimized.
👉 This leads the network to prefer solutions with overall smaller weights, which tend to be more stable and less prone to overfitting (see previous note)

General L2-Regularized Loss

So far, the L2-regularized loss has been expressed with respect to a single training sample $x$ .
In practice, however, the loss is computed over the entire training set of $n$ samples by averaging across them.
This leads to the following general form:

L = L data term (averaged over n) \frac{1}{n} x \in train \sum L_{x} + regularization term (scaled by n) \frac{λ}{2 n} w \sum w^{2}

The first term $L$ corresponds to the average loss over the entire dataset.
The second term is the sum of the squares of all the network’s weights. It penalizes large weights and is scaled by $\frac{1}{n}$ so that the regularization strength is comparable in magnitude to the averaged loss term.

Two Conventions for the Regularized Loss

Sum convention
$L = i = 1 \sum n L_{i} + \frac{λ}{2} w \sum w^{2}$
➡️ The loss grows linearly with $n$ .
➡️ The regularization always weighs $λ$ , but the balance shifts as the dataset size changes.

Mean convention
$L = \frac{1}{n} i = 1 \sum n L_{i} + \frac{λ}{2 n} w \sum w^{2}$
➡️ The loss is normalized per sample, making it independent of $n$ .
➡️ The relative strength of the regularization remains consistent even if the dataset size varies.

Which convention to use?
In modern Deep Learning, the mean convention is regarded as more robust, particularly with mini-batch SGD.
This is because mini-batches approximate the dataset average: using the mean ensures that both the data term and the regularization term are on the same scale, regardless of the batch size or the total number of samples. This stability makes hyperparameter tuning (e.g., the choice of $λ$ ) more reliable and consistent across different training regimes.

Mini-batch L2-Regularized Loss

In practice, training is not performed on the entire dataset at once, but on mini-batches of $m$ samples.

The mini-batch L2-regularized loss is written as:

L_{mb} = data term (averaged over m) \frac{1}{m} x \in mini-batch \sum L_{x} + regularization term (scaled by n) \frac{λ}{2 n} w \sum w^{2}

where:

Data term: it uses the factor $\frac{1}{m}$ since the loss is averaged over the $m$ training samples in the current mini-batch. This is a stochastic approximation of the full-dataset loss: $L = \frac{1}{n} \sum_{x \in train} L_{x}$
Regularization term: this term is scaled by $\frac{1}{n}$ (the total dataset size), not by $m$ .
This ensures that the strength of the penalty is independent of the chosen mini-batch size.
Dividing by $m$ would make the effect of regularization artificially stronger or weaker depending on the batch size.

Note

It is also important to note that the regularization term does not include the biases, as will be explained further on.

Intuitively, the effect of regularization is to make the network tend to prefer small weights, all else being equal.
Large weights will only be allowed if they significantly improve the first part of the loss function.

Important

In other words, regularization can be seen as a trade-off between the minimization of the original loss function and the reduction of weight magnitudes.
The relative importance of these two elements depends on the value of $λ$ :

when $λ$ is small, the loss is driven mainly by the original objective;

when $λ$ is large, the loss is dominated by the weight penalty term.

The procedure for selecting $λ$ will be discussed later.

Warning

At first glance, it is not at all obvious that such a trade-off could help reduce overfitting.
However, it turns out that it does. The reason why this happens was discussed in the previous note.

GD for a L2-regularized network

To apply the gradient descent learning algorithm in a regularized neural network, it is necessary to compute the partial derivatives $\partial L / \partial w$ and $\partial L / \partial b$ for all the weights and biases in the network.

Tip

Strictly speaking, the backpropagation equations should be reformulated to account for the change in the loss function that now includes the regularization term.
In practice, however, one can take a shortcut and directly modify the weight update rule, without altering the backpropagation equations.

Given the form of the regularized loss function $L = L + \frac{λ}{2 n} w \sum w^{2}$ the focus is placed on the weight $w_{j}$ , and the partial derivative of the regularization term with respect to it is computed.

Derivative of the regularized loss w.r.t $w_{j}$	Derivative of the regularized loss w.r.t $b_{j}$
Since: $\frac{\partial}{\partial w _{j}} (\frac{λ}{2 n} \sum_{k} w_{k}^{2}) = \frac{λ}{2 n} \sum_{k} \frac{\partial}{\partial w _{j}} (w_{k}^{2}) = \frac{λ}{n} w_{j}$ then: $\frac{\partial L}{\partial w _{j}} = \frac{\partial L}{\partial w _{j}} + \frac{λ}{n} w_{j}$	$\frac{\partial L}{\partial b _{j}} = \frac{\partial L}{\partial b _{j}}$

Important

The terms $\frac{\partial L}{\partial w _{j}}$ and $\frac{\partial L}{\partial b _{j}}$ can be computed using standard backpropagation.
Consequently, the gradient of the regularized loss function is obtained in a straightforward way: apply standard backpropagation and add $\frac{λ}{n} w_{j}$ to the partial derivative of the $j$ -th weight.

GD weight update rule	GD bias update rule
$w_{j}^{(t + 1)} = w_{j}^{(t)} - η (\frac{\partial L}{\partial w _{j}} + \frac{λ}{n} w_{j}) = (1 - η \frac{λ}{n}) w_{j}^{(t)} - η \frac{\partial L}{\partial w _{j}}$	$b_{j}^{(t + 1)} = b_{j}^{(t)} - η \frac{\partial L}{\partial b _{j}}$
This is exactly the same rule as in standard gradient descent, except that the weight $w_{j}$ is first rescaled by a factor of $(1 - η \frac{λ}{n})$ . This rescaling is known as weight decay, since it tends to make the weights smaller. At first glance, it might seem that the weights are relentlessly pushed toward zero, but this is not the case: the other term in the update (the derivative of the unregularized loss function) can still push the weights to increase if doing so reduces the unregularized loss.	The partial derivatives with respect to the biases remain unchanged, so the learning rule for the biases does not differ from the standard one.

GD weight update rule

GD bias update rule

w_{j}^{(t + 1)} = w_{j}^{(t)} - η (\frac{\partial L}{\partial w _{j}} + \frac{λ}{n} w_{j}) = (1 - η \frac{λ}{n}) w_{j}^{(t)} - η \frac{\partial L}{\partial w _{j}}

b_{j}^{(t + 1)} = b_{j}^{(t)} - η \frac{\partial L}{\partial b _{j}}

This is exactly the same rule as in standard gradient descent, except that the weight

w_{j}

is first rescaled by a factor of

(1 - η \frac{λ}{n})

. This rescaling is known as weight decay, since it tends to make the weights smaller.
At first glance, it might seem that the weights are relentlessly pushed toward zero, but this is not the case:
the other term in the update (the derivative of the unregularized loss function) can still push the weights to increase if doing so reduces the unregularized loss.

The partial derivatives with respect to the biases remain unchanged, so the learning rule for the biases does not differ from the standard one.

Note

This approach makes it possible to incorporate regularization in a simple and modular way, without the need to reformulate the entire backpropagation.

Mini-batch SGD for a L2-regularized network

Question

What changes with mini-batch stochastic gradient descent (SGD)?

SGD weight update rule	SGD bias update rule
$w_{j}^{(t + 1)} = (1 - η \frac{λ}{n}) w_{j}^{(t)} - \frac{η}{m} \sum_{x} \frac{\partial L _{x}}{\partial w _{j}}$	$b_{j}^{(t + 1)} = b_{j}^{(t)} - \frac{η}{m} \sum_{x} \frac{\partial L _{x}}{\partial b _{j}}$
Just as in the unregularized case, $\frac{\partial L}{\partial w}$ can be estimated by averaging over a mini-batch of $m$ training examples. The sum is taken over the examples $x$ in the mini-batch. $L_{x}$ is the unregularized loss function for each example. 👉 This is exactly the same rule as in unregularized stochastic gradient descent, except for the weight decay factor $1 - η \frac{λ}{n}$ , which reduces the weights at each step.	The sum is taken over the examples $x$ in the mini-batch. 👉 The regularized update rule for the biases is identical to the unregularized one.

SGD weight update rule

SGD bias update rule

w_{j}^{(t + 1)} = (1 - η \frac{λ}{n}) w_{j}^{(t)} - \frac{η}{m} \sum_{x} \frac{\partial L _{x}}{\partial w _{j}}

b_{j}^{(t + 1)} = b_{j}^{(t)} - \frac{η}{m} \sum_{x} \frac{\partial L _{x}}{\partial b _{j}}

Just as in the unregularized case,

\frac{\partial L}{\partial w}

can be estimated by averaging over a mini-batch of

m

training examples.
The sum is taken over the examples

x

in the mini-batch.

L_{x}

is the unregularized loss function for each example.

👉 This is exactly the same rule as in unregularized stochastic gradient descent, except for the weight decay factor

1 - η \frac{λ}{n}

, which reduces the weights at each step.

The sum is taken over the examples

x

in the mini-batch.

👉 The regularized update rule for the biases is identical to the unregularized one.

AdamW (Adam + L2 regularization)

Do not use Adam + L2 regularization

In Adam, the learning rate is rescaled for each parameter based on the history of the sum of gradients and the sum of squared gradients.
Strictly speaking, however, this is not a simple arithmetic sum, but rather an exponential moving average (EMA).

The problem arises when trying to apply Adam’s rescaling strategy to the entire term:
$\frac{\partial L}{\partial w _{j}} + \frac{λ}{n} w_{j}$
Doing so would be a serious conceptual mistake.

The Pitfall of Adam + L2 Regularization

This is because Adam is a learning rate rescaling strategy designed for terms of the form $\frac{\partial L}{\partial w _{j}}$ , not for the terms $\frac{λ}{n} w_{j}$ that arise from regularization. These two terms serve distinct purposes.

Adam is designed to adjust the learning rate based on the behavior of the loss gradient, not the regularization term.

Rescaling $\frac{λ}{n} w_{j}$ as well would lead to a paradox:
while the goal of regularization is to encourage small weights, the rescaling would end up encouraging some weights more than others to become small, whereas the objective should be to enforce uniformly small weights.

Therefore, when using Adam together with a weight decay term, directly coupling Adam and L2 is not correct — it is a mistake!!!

AdamW: Decoupled Weight Decay

The correct way to combine Adam with L2 regularization is to use AdamW,
a variant specifically designed to apply weight decay properly by
decoupling it from the adaptive gradient update.

AdamW operates in two distinct phases:

First phase: apply the standard Adam update rule to the loss gradient, with adaptive learning rate rescaling.

Second phase: apply weight decay as a separate update, subtracting a term proportional to $λ \cdot w_{j}$ , scaled only by the learning rate $η$ , after adaptive LR.

✅ In summary, if L2 regularization is required with Adam, AdamW must be used.
Unlike Adam + L2, which incorrectly mixes objectives, AdamW decouples weight decay from gradient adaptation, ensuring stable and mathematically consistent regularization.

Recap

While the terms L2 regularization and weight decay are often used interchangeably, they are technically distinct and their equivalence depends on the optimizer being used.

With SGD: The two are mathematically equivalent. Adding the L2 penalty term to the loss function results in a gradient update rule that is identical to applying a proportional decay to the weights at each step.

With Adaptive Optimizers (e.g., Adam): The equivalence breaks. Adam adapts the learning rate for each parameter based on the history of its gradients. If L2 regularization is implemented by simply adding its derivative ( $\frac{λ}{n} w$ ) to the loss gradient, Adam will incorrectly rescale this regularization term as well. This can lead to ineffective or unpredictable regularization.

✅ This is precisely why AdamW was introduced. It decouples weight decay from the gradient update, applying it directly to the weights after the Adam optimization step. This ensures that the decay is applied consistently, as intended by the original concept of weight decay.

Regularization and biases

Note

L2 regularization is usually not applied to bias terms.
While it is technically possible to include them, empirical results show that the effect is often negligible, which is why this is largely a conventional choice.

Why Large Biases Do Not Pose a Problem

A large bias does not make a neuron sensitive to inputs in the same way large weights do.
Therefore, there is no need to worry that large biases will cause the network to learn the noise in the data.

Large Biases Can Enhance Network Flexibility

Allowing large biases can increase the flexibility of the network’s behavior.
For example, large biases facilitate neuron saturation, an effect that can sometimes be functional or even desirable.

For these reasons, in practice biases are almost never included in the regularization terms.