L2 regularization in depth

The previous note introduced L2 regularization at the level of the update rule: at every gradient step, the weight $w_{j}$ is rescaled by a factor $(1 - η λ / n)$ before the usual gradient update is applied. That view is operationally sufficient, but it leaves a deeper question unanswered:

The deeper question

Suppose training converges. What does the regularized solution $w$ look like, as a function of the unregularized optimum $w^{⋆}$ ? Which components of $w^{⋆}$ does L2 shrink, and which does it leave essentially untouched?

The answer is one of the most elegant results in classical regularization theory: L2 regularization performs an anisotropic shrinkage of $w^{⋆}$ along the eigenvectors of the loss Hessian. The components of $w^{⋆}$ that lie along flat directions of the loss (small eigenvalues of the Hessian) are shrunk almost to zero, while components along steep directions (large eigenvalues) survive nearly intact. The derivation is short, the geometry is striking, and the conclusion gives the deepest intuition for why L2 regularization improves generalization.

This note follows the analysis of Goodfellow, Bengio and Courville (Deep Learning, Section 7.1.1), adapted to the notation used throughout this site.

Setup

The regularized objective, as defined in L2 regularization, is

L (w) = L (w) + \frac{λ}{2} w^{⊤} w = L (w) + \frac{λ}{2} ∥ w ∥_{2}^{2} .

Notation reconciled with the previous note

The previous note writes the regularization term as $\frac{λ}{2 n} \sum_{w} w^{2}$ , with the $1/ n$ factor included so that $λ$ stays dataset-size-independent. The analysis in this note works in scaled form: $L$ here is the (already-averaged) data loss and $λ$ here is the rescaled coefficient that absorbs the $1/ n$ . Either convention gives the same algebraic result; the simpler form $\frac{λ}{2} ∥ w ∥_{2}^{2}$ is used throughout to keep the formulas readable.

For the same reason, biases are omitted from $w$ throughout this note. The previous note explains why L2 is conventionally not applied to biases; including them would not change the structure of the analysis but would clutter the notation.

Why this form is also called "weight decay" and "ridge"

The penalty $\frac{λ}{2} ∥ w ∥_{2}^{2}$ goes by three names across communities:

L2 regularization (machine learning, deep learning);

weight decay (deep learning, when emphasizing the update-rule view);

ridge regression or Tikhonov regularization (statistics, classical inverse problems).

The names refer to the same mathematical object. The latter two reflect the historical origins of the construction in linear regression and operator theory, respectively. The analysis below applies to any twice-differentiable loss; the linear-regression special case appears at the end.

The gradient and the update rule, written compactly

Differentiating the regularized loss gives the gradient

\nabla L (w) = \nabla L (w) + λ w .

A single step of gradient descent therefore reads

w \leftarrow w - η (\nabla L (w) + λ w) = (1 - η λ) w - η \nabla L (w) .

This is the weight-decay form: shrink the weight by the multiplicative factor $(1 - η λ)$ , then apply the usual unregularized update. The rule does not push the weights inexorably to zero: the second term, the unregularized gradient, can still drive any individual weight to grow if doing so reduces $L$ enough.

Per-step vs at-convergence

The update rule above describes what happens during a single step. It says nothing about where the trajectory ultimately lands. The question this note addresses is the fixed point: where does the regularized iterate $w$ converge to, and how does it relate to the unregularized optimum $w^{⋆}$ ? The answer requires going past the update rule and looking at the shape of $L$ around $w^{⋆}$ .

Quadratic approximation around the unregularized optimum

To make the analysis tractable, the unregularized loss is approximated by a second-order Taylor expansion around the unregularized optimum $w^{⋆} = ar g min_{w} L (w)$ :

L (w) = L (w^{⋆}) + \frac{1}{2} (w - w^{⋆})^{⊤} H (w - w^{⋆}),

where $H = \nabla^{2} L (w^{⋆})$ is the Hessian of $L$ evaluated at $w^{⋆}$ .

Why no first-order term

A second-order Taylor expansion at a point $w^{⋆}$ is
$L (w) \approx L (w^{⋆}) + \nabla L (w^{⋆})^{⊤} (w - w^{⋆}) + \frac{1}{2} (w - w^{⋆})^{⊤} H (w - w^{⋆}) .$
At a minimum, the gradient vanishes: $\nabla L (w^{⋆}) = 0$ . The first-order term therefore drops out, leaving the pure quadratic above.

A second consequence of $w^{⋆}$ being a minimum: the Hessian $H$ is positive semi-definite. All its eigenvalues $λ_{i}$ satisfy $λ_{i} \geq 0$ .

When is the quadratic approximation exact?

For loss functions that are already quadratic in $w$ , no approximation is involved: the second-order Taylor expansion equals the true loss. The most important example is linear regression with squared error, where $L (w) = \frac{1}{2} ∥ Xw - y ∥^{2}$ , and the regularized version becomes exactly ridge regression. The result derived below is therefore an exact statement for ridge regression and a local approximation for any other twice-differentiable loss near its minimum.

The gradient of the quadratic approximation $L$ is

\nabla L (w) = H (w - w^{⋆}) .

It vanishes precisely at $w = w^{⋆}$ , recovering the unregularized minimum.

Solving for the regularized optimum

Adding the gradient of the regularization term $\frac{λ}{2} ∥ w ∥_{2}^{2}$ , which is $λ w$ , and setting the total gradient to zero, the regularized optimum $w$ satisfies

λ w + H (w - w^{⋆}) = 0 .

Collecting terms,

(H + λ I) w = H w^{⋆},

and assuming $H + λ I$ is invertible (it always is for $λ > 0$ , because the eigenvalues of $H + λ I$ are $λ_{i} + λ > 0$ ),

w = (H + λ I)^{- 1} H w^{⋆} .

This is the closed-form relationship between the regularized and unregularized optima for a quadratic loss. Two sanity checks:

As $λ \to 0$ , the regularization vanishes: $(H + 0 \cdot I)^{- 1} H = H^{- 1} H = I$ , so $w \to w^{⋆}$ .
As $λ \to \infty$ , the regularization dominates: $(H + λ I)^{- 1} H \approx (λ I)^{- 1} H = H / λ \to 0$ , so $w \to 0$ .

The interesting regime is in between: what does $(H + λ I)^{- 1} H$ actually do to $w^{⋆}$ ? The answer becomes transparent in the basis of eigenvectors of $H$ .

The geometry: eigendecomposition of the Hessian

Since $H$ is real and symmetric (it is a Hessian of a real-valued function), the spectral theorem guarantees an orthogonal eigendecomposition

H = Q Λ Q^{⊤},

where $Λ = diag (λ_{1}, \dots, λ_{d})$ collects the (non-negative) eigenvalues of $H$ on its diagonal, and the columns of $Q$ are the corresponding orthonormal eigenvectors $q_{i}$ . The matrix $Q$ is therefore orthogonal: $Q^{⊤} Q = Q Q^{⊤} = I$ . The eigenvalues carry a subscript, $λ_{i}$ , while the single unsubscripted $λ$ is always the regularization strength; the entire result will turn on how these two quantities compare.

A geometric reading of the eigenvalues

Each eigenvalue $λ_{i}$ of $H$ measures the curvature of the loss along the corresponding eigenvector direction $q_{i}$ . Substituting $w - w^{⋆} = q_{i}$ (a unit step along eigenvector $i$ ) into the quadratic approximation,
$L (w^{⋆} + q_{i}) - L (w^{⋆}) = \frac{1}{2} q_{i}^{⊤} H q_{i} = \frac{1}{2} λ_{i} .$
Large $λ_{i}$ means a steep direction: a small step in direction $q_{i}$ increases the loss a lot. Small $λ_{i}$ means a flat direction: a step in direction $q_{i}$ barely changes the loss.

Loss landscapes of trained neural networks typically have a strongly anisotropic spectrum: a few large eigenvalues (a few directions along which the loss curves sharply) and many small eigenvalues (most directions are nearly flat). The flat directions are the ones where the data essentially does not constrain the solution: the loss has many equally good answers.

Substituting the eigendecomposition into the boxed formula,

w = (Q Λ Q^{⊤} + λ I)^{- 1} Q Λ Q^{⊤} w^{⋆} .

Since $λ I = Q (λ I) Q^{⊤}$ (the identity is unchanged under any orthogonal change of basis),

Q Λ Q^{⊤} + λ I = Q (Λ + λ I) Q^{⊤} .

The inverse of an orthogonal-conjugated diagonal matrix is the orthogonal-conjugated inverse:

(Q (Λ + λ I) Q^{⊤})^{- 1} = Q (Λ + λ I)^{- 1} Q^{⊤} .

Substituting back:

w = Q (Λ + λ I)^{- 1} Q^{⊤} Q Λ Q^{⊤} w^{⋆} .

The middle $Q^{⊤} Q$ collapses to $I$ by orthogonality, leaving the clean expression

w = Q (Λ + λ I)^{- 1} Λ Q^{⊤} w^{⋆} .

This is the central result of the section. It says exactly what L2 regularization does to the optimum.

Reading the central result coordinate by coordinate

The matrix $(Λ + λ I)^{- 1} Λ$ is diagonal, since it is a product of two diagonal matrices. Its $i$ -th diagonal entry is

\frac{λ _{i}}{λ _{i} + λ} .

The matrix $Q^{⊤} w^{⋆}$ expresses $w^{⋆}$ in the eigenbasis of $H$ : the $i$ -th component, $(Q^{⊤} w^{⋆})_{i}$ , is the projection of $w^{⋆}$ onto the $i$ -th eigenvector $q_{i}$ .

Putting everything together, the boxed formula says:

The central insight, in one line

In the eigenbasis of the loss Hessian, L2 regularization rescales the unregularized optimum $w^{⋆}$ coordinate by coordinate, with the $i$ -th coordinate multiplied by $\frac{λ _{i}}{λ _{i} + λ}$ .

The shrinkage factor depends only on the ratio $λ_{i} / λ$ between the local curvature in that direction and the regularization strength:

$λ_{i} ≫ λ$ (steep direction): $\frac{λ _{i}}{λ _{i} + λ} \approx 1$ . The coordinate is essentially preserved. The loss strongly opposes any shrinkage, so the regularization has little effect.

$λ_{i} ≪ λ$ (flat direction): $\frac{λ _{i}}{λ _{i} + λ} \approx 0$ . The coordinate is shrunk almost to zero. The loss does not care about this direction, so the regularization wins.

$λ_{i} \sim λ$ : intermediate shrinkage by a factor near $1/2$ .

This is anisotropic shrinkage, and it is the key to understanding why L2 regularization works.

Why this is the deep insight

A naive view of L2 regularization holds that it just “makes the weights smaller”. The boxed formula refines this view dramatically: L2 makes some weights smaller, but not uniformly, and the directions in which it shrinks aggressively are exactly the directions in which the data does not care about the answer.

A direction along which the loss is flat is a direction in which the training data leaves the model parameters underdetermined. The choice along such a direction is essentially arbitrary, and an unregularized optimizer may pick a large value for purely numerical reasons. L2 regularization breaks this indeterminacy in favour of zero, the simplest and most generalizable choice. This is the structural reason L2 improves generalization: it removes degrees of freedom the data did not constrain.

The shrinkage factors add up to something meaningful

The per-direction factors $λ_{i} / (λ_{i} + λ)$ do more than describe the solution one coordinate at a time; their sum is itself an interpretable number,
$γ = i = 1 \sum d \frac{λ _{i}}{λ _{i} + λ} = tr [(H + λ I)^{- 1} H] .$
Each direction contributes a value between $0$ and $1$ , near $1$ where the data fixes the coefficient and near $0$ where it does not, so $γ$ counts the effective number of parameters the regularized model actually uses. At $λ = 0$ it equals $d$ , the full parameter count; as $λ$ grows it falls steadily toward zero. In ridge regression $γ$ is exactly the trace of the smoother (hat) matrix, the quantity classical statistics calls the effective degrees of freedom of the fit: a single dial that reads how much model the data is really paying for.

Worked example: linear regression and ridge regression

The cleanest place to see the result in closed form is linear regression with squared error, which is exactly quadratic. The unregularized loss is

L (w) = \frac{1}{2} ∥ Xw - y ∥^{2},

with $X \in R^{n \times d}$ the design matrix (rows are training examples) and $y \in R^{n}$ the targets. Direct differentiation gives the gradient $\nabla L (w) = X^{⊤} (Xw - y)$ and the Hessian

H = X^{⊤} X .

The unregularized optimum, when $X^{⊤} X$ is invertible, is the familiar least-squares solution

w^{⋆} = (X^{⊤} X)^{- 1} X^{⊤} y .

Adding L2 and applying the boxed formula gives ridge regression:

w = (X^{⊤} X + λ I)^{- 1} X^{⊤} y .

The Hessian’s eigenvalues are the squared singular values of $X$ . The eigenvectors $q_{i}$ are the right-singular vectors of $X$ , which are exactly the principal-component directions of the input data: $q_{1}$ is the direction of largest variance in the inputs, $q_{2}$ the second-largest, and so on.

L2 = shrinkage toward zero along the least informative principal components

In the linear-regression case, the central insight reads:

The first principal components of $X$ (large singular values, large $λ_{i} = σ_{i}^{2}$ ) carry most of the variance in the inputs. The coefficients of $w^{⋆}$ along these directions are well-determined by the data and are barely shrunk by L2.

The last principal components of $X$ (small singular values, small $λ_{i}$ ) correspond to directions in input space along which the data has almost no spread. The coefficients of $w^{⋆}$ along these directions are poorly identified: small changes in the training data could flip them dramatically. These are exactly the coefficients that L2 shrinks aggressively toward zero.

Ridge regression is, in this sense, a soft principal-component regression: it does not hard-truncate the small-variance directions, it smoothly downweights them. The same logic governs L2 regularization in the more general nonlinear setting, where “principal components of the inputs” is replaced by “eigenvectors of the local loss Hessian”.

The feature-level reading: L2 as inflated input variance

The same result has a more elementary reading, one coordinate at a time. When the inputs are centred, $\frac{1}{n} X^{⊤} X$ is the input covariance matrix, and its diagonal holds the variance of each feature, that is, of each column of $X$ (recall that the rows of $X$ are the $n$ training examples and its columns are the $d$ input features). Passing from $(X^{⊤} X)^{- 1}$ to $(X^{⊤} X + λ I)^{- 1}$ adds $λ$ to every diagonal entry, so the estimator behaves as if each feature carried $λ$ more variance than it actually does. When the features are roughly uncorrelated (so $X^{⊤} X$ is nearly diagonal) the effect is transparent: the weight on feature $j$ is scaled by $σ_{j}^{2} / (σ_{j}^{2} + λ)$ , with $σ_{j}^{2}$ that feature’s variance. A feature with large spread across the data keeps its weight; a feature whose values barely move has its weight pulled toward zero, because the inflated $λ$ now dominates its real variance. This is the per-feature shadow of the eigenvector story above: “eigenvectors of the Hessian” collapse to “individual input features” exactly when those features stop being correlated.

A visual picture

The same conclusion has a clean two-dimensional picture, shown below (the geometry behind Goodfellow, Bengio and Courville’s Figure 7.1). Take a weight space with only two coordinates, $w_{1}$ and $w_{2}$ .

The contours of the unregularized loss $L$ form an ellipse centred at $w^{⋆}$ , with axes aligned with the eigenvectors of $H$ . The eccentricity of the ellipse measures the anisotropy: a long axis (small $λ_{i}$ ) is a flat direction; a short axis (large $λ_{i}$ ) is a steep direction.
The contours of the regularization penalty $\frac{λ}{2} ∥ w ∥_{2}^{2}$ form circles centred at the origin.
The regularized optimum $w$ sits at the point of tangency of these two families, where the gradients of data loss and penalty balance.

Reading the figure

Follow $w$ relative to $w^{⋆}$ . Along the short axis of the ellipse (the steep, large- $λ_{i}$ direction), pulling the solution toward the origin means climbing the loss quickly, so the tangency point stays much closer to $w^{⋆}$ and that coordinate is largely preserved. Along the long axis (the flat, small- $λ_{i}$ direction), moving toward the origin costs almost no loss, so the tangency point slides far inward toward the origin and that coordinate is heavily shrunk. The regularized optimum $w$ is dragged toward the origin, but mostly along the direction the loss does not defend.

The picture is the geometric content of the algebraic formula: L2 prefers to move along whichever direction the loss does not strongly defend.

The same shrinkage hides inside early stopping

Anisotropic shrinkage does not require an explicit penalty at all: plain gradient descent, halted early, produces almost the same effect for free. Starting from $w = 0$ and running $τ$ steps on the quadratic loss, the $i$ -th eigendirection of $w^{⋆}$ is reached only up to a factor
$1 - (1 - η λ_{i})^{τ},$
close to $1$ along steep directions (large $λ_{i}$ , learned within a few steps) and close to $0$ along flat ones (small $λ_{i}$ , learned only very slowly). That is the same profile as L2’s $λ_{i} / (λ_{i} + λ)$ : keep the steep directions, suppress the flat ones. Matching the two factors identifies $λ \approx 1/ (η τ)$ , so fewer training steps act like a stronger penalty. Early stopping is, to second order, implicit weight decay, the optimization-dynamics counterpart of the static result derived here, and both improve generalization for the same reason: they decline to fit the directions the data does not pin down.

Connection to the rest of this site

The eigendecomposition view of L2 ties into several other notes:

Random-matrix and initialization: the Xavier and He initialization note discusses the spectral radius of the recurrent / weight Jacobians under random initialization. The same eigenvalue language used there governs the analysis in this note, with $H$ replacing the recurrent weight matrix.
Gradient flow and skip connections: the Skip connections note showed that ResNets create an identity-plus-perturbation Jacobian $I + \cdot$ during backpropagation. The matrix $H + λ I$ in this note has exactly the same shape: a base operator plus a scaled identity that shifts the spectrum away from zero, regularizing whatever pathology the base operator has.
Decoupled weight decay and adaptive optimizers: the central result of this note is the geometric content of “what weight decay does to the optimum”. The AdamW note shows that this geometric content is preserved when Adam is paired with decoupled weight decay, and distorted when L2 is naively added to Adam’s gradient. The eigendecomposition view explains exactly what is being preserved or distorted: in AdamW the anisotropic shrinkage $λ_{i} / (λ_{i} + λ)$ along each eigendirection of the loss Hessian is intact; in coupled Adam + L2, Adam’s adaptive denominator multiplies that factor by an arbitrary, history-dependent rescaling that has no statistical justification.
The cost of inferring directions the data does not constrain: the closely related view of L2 as MAP estimation under a Gaussian prior on $w$ pinpoints the same insight from a Bayesian angle. Adding $\frac{λ}{2} ∥ w ∥_{2}^{2}$ to the negative log-likelihood is equivalent to multiplying the likelihood by a Gaussian prior $p (w) \propto exp (- \frac{λ}{2} ∥ w ∥_{2}^{2})$ ; the regularized optimum is the maximum a posteriori (MAP) estimate. The prior pulls the estimate toward zero by exactly the factor $λ_{i} / (λ_{i} + λ)$ in each eigendirection, with the same interpretation: directions the likelihood barely constrains are dominated by the prior.

Take-aways

What this analysis adds to the previous note

The previous note showed that L2 produces weight decay during training. This note shows what weight decay accomplishes when training converges:

L2 does not shrink all weights uniformly. It shrinks them anisotropically, with the shrinkage strength along each direction set by the ratio of the regularization coefficient $λ$ to the local curvature $λ_{i}$ of the loss.

Steep directions are preserved. The coefficients of $w^{⋆}$ along directions in which the data strongly constrains the model survive nearly intact under L2.

Flat directions are zeroed. The coefficients of $w^{⋆}$ along directions in which the data is essentially silent are shrunk toward zero, removing degrees of freedom the data did not pin down.

In the linear-regression special case, this is ridge regression, and the eigenvectors of $H$ are the principal components of the inputs: L2 is a soft principal-component shrinkage.

The same intuition transfers to deep nonlinear networks locally around any minimum: L2 removes the indeterminacy in the parameter directions the data does not constrain, producing solutions that depend less on the idiosyncrasies of the training set.

The complementary mechanism of Dropout achieves a similar generalization benefit by a completely different route (injecting noise into activations rather than penalizing weights). The two are usually combined in practice when both are useful, exactly because they regularize different aspects of the model.

Deep Learning: Zero to Hero

Explorer

L2 regularization in depth

Setup

The gradient and the update rule, written compactly

Quadratic approximation around the unregularized optimum

Solving for the regularized optimum

The geometry: eigendecomposition of the Hessian

Reading the central result coordinate by coordinate

Worked example: linear regression and ridge regression

A visual picture

Connection to the rest of this site

Take-aways

Graph View

Table of Contents

Backlinks