The previous note introduced L2 regularization at the level of the update rule: at every gradient step, the weight is rescaled by a factor before the usual gradient update is applied. That view is operationally sufficient, but it leaves a deeper question unanswered:
The deeper question
Suppose training converges. What does the regularized solution look like, as a function of the unregularized optimum ? Which components of does L2 shrink, and which does it leave essentially untouched?
The answer is one of the most elegant results in classical regularization theory: L2 regularization performs an anisotropic shrinkage of along the eigenvectors of the loss Hessian. The components of that lie along flat directions of the loss (small eigenvalues of the Hessian) are shrunk almost to zero, while components along steep directions (large eigenvalues) survive nearly intact. The derivation is short, the geometry is striking, and the conclusion gives the deepest intuition for why L2 regularization improves generalization.
This note follows the analysis of Goodfellow, Bengio and Courville (Deep Learning, Section 7.1.1), adapted to the notation used throughout this site.
Setup
The regularized objective, as defined in L2 regularization, is
Notation reconciled with the previous note
The previous note writes the regularization term as , with the factor included so that stays dataset-size-independent. The analysis in this note works in scaled form: here is the (already-averaged) data loss and here is the rescaled coefficient that absorbs the . Either convention gives the same algebraic result; the simpler form is used throughout to keep the formulas readable.
For the same reason, biases are omitted from throughout this note. The previous note explains why L2 is conventionally not applied to biases; including them would not change the structure of the analysis but would clutter the notation.
Why this form is also called "weight decay" and "ridge"
The penalty goes by three names across communities:
- L2 regularization (machine learning, deep learning);
- weight decay (deep learning, when emphasizing the update-rule view);
- ridge regression or Tikhonov regularization (statistics, classical inverse problems).
The names refer to the same mathematical object. The latter two reflect the historical origins of the construction in linear regression and operator theory, respectively. The analysis below applies to any twice-differentiable loss; the linear-regression special case appears at the end.
The gradient and the update rule, written compactly
Differentiating the regularized loss gives the gradient
A single step of gradient descent therefore reads
This is the weight-decay form: shrink the weight by the multiplicative factor , then apply the usual unregularized update. The rule does not push the weights inexorably to zero: the second term, the unregularized gradient, can still drive any individual weight to grow if doing so reduces enough.
Per-step vs at-convergence
The update rule above describes what happens during a single step. It says nothing about where the trajectory ultimately lands. The question this note addresses is the fixed point: where does the regularized iterate converge to, and how does it relate to the unregularized optimum ? The answer requires going past the update rule and looking at the shape of around .
Quadratic approximation around the unregularized optimum
To make the analysis tractable, the unregularized loss is approximated by a second-order Taylor expansion around the unregularized optimum :
where is the Hessian of evaluated at .
Why no first-order term
A second-order Taylor expansion at a point is
At a minimum, the gradient vanishes: . The first-order term therefore drops out, leaving the pure quadratic above.
A second consequence of being a minimum: the Hessian is positive semi-definite. All its eigenvalues satisfy .
When is the quadratic approximation exact?
For loss functions that are already quadratic in , no approximation is involved: the second-order Taylor expansion equals the true loss. The most important example is linear regression with squared error, where , and the regularized version becomes exactly ridge regression. The result derived below is therefore an exact statement for ridge regression and a local approximation for any other twice-differentiable loss near its minimum.
The gradient of the quadratic approximation is
It vanishes precisely at , recovering the unregularized minimum.
Solving for the regularized optimum
Adding the gradient of the regularization term , which is , and setting the total gradient to zero, the regularized optimum satisfies
Collecting terms,
and assuming is invertible (it always is for , because the eigenvalues of are ),
This is the closed-form relationship between the regularized and unregularized optima for a quadratic loss. Two sanity checks:
- As , the regularization vanishes: , so .
- As , the regularization dominates: , so .
The interesting regime is in between: what does actually do to ? The answer becomes transparent in the basis of eigenvectors of .
The geometry: eigendecomposition of the Hessian
Since is real and symmetric (it is a Hessian of a real-valued function), the spectral theorem guarantees an orthogonal eigendecomposition
where collects the (non-negative) eigenvalues of on its diagonal, and the columns of are the corresponding orthonormal eigenvectors . The matrix is therefore orthogonal: . The eigenvalues carry a subscript, , while the single unsubscripted is always the regularization strength; the entire result will turn on how these two quantities compare.
A geometric reading of the eigenvalues
Each eigenvalue of measures the curvature of the loss along the corresponding eigenvector direction . Substituting (a unit step along eigenvector ) into the quadratic approximation,
Large means a steep direction: a small step in direction increases the loss a lot. Small means a flat direction: a step in direction barely changes the loss.
Loss landscapes of trained neural networks typically have a strongly anisotropic spectrum: a few large eigenvalues (a few directions along which the loss curves sharply) and many small eigenvalues (most directions are nearly flat). The flat directions are the ones where the data essentially does not constrain the solution: the loss has many equally good answers.
Substituting the eigendecomposition into the boxed formula,
Since (the identity is unchanged under any orthogonal change of basis),
The inverse of an orthogonal-conjugated diagonal matrix is the orthogonal-conjugated inverse:
Substituting back:
The middle collapses to by orthogonality, leaving the clean expression
This is the central result of the section. It says exactly what L2 regularization does to the optimum.
Reading the central result coordinate by coordinate
The matrix is diagonal, since it is a product of two diagonal matrices. Its -th diagonal entry is
The matrix expresses in the eigenbasis of : the -th component, , is the projection of onto the -th eigenvector .
Putting everything together, the boxed formula says:
The central insight, in one line
In the eigenbasis of the loss Hessian, L2 regularization rescales the unregularized optimum coordinate by coordinate, with the -th coordinate multiplied by .
The shrinkage factor depends only on the ratio between the local curvature in that direction and the regularization strength:
- (steep direction): . The coordinate is essentially preserved. The loss strongly opposes any shrinkage, so the regularization has little effect.
- (flat direction): . The coordinate is shrunk almost to zero. The loss does not care about this direction, so the regularization wins.
- : intermediate shrinkage by a factor near .
This is anisotropic shrinkage, and it is the key to understanding why L2 regularization works.
Why this is the deep insight
A naive view of L2 regularization holds that it just “makes the weights smaller”. The boxed formula refines this view dramatically: L2 makes some weights smaller, but not uniformly, and the directions in which it shrinks aggressively are exactly the directions in which the data does not care about the answer.
A direction along which the loss is flat is a direction in which the training data leaves the model parameters underdetermined. The choice along such a direction is essentially arbitrary, and an unregularized optimizer may pick a large value for purely numerical reasons. L2 regularization breaks this indeterminacy in favour of zero, the simplest and most generalizable choice. This is the structural reason L2 improves generalization: it removes degrees of freedom the data did not constrain.
The shrinkage factors add up to something meaningful
The per-direction factors do more than describe the solution one coordinate at a time; their sum is itself an interpretable number,
Each direction contributes a value between and , near where the data fixes the coefficient and near where it does not, so counts the effective number of parameters the regularized model actually uses. At it equals , the full parameter count; as grows it falls steadily toward zero. In ridge regression is exactly the trace of the smoother (hat) matrix, the quantity classical statistics calls the effective degrees of freedom of the fit: a single dial that reads how much model the data is really paying for.
Worked example: linear regression and ridge regression
The cleanest place to see the result in closed form is linear regression with squared error, which is exactly quadratic. The unregularized loss is
with the design matrix (rows are training examples) and the targets. Direct differentiation gives the gradient and the Hessian
The unregularized optimum, when is invertible, is the familiar least-squares solution
Adding L2 and applying the boxed formula gives ridge regression:
The Hessian’s eigenvalues are the squared singular values of . The eigenvectors are the right-singular vectors of , which are exactly the principal-component directions of the input data: is the direction of largest variance in the inputs, the second-largest, and so on.
L2 = shrinkage toward zero along the least informative principal components
In the linear-regression case, the central insight reads:
- The first principal components of (large singular values, large ) carry most of the variance in the inputs. The coefficients of along these directions are well-determined by the data and are barely shrunk by L2.
- The last principal components of (small singular values, small ) correspond to directions in input space along which the data has almost no spread. The coefficients of along these directions are poorly identified: small changes in the training data could flip them dramatically. These are exactly the coefficients that L2 shrinks aggressively toward zero.
Ridge regression is, in this sense, a soft principal-component regression: it does not hard-truncate the small-variance directions, it smoothly downweights them. The same logic governs L2 regularization in the more general nonlinear setting, where “principal components of the inputs” is replaced by “eigenvectors of the local loss Hessian”.
The feature-level reading: L2 as inflated input variance
The same result has a more elementary reading, one coordinate at a time. When the inputs are centred, is the input covariance matrix, and its diagonal holds the variance of each feature, that is, of each column of (recall that the rows of are the training examples and its columns are the input features). Passing from to adds to every diagonal entry, so the estimator behaves as if each feature carried more variance than it actually does. When the features are roughly uncorrelated (so is nearly diagonal) the effect is transparent: the weight on feature is scaled by , with that feature’s variance. A feature with large spread across the data keeps its weight; a feature whose values barely move has its weight pulled toward zero, because the inflated now dominates its real variance. This is the per-feature shadow of the eigenvector story above: “eigenvectors of the Hessian” collapse to “individual input features” exactly when those features stop being correlated.
A visual picture
The same conclusion has a clean two-dimensional picture, shown below (the geometry behind Goodfellow, Bengio and Courville’s Figure 7.1). Take a weight space with only two coordinates, and .
- The contours of the unregularized loss form an ellipse centred at , with axes aligned with the eigenvectors of . The eccentricity of the ellipse measures the anisotropy: a long axis (small ) is a flat direction; a short axis (large ) is a steep direction.
- The contours of the regularization penalty form circles centred at the origin.
- The regularized optimum sits at the point of tangency of these two families, where the gradients of data loss and penalty balance.
Reading the figure
Follow relative to . Along the short axis of the ellipse (the steep, large- direction), pulling the solution toward the origin means climbing the loss quickly, so the tangency point stays much closer to and that coordinate is largely preserved. Along the long axis (the flat, small- direction), moving toward the origin costs almost no loss, so the tangency point slides far inward toward the origin and that coordinate is heavily shrunk. The regularized optimum is dragged toward the origin, but mostly along the direction the loss does not defend.
The picture is the geometric content of the algebraic formula: L2 prefers to move along whichever direction the loss does not strongly defend.
The same shrinkage hides inside early stopping
Anisotropic shrinkage does not require an explicit penalty at all: plain gradient descent, halted early, produces almost the same effect for free. Starting from and running steps on the quadratic loss, the -th eigendirection of is reached only up to a factor
close to along steep directions (large , learned within a few steps) and close to along flat ones (small , learned only very slowly). That is the same profile as L2’s : keep the steep directions, suppress the flat ones. Matching the two factors identifies , so fewer training steps act like a stronger penalty. Early stopping is, to second order, implicit weight decay, the optimization-dynamics counterpart of the static result derived here, and both improve generalization for the same reason: they decline to fit the directions the data does not pin down.
Connection to the rest of this site
The eigendecomposition view of L2 ties into several other notes:
- Random-matrix and initialization: the Xavier and He initialization note discusses the spectral radius of the recurrent / weight Jacobians under random initialization. The same eigenvalue language used there governs the analysis in this note, with replacing the recurrent weight matrix.
- Gradient flow and skip connections: the Skip connections note showed that ResNets create an identity-plus-perturbation Jacobian during backpropagation. The matrix in this note has exactly the same shape: a base operator plus a scaled identity that shifts the spectrum away from zero, regularizing whatever pathology the base operator has.
- Decoupled weight decay and adaptive optimizers: the central result of this note is the geometric content of “what weight decay does to the optimum”. The AdamW note shows that this geometric content is preserved when Adam is paired with decoupled weight decay, and distorted when L2 is naively added to Adam’s gradient. The eigendecomposition view explains exactly what is being preserved or distorted: in AdamW the anisotropic shrinkage along each eigendirection of the loss Hessian is intact; in coupled Adam + L2, Adam’s adaptive denominator multiplies that factor by an arbitrary, history-dependent rescaling that has no statistical justification.
- The cost of inferring directions the data does not constrain: the closely related view of L2 as MAP estimation under a Gaussian prior on pinpoints the same insight from a Bayesian angle. Adding to the negative log-likelihood is equivalent to multiplying the likelihood by a Gaussian prior ; the regularized optimum is the maximum a posteriori (MAP) estimate. The prior pulls the estimate toward zero by exactly the factor in each eigendirection, with the same interpretation: directions the likelihood barely constrains are dominated by the prior.
Take-aways
What this analysis adds to the previous note
The previous note showed that L2 produces weight decay during training. This note shows what weight decay accomplishes when training converges:
- L2 does not shrink all weights uniformly. It shrinks them anisotropically, with the shrinkage strength along each direction set by the ratio of the regularization coefficient to the local curvature of the loss.
- Steep directions are preserved. The coefficients of along directions in which the data strongly constrains the model survive nearly intact under L2.
- Flat directions are zeroed. The coefficients of along directions in which the data is essentially silent are shrunk toward zero, removing degrees of freedom the data did not pin down.
- In the linear-regression special case, this is ridge regression, and the eigenvectors of are the principal components of the inputs: L2 is a soft principal-component shrinkage.
- The same intuition transfers to deep nonlinear networks locally around any minimum: L2 removes the indeterminacy in the parameter directions the data does not constrain, producing solutions that depend less on the idiosyncrasies of the training set.
The complementary mechanism of Dropout achieves a similar generalization benefit by a completely different route (injecting noise into activations rather than penalizing weights). The two are usually combined in practice when both are useful, exactly because they regularize different aspects of the model.