AdamW

1. Intro

AdamW is the standard modern way to combine the Adam optimizer with weight decay.

It was introduced by Ilya Loshchilov and Frank Hutter in the paper Decoupled Weight Decay Regularization (arXiv 2017, published at ICLR 2019).

The central insight of that work is extremely important:

in SGD, L2 regularization and weight decay are equivalent,
in Adam, they are not equivalent,
therefore, applying L2 regularization to Adam as if it were ordinary weight decay is conceptually incorrect.

AdamW was introduced precisely to fix this problem.

Reading order

This note assumes familiarity with two pieces of background:

the L2 regularization machinery (per-sample loss, full-dataset loss, weight-decay derivation for SGD), developed in L2 regularization;

the Adam optimizer (first/second moment EMAs, bias correction, adaptive denominator), developed in Adam.

A deeper geometric understanding of what L2 regularization does to the optimal weights (eigendecomposition of the loss Hessian, anisotropic shrinkage along low-curvature directions) is given in L2 regularization in depth.

What problem does AdamW solve?

AdamW solves the following issue: when the term $λ_{wd} w^{(t)}$ is injected directly into Adam’s gradient pipeline, the regularization term is processed by Adam’s adaptive moment machinery. As a result, the shrinkage of the weights is no longer a clean, uniform weight decay. Instead, it becomes entangled with the history of the gradients.

Notation

In this note, $w^{(t)}$ denotes the subset of trainable parameters to which weight decay is applied. In practice, this usually means weight tensors; biases and normalization parameters are often excluded.

The coefficient used directly in the optimizer update is denoted by $λ_{wd}$ . This choice keeps the AdamW formulas conceptually clean, because AdamW is defined as a decoupled update rule.

To connect this note with L2 regularization, suppose the regularized objective is written using the averaged-loss convention
$L = L + \frac{λ}{2 n} ∥ w ∥_{2}^{2} .$
Then the corresponding decay coefficient in the optimizer update is
$λ_{wd} = \frac{λ}{n} .$
So the present note is fully compatible with the notation used in the L2-regularization note; it simply names the direct decay coefficient explicitly.

Element-wise operators used in Adam-family update rules

Adam’s update rules combine vectors coordinate by coordinate, never as full matrix products. Two element-wise operators appear repeatedly:

$⊙$ (Hadamard product) is element-wise multiplication: $(a ⊙ b)_{j} = a_{j} b_{j}$ . The shorthand $a^{⊙ 2}$ denotes element-wise squaring, $(a^{⊙ 2})_{j} = a_{j}^{2}$ .

$⊘$ is element-wise division: $(a ⊘ b)_{j} = a_{j} / b_{j}$ , applicable whenever every $b_{j}$ is non-zero.

Likewise, $v$ in this note means the element-wise square root, $(v)_{j} = v_{j}$ .

All these operators preserve the shape of their operands; no broadcasting is involved and no coordinates are mixed. This is the same coordinate-wise pattern used in LSTM cell state and in soft attention, among many other places where neural networks apply learned per-coordinate transformations.

2. Motivation

When L2 regularization is first learned in the context of gradient descent or SGD, the following equivalence is usually established:

optimizing a L2-regularized loss
applying multiplicative weight decay at each step

For SGD, this equivalence is exact. Many implementations and explanations carried this intuition over to Adam. That transfer is the mistake AdamW was designed to correct.

Historical motivation

AdamW was not introduced as a minor variant of Adam. It was introduced because a large part of the community was informally treating “Adam + L2 regularization” as if it were the same thing as “Adam + weight decay”. The paper by Loshchilov and Hutter showed that this identification is false for adaptive optimizers.

3. L2 regularization and weight decay equivalence in SGD

The starting point is the SGD update of an L2-regularized objective. This derivation is treated in full in L2 regularization; the version below restates the result in the notation used throughout this note (with $λ_{wd}$ as the direct optimizer coefficient).

Let

L^{(t)}

denote the data loss at iteration $t$ , and let

g^{(t)} ≜ \nabla_{w} L^{(t)}

be its gradient evaluated at the current parameters $w^{(t)}$ .

Now consider the L2-regularized objective

L^{(t)} = L^{(t)} + \frac{λ _{wd}}{2} ∥ w ∥_{2}^{2} .

Its gradient is

\nabla_{w} L^{(t)} = g^{(t)} + λ_{wd} w^{(t)} .

Therefore, the SGD update becomes

w^{(t + 1)} = w^{(t)} - η (g^{(t)} + λ_{wd} w^{(t)}) .

Rearranging,

w^{(t + 1)} = (1 - η λ_{wd}) w^{(t)} - η g^{(t)}

This is the key identity.

It shows that, for SGD:

the term $- η g^{(t)}$ is the usual optimization step on the data loss,
the factor $(1 - η λ_{wd})$ is a pure multiplicative shrinkage of the parameters.

This multiplicative shrinkage is exactly what is meant by weight decay.

Key conclusion for SGD

In SGD, adding the L2 penalty
$\frac{λ _{wd}}{2} ∥ w ∥_{2}^{2}$
to the loss is mathematically equivalent to shrinking the parameters by the factor $(1 - η λ_{wd})$ at every step. The same identity is derived from the loss perspective in L2 regularization (sections on GD and mini-batch SGD).

4. Why this equivalence breaks in Adam

The problem appears as soon as the update is no longer a plain scalar multiple of the gradient.

Adam does not apply a single global scaling to the gradient. Instead, it builds:

a first-moment estimate,
a second-moment estimate,
a coordinate-wise adaptive denominator.

If L2 regularization is naively inserted into Adam, the algorithm uses the augmented gradient

\tilde{g}^{(t)} = g^{(t)} + λ_{wd} w^{(t)}

and then computes

m^{(t)} = β_{1} m^{(t - 1)} + (1 - β_{1}) \tilde{g}^{(t)},

v^{(t)} = β_{2} v^{(t - 1)} + (1 - β_{2}) (\tilde{g}^{(t)})^{⊙ 2},

\hat{m}^{(t)} = \frac{m ^{(t)}}{1 - β _{1}^{t}}, \hat{v}^{(t)} = \frac{v ^{(t)}}{1 - β _{2}^{t}},

followed by

w^{(t + 1)} = w^{(t)} - η \hat{m}^{(t)} ⊘ (\hat{v}^{(t)} + ε),

where all vector operations are elementwise.

Two problems immediately appear.

1 The regularization term contaminates the moment estimates

The moments $m^{(t)}$ and $v^{(t)}$ are no longer statistics of the data gradient alone. They now track the mixed quantity
$g^{(t)} + λ_{wd} w^{(t)} .$
The regularization term is therefore no longer a clean add-on applied at the end of the update. It influences:

the EMA of gradients,

the EMA of squared gradients,

the bias-corrected quantities,

the adaptive denominator itself.

In other words, the regularizer is passed through machinery that was designed to analyze the geometry of the loss gradient, not the geometry of the regularization term.

2 The decay ceases to be a clean multiplicative shrinkage

To isolate the core issue, it is useful to momentarily ignore the first-moment smoothing and view Adam as a coordinate-wise preconditioner.

Define
$D^{(t)} ≜ diag (\frac{1}{v ^ ^{(t)} + ε}) .$
Then the adaptive step has the schematic form
$w^{(t + 1)} \approx w^{(t)} - η D^{(t)} (g^{(t)} + λ_{wd} w^{(t)}) .$
Rearranging,
$w^{(t + 1)} \approx (I - η λ_{wd} D^{(t)}) w^{(t)} - η D^{(t)} g^{(t)}$
This is no longer ordinary weight decay.

Indeed:

if $D^{(t)}$ were a scalar multiple of the identity, the decay would be uniform;

in Adam, $D^{(t)}$ is coordinate-wise and time-dependent;

therefore, each parameter is shrunk by a different effective factor. The decay has become anisotropic and history-dependent.

The conceptual failure of "Adam + L2"

The quantity $λ_{wd} w^{(t)}$ is supposed to express a clean preference for small weights. In coupled Adam, however, this term is rescaled according to the past gradient statistics of each coordinate. As a result, some parameters are regularized more strongly than others for reasons that come from Adam’s adaptive geometry, not from the intended meaning of weight decay.

Note

This is the precise reason why the sentence “L2 regularization is the same as weight decay” is true for SGD but false for Adam.

5. The fix: AdamW

AdamW solves the above problem by applying the decay outside Adam’s adaptive gradient transformation.

The moments are computed from the true data gradient only:

g^{(t)} = \nabla_{w} L^{(t)},

m^{(t)} = β_{1} m^{(t - 1)} + (1 - β_{1}) g^{(t)},

v^{(t)} = β_{2} v^{(t - 1)} + (1 - β_{2}) (g^{(t)})^{⊙ 2},

\hat{m}^{(t)} = \frac{m ^{(t)}}{1 - β _{1}^{t}}, \hat{v}^{(t)} = \frac{v ^{(t)}}{1 - β _{2}^{t}} .

Only after the Adam step is defined is weight decay applied:

w^{(t + 1)} = w^{(t)} - η \hat{m}^{(t)} ⊘ (\hat{v}^{(t)} + ε) - η λ_{wd} w^{(t)}

Equivalently,

w^{(t + 1)} = (1 - η λ_{wd}) w^{(t)} - η \hat{m}^{(t)} ⊘ (\hat{v}^{(t)} + ε)

This is AdamW.

The essential distinction

In AdamW:

the adaptive moments see only the gradient of the data loss,

the shrinkage term $- η λ_{wd} w^{(t)}$ is applied separately,

the meaning of weight decay is restored.

5.1. Why this fix makes sense mathematically

Adam and weight decay serve two different purposes:

Adam adapts the update to the local gradient statistics of the loss,
weight decay expresses an external preference for smaller parameter norms.

These roles should not be mixed.

If the regularization term is passed through Adam’s adaptive denominator, then the penalty on the parameters becomes distorted by quantities that were never meant to regulate the penalty itself.

AdamW enforces the correct separation:

the loss gradient is treated adaptively,
the weight norm is shrunk directly.

This separation is exactly what the word decoupled refers to.

Decoupling

In AdamW, the optimization step with respect to the data loss and the shrinkage step with respect to weight decay are kept as two distinct operations. That is the entire conceptual heart of AdamW.

6. AdamW is not “Adam with an L2 term”

This distinction is subtle but fundamental.

Therefore, AdamW should be understood as:

Adam applied to the data loss, plus
a separate multiplicative shrinkage of the parameters.

Common terminological trap

In many codebases and tutorials, the words “L2 regularization” and “weight decay” are used interchangeably. That shortcut is acceptable for SGD, but it becomes misleading for Adam-like optimizers. AdamW exists precisely because the shortcut fails.

7. Coordinate-wise interpretation

For a single decayed weight parameter $w_{j}$ , AdamW performs

w_{j}^{(t + 1)} = (1 - η λ_{wd}) w_{j}^{(t)} - η \frac{m ^ _{j}^{(t)}}{v ^ _{j}^{(t)} + ε} .

This formula is very revealing:

the factor $(1 - η λ_{wd})$ applies the same decay coefficient to every parameter,
the adaptive factor $\frac{1}{v ^ _{j}^{(t)} + ε}$ acts only on the loss-gradient step,
the regularization term is no longer distorted by coordinate-wise gradient statistics.

This is exactly the behavior that the original notion of weight decay is supposed to represent.

8. Why AdamW usually works better in practice

The paper introducing AdamW emphasized two practical consequences.

Better-behaved regularization

With coupled Adam, the strength of regularization is entangled with Adam’s adaptive rescaling. With AdamW, the shrinkage mechanism is explicit and interpretable.

Cleaner hyperparameter tuning

In coupled Adam, the effect of the regularization term is distorted by the optimizer’s internal state. In AdamW, the role of $λ_{wd}$ is much cleaner:

$η$ controls the scale of the Adam step,

$λ_{wd}$ controls the strength of multiplicative shrinkage.

This does not mean that $η$ and $λ_{wd}$ become mathematically unrelated. The update still contains the product $η λ_{wd}$ . However, the regularization is no longer additionally warped by the adaptive denominator and moment estimates. This makes tuning substantially more interpretable.

Note

This is one of the most important practical reasons why AdamW replaced plain Adam in many modern training pipelines.

9. Adam + L2 vs AdamW

Method	Moments computed from	Does adaptive scaling affect the decay term?	True decoupled weight decay?
Adam + L2	$g^{(t)} + λ_{wd} w^{(t)}$	Yes	No
AdamW	$g^{(t)}$ only	No	Yes

10. Implementation notes

In modern frameworks

In modern deep-learning libraries, AdamW is typically implemented directly as the decoupled rule above.

This is the optimizer that should generally be preferred whenever:

Adam-style adaptivity is desired,

nonzero weight decay is required,

a modern training pipeline is being designed.

Parameters often excluded from decay

In practice, weight decay is often applied to:

weight matrices of linear layers,

convolution kernels,

embedding matrices,

but often excluded from:

biases,

BatchNorm / LayerNorm scale and shift parameters.

The underlying practical reason is that these parameters typically do not benefit from norm shrinkage in the same way as ordinary weights. The standard PyTorch parameter-group pattern that implements this exclusion is shown in L2 regularization.

Practical rule

If an Adam-family optimizer is used together with weight_decay, the safe default is almost always:

use AdamW, not plain Adam,

apply decay primarily to true weight tensors,

exclude biases and normalization parameters unless a specific reason exists not to.

11. Summary

AdamW should be remembered through one central idea:

Important

Weight decay must be decoupled from Adam’s adaptive gradient transformation.

The logic of the method is:

Adam computes an adaptive step from the data loss gradient.
Weight decay shrinks the parameters separately.

This separation is necessary because:

in SGD, L2 regularization and weight decay are equivalent,
in Adam, they are not,
the naive transfer of the SGD intuition to Adam is mathematically incorrect.

For this reason, AdamW is more than a convenience: it is the mathematically clean way to use weight decay with Adam.

Final takeaway

Whenever Adam-style optimization and weight decay are both desired, AdamW is the principled choice.

L2 regularization develops the per-sample, full-dataset and mini-batch forms of the L2-regularized loss, derives the SGD weight-decay identity used as the starting point of Section 3, and gives the PyTorch parameter conventions (including the $weight_decay = λ / n$ conversion).
L2 regularization in depth analyses what L2 does to the optimal weights: an anisotropic shrinkage of $w^{⋆}$ along the eigenvectors of the loss Hessian, with shrinkage factor $λ_{i} / (λ_{i} + λ)$ in each direction. This is the geometric content of “weight decay” that the AdamW update rule preserves and that coupled Adam + L2 distorts.
Adam develops the moment-EMA and adaptive-denominator machinery taken as given in this note.
Optimization and Regularization collects the AdamW paper (Loshchilov and Hutter, 2017) together with Adam, AdaGrad, RMSProp, and the wider optimizer and regularizer literature.

Deep Learning: Zero to Hero

Explorer

1. Intro

2. Motivation

3. L2 regularization and weight decay equivalence in SGD

4. Why this equivalence breaks in Adam

5. The fix: AdamW

5.1. Why this fix makes sense mathematically

6. AdamW is not “Adam with an L2 term”

7. Coordinate-wise interpretation

8. Why AdamW usually works better in practice

9. Adam + L2 vs AdamW

10. Implementation notes

11. Summary

Graph View

Table of Contents

Backlinks

Deep Learning: Zero to Hero

Explorer

AdamW

1. Intro

2. Motivation

3. L2 regularization and weight decay equivalence in SGD

4. Why this equivalence breaks in Adam

5. The fix: AdamW

5.1. Why this fix makes sense mathematically

6. AdamW is not “Adam with an L2 term”

7. Coordinate-wise interpretation

8. Why AdamW usually works better in practice

9. Adam + L2 vs AdamW

10. Implementation notes

11. Summary

Related notes

Graph View

Table of Contents

Backlinks