1. Intro
AdamW is the standard modern way to combine the Adam optimizer with weight decay.
It was introduced by Ilya Loshchilov and Frank Hutter in the paper Decoupled Weight Decay Regularization (arXiv 2017, published at ICLR 2019).
The central insight of that work is extremely important:
- in SGD, L2 regularization and weight decay are equivalent,
- in Adam, they are not equivalent,
- therefore, applying L2 regularization to Adam as if it were ordinary weight decay is conceptually incorrect.
AdamW was introduced precisely to fix this problem.
What problem does AdamW solve?
AdamW solves the following issue: when the term is injected directly into Adam’s gradient pipeline, the regularization term is processed by Adam’s adaptive moment machinery. As a result, the shrinkage of the weights is no longer a clean, uniform weight decay. Instead, it becomes entangled with the history of the gradients.
Notation
In this note, denotes the subset of trainable parameters to which weight decay is applied. In practice, this usually means weight tensors; biases and normalization parameters are often excluded.
The coefficient used directly in the optimizer update is denoted by . This choice keeps the AdamW formulas conceptually clean, because AdamW is defined as a decoupled update rule.
To connect this note with L2 regularization, suppose the regularized objective is written using the averaged-loss convention
Then the corresponding decay coefficient in the optimizer update is
So the present note is fully compatible with the notation used in the L2-regularization note; it simply names the direct decay coefficient explicitly.
2. Motivation
When L2 regularization is first learned in the context of gradient descent or SGD, the following equivalence is usually established:
- optimizing a L2-regularized loss
- applying multiplicative weight decay at each step
For SGD, this equivalence is exact. Many implementations and explanations carried this intuition over to Adam. That transfer is the mistake AdamW was designed to correct.
Historical motivation
AdamW was not introduced as a minor variant of Adam. It was introduced because a large part of the community was informally treating “Adam + L2 regularization” as if it were the same thing as “Adam + weight decay”. The paper by Loshchilov and Hutter showed that this identification is false for adaptive optimizers.
3. L2 regularization and weight decay equivalence in SGD
Let
denote the data loss at iteration , and let
be its gradient evaluated at the current parameters .
Now consider the L2-regularized objective
Its gradient is
Therefore, the SGD update becomes
Rearranging,
This is the key identity.
It shows that, for SGD:
- the term is the usual optimization step on the data loss,
- the factor is a pure multiplicative shrinkage of the parameters.
This multiplicative shrinkage is exactly what is meant by weight decay.
Key conclusion for SGD
In SGD, adding the L2 penalty
to the loss is mathematically equivalent to shrinking the parameters by the factor at every step.
4. Why this equivalence breaks in Adam
The problem appears as soon as the update is no longer a plain scalar multiple of the gradient.
Adam does not apply a single global scaling to the gradient. Instead, it builds:
- a first-moment estimate,
- a second-moment estimate,
- a coordinate-wise adaptive denominator.
If L2 regularization is naively inserted into Adam, the algorithm uses the augmented gradient
and then computes
followed by
where all vector operations are elementwise.
Two problems immediately appear.
1 The regularization term contaminates the moment estimates
The moments and are no longer statistics of the data gradient alone. They now track the mixed quantity
This means that the regularization term is not merely added at the end of the update. It influences:
- the EMA of gradients,
- the EMA of squared gradients,
- the bias-corrected quantities,
- the adaptive denominator itself.
In other words, the regularizer is passed through machinery that was designed to analyze the geometry of the loss gradient, not the geometry of the regularization term.
2 The decay ceases to be a clean multiplicative shrinkage
To isolate the core issue, it is useful to momentarily ignore the first-moment smoothing and view Adam as a coordinate-wise preconditioner.
Define
Then the adaptive step has the schematic form
Rearranging,
This is no longer ordinary weight decay.
Indeed:
- if were a scalar multiple of the identity, the decay would be uniform;
- in Adam, is coordinate-wise and time-dependent;
- therefore, each parameter is shrunk by a different effective factor. The decay has become anisotropic and history-dependent.
The conceptual failure of "Adam + L2"
The quantity is supposed to express a clean preference for small weights. In coupled Adam, however, this term is rescaled according to the past gradient statistics of each coordinate. As a result, some parameters are regularized more strongly than others for reasons that come from Adam’s adaptive geometry, not from the intended meaning of weight decay.
Note
This is the precise reason why the sentence “L2 regularization is the same as weight decay” is true for SGD but false for Adam.
5. The fix: AdamW
AdamW solves the above problem by applying the decay outside Adam’s adaptive gradient transformation.
The moments are computed from the true data gradient only:
Only after the Adam step is defined is weight decay applied:
Equivalently,
This is AdamW.
The essential distinction
In AdamW:
- the adaptive moments see only the gradient of the data loss,
- the shrinkage term is applied separately,
- the meaning of weight decay is restored.
5.1. Why this fix makes sense mathematically
Adam and weight decay serve two different purposes:
- Adam adapts the update to the local gradient statistics of the loss,
- weight decay expresses an external preference for smaller parameter norms.
These roles should not be mixed.
If the regularization term is passed through Adam’s adaptive denominator, then the penalty on the parameters becomes distorted by quantities that were never meant to regulate the penalty itself.
AdamW enforces the correct separation:
- the loss gradient is treated adaptively,
- the weight norm is shrunk directly.
This separation is exactly what the word decoupled refers to.
Decoupling
In AdamW, the optimization step with respect to the data loss and the shrinkage step with respect to weight decay are kept as two distinct operations. That is the entire conceptual heart of AdamW.
6. AdamW is not “Adam with an L2 term”
This distinction is subtle but fundamental.
For SGD:
- L2 regularization
- weight decay
produce the same update rule.
For Adam:
- Adam on a L2-regularized objective
- Adam plus explicit decoupled weight decay
produce different update rules.
Therefore, AdamW should be understood as:
- Adam applied to the data loss, plus
- a separate multiplicative shrinkage of the parameters.
It is not simply a mere rewriting of the L2-regularized Adam objective.
Common terminological trap
In many codebases and tutorials, the words “L2 regularization” and “weight decay” are used interchangeably. That shortcut is acceptable for SGD, but it becomes misleading for Adam-like optimizers. AdamW exists precisely because the shortcut fails.
7. Coordinate-wise interpretation
For a single decayed weight parameter , AdamW performs
This formula is very revealing:
- the factor applies the same decay coefficient to every parameter,
- the adaptive factor acts only on the loss-gradient step,
- the regularization term is no longer distorted by coordinate-wise gradient statistics.
This is exactly the behavior that the original notion of weight decay is supposed to represent.
8. Why AdamW usually works better in practice
The paper introducing AdamW emphasized two practical consequences.
Better-behaved regularization
With coupled Adam, the strength of regularization is entangled with Adam’s adaptive rescaling. With AdamW, the shrinkage mechanism is explicit and interpretable.
Cleaner hyperparameter tuning
In coupled Adam, the effect of the regularization term is distorted by the optimizer’s internal state. In AdamW, the role of is much cleaner:
- controls the scale of the Adam step,
- controls the strength of multiplicative shrinkage.
This does not mean that and become mathematically unrelated. The update still contains the product . However, the regularization is no longer additionally warped by the adaptive denominator and moment estimates. This makes tuning substantially more interpretable.
Note
This is one of the most important practical reasons why AdamW replaced plain Adam in many modern training pipelines.
9. Adam + L2 vs AdamW
| Method | Moments computed from | Does adaptive scaling affect the decay term? | True decoupled weight decay? |
|---|---|---|---|
| Adam + L2 | Yes | No | |
| AdamW | only | No | Yes |
10. Implementation notes
In modern frameworks
In modern deep-learning libraries,
AdamWis typically implemented directly as the decoupled rule above.This is the optimizer that should generally be preferred whenever:
- Adam-style adaptivity is desired,
- nonzero weight decay is required,
- a modern training pipeline is being designed.
Parameters often excluded from decay
In practice, weight decay is often applied to:
- weight matrices of linear layers,
- convolution kernels,
- embedding matrices,
but often excluded from:
- biases,
- BatchNorm / LayerNorm scale and shift parameters.
The underlying practical reason is that these parameters typically do not benefit from norm shrinkage in the same way as ordinary weights.
Practical rule
If an Adam-family optimizer is used together with
weight_decay, the safe default is almost always:
- use AdamW, not plain
Adam,- apply decay primarily to true weight tensors,
- exclude biases and normalization parameters unless a specific reason exists not to.
11. Summary
AdamW should be remembered through one central idea:
Important
Weight decay must be decoupled from Adam’s adaptive gradient transformation.
The logic of the method is:
- Adam computes an adaptive step from the data loss gradient.
- Weight decay shrinks the parameters separately.
This separation is necessary because:
- in SGD, L2 regularization and weight decay are equivalent,
- in Adam, they are not,
- the naive transfer of the SGD intuition to Adam is mathematically incorrect.
For this reason, AdamW is not merely a convenient variant. It is the mathematically clean way to use weight decay with Adam.
Final takeaway
Whenever Adam-style optimization and weight decay are both desired, AdamW is the principled choice.