1. Intro

AdamW is the standard modern way to combine the Adam optimizer with weight decay.

It was introduced by Ilya Loshchilov and Frank Hutter in the paper Decoupled Weight Decay Regularization (arXiv 2017, published at ICLR 2019).

The central insight of that work is extremely important:

  • in SGD, L2 regularization and weight decay are equivalent,
  • in Adam, they are not equivalent,
  • therefore, applying L2 regularization to Adam as if it were ordinary weight decay is conceptually incorrect.

AdamW was introduced precisely to fix this problem.

What problem does AdamW solve?

AdamW solves the following issue: when the term is injected directly into Adam’s gradient pipeline, the regularization term is processed by Adam’s adaptive moment machinery. As a result, the shrinkage of the weights is no longer a clean, uniform weight decay. Instead, it becomes entangled with the history of the gradients.

Notation

In this note, denotes the subset of trainable parameters to which weight decay is applied. In practice, this usually means weight tensors; biases and normalization parameters are often excluded.

The coefficient used directly in the optimizer update is denoted by . This choice keeps the AdamW formulas conceptually clean, because AdamW is defined as a decoupled update rule.

To connect this note with L2 regularization, suppose the regularized objective is written using the averaged-loss convention

Then the corresponding decay coefficient in the optimizer update is

So the present note is fully compatible with the notation used in the L2-regularization note; it simply names the direct decay coefficient explicitly.


2. Motivation

When L2 regularization is first learned in the context of gradient descent or SGD, the following equivalence is usually established:

  • optimizing a L2-regularized loss
  • applying multiplicative weight decay at each step

For SGD, this equivalence is exact. Many implementations and explanations carried this intuition over to Adam. That transfer is the mistake AdamW was designed to correct.

Historical motivation

AdamW was not introduced as a minor variant of Adam. It was introduced because a large part of the community was informally treating “Adam + L2 regularization” as if it were the same thing as “Adam + weight decay”. The paper by Loshchilov and Hutter showed that this identification is false for adaptive optimizers.


3. L2 regularization and weight decay equivalence in SGD

Let

denote the data loss at iteration , and let

be its gradient evaluated at the current parameters .

Now consider the L2-regularized objective

Its gradient is

Therefore, the SGD update becomes

Rearranging,

This is the key identity.

It shows that, for SGD:

  • the term is the usual optimization step on the data loss,
  • the factor is a pure multiplicative shrinkage of the parameters.

This multiplicative shrinkage is exactly what is meant by weight decay.

Key conclusion for SGD

In SGD, adding the L2 penalty

to the loss is mathematically equivalent to shrinking the parameters by the factor at every step.


4. Why this equivalence breaks in Adam

The problem appears as soon as the update is no longer a plain scalar multiple of the gradient.

Adam does not apply a single global scaling to the gradient. Instead, it builds:

  • a first-moment estimate,
  • a second-moment estimate,
  • a coordinate-wise adaptive denominator.

If L2 regularization is naively inserted into Adam, the algorithm uses the augmented gradient

and then computes

followed by

where all vector operations are elementwise.

Two problems immediately appear.

1 The regularization term contaminates the moment estimates

The moments and are no longer statistics of the data gradient alone. They now track the mixed quantity

This means that the regularization term is not merely added at the end of the update. It influences:

  • the EMA of gradients,
  • the EMA of squared gradients,
  • the bias-corrected quantities,
  • the adaptive denominator itself.

In other words, the regularizer is passed through machinery that was designed to analyze the geometry of the loss gradient, not the geometry of the regularization term.

2 The decay ceases to be a clean multiplicative shrinkage

To isolate the core issue, it is useful to momentarily ignore the first-moment smoothing and view Adam as a coordinate-wise preconditioner.

Define

Then the adaptive step has the schematic form

Rearranging,

This is no longer ordinary weight decay.

Indeed:

  • if were a scalar multiple of the identity, the decay would be uniform;
  • in Adam, is coordinate-wise and time-dependent;
  • therefore, each parameter is shrunk by a different effective factor. The decay has become anisotropic and history-dependent.

The conceptual failure of "Adam + L2"

The quantity is supposed to express a clean preference for small weights. In coupled Adam, however, this term is rescaled according to the past gradient statistics of each coordinate. As a result, some parameters are regularized more strongly than others for reasons that come from Adam’s adaptive geometry, not from the intended meaning of weight decay.

Note

This is the precise reason why the sentence “L2 regularization is the same as weight decay” is true for SGD but false for Adam.


5. The fix: AdamW

AdamW solves the above problem by applying the decay outside Adam’s adaptive gradient transformation.

The moments are computed from the true data gradient only:

Only after the Adam step is defined is weight decay applied:

Equivalently,

This is AdamW.

The essential distinction

In AdamW:

  • the adaptive moments see only the gradient of the data loss,
  • the shrinkage term is applied separately,
  • the meaning of weight decay is restored.

5.1. Why this fix makes sense mathematically

Adam and weight decay serve two different purposes:

  • Adam adapts the update to the local gradient statistics of the loss,
  • weight decay expresses an external preference for smaller parameter norms.

These roles should not be mixed.

If the regularization term is passed through Adam’s adaptive denominator, then the penalty on the parameters becomes distorted by quantities that were never meant to regulate the penalty itself.

AdamW enforces the correct separation:

  • the loss gradient is treated adaptively,
  • the weight norm is shrunk directly.

This separation is exactly what the word decoupled refers to.

Decoupling

In AdamW, the optimization step with respect to the data loss and the shrinkage step with respect to weight decay are kept as two distinct operations. That is the entire conceptual heart of AdamW.


6. AdamW is not “Adam with an L2 term”

This distinction is subtle but fundamental.

For SGD:

  • L2 regularization
  • weight decay

produce the same update rule.

For Adam:

  • Adam on a L2-regularized objective
  • Adam plus explicit decoupled weight decay

produce different update rules.

Therefore, AdamW should be understood as:

  • Adam applied to the data loss, plus
  • a separate multiplicative shrinkage of the parameters.

It is not simply a mere rewriting of the L2-regularized Adam objective.

Common terminological trap

In many codebases and tutorials, the words “L2 regularization” and “weight decay” are used interchangeably. That shortcut is acceptable for SGD, but it becomes misleading for Adam-like optimizers. AdamW exists precisely because the shortcut fails.


7. Coordinate-wise interpretation

For a single decayed weight parameter , AdamW performs

This formula is very revealing:

  • the factor applies the same decay coefficient to every parameter,
  • the adaptive factor acts only on the loss-gradient step,
  • the regularization term is no longer distorted by coordinate-wise gradient statistics.

This is exactly the behavior that the original notion of weight decay is supposed to represent.


8. Why AdamW usually works better in practice

The paper introducing AdamW emphasized two practical consequences.

Better-behaved regularization

With coupled Adam, the strength of regularization is entangled with Adam’s adaptive rescaling. With AdamW, the shrinkage mechanism is explicit and interpretable.

Cleaner hyperparameter tuning

In coupled Adam, the effect of the regularization term is distorted by the optimizer’s internal state. In AdamW, the role of is much cleaner:

  • controls the scale of the Adam step,
  • controls the strength of multiplicative shrinkage.

This does not mean that and become mathematically unrelated. The update still contains the product . However, the regularization is no longer additionally warped by the adaptive denominator and moment estimates. This makes tuning substantially more interpretable.

Note

This is one of the most important practical reasons why AdamW replaced plain Adam in many modern training pipelines.


9. Adam + L2 vs AdamW

MethodMoments computed fromDoes adaptive scaling affect the decay term?True decoupled weight decay?
Adam + L2YesNo
AdamW onlyNoYes

10. Implementation notes

In modern frameworks

In modern deep-learning libraries, AdamW is typically implemented directly as the decoupled rule above.

This is the optimizer that should generally be preferred whenever:

  • Adam-style adaptivity is desired,
  • nonzero weight decay is required,
  • a modern training pipeline is being designed.

Parameters often excluded from decay

In practice, weight decay is often applied to:

  • weight matrices of linear layers,
  • convolution kernels,
  • embedding matrices,

but often excluded from:

  • biases,
  • BatchNorm / LayerNorm scale and shift parameters.

The underlying practical reason is that these parameters typically do not benefit from norm shrinkage in the same way as ordinary weights.

Practical rule

If an Adam-family optimizer is used together with weight_decay, the safe default is almost always:

  • use AdamW, not plain Adam,
  • apply decay primarily to true weight tensors,
  • exclude biases and normalization parameters unless a specific reason exists not to.

11. Summary

AdamW should be remembered through one central idea:

Important

Weight decay must be decoupled from Adam’s adaptive gradient transformation.

The logic of the method is:

  1. Adam computes an adaptive step from the data loss gradient.
  2. Weight decay shrinks the parameters separately.

This separation is necessary because:

  • in SGD, L2 regularization and weight decay are equivalent,
  • in Adam, they are not,
  • the naive transfer of the SGD intuition to Adam is mathematically incorrect.

For this reason, AdamW is not merely a convenient variant. It is the mathematically clean way to use weight decay with Adam.

Final takeaway

Whenever Adam-style optimization and weight decay are both desired, AdamW is the principled choice.