1. Intro

Adam (ptive oment estimation) is the natural completion of the adaptive-optimizer line that began with AdaGrad and continued with RMSProp. It combines two complementary ideas:

  • a first-moment EMA of the gradients, which plays the role of Momentum;
  • a second-moment EMA of the squared gradients, inherited from RMSProp.

Instead of updating each parameter from the raw gradient alone, Adam builds:

  • a smoothed estimate of the update direction,
  • a smoothed estimate of the gradient scale.

For this reason, Adam is often summarized as a combination of Momentum + RMSProp, with an additional bias correction that fixes the startup transient of plain RMSProp.

Core idea

Adam may be interpreted as follows:

  • Momentum in the numerator: the EMA of gradients stabilizes the direction of motion.
  • RMSProp in the denominator: the EMA of squared gradients rescales each parameter adaptively.
  • Bias correction on top: the zero-initialization bias of both EMAs is removed during the first iterations.

Why bias correction matters

Exponential moving averages initialized at zero are systematically biased toward zero at the beginning of training. In Adam this matters twice:

  • the first-moment estimate underestimates the true average gradient,
  • the second-moment estimate underestimates the gradient magnitude that appears in the denominator.

Without correction, the early updates are not properly scaled.

Adam's practical strength

Strong behavior is usually observed from the very first iterations:

  • rapid initial progress is often obtained,
  • the step size is adapted parameter-by-parameter,
  • much of the unstable startup behavior that affects adaptive methods without bias correction is avoided.

2. Adam in depth

Info

This note describes the core Adam algorithm itself, not the full space of implementation variants and framework-specific options.

2.1 Adam update rule

Let

be the gradient of the loss with respect to parameter at iteration .

Adam maintains two exponential moving averages:

with initialization

Because both moving averages start at zero, Adam corrects them as follows:

The parameter update is then:

where:

  • is the base learning rate,
  • controls the memory of the first moment,
  • controls the memory of the second moment,
  • is a small numerical-stability constant.

The default values proposed in the original Adam paper are:

Note

The square root in is essential. Since tracks a second moment, taking the square root brings the denominator back to the same scale as the gradient. The same dimensional argument used in RMSProp applies here unchanged.

Effective memory length of the two EMAs

Each EMA in Adam has an effective memory horizon of approximately steps (the same heuristic introduced for RMSProp’s ):

  • the first moment with remembers roughly the last steps of gradients;
  • the second moment with remembers roughly the last steps of squared gradients.

The asymmetry is deliberate. The direction signal is allowed to react quickly to changes in the gradient (short memory), while the per-coordinate scale signal is allowed to vary only slowly (long memory). The second-moment estimator’s much longer horizon is exactly why its bias correction is so important: with , the uncorrected does not approach a fully-formed estimate until hundreds of iterations, and the early denominator can be systematically too small without correction.


2.2 EMA in Adam

Adam uses two distinct EMAs, each with a different role in the update.

QuantityRecursive formulaRole
First moment Tracks a smoothed version of the gradient; this stabilizes direction and plays the role of momentum
Second moment Tracks a smoothed version of squared gradient magnitudes; this rescales each parameter adaptively

The qualitative effect is the following:

  • if a parameter has received large recent gradients, then becomes large and the effective step on that parameter is reduced;
  • if a parameter has received small or infrequent gradients, then stays smaller and that parameter is updated more aggressively.

Interpretation

In Adam, two roles that are not separated in plain SGD are handled explicitly:

  • the numerator determines the update direction,
  • the denominator determines the update scale for each parameter.

2.3 Correcting EMA bias at initialization

2.3.1 EMA is biased toward zero at the beginning

Because and , the first few values of the two EMAs are pulled toward zero by construction.

If the gradient statistics are roughly stationary over the initial iterations, then:

  • the average gradient does not change dramatically from one step to the next,
  • the average squared gradient does not change dramatically either,
  • in other words, during the very early phase of training the gradient process may be treated as having approximately stable first and second moments.

Concretely, this means that for the first few steps the following approximation is used:

This assumption does not assert that the gradients are exactly constant. It is only a local approximation used to isolate the effect of zero initialization of the EMAs. Under this approximation, the bias formulas below show clearly that the factors and arise from initialization, not from some special property of the loss.

This is the origin of the correction factors and .

Note

Bias correction addresses exactly one defect: the downward bias that zero-initialized EMAs carry during the first iterations. It removes that, and nothing beyond it. Where the gradient statistics are themselves non-stationary and drift over training, the corrected moments remain approximations of moving targets.

2.3.2 First-step behavior

At the first iteration (), before bias correction:

Therefore, without bias correction, the first Adam update would be:

If is negligible compared with , this becomes approximately:

With the default values:

The first step is therefore not “almost negligible”. If anything, it can be too large, because the denominator is underestimated even more strongly than the numerator.

Common misconception

A small and a small do not imply a tiny Adam update. Since appears under a square root in the denominator, the early step can actually be amplified.

2.3.3 With bias correction

Applying the bias-correction terms gives:

Hence the first corrected update is:

A much better-behaved update is therefore obtained:

  • the update direction is aligned with the true gradient,
  • the initial scale distortion caused by zero initialization is removed,
  • the optimizer starts from a controlled regime rather than from a badly mis-scaled one.

Why the correction is especially important for

Since is very close to , the second-moment estimate is heavily biased toward zero at the start if left uncorrected. That is precisely the term sitting in the denominator. This is why bias correction is a central part of Adam rather than a minor technical addition.

2.3.4 Why the correction matters mostly at the beginning

As grows,

because both and are chosen in the interval . Whenever a number satisfies , repeated multiplication by makes it exponentially smaller:

For example:

and

Thus, even if the decay is slow when is very close to , the power still eventually vanishes.

Hence

Therefore, the bias-correction terms matter mainly during the startup phase. After enough iterations, Adam behaves like an adaptive optimizer based on well-formed moving averages.


2.4 Dimensional consideration

In much of Deep Learning, parameters and losses are treated as dimensionless quantities. Even under that convention, it remains useful to check that Adam is internally consistent:

  • and have the same units as the gradient,
  • and have the units of the gradient squared,
  • has the same units as the gradient,
  • therefore is dimensionless.

Thus the update

is dimensionally consistent.

Info

The dimensional check also clarifies why the denominator must be rather than . Without the square root, the numerator and denominator would not live on compatible scales.


3. Summary

Adam is built from three ingredients:

  • momentum-like smoothing through the EMA of gradients in the numerator;
  • RMSProp-like adaptivity through the EMA of squared gradients in the denominator;
  • bias correction to neutralize the zero-initialization distortion of both EMAs during the startup phase.

The combination delivers stable, per-parameter, well-scaled updates from the very first iteration, which is the main reason Adam remains one of the strongest default optimizers in deep learning.

Memory cost: the parameters (one first-moment buffer and one second-moment buffer per parameter), versus for RMSProp and SGD with Momentum, and for plain SGD. This cost matters at very large scale and motivates memory-efficient variants like Adafactor and 8-bit Adam in foundation-model training; see Choosing an optimizer for the comparison table.

Adam vs AdamW when weight decay is involved

Whenever weight decay is needed, the correct choice is AdamW, not “Adam + L2 mixed into the gradient”. L2 added to Adam’s gradient is passed through Adam’s adaptive denominator, which distorts the regularization in a way that has no statistical justification. AdamW decouples the weight-decay step from the adaptive rescaling, restoring the geometric content of L2 regularization. This is treated in detail in AdamW.

Sources

Adam comes from Kingma and Ba (2014). It and the wider optimizer and regularizer literature are collected in Optimization and Regularization.