1. Intro

Adam (ptive oment estimation) combines two complementary ideas:

  • a first-moment EMA of the gradients, which plays the role of momentum,
  • a second-moment EMA of the squared gradients, inherited from RMSProp.

Instead of updating each parameter from the raw gradient alone, Adam builds:

  • a smoothed estimate of the update direction,
  • a smoothed estimate of the gradient scale.

For this reason, Adam is often summarized as a combination of Momentum + RMSProp.

Core idea

Adam may be interpreted as follows:

  • Momentum in the numerator: the EMA of gradients stabilizes the direction of motion.
  • RMSProp in the denominator: the EMA of squared gradients rescales each parameter adaptively.
  • Bias correction on top: the zero-initialization bias of both EMAs is removed during the first iterations.

Why bias correction matters

Exponential moving averages initialized at zero are systematically biased toward zero at the beginning of training. In Adam this matters twice:

  • the first-moment estimate underestimates the true average gradient,
  • the second-moment estimate underestimates the gradient magnitude that appears in the denominator.

Without correction, the early updates are not properly scaled.

Adam's practical strength

Strong behavior is usually observed from the very first iterations:

  • rapid initial progress is often obtained,
  • the step size is adapted parameter-by-parameter,
  • much of the unstable startup behavior that affects adaptive methods without bias correction is avoided.

2. Adam in depth

Info

This note describes the core Adam algorithm itself, not the full space of implementation variants and framework-specific options.

2.1 Adam update rule

Let

be the gradient of the loss with respect to parameter at iteration .

Adam maintains two exponential moving averages:

with initialization

Because both moving averages start at zero, Adam corrects them as follows:

The parameter update is then:

where:

  • is the base learning rate,
  • controls the memory of the first moment,
  • controls the memory of the second moment,
  • is a small numerical-stability constant.

The default values proposed in the original Adam paper are:

Note

The square root in is essential. Since tracks a second moment, taking the square root brings the denominator back to the same scale as the gradient.


2.2 EMA in Adam

Adam uses two distinct EMAs, each with a different role in the update.

QuantityRecursive formulaRole
First moment Tracks a smoothed version of the gradient; this stabilizes direction and plays the role of momentum
Second moment Tracks a smoothed version of squared gradient magnitudes; this rescales each parameter adaptively

The qualitative effect is the following:

  • if a parameter has received large recent gradients, then becomes large and the effective step on that parameter is reduced;
  • if a parameter has received small or infrequent gradients, then stays smaller and that parameter is updated more aggressively.

Interpretation

In Adam, two roles that are not separated in plain SGD are handled explicitly:

  • the numerator determines the update direction,
  • the denominator determines the update scale for each parameter.

2.3 Correcting EMA bias at initialization

2.3.1 EMA is biased toward zero at the beginning

Because and , the first few values of the two EMAs are pulled toward zero by construction.

If the gradient statistics are roughly stationary over the initial iterations, then:

  • the average gradient does not change dramatically from one step to the next,
  • the average squared gradient does not change dramatically either,
  • in other words, during the very early phase of training the gradient process may be treated as having approximately stable first and second moments.

Concretely, this means that for the first few steps the following approximation is used:

This assumption does not assert that the gradients are exactly constant. It is only a local approximation used to isolate the effect of zero initialization of the EMAs. Under this approximation, the bias formulas below show clearly that the factors and arise from initialization, not from some special property of the loss.

This is the origin of the correction factors and .

Note

Bias correction does not mean that Adam becomes magically perfect in every non-stationary setting. It means that Adam removes the specific distortion induced by initializing the EMAs at zero.

2.3.2 First-step behavior

At the first iteration (), before bias correction:

Therefore, without bias correction, the first Adam update would be:

If is negligible compared with , this becomes approximately:

With the default values:

The first step is therefore not “almost negligible”. If anything, it can be too large, because the denominator is underestimated even more strongly than the numerator.

Common misconception

A small and a small do not imply a tiny Adam update. Since appears under a square root in the denominator, the early step can actually be amplified.

2.3.3 With bias correction

Applying the bias-correction terms gives:

Hence the first corrected update is:

A much better-behaved update is therefore obtained:

  • the update direction is aligned with the true gradient,
  • the initial scale distortion caused by zero initialization is removed,
  • the optimizer starts from a controlled regime rather than from a badly mis-scaled one.

Why the correction is especially important for

Since is very close to , the second-moment estimate is heavily biased toward zero at the start if left uncorrected. That is precisely the term sitting in the denominator. This is why bias correction is a central part of Adam rather than a minor technical addition.

2.3.4 Why the correction matters mostly at the beginning

As grows,

because both and are chosen in the interval . Whenever a number satisfies , repeated multiplication by makes it exponentially smaller:

For example:

and

Thus, even if the decay is slow when is very close to , the power still eventually vanishes.

Hence

Therefore, the bias-correction terms matter mainly during the startup phase. After enough iterations, Adam behaves like an adaptive optimizer based on well-formed moving averages.


2.4 Dimensional consideration

In much of Deep Learning, parameters and losses are treated as dimensionless quantities. Even under that convention, it remains useful to check that Adam is internally consistent:

  • and have the same units as the gradient,
  • and have the units of the gradient squared,
  • has the same units as the gradient,
  • therefore is dimensionless.

Thus the update

is dimensionally consistent.

Info

The dimensional check also clarifies why the denominator must be rather than . Without the square root, the numerator and denominator would not live on compatible scales.


3. Summary

Adam is best understood as an optimizer built from three ingredients:

  • momentum-like smoothing through the EMA of gradients,
  • RMSProp-like adaptivity through the EMA of squared gradients,
  • bias correction to neutralize the zero-initialization distortion of both EMAs.

Its main strengths are:

  • stable and effective behavior from the very first iterations,
  • per-parameter adaptive step sizes,
  • fast practical convergence on a wide range of deep-learning problems.

Its key conceptual advantage over a plain RMSProp-style method is that the startup phase is handled much more carefully:

  • the first-moment estimate is corrected,
  • the second-moment estimate is corrected,
  • the early effective learning rate is far better controlled.

Practical note

Adam remains one of the strongest default optimizers in Deep Learning. However, when weight decay is needed, the modern practical choice is usually AdamW, not “Adam + L2 mixed into the gradient”, because the regularization term should be decoupled from Adam’s adaptive rescaling.