1. Intro
Adam (ptive oment estimation) combines two complementary ideas:
- a first-moment EMA of the gradients, which plays the role of momentum,
- a second-moment EMA of the squared gradients, inherited from RMSProp.
Instead of updating each parameter from the raw gradient alone, Adam builds:
- a smoothed estimate of the update direction,
- a smoothed estimate of the gradient scale.
For this reason, Adam is often summarized as a combination of Momentum + RMSProp.
Core idea
Adam may be interpreted as follows:
- Momentum in the numerator: the EMA of gradients stabilizes the direction of motion.
- RMSProp in the denominator: the EMA of squared gradients rescales each parameter adaptively.
- Bias correction on top: the zero-initialization bias of both EMAs is removed during the first iterations.
Why bias correction matters
Exponential moving averages initialized at zero are systematically biased toward zero at the beginning of training. In Adam this matters twice:
- the first-moment estimate underestimates the true average gradient,
- the second-moment estimate underestimates the gradient magnitude that appears in the denominator.
Without correction, the early updates are not properly scaled.
Adam's practical strength
Strong behavior is usually observed from the very first iterations:
- rapid initial progress is often obtained,
- the step size is adapted parameter-by-parameter,
- much of the unstable startup behavior that affects adaptive methods without bias correction is avoided.
2. Adam in depth
Info
This note describes the core Adam algorithm itself, not the full space of implementation variants and framework-specific options.
2.1 Adam update rule
Let
be the gradient of the loss with respect to parameter at iteration .
Adam maintains two exponential moving averages:
with initialization
Because both moving averages start at zero, Adam corrects them as follows:
The parameter update is then:
where:
- is the base learning rate,
- controls the memory of the first moment,
- controls the memory of the second moment,
- is a small numerical-stability constant.
The default values proposed in the original Adam paper are:
Note
The square root in is essential. Since tracks a second moment, taking the square root brings the denominator back to the same scale as the gradient.
2.2 EMA in Adam
Adam uses two distinct EMAs, each with a different role in the update.
| Quantity | Recursive formula | Role |
|---|---|---|
| First moment | Tracks a smoothed version of the gradient; this stabilizes direction and plays the role of momentum | |
| Second moment | Tracks a smoothed version of squared gradient magnitudes; this rescales each parameter adaptively |
The qualitative effect is the following:
- if a parameter has received large recent gradients, then becomes large and the effective step on that parameter is reduced;
- if a parameter has received small or infrequent gradients, then stays smaller and that parameter is updated more aggressively.
Interpretation
In Adam, two roles that are not separated in plain SGD are handled explicitly:
- the numerator determines the update direction,
- the denominator determines the update scale for each parameter.
2.3 Correcting EMA bias at initialization
2.3.1 EMA is biased toward zero at the beginning
Because and , the first few values of the two EMAs are pulled toward zero by construction.
If the gradient statistics are roughly stationary over the initial iterations, then:
- the average gradient does not change dramatically from one step to the next,
- the average squared gradient does not change dramatically either,
- in other words, during the very early phase of training the gradient process may be treated as having approximately stable first and second moments.
Concretely, this means that for the first few steps the following approximation is used:
This assumption does not assert that the gradients are exactly constant. It is only a local approximation used to isolate the effect of zero initialization of the EMAs. Under this approximation, the bias formulas below show clearly that the factors and arise from initialization, not from some special property of the loss.
On the use of expectations such as
For readers accustomed to taking expectations only of explicit random variables, the following statistical viewpoint is useful.
In full-batch gradient descent, once the dataset and the parameter vector are fixed, the gradient is deterministic. In mini-batch SGD, however, the gradient computed at step depends on the randomly selected mini-batch. Therefore, for a fixed parameter vector , the gradient may be modeled as a random variable:
where denotes the random mini-batch sampled at iteration . The actually observed gradient is then one realization of that random variable.
Under this viewpoint:
- is the average gradient over all possible mini-batches,
- is the average squared gradient over all possible mini-batches.
This is exactly the same kind of expectation used in probability theory for any random variable. In informal optimizer notes, the same symbol is often used both for the random quantity and for one realized value. That notation is slightly abusive, but standard.
The expectation is introduced here because the bias of an EMA is fundamentally a statistical statement about averages over repeated draws, not only a statement about one single realized training trajectory.
This is the origin of the correction factors and .
Note
Bias correction does not mean that Adam becomes magically perfect in every non-stationary setting. It means that Adam removes the specific distortion induced by initializing the EMAs at zero.
2.3.2 First-step behavior
At the first iteration (), before bias correction:
Therefore, without bias correction, the first Adam update would be:
If is negligible compared with , this becomes approximately:
With the default values:
The first step is therefore not “almost negligible”. If anything, it can be too large, because the denominator is underestimated even more strongly than the numerator.
Common misconception
A small and a small do not imply a tiny Adam update. Since appears under a square root in the denominator, the early step can actually be amplified.
2.3.3 With bias correction
Applying the bias-correction terms gives:
Hence the first corrected update is:
A much better-behaved update is therefore obtained:
- the update direction is aligned with the true gradient,
- the initial scale distortion caused by zero initialization is removed,
- the optimizer starts from a controlled regime rather than from a badly mis-scaled one.
Why the correction is especially important for
Since is very close to , the second-moment estimate is heavily biased toward zero at the start if left uncorrected. That is precisely the term sitting in the denominator. This is why bias correction is a central part of Adam rather than a minor technical addition.
2.3.4 Why the correction matters mostly at the beginning
As grows,
because both and are chosen in the interval . Whenever a number satisfies , repeated multiplication by makes it exponentially smaller:
For example:
and
Thus, even if the decay is slow when is very close to , the power still eventually vanishes.
Hence
Therefore, the bias-correction terms matter mainly during the startup phase. After enough iterations, Adam behaves like an adaptive optimizer based on well-formed moving averages.
2.4 Dimensional consideration
In much of Deep Learning, parameters and losses are treated as dimensionless quantities. Even under that convention, it remains useful to check that Adam is internally consistent:
- and have the same units as the gradient,
- and have the units of the gradient squared,
- has the same units as the gradient,
- therefore is dimensionless.
Thus the update
is dimensionally consistent.
Info
The dimensional check also clarifies why the denominator must be rather than . Without the square root, the numerator and denominator would not live on compatible scales.
3. Summary
Adam is best understood as an optimizer built from three ingredients:
- momentum-like smoothing through the EMA of gradients,
- RMSProp-like adaptivity through the EMA of squared gradients,
- bias correction to neutralize the zero-initialization distortion of both EMAs.
Its main strengths are:
- stable and effective behavior from the very first iterations,
- per-parameter adaptive step sizes,
- fast practical convergence on a wide range of deep-learning problems.
Its key conceptual advantage over a plain RMSProp-style method is that the startup phase is handled much more carefully:
- the first-moment estimate is corrected,
- the second-moment estimate is corrected,
- the early effective learning rate is far better controlled.
Practical note
Adam remains one of the strongest default optimizers in Deep Learning. However, when weight decay is needed, the modern practical choice is usually AdamW, not “Adam + L2 mixed into the gradient”, because the regularization term should be decoupled from Adam’s adaptive rescaling.