Intro
Adam (ptive oment estimation) combines two key ideas:
- Momentum: accumulation of past gradients.
- RMSProp: accumulation of squared gradients.
In Adam, the EMA is applied to first momentum as well both to the first moment (momentum) and the second moment (squared gradients).
How Adam addresses EMA being biased toward zero
Since EMA alone is biased toward zero (i.e., it can cause extremely large effective learning rates during the very first training iterations) Adam introduces a bias-correction mechanism that stabilizes updates in the early stages of training.
Adam’s strength
Thanks to the bias correction and combined use of momentum + RMSProp, Adam achieves a more balanced and robust learning rate behavior from the very start of training: fast exploration early on without wild divergence, followed by controlled convergence.
Adam in depth
Adam update rule
The update rule prescribed by Adam is:
where:
- : bias-corrected EMA of gradients
- : bias-corrected EMA of squared gradients
- : base learning-rate
- : term for numerical stability
EMA in Adam
Adam uses two Exponential Moving Averages (EMAs):
- one on the sequence of gradients (equivalent to Momentum),
- one on the squared gradients (same principle behind RMSProp).
The table below summarizes their respective formulas and roles.
| Recursive formula | |
|---|---|
| Exponential Moving Average (EMA) applied to the history of gradients (borrowed from the Momentum technique). In this update formula, instead of multiplying by the current gradient, an EMA of past gradients is used. This reflects the same principle as Momentum: the update of the current state depends not only on the present gradient but also on gradients from previous steps, weighted by a coefficient . | |
| Exponential Moving Average (EMA) applied to the history of squared gradients (borrowed from RMSProp). |
Default parameters (from the original paper):
These two coefficients are used in the EMAs above, but more importantly, they are essential to correct the zero-bias inherent in such exponential moving averages.
Correcting EMA bias at initialization
EMA is biased toward zero
At the beginning of training (), the two EMAs exhibit a bias toward zero, since they are computed over only a very small number of samples.
No bias correction scenario
If we consider the update rule without bias correction:
then in the very first iterations, since and , the estimates and turn out to be very small, even if the gradient is not.
In fact, at the first step (), we have:
With the values suggested in the original paper (, ), the initial coefficients are:
Thus, both and are heavily attenuated, making the update almost negligible:
Bias correction for EMAs at initialization
To avoid this issue, Adam introduces a bias correction through the following terms:
With bias correction
Applying bias correction:
At the first step (), this becomes:
Thus, thanks to the correction, the update rule is already consistent with the true gradient from the very first iteration, preventing learning from stalling:
Dimensional consideration
In Deep Learning, parameters (and typically the loss ) are treated as dimensionless quantities. Keeping in mind this convention, in Adam’s update rule:
- Numerator (), the bias-corrected EMA of gradients, has the same units as the gradient. This ensures that the update direction is consistent with gradient descent: updates are proportional to the gradient, not to its square.
- Denominator (), the bias-corrected EMA of squared gradients, also has the same units as the gradient. Dividing by yields a dimensionless ratio.
As a result, the update step
remains dimensionally consistent (i.e., dimensionless) and interpretable as a scaled gradient descent step.
Summary
Adam sets itself apart from other optimizers because it works well right from the very first training steps, thanks to the bias-correction applied to its EMAs.
- It applies bias-correction to both the first moment () and second moment () estimates, counteracting their initialization at zero.
- This prevents the denominators in the update rule from being too close to zero, ensuring stable updates even in the earliest training iterations.
- After the initial transient phase, Adam behaves similarly to other adaptive optimizers, but its key advantage lies in handling the start of training reliably, where other methods may diverge or stall.
- It also retains per-parameter adaptivity: parameters with frequent or large gradients are moderated, while rare or smaller ones are not overshadowed.
Info
Adam is a robust and self-adaptive optimizer, and has become the de facto standard for training deep networks.