Adam

Intro

Adam ( $Ada$ ptive $m$ oment estimation) combines two key ideas:

Momentum: accumulation of past gradients.
RMSProp: accumulation of squared gradients.

In Adam, the EMA is applied to first momentum as well both to the first moment (momentum) and the second moment (squared gradients).

How Adam addresses EMA being biased toward zero

Since EMA alone is biased toward zero (i.e., it can cause extremely large effective learning rates during the very first training iterations) Adam introduces a bias-correction mechanism that stabilizes updates in the early stages of training.

Adam’s strength

Thanks to the bias correction and combined use of momentum + RMSProp, Adam achieves a more balanced and robust learning rate behavior from the very start of training: fast exploration early on without wild divergence, followed by controlled convergence.

Adam in depth

Adam update rule

The update rule prescribed by Adam is:

θ_{j}^{(t + 1)} = θ_{j}^{(t)} - η \frac{m _{j}^{(t + 1)}}{v _{j}^{(t + 1)} + ε}

where:

$m_{j}^{(t + 1)} = \frac{m _{j}^{(t + 1)}}{1 - β _{1}}$ : bias-corrected EMA of gradients
$v_{j}^{(t + 1)} = \frac{v _{j}^{(t + 1)}}{1 - β _{2}}$ : bias-corrected EMA of squared gradients
$η$ : base learning-rate
$ε \sim 1 0^{- 8}$ : term for numerical stability

EMA in Adam

Adam uses two Exponential Moving Averages (EMAs):

one on the sequence of gradients (equivalent to Momentum),
one on the squared gradients (same principle behind RMSProp).

The table below summarizes their respective formulas and roles.

Recursive formula
$m_{j}^{(t + 1)} = β_{1} m_{j}^{(t)} + (1 - β_{1}) \nabla_{θ_{j}} L^{(t)}$	Exponential Moving Average (EMA) applied to the history of gradients (borrowed from the Momentum technique). In this update formula, instead of multiplying $η$ by the current gradient, an EMA of past gradients is used. This reflects the same principle as Momentum: the update of the current state depends not only on the present gradient but also on gradients from previous steps, weighted by a coefficient $μ$ .
$v_{j}^{(t + 1)} = β_{2} v_{j}^{(t)} + (1 - β_{2}) [\nabla_{θ_{j}} L^{(t)}]^{2}$	Exponential Moving Average (EMA) applied to the history of squared gradients (borrowed from RMSProp).

Recursive formula

m_{j}^{(t + 1)} = β_{1} m_{j}^{(t)} + (1 - β_{1}) \nabla_{θ_{j}} L^{(t)}

Exponential Moving Average (EMA) applied to the history of gradients (borrowed from the Momentum technique).
In this update formula, instead of multiplying

η

by the current gradient, an EMA of past gradients is used.
This reflects the same principle as Momentum: the update of the current state depends not only on the present gradient but also on gradients from previous steps, weighted by a coefficient

μ

v_{j}^{(t + 1)} = β_{2} v_{j}^{(t)} + (1 - β_{2}) [\nabla_{θ_{j}} L^{(t)}]^{2}

Exponential Moving Average (EMA) applied to the history of squared gradients (borrowed from RMSProp).

Default parameters (from the original paper):

β_{1} = 0.9, β_{2} = 0.999.

These two coefficients are used in the EMAs above, but more importantly, they are essential to correct the zero-bias inherent in such exponential moving averages.

Correcting EMA bias at initialization

EMA is biased toward zero

At the beginning of training ( $t = 0, 1, 2, \dots$ ), the two EMAs exhibit a bias toward zero, since they are computed over only a very small number of samples.

No bias correction scenario

If we consider the update rule without bias correction:
$θ_{j}^{(t + 1)} = θ_{j}^{(t)} - η \frac{m _{j}^{(t + 1)}}{v _{j}^{(t + 1)} + ε},$
then in the very first iterations, since $m_{j}^{(0)} = 0$ and $v_{j}^{(0)} = 0$ , the estimates $m_{j}^{(1)}$ and $v_{j}^{(1)}$ turn out to be very small, even if the gradient $\nabla_{θ_{j}} L^{(1)}$ is not.

In fact, at the first step ( $t = 0 \to t = 1$ ), we have:
$m_{j}^{(1)} v_{j}^{(1)} = β_{1} \cdot 0 + (1 - β_{1}) \nabla_{θ_{j}} L^{(1)} = (1 - β_{1}) \nabla_{θ_{j}} L^{(1)} = (1 - β_{2}) [\nabla_{θ_{j}} L^{(1)}]^{2}$
With the values suggested in the original paper ( $β_{1} = 0.9$ , $β_{2} = 0.999$ ), the initial coefficients are:

$1 - β_{1} = 0.1$

$1 - β_{2} = 0.001$

Thus, both $m_{j}^{(1)}$ and $v_{j}^{(1)}$ are heavily attenuated, making the update $θ_{j}^{(1)}$ almost negligible:
$θ_{j}^{(1)} = θ_{j}^{(0)} - η \cdot \frac{0.1 \nabla _{θ_{j}} L ^{(1)}}{0.001 [ \nabla _{θ_{j}} L ^{(1)} ] ^{2} + ε} .$

Bias correction for EMAs at initialization

To avoid this issue, Adam introduces a bias correction through the following terms:
$m_{j}^{(t + 1)} = \frac{m _{j}^{(t + 1)}}{1 - β _{1}}, v_{j}^{(t + 1)} = \frac{v _{j}^{(t + 1)}}{1 - β _{2}}$

With bias correction

Applying bias correction:
$\overset{m}{^}_{j}^{(1)} = \frac{m _{j}^{(1)}}{1 - β _{1}}, \overset{v}{^}_{j}^{(1)} = \frac{v _{j}^{(1)}}{1 - β _{2}} .$
At the first step ( $m_{j}^{(0)} = v_{j}^{(0)} = 0$ ), this becomes:
$\overset{m}{^}_{j}^{(1)} = \frac{( 1 - β _{1} ) \nabla _{θ_{j}} L ^{(1)}}{1 - β _{1}} = \nabla_{θ_{j}} L^{(1)},$ $\overset{v}{^}_{j}^{(1)} = \frac{( 1 - β _{2} ) [ \nabla _{θ_{j}} L ^{(1)} ] ^{2}}{1 - β _{2}} = [\nabla_{θ_{j}} L^{(1)}]^{2} .$
Thus, thanks to the correction, the update rule is already consistent with the true gradient from the very first iteration, preventing learning from stalling:
$θ_{j}^{(1)} = θ_{j}^{(0)} - η \cdot \frac{\nabla _{θ_{j}} L ^{(1)}}{[ \nabla _{θ_{j}} L ^{(1)} ] ^{2} + ε} .$

Dimensional consideration

In Deep Learning, parameters $θ_{j}$ (and typically the loss $L$ ) are treated as dimensionless quantities. Keeping in mind this convention, in Adam’s update rule:

Numerator ( $\overset{m}{^}_{j}$ ), the bias-corrected EMA of gradients, has the same units as the gradient. This ensures that the update direction is consistent with gradient descent: updates are proportional to the gradient, not to its square.

Denominator ( $\overset{v}{^}_{j}$ ), the bias-corrected EMA of squared gradients, also has the same units as the gradient. Dividing $\overset{m}{^}_{j}$ by $\overset{v}{^}_{j}$ yields a dimensionless ratio.

As a result, the update step
$Δ θ_{j} = - η \frac{m ^ _{j}^{(t + 1)}}{v ^ _{j}^{(t + 1)} + ε}$
remains dimensionally consistent (i.e., dimensionless) and interpretable as a scaled gradient descent step.

Summary

Adam sets itself apart from other optimizers because it works well right from the very first training steps, thanks to the bias-correction applied to its EMAs.

It applies bias-correction to both the first moment ( $m_{t}$ ) and second moment ( $v_{t}$ ) estimates, counteracting their initialization at zero.
This prevents the denominators in the update rule from being too close to zero, ensuring stable updates even in the earliest training iterations.
After the initial transient phase, Adam behaves similarly to other adaptive optimizers, but its key advantage lies in handling the start of training reliably, where other methods may diverge or stall.
It also retains per-parameter adaptivity: parameters with frequent or large gradients are moderated, while rare or smaller ones are not overshadowed.

Info

Adam is a robust and self-adaptive optimizer, and has become the de facto standard for training deep networks.

Deep Learning

Explorer

Adam

Intro

Adam in depth

Adam update rule

EMA in Adam

Correcting EMA bias at initialization

Summary

Graph View

Table of Contents

Backlinks