RMSProp

Hinton’s intuition: forget the distant past

Limitation of AdaGrad

The weakness of AdaGrad lies in the fact that it accumulates the entire history of squared gradients.
As training progresses, the denominator in the update rule grows unbounded, causing the effective learning rate for each parameter $θ_{j}$ to shrink continuously.

💡 Idea (G. Hinton)

If the problem comes from summing the entire history, then stop summing it all.
Instead, assign decreasing weights to older gradients, so that recent information matters more than distant past.

Solution: use EMA

The solution builds on a widely applicable concept that goes beyond Deep Learning itself: the Exponential Moving Average (EMA).

Exponential Moving Average (EMA)

Definition

EMA definition

The Exponential Moving Average (EMA) is a type of infinite impulse response filter that applies exponentially decaying weights to past observations.
Unlike a simple moving average, which assigns equal importance to all past values within a fixed window, the EMA emphasizes recent data points while gradually “forgetting” the distant past.
This makes it particularly useful for applications such as time-series normalization and, in Deep Learning, for controlling the accumulation of gradient information.

Given a sequence of data points $x^{(t)}$ and a decay factor $α \in (0, 1]$ , the EMA is defined recursively as:
$EM A^{(t)} = α EM A^{(t - 1)} + (1 - α) x^{(t)}$

The most recent value $x^{(t)}$ contributes with weight $(1 - α)$ .

Each past contribution is attenuated by a factor of $α$ per time step → the further back in time, the smaller its influence on the current EMA.

Why EMA matters for optimization

When applied to the squared gradients, the EMA provides a “limited memory” mechanism:

It preserves per-parameter adaptivity, just as in AdaGrad.
It avoids the unbounded accumulation that causes AdaGrad’s learning rate to shrink towards zero.
It forms the foundation of modern adaptive optimizers such as RMSProp, Adam, and AdaDelta.

Key intuition

The EMA acts as a soft forgetting mechanism.
Instead of treating all past gradients as equally important, it ensures that recent gradients dominate the dynamics of learning, while older ones gradually fade in influence.

RMSProp in depth

Update rule

The parameter update in RMSProp follows the rule:

θ_{j}^{(t + 1)} = θ_{j}^{(t)} - \frac{η}{v _{j}^{(t)} + ϵ} \nabla_{θ_{j}} L^{(t)}

where:

v_{j}^{(t)} = α v_{j}^{(t - 1)} + (1 - α) [\nabla_{θ_{j}} L^{(t)}]^{2}, is an EMA of squared gradients with α = 0.99

General EMA formula	Specific case: RMSProp for $θ_{j}$	Meaning in RMSProp
$EM A^{(t)}$	$v_{j}^{(t)}$	Current value of the exponential moving average of the squared $j$ -th gradient component
$EM A^{(t - 1)}$	$v_{j}^{(t - 1)}$	Exponential moving average, at step $t - 1$ , of the squared $j$ -th gradient component
$x^{(t)}$	$[\nabla_{θ_{j}} L^{(t)}]^{2}$	Current input: squared value of the $j$ -th gradient component at step $t$
$α \in (0, 1)$	$α \in (0, 1)$	Decay rate: controls how much past information is remembered
$(1 - α) x^{(t)}$	$(1 - α) [\nabla_{θ_{j}} L^{(t)}]^{2}$	Contribution of the current $j$ -th gradient component

Note

The denominator $v_{j}^{(t)}$ averages the squared past gradient components for parameter $j$ , but does so by assigning greater weight to the more recent ones.

Why is it called RMSProp

Term	Explanation
$R$ oot	The square root of the EMA of squared gradients, $v_{j}^{(t)}$ , is computed. This brings the adaptive term back to the same scale as the gradient, preventing numerical imbalance in the update. Without the square root, the denominator would grow too quickly, making the effective learning rate excessively small. The root stabilizes and normalizes the effective learning rate used in the parameter update.
$M$ ean	An average (moving average) is used instead of a single gradient observation, producing a more stable estimate.
$S$ quare	The gradient component is squared in the EMA formula to measure only its magnitude (the sign is irrelevant). This provides an indicator of the gradient’s energy, useful for modulating the effective learning rate applied to parameter $θ_{j}$ .
$PROP$ agation	The estimate $v_{j}^{(t)}$ is not reset at each step but updated recursively $v_{j}^{(t)} = α v_{j}^{(t - 1)} + (1 - α) [\nabla_{θ_{j}} L^{(t)}]^{2}$ . This way, the effect of past $j$ -th gradient components propagates across iterations, making the effective learning rate for $θ_{j}$ continuously adaptive.

Unrolling of the EMA formula

To fully understand the behavior of RMSProp, it is useful to expand recursively the definition of the exponential moving average (EMA) of the squared gradients step by step.
This makes it clear how past $j$ -th gradient components gradually lose influence over time, illustrating the mechanism of exponential memory decay.

Let’s unroll the recursion for the first steps:

Step 1

v_{j}^{(1)} = α v_{j}^{(0)} + (1 - α) [\nabla_{θ_{j}} L^{(1)}]^{2}

Step 2

v_{j}^{(2)} = α v_{j}^{(1)} α^{2} v_{j}^{(0)} + α (1 - α) [\nabla_{θ_{j}} L^{(1)}]^{2} + (1 - α) [\nabla_{θ_{j}} L^{(2)}]^{2}

Step 3

v_{j}^{(3)} = α v_{j}^{(2)} α^{3} v_{j}^{(0)} + α^{2} (1 - α) [\nabla_{θ_{j}} L^{(1)}]^{2} + α (1 - α) [\nabla_{θ_{j}} L^{(2)}]^{2} + (1 - α) [\nabla_{θ_{j}} L^{(3)}]^{2}

By induction, this leads to the explicit formula:

v_{j}^{(t)} = α^{t} v_{j}^{(0)} + k = 1 \sum t α^{t - k} (1 - α) [\nabla_{θ_{j}} L^{(k)}]^{2}

From this expression it can be observed that:

Each squared gradient term $[\nabla_{θ_{j}} L^{(k)}]^{2}$ is multiplied by a coefficient $α^{t - k} (1 - α)$ ,
The further back in time a gradient component is, the smaller its impact on $v_{j}^{(t)}$ .

Effect of exponential decay

Since $α \in (0, 1]$ , the coefficient $α^{t - k} (1 - α)$ applies an exponentially decreasing weight to past $j$ -th gradient components.

Here, $t$ denotes the current iteration, while $k$ indexes a past iteration ( $1 \leq k \leq t$ ).
Thus, the further back in time a squared gradient term $[\nabla_{θ_{j}} L^{(k)}]^{2}$ is (i.e., when $k ≪ t$ ), the less influence it has on $v_{j}^{(t)}$ .
Recent gradient components dominate the value of the moving average.

For example, with $α = 0.99$ :

$α^{1} = 0.99$ ,

$α^{10} \approx 0.90$ ,

$α^{100} \approx 0.366$ ,

$α^{500} \approx 0.0066$ .

This shows that gradient components older than just a few hundred iterations carry virtually no weight: the optimizer effectively “forgets” the distant past, favoring an up-to-date and responsive estimate of gradient variability.

Important

This exponential decay ensures that recent gradients exert the most influence on the current value of $v_{j}^{(t)}$ , while past information is progressively “forgotten.”
This mechanism is the very essence of RMSProp’s adaptive behavior.

RMSProp weakness

EMA is biased towards $0$

With $α = 0.99$ and $v_{j}^{(0)} = 0$ , the exponential moving average
$v_{j}^{(t)} = α v_{j}^{(t - 1)} + (1 - α) [\nabla_{θ_{j}} L^{(t)}]^{2}$
initially incorporates only 1% of the current squared gradient.

Iteration $v_{j}^{(t)}$ computed Weight given to current gradient
$t = 1$ $0.01 [\nabla_{θ_{j}} L^{(1)}]^{2}$ 1%
$t = 2$ $0.99 \cdot 0.01 [\nabla_{θ_{j}} L^{(1)}]^{2} + 0.01 [\nabla_{θ_{j}} L^{(2)}]^{2}$ ≈ 1%

During these very first iterations, the denominator in the RMSProp update rule
$θ_{j}^{(t + 1)} = θ_{j}^{(t)} - \frac{η}{v _{j}^{(t)} + ε} \nabla_{θ_{j}} L^{(t)}$
is still extremely small (close to $ε$ ).
As a result, the effective learning rate becomes extremely large.

Iteration	$v_{j}^{(t)}$ computed	Weight given to current gradient
$t = 1$	$0.01 [\nabla_{θ_{j}} L^{(1)}]^{2}$	1%
$t = 2$	$0.99 \cdot 0.01 [\nabla_{θ_{j}} L^{(1)}]^{2} + 0.01 [\nabla_{θ_{j}} L^{(2)}]^{2}$	≈ 1%

The risk of “wild jumps” in the loss landscape at the start of training

As a consequence, at the start of training, the RMSProp optimizer takes giant uncontrolled steps across the loss surface.
Instead of gradually descending into a promising valley, the model may:

overshoot minima entirely,

bounce chaotically across different regions,

or even diverge.

This behavior can be described as the model “wandering around” the loss landscape, wasting early epochs without meaningful convergence.

Only after this short but turbulent transient phase does the EMA begin to accumulate enough recent gradient information.

Once $v_{j}^{(t)}$ has accumulated enough recent gradients, the denominator in the update rule grows accordingly.
This activates RMSProp’s self-normalization mechanism, which gradually reduces the effective learning rate and brings the optimizer out of its turbulent startup phase.
From that point on, RMSProp transitions into a stable regime of more controlled descent (exploitation).

At that point, if the model has not already been pushed out of promising valleys of the loss surface, the stabilized learning rate allows it to exploit one basin and converge more steadily toward a minimum.

Deep Learning

Explorer

RMSProp

Hinton’s intuition: forget the distant past

Exponential Moving Average (EMA)

Definition

Why EMA matters for optimization

RMSProp in depth

Update rule

Why is it called RMSProp

Unrolling of the EMA formula

Step 1

Step 2

Step 3

RMSProp weakness

Graph View

Table of Contents

Backlinks