1. Intro

The limitation RMSProp fixes

AdaGrad rescales each parameter by the cumulative sum of all past squared gradients. This gives useful per-parameter adaptivity, but the denominator grows monotonically throughout training, and the effective learning rate keeps shrinking even when very old gradients should no longer matter.

RMSProp's core move

If the problem comes from accumulating the entire history, the fix is to stop accumulating it uniformly: older squared gradients should not have the same influence as recent ones. RMSProp does this by replacing AdaGrad’s cumulative sum with an exponential moving average (EMA) of squared gradients. The result is still an adaptive per-parameter scaling rule, but one with exponential forgetting rather than infinite memory.

RMSProp is therefore best understood as AdaGrad with forgetting: the adaptivity is preserved, the unbounded accumulation is replaced by exponential memory decay, and recent gradient magnitudes matter much more than old ones.


2. RMSProp in depth

2.1 Update rule

Let

be the gradient of the loss with respect to parameter at iteration .

RMSProp maintains the exponential moving average

with initialization

The parameter update is then

where:

  • is the base learning rate,
  • is the decay factor of the EMA,
  • is a small numerical-stability constant.

Typical choices take close to , for example or , so that the running estimate changes smoothly without becoming completely unresponsive.

Note

The square root in is essential. Since tracks a second moment, taking the square root brings the denominator back to the same scale as the gradient.

The recursion above is simply the general EMA template from §1 applied to the squared -th gradient component. The denominator is therefore the root-mean-square of recent gradients along coordinate , evaluated by the EMA: hence the name RMSProp.

How the adaptive denominator rescales each coordinate

RMSProp separates two roles that are mixed together in plain SGD:

  • the gradient determines the direction of the update,
  • the factor determines the scale of the update.

The qualitative behaviour is immediate:

  • if a parameter has experienced large recent gradients, then becomes large and the effective step on that parameter is reduced;
  • if a parameter has experienced small or infrequent gradients, then remains smaller and that parameter is updated more aggressively.

In this sense, RMSProp is a per-parameter normalization mechanism based on recent gradient magnitudes, just like AdaGrad but with bounded memory.


2.2 Why EMA fixes AdaGrad’s main limitation

The essential difference between AdaGrad and RMSProp is the memory model.

AdaGrad uses

so every past squared gradient contributes forever. RMSProp replaces this cumulative sum with the recursion

Thus each past squared gradient is weighted by

This has two immediate consequences:

  • the contribution of a past gradient decays exponentially with its age;
  • the optimizer has finite effective memory, even though the formula is recursive and never explicitly drops terms.

Effective memory length

A useful rule of thumb is that the EMA remembers roughly

iterations. Thus:

  • if , the memory horizon is on the order of steps;
  • if , it is on the order of steps.

This is not an exact cutoff, but it gives an excellent intuition for how far into the past the optimizer is effectively looking.

Note

When , the weights on the observed squared gradients sum to

not yet to . So during the startup phase, is not a fully normalized weighted average. This is exactly the source of the initialization bias discussed below.


2.3 A mnemonic reading of “RMSProp”

The name can be used as a compact memory aid for the mechanism. This section should be read as a mnemonic, not as a claim about official historical etymology.

TermMechanistic meaning in RMSProp
ootThe update uses , not itself
ean is a moving average of squared gradients
quareThe gradients enter through
The running estimate is propagated recursively from one iteration to the next

Note

This mnemonic is useful because it fixes the algorithm in memory in the correct order:

  1. square the gradient,
  2. average it over time with an EMA,
  3. take the root,
  4. use the result to propagate an adaptive scaling across iterations.

2.4 Initialization bias

2.4.1 The EMA is biased toward zero at the beginning

Because , the early values of the running second-moment estimate are systematically biased downward.

If the squared-gradient statistics are approximately stable during the first iterations, then one may write

Under this approximation,

Hence the estimate underestimates the true recent second moment during the startup phase. This matters because appears in the denominator of the update rule.

For the common choice , the startup behavior can be made very concrete:

IterationComputed EMA stateWeight given to current squared gradient
still only on the newest term

A common shorthand says the denominator is “essentially ” at the start; that reading is too crude. The accurate statement is narrower: during the first iterations the EMA has not yet accumulated enough recent squared-gradient information, so the denominator stays systematically smaller than the long-run scale it is meant to represent.

2.4.2 First-step behavior

At the first iteration,

Therefore the first RMSProp update is

If is negligible compared with , then

For the common choice ,

So the first RMSProp step is approximately

Warning

This is the precise mathematical statement behind the usual qualitative warning: plain RMSProp can make the initial effective step scale larger than the base learning rate. This does not imply that RMSProp must diverge at the start. The accurate, narrower claim is:

  • the denominator is initially underestimated,
  • the early effective step size can therefore be amplified,
  • whether that amplification is harmless or harmful depends on , , the gradient scale, and the local geometry of the objective.

Loss-landscape intuition

At the start of training, plain RMSProp can take very large effective steps across the loss surface. Instead of descending gradually into a promising valley, the model may:

  • overshoot minima entirely,
  • bounce across different regions,
  • or, in unstable regimes, even diverge.

In that sense, the trajectory can spend its early iterations “wandering” across the loss landscape before the EMA has accumulated enough recent gradient information.

Once has accumulated enough recent squared gradients, the denominator grows accordingly. This activates RMSProp’s self-normalization mechanism: the effective learning rate is gradually reduced, the turbulent startup phase fades, and the optimizer moves into a more controlled exploitation regime.

If the model has not already been pushed out of a promising basin during that transient, the stabilized regime then allows it to exploit one region of the loss surface and converge more steadily.

2.4.3 Why the transient eventually disappears

As grows,

so

Therefore the initialization bias fades away with time. After enough iterations, behaves like a well-formed EMA of recent squared gradients rather than like an estimate still dominated by its zero initialization.

Relation to Adam

This startup transient is one of the conceptual reasons Adam is more refined than plain RMSProp. Adam keeps the same second-moment idea but adds an explicit bias-correction factor , so the early denominator is not systematically underestimated in the same way. It also adds a first-moment EMA (momentum) in the numerator, which RMSProp lacks.


2.5 Limitations of RMSProp

Main limitations

Plain RMSProp is an important improvement over AdaGrad, but it still leaves three structural issues unresolved:

  • No bias correction: the second-moment estimate starts from zero and is therefore biased downward during the first iterations. This can make the early effective step size larger than the nominal learning rate suggests.
  • No first-moment smoothing: RMSProp rescales the gradient magnitude, but it does not smooth the update direction through a momentum-like EMA of gradients. As a result, the direction of motion can remain noisy when mini-batch gradients are highly variable.
  • Strong sensitivity to hyperparameters: the practical behaviour of RMSProp depends noticeably on the interaction between , , and . A choice that is stable in one regime can become too aggressive or too damped in another.

In short, RMSProp fixes AdaGrad’s infinite-memory problem, but it does not yet provide the startup control and directional stabilization that Adam adds explicitly.


2.6 Dimensional consistency

Even when parameters and losses are treated as dimensionless, it is useful to verify that the update is internally coherent:

  • has the units of a gradient,
  • has the units of a squared gradient,
  • has the same units as the gradient,
  • therefore is dimensionless.

This is another way to see why the square root is necessary. Without it, the numerator and denominator would not live on compatible scales.


3. Summary

RMSProp repairs AdaGrad’s infinite-memory flaw by replacing cumulative gradient history with an exponentially decaying memory. Its essential ingredients (EMA of squared gradients, per-parameter adaptive denominator, exponential forgetting) preserve AdaGrad’s adaptivity while avoiding the irreversible shrinkage of the effective learning rate.

The conceptual limitation that remains is equally clear: plain RMSProp has no bias correction, so the early second-moment estimate is biased toward zero, and the startup phase can be more aggressive than the nominal learning rate suggests. Plain RMSProp also has no first-moment smoothing (no momentum-like EMA of the gradients themselves), so the direction of motion remains noisy under high-variance mini-batch gradients.

Memory cost: the parameters (one running squared-gradient buffer per parameter), the same as SGD with Momentum.

Where to read next

Adam keeps RMSProp’s adaptive denominator and adds the two ingredients that plain RMSProp lacks: a first-moment EMA (momentum) in the numerator, and explicit bias correction for both moments to neutralize the startup transient. For modern deep-learning workloads, Adam (or AdamW when weight decay is involved) is usually the safer default; RMSProp remains conceptually important and is still occasionally preferred in RNN-style training where Adam’s momentum can over-smooth direction signals.

Sources

AdaGrad, RMSProp, Adam, and AdamW form one lineage of adaptive optimizers; their papers are collected in Optimization and Regularization.