1. Intro

Limitation of AdaGrad

AdaGrad rescales each parameter by the cumulative sum of all past squared gradients. This gives useful per-parameter adaptivity, but it also means that the denominator grows monotonically throughout training. As a consequence, the effective learning rate keeps shrinking, even when very old gradients should no longer matter.

💡Hinton's intuition

If the problem comes from accumulating the entire history, then stop accumulating it uniformly. Older squared gradients should not have the same influence as recent ones. What is needed is a memory mechanism that keeps the adaptive idea but gives progressively smaller weight to the distant past.

RMSProp's core move

RMSProp does exactly this by replacing AdaGrad’s cumulative memory with an exponential moving average (EMA) of squared gradients. The result is still an adaptive per-parameter scaling rule, but one with exponential forgetting rather than infinite memory.

Important

RMSProp is therefore best understood as AdaGrad with forgetting:

  • AdaGrad’s adaptivity is preserved,
  • infinite memory is replaced by exponential memory decay,
  • recent gradient magnitudes matter much more than old ones.

2. RMSProp in depth

2.1 Update rule

Let

be the gradient of the loss with respect to parameter at iteration .

RMSProp maintains the exponential moving average

with initialization

The parameter update is then

where:

  • is the base learning rate,
  • is the decay factor of the EMA,
  • is a small numerical-stability constant.

Typical choices take close to , for example or , so that the running estimate changes smoothly without becoming completely unresponsive.

Note

The square root in is essential. Since tracks a second moment, taking the square root brings the denominator back to the same scale as the gradient.

Viewed through the general EMA template

RMSProp is simply the special case

This identification is worth making explicit because it isolates the real role of the denominator: is not an arbitrary auxiliary variable, but the EMA state that stores recent information about the squared -th gradient component.

General EMA objectRMSProp quantityMeaning in RMSProp
Current EMA of the squared -th gradient component
Previous EMA state
Current squared gradient input
Decay factor controlling how much past information is retained
Contribution of the current squared gradient
RMS scale used to normalize the update of coordinate

How the adaptive denominator rescales each coordinate

RMSProp separates two roles that are mixed together in plain SGD:

  • the gradient determines the direction of the update,
  • the factor determines the scale of the update.

The qualitative behavior is immediate:

  • if a parameter has experienced large recent gradients, then becomes large and the effective step on that parameter is reduced;
  • if a parameter has experienced small or infrequent gradients, then remains smaller and that parameter is updated more aggressively.

In this sense, RMSProp is a per-parameter normalization mechanism based on recent gradient magnitudes.


2.2 Why EMA fixes AdaGrad’s main limitation

The essential difference between AdaGrad and RMSProp is the memory model.

AdaGrad uses

so every past squared gradient contributes forever. RMSProp replaces this cumulative sum with the recursion

Thus each past squared gradient is weighted by

This has two immediate consequences:

  • the contribution of a past gradient decays exponentially with its age;
  • the optimizer has finite effective memory, even though the formula is recursive and never explicitly drops terms.

Effective memory length

A useful rule of thumb is that the EMA remembers roughly

iterations. Thus:

  • if , the memory horizon is on the order of steps;
  • if , it is on the order of steps.

This is not an exact cutoff, but it gives an excellent intuition for how far into the past the optimizer is effectively looking.

Note

When , the weights on the observed squared gradients sum to

not yet to . So during the startup phase, is not a fully normalized weighted average. This is exactly the source of the initialization bias discussed below.


2.3 A mnemonic reading of “RMSProp”

The name can be used as a compact memory aid for the mechanism. This section should be read as a mnemonic, not as a claim about official historical etymology.

TermMechanistic meaning in RMSProp
ootThe update uses , not itself
ean is a moving average of squared gradients
quareThe gradients enter through
The running estimate is propagated recursively from one iteration to the next

Note

This mnemonic is useful because it fixes the algorithm in memory in the correct order:

  1. square the gradient,
  2. average it over time with an EMA,
  3. take the root,
  4. use the result to propagate an adaptive scaling across iterations.

2.4 Initialization bias

2.4.1 The EMA is biased toward zero at the beginning

Because , the early values of the running second-moment estimate are systematically biased downward.

If the squared-gradient statistics are approximately stable during the first iterations, then one may write

Under this approximation,

Hence the estimate underestimates the true recent second moment during the startup phase. This matters because appears in the denominator of the update rule.

For the common choice , the startup behavior can be made very concrete:

IterationComputed EMA stateWeight given to current squared gradient
still only on the newest term

This does not mean that the denominator is generically “equal to “. It means something more precise: during the first iterations, the EMA has not yet accumulated enough recent squared-gradient information, so the denominator can remain systematically smaller than the long-run scale it is meant to represent.

2.4.2 First-step behavior

At the first iteration,

Therefore the first RMSProp update is

If is negligible compared with , then

For the common choice ,

So the first RMSProp step is approximately

Warning

This is the precise mathematical statement behind the usual qualitative warning: plain RMSProp can make the initial effective step scale larger than the base learning rate. This does not mean that RMSProp must diverge at the start. It means something narrower and more defensible:

  • the denominator is initially underestimated,
  • the early effective step size can therefore be amplified,
  • whether that amplification is harmless or harmful depends on , , the gradient scale, and the local geometry of the objective.

Loss-landscape intuition

At the start of training, plain RMSProp can take very large effective steps across the loss surface. Instead of descending gradually into a promising valley, the model may:

  • overshoot minima entirely,
  • bounce across different regions,
  • or, in unstable regimes, even diverge.

In that sense, the trajectory can spend its early iterations “wandering” across the loss landscape before the EMA has accumulated enough recent gradient information.

Once has accumulated enough recent squared gradients, the denominator grows accordingly. This activates RMSProp’s self-normalization mechanism: the effective learning rate is gradually reduced, the turbulent startup phase fades, and the optimizer moves into a more controlled exploitation regime.

If the model has not already been pushed out of a promising basin during that transient, the stabilized regime then allows it to exploit one region of the loss surface and converge more steadily.

2.4.3 Why the transient eventually disappears

As grows,

so

Therefore the initialization bias fades away with time. After enough iterations, behaves like a well-formed EMA of recent squared gradients rather than like an estimate still dominated by its zero initialization.

Relation to Adam

This startup transient is one of the conceptual reasons Adam is more refined than plain RMSProp. Adam keeps the same second-moment idea but adds an explicit bias-correction factor, so the early denominator is not systematically underestimated in the same way.


2.5 Limitations of RMSProp

Main limitations

Plain RMSProp is an important improvement over AdaGrad, but it still leaves three structural issues unresolved:

  • No bias correction: the second-moment estimate starts from zero and is therefore biased downward during the first iterations. This can make the early effective step size larger than the nominal learning rate suggests.
  • No first-moment smoothing: RMSProp rescales the gradient magnitude, but it does not smooth the update direction through a momentum-like EMA of gradients. As a result, the direction of motion can remain noisy when mini-batch gradients are highly variable.
  • Strong sensitivity to hyperparameters: the practical behavior of RMSProp depends noticeably on the interaction between , , and . A choice that is stable in one regime can become too aggressive or too damped in another.

In short, RMSProp fixes AdaGrad’s infinite-memory problem, but it does not yet provide the startup control and directional stabilization that later optimizers such as Adam add explicitly.


2.6 Dimensional consistency

Even when parameters and losses are treated as dimensionless, it is useful to verify that the update is internally coherent:

  • has the units of a gradient,
  • has the units of a squared gradient,
  • has the same units as the gradient,
  • therefore is dimensionless.

This is another way to see why the square root is necessary. Without it, the numerator and denominator would not live on compatible scales.


3. Summary

RMSProp is best viewed as the optimizer that repairs AdaGrad’s infinite-memory flaw by replacing cumulative gradient history with an exponentially decaying memory.

Its essential ingredients are:

  • an EMA of squared gradients,
  • a per-parameter adaptive denominator,
  • exponential forgetting of distant past information.

Its main strengths are:

  • it preserves per-parameter adaptivity,
  • it avoids AdaGrad’s irreversible shrinkage of the effective learning rate,
  • it reacts to recent gradient scales rather than to the full training history.

Its main conceptual limitation is equally clear:

  • plain RMSProp has no bias correction,
  • so the early second-moment estimate is biased toward zero,
  • and the startup phase can therefore be more aggressive than the nominal learning rate suggests.

Practical note

RMSProp remains an important optimizer historically and conceptually. In modern deep-learning practice, however, Adam is often preferred because it augments the same adaptive denominator with momentum in the numerator and with explicit bias correction during the startup phase.