1. Intro
The limitation RMSProp fixes
AdaGrad rescales each parameter by the cumulative sum of all past squared gradients. This gives useful per-parameter adaptivity, but the denominator grows monotonically throughout training, and the effective learning rate keeps shrinking even when very old gradients should no longer matter.
RMSProp's core move
If the problem comes from accumulating the entire history, the fix is to stop accumulating it uniformly: older squared gradients should not have the same influence as recent ones. RMSProp does this by replacing AdaGrad’s cumulative sum with an exponential moving average (EMA) of squared gradients. The result is still an adaptive per-parameter scaling rule, but one with exponential forgetting rather than infinite memory.
RMSProp is therefore best understood as AdaGrad with forgetting: the adaptivity is preserved, the unbounded accumulation is replaced by exponential memory decay, and recent gradient magnitudes matter much more than old ones.
Historical note
RMSProp is traditionally attributed to Geoffrey Hinton’s lecture notes rather than to a single canonical archival paper. For that reason, minor variants coexist across libraries and textbooks. In this note, plain RMSProp is described:
- no momentum term,
- no centered variant,
- only the EMA of squared gradients in the denominator.
EMA recap
An Exponential Moving Average of a sequence with decay factor is the recursion
The current observation enters with weight ; past observations are attenuated by successive powers of . The whole sequence is therefore a soft, finite, exponentially decaying memory of . RMSProp applies this construction to the sequence of squared gradients, which is exactly what gives the optimizer a bounded effective memory while preserving the adaptive coordinate-wise rescaling inherited from AdaGrad.
2. RMSProp in depth
2.1 Update rule
Let
be the gradient of the loss with respect to parameter at iteration .
RMSProp maintains the exponential moving average
with initialization
The parameter update is then
where:
- is the base learning rate,
- is the decay factor of the EMA,
- is a small numerical-stability constant.
Typical choices take close to , for example or , so that the running estimate changes smoothly without becoming completely unresponsive.
Note
The square root in is essential. Since tracks a second moment, taking the square root brings the denominator back to the same scale as the gradient.
The recursion above is simply the general EMA template from §1 applied to the squared -th gradient component. The denominator is therefore the root-mean-square of recent gradients along coordinate , evaluated by the EMA: hence the name RMSProp.
The general EMA template, mapped to RMSProp
Tracking the correspondence between the EMA recap of §1 and the RMSProp recursion explicitly:
General EMA object RMSProp quantity Meaning in RMSProp Current EMA of the squared -th gradient component Previous EMA state Current squared-gradient input Decay factor controlling how much past information is retained Contribution of the current squared gradient RMS scale used to normalize the update of coordinate The identification makes the structure of RMSProp transparent: is not an arbitrary auxiliary variable, but the EMA state of one specific signal (the squared -th gradient component), and the rest of the algorithm follows mechanically from the EMA construction.
How the adaptive denominator rescales each coordinate
RMSProp separates two roles that are mixed together in plain SGD:
- the gradient determines the direction of the update,
- the factor determines the scale of the update.
The qualitative behaviour is immediate:
- if a parameter has experienced large recent gradients, then becomes large and the effective step on that parameter is reduced;
- if a parameter has experienced small or infrequent gradients, then remains smaller and that parameter is updated more aggressively.
In this sense, RMSProp is a per-parameter normalization mechanism based on recent gradient magnitudes, just like AdaGrad but with bounded memory.
2.2 Why EMA fixes AdaGrad’s main limitation
The essential difference between AdaGrad and RMSProp is the memory model.
AdaGrad uses
so every past squared gradient contributes forever. RMSProp replaces this cumulative sum with the recursion
EMA unrolling
To see the memory mechanism directly, it is useful to expand the recursion step by step.
Step 1
Step 2
Step 3
By induction, this leads to the explicit formula:
Thus each past squared gradient is weighted by
This has two immediate consequences:
- the contribution of a past gradient decays exponentially with its age;
- the optimizer has finite effective memory, even though the formula is recursive and never explicitly drops terms.
Effective memory length
A useful rule of thumb is that the EMA remembers roughly
iterations. Thus:
- if , the memory horizon is on the order of steps;
- if , it is on the order of steps.
This is not an exact cutoff, but it gives an excellent intuition for how far into the past the optimizer is effectively looking.
Note
When , the weights on the observed squared gradients sum to
not yet to . So during the startup phase, is not a fully normalized weighted average. This is exactly the source of the initialization bias discussed below.
2.3 A mnemonic reading of “RMSProp”
The name can be used as a compact memory aid for the mechanism. This section should be read as a mnemonic, not as a claim about official historical etymology.
| Term | Mechanistic meaning in RMSProp |
|---|---|
| oot | The update uses , not itself |
| ean | is a moving average of squared gradients |
| quare | The gradients enter through |
| The running estimate is propagated recursively from one iteration to the next |
Note
This mnemonic is useful because it fixes the algorithm in memory in the correct order:
- square the gradient,
- average it over time with an EMA,
- take the root,
- use the result to propagate an adaptive scaling across iterations.
2.4 Initialization bias
2.4.1 The EMA is biased toward zero at the beginning
Because , the early values of the running second-moment estimate are systematically biased downward.
If the squared-gradient statistics are approximately stable during the first iterations, then one may write
Under this approximation,
Hence the estimate underestimates the true recent second moment during the startup phase. This matters because appears in the denominator of the update rule.
For the common choice , the startup behavior can be made very concrete:
| Iteration | Computed EMA state | Weight given to current squared gradient |
|---|---|---|
| still only on the newest term |
A common shorthand says the denominator is “essentially ” at the start; that reading is too crude. The accurate statement is narrower: during the first iterations the EMA has not yet accumulated enough recent squared-gradient information, so the denominator stays systematically smaller than the long-run scale it is meant to represent.
2.4.2 First-step behavior
At the first iteration,
Therefore the first RMSProp update is
If is negligible compared with , then
For the common choice ,
So the first RMSProp step is approximately
Warning
This is the precise mathematical statement behind the usual qualitative warning: plain RMSProp can make the initial effective step scale larger than the base learning rate. This does not imply that RMSProp must diverge at the start. The accurate, narrower claim is:
- the denominator is initially underestimated,
- the early effective step size can therefore be amplified,
- whether that amplification is harmless or harmful depends on , , the gradient scale, and the local geometry of the objective.
Loss-landscape intuition
At the start of training, plain RMSProp can take very large effective steps across the loss surface. Instead of descending gradually into a promising valley, the model may:
- overshoot minima entirely,
- bounce across different regions,
- or, in unstable regimes, even diverge.
In that sense, the trajectory can spend its early iterations “wandering” across the loss landscape before the EMA has accumulated enough recent gradient information.
Once has accumulated enough recent squared gradients, the denominator grows accordingly. This activates RMSProp’s self-normalization mechanism: the effective learning rate is gradually reduced, the turbulent startup phase fades, and the optimizer moves into a more controlled exploitation regime.
If the model has not already been pushed out of a promising basin during that transient, the stabilized regime then allows it to exploit one region of the loss surface and converge more steadily.
2.4.3 Why the transient eventually disappears
As grows,
so
Therefore the initialization bias fades away with time. After enough iterations, behaves like a well-formed EMA of recent squared gradients rather than like an estimate still dominated by its zero initialization.
Relation to Adam
This startup transient is one of the conceptual reasons Adam is more refined than plain RMSProp. Adam keeps the same second-moment idea but adds an explicit bias-correction factor , so the early denominator is not systematically underestimated in the same way. It also adds a first-moment EMA (momentum) in the numerator, which RMSProp lacks.
2.5 Limitations of RMSProp
Main limitations
Plain RMSProp is an important improvement over AdaGrad, but it still leaves three structural issues unresolved:
- No bias correction: the second-moment estimate starts from zero and is therefore biased downward during the first iterations. This can make the early effective step size larger than the nominal learning rate suggests.
- No first-moment smoothing: RMSProp rescales the gradient magnitude, but it does not smooth the update direction through a momentum-like EMA of gradients. As a result, the direction of motion can remain noisy when mini-batch gradients are highly variable.
- Strong sensitivity to hyperparameters: the practical behaviour of RMSProp depends noticeably on the interaction between , , and . A choice that is stable in one regime can become too aggressive or too damped in another.
In short, RMSProp fixes AdaGrad’s infinite-memory problem, but it does not yet provide the startup control and directional stabilization that Adam adds explicitly.
2.6 Dimensional consistency
Even when parameters and losses are treated as dimensionless, it is useful to verify that the update is internally coherent:
- has the units of a gradient,
- has the units of a squared gradient,
- has the same units as the gradient,
- therefore is dimensionless.
This is another way to see why the square root is necessary. Without it, the numerator and denominator would not live on compatible scales.
3. Summary
RMSProp repairs AdaGrad’s infinite-memory flaw by replacing cumulative gradient history with an exponentially decaying memory. Its essential ingredients (EMA of squared gradients, per-parameter adaptive denominator, exponential forgetting) preserve AdaGrad’s adaptivity while avoiding the irreversible shrinkage of the effective learning rate.
The conceptual limitation that remains is equally clear: plain RMSProp has no bias correction, so the early second-moment estimate is biased toward zero, and the startup phase can be more aggressive than the nominal learning rate suggests. Plain RMSProp also has no first-moment smoothing (no momentum-like EMA of the gradients themselves), so the direction of motion remains noisy under high-variance mini-batch gradients.
Memory cost: the parameters (one running squared-gradient buffer per parameter), the same as SGD with Momentum.
Where to read next
Adam keeps RMSProp’s adaptive denominator and adds the two ingredients that plain RMSProp lacks: a first-moment EMA (momentum) in the numerator, and explicit bias correction for both moments to neutralize the startup transient. For modern deep-learning workloads, Adam (or AdamW when weight decay is involved) is usually the safer default; RMSProp remains conceptually important and is still occasionally preferred in RNN-style training where Adam’s momentum can over-smooth direction signals.
Sources
AdaGrad, RMSProp, Adam, and AdamW form one lineage of adaptive optimizers; their papers are collected in Optimization and Regularization.