1. Intro
Limitation of AdaGrad
AdaGrad rescales each parameter by the cumulative sum of all past squared gradients. This gives useful per-parameter adaptivity, but it also means that the denominator grows monotonically throughout training. As a consequence, the effective learning rate keeps shrinking, even when very old gradients should no longer matter.
💡Hinton's intuition
If the problem comes from accumulating the entire history, then stop accumulating it uniformly. Older squared gradients should not have the same influence as recent ones. What is needed is a memory mechanism that keeps the adaptive idea but gives progressively smaller weight to the distant past.
RMSProp's core move
RMSProp does exactly this by replacing AdaGrad’s cumulative memory with an exponential moving average (EMA) of squared gradients. The result is still an adaptive per-parameter scaling rule, but one with exponential forgetting rather than infinite memory.
Important
RMSProp is therefore best understood as AdaGrad with forgetting:
- AdaGrad’s adaptivity is preserved,
- infinite memory is replaced by exponential memory decay,
- recent gradient magnitudes matter much more than old ones.
Historical note
RMSProp is traditionally attributed to Geoffrey Hinton’s lecture notes rather than to a single canonical archival paper. For that reason, minor variants coexist across libraries and textbooks. In this note, plain RMSProp is described:
- no momentum term,
- no centered variant,
- only the EMA of squared gradients in the denominator.
EMA recap
An Exponential Moving Average (EMA) is a recursive weighted average that assigns more weight to recent observations and exponentially decaying weight to older ones.
Given a sequence and a decay factor , it is defined as
Two facts are worth keeping in mind:
- the most recent observation enters with weight ;
- past observations are attenuated by successive powers of , so older information gradually loses influence.
RMSProp applies this mechanism to the sequence of squared gradients, which is precisely what gives it a finite, exponentially decaying memory.
When applied to squared gradients, the EMA provides a limited-memory mechanism:
- it preserves per-parameter adaptivity, just as in AdaGrad;
- it avoids the unbounded accumulation that causes AdaGrad’s learning rate to shrink toward zero;
- it forms part of the conceptual foundation of later adaptive optimizers such as RMSProp, Adam, and AdaDelta.
Key intuition
The EMA acts as a soft forgetting mechanism. Instead of treating all past gradients as equally important, it ensures that recent gradients dominate the learning dynamics, while older ones gradually fade in influence.
2. RMSProp in depth
2.1 Update rule
Let
be the gradient of the loss with respect to parameter at iteration .
RMSProp maintains the exponential moving average
with initialization
The parameter update is then
where:
- is the base learning rate,
- is the decay factor of the EMA,
- is a small numerical-stability constant.
Typical choices take close to , for example or , so that the running estimate changes smoothly without becoming completely unresponsive.
Note
The square root in is essential. Since tracks a second moment, taking the square root brings the denominator back to the same scale as the gradient.
Viewed through the general EMA template
RMSProp is simply the special case
This identification is worth making explicit because it isolates the real role of the denominator: is not an arbitrary auxiliary variable, but the EMA state that stores recent information about the squared -th gradient component.
| General EMA object | RMSProp quantity | Meaning in RMSProp |
|---|---|---|
| Current EMA of the squared -th gradient component | ||
| Previous EMA state | ||
| Current squared gradient input | ||
| Decay factor controlling how much past information is retained | ||
| Contribution of the current squared gradient | ||
| RMS scale used to normalize the update of coordinate |
How the adaptive denominator rescales each coordinate
RMSProp separates two roles that are mixed together in plain SGD:
- the gradient determines the direction of the update,
- the factor determines the scale of the update.
The qualitative behavior is immediate:
- if a parameter has experienced large recent gradients, then becomes large and the effective step on that parameter is reduced;
- if a parameter has experienced small or infrequent gradients, then remains smaller and that parameter is updated more aggressively.
In this sense, RMSProp is a per-parameter normalization mechanism based on recent gradient magnitudes.
2.2 Why EMA fixes AdaGrad’s main limitation
The essential difference between AdaGrad and RMSProp is the memory model.
AdaGrad uses
so every past squared gradient contributes forever. RMSProp replaces this cumulative sum with the recursion
EMA unrolling
To see the memory mechanism directly, it is useful to expand the recursion step by step.
Step 1
Step 2
Step 3
By induction, this leads to the explicit formula:
Thus each past squared gradient is weighted by
This has two immediate consequences:
- the contribution of a past gradient decays exponentially with its age;
- the optimizer has finite effective memory, even though the formula is recursive and never explicitly drops terms.
Effective memory length
A useful rule of thumb is that the EMA remembers roughly
iterations. Thus:
- if , the memory horizon is on the order of steps;
- if , it is on the order of steps.
This is not an exact cutoff, but it gives an excellent intuition for how far into the past the optimizer is effectively looking.
Note
When , the weights on the observed squared gradients sum to
not yet to . So during the startup phase, is not a fully normalized weighted average. This is exactly the source of the initialization bias discussed below.
2.3 A mnemonic reading of “RMSProp”
The name can be used as a compact memory aid for the mechanism. This section should be read as a mnemonic, not as a claim about official historical etymology.
| Term | Mechanistic meaning in RMSProp |
|---|---|
| oot | The update uses , not itself |
| ean | is a moving average of squared gradients |
| quare | The gradients enter through |
| The running estimate is propagated recursively from one iteration to the next |
Note
This mnemonic is useful because it fixes the algorithm in memory in the correct order:
- square the gradient,
- average it over time with an EMA,
- take the root,
- use the result to propagate an adaptive scaling across iterations.
2.4 Initialization bias
2.4.1 The EMA is biased toward zero at the beginning
Because , the early values of the running second-moment estimate are systematically biased downward.
If the squared-gradient statistics are approximately stable during the first iterations, then one may write
Under this approximation,
Hence the estimate underestimates the true recent second moment during the startup phase. This matters because appears in the denominator of the update rule.
For the common choice , the startup behavior can be made very concrete:
| Iteration | Computed EMA state | Weight given to current squared gradient |
|---|---|---|
| still only on the newest term |
This does not mean that the denominator is generically “equal to “. It means something more precise: during the first iterations, the EMA has not yet accumulated enough recent squared-gradient information, so the denominator can remain systematically smaller than the long-run scale it is meant to represent.
2.4.2 First-step behavior
At the first iteration,
Therefore the first RMSProp update is
If is negligible compared with , then
For the common choice ,
So the first RMSProp step is approximately
Warning
This is the precise mathematical statement behind the usual qualitative warning: plain RMSProp can make the initial effective step scale larger than the base learning rate. This does not mean that RMSProp must diverge at the start. It means something narrower and more defensible:
- the denominator is initially underestimated,
- the early effective step size can therefore be amplified,
- whether that amplification is harmless or harmful depends on , , the gradient scale, and the local geometry of the objective.
Loss-landscape intuition
At the start of training, plain RMSProp can take very large effective steps across the loss surface. Instead of descending gradually into a promising valley, the model may:
- overshoot minima entirely,
- bounce across different regions,
- or, in unstable regimes, even diverge.
In that sense, the trajectory can spend its early iterations “wandering” across the loss landscape before the EMA has accumulated enough recent gradient information.
Once has accumulated enough recent squared gradients, the denominator grows accordingly. This activates RMSProp’s self-normalization mechanism: the effective learning rate is gradually reduced, the turbulent startup phase fades, and the optimizer moves into a more controlled exploitation regime.
If the model has not already been pushed out of a promising basin during that transient, the stabilized regime then allows it to exploit one region of the loss surface and converge more steadily.
2.4.3 Why the transient eventually disappears
As grows,
so
Therefore the initialization bias fades away with time. After enough iterations, behaves like a well-formed EMA of recent squared gradients rather than like an estimate still dominated by its zero initialization.
Relation to Adam
This startup transient is one of the conceptual reasons Adam is more refined than plain RMSProp. Adam keeps the same second-moment idea but adds an explicit bias-correction factor, so the early denominator is not systematically underestimated in the same way.
2.5 Limitations of RMSProp
Main limitations
Plain RMSProp is an important improvement over AdaGrad, but it still leaves three structural issues unresolved:
- No bias correction: the second-moment estimate starts from zero and is therefore biased downward during the first iterations. This can make the early effective step size larger than the nominal learning rate suggests.
- No first-moment smoothing: RMSProp rescales the gradient magnitude, but it does not smooth the update direction through a momentum-like EMA of gradients. As a result, the direction of motion can remain noisy when mini-batch gradients are highly variable.
- Strong sensitivity to hyperparameters: the practical behavior of RMSProp depends noticeably on the interaction between , , and . A choice that is stable in one regime can become too aggressive or too damped in another.
In short, RMSProp fixes AdaGrad’s infinite-memory problem, but it does not yet provide the startup control and directional stabilization that later optimizers such as Adam add explicitly.
2.6 Dimensional consistency
Even when parameters and losses are treated as dimensionless, it is useful to verify that the update is internally coherent:
- has the units of a gradient,
- has the units of a squared gradient,
- has the same units as the gradient,
- therefore is dimensionless.
This is another way to see why the square root is necessary. Without it, the numerator and denominator would not live on compatible scales.
3. Summary
RMSProp is best viewed as the optimizer that repairs AdaGrad’s infinite-memory flaw by replacing cumulative gradient history with an exponentially decaying memory.
Its essential ingredients are:
- an EMA of squared gradients,
- a per-parameter adaptive denominator,
- exponential forgetting of distant past information.
Its main strengths are:
- it preserves per-parameter adaptivity,
- it avoids AdaGrad’s irreversible shrinkage of the effective learning rate,
- it reacts to recent gradient scales rather than to the full training history.
Its main conceptual limitation is equally clear:
- plain RMSProp has no bias correction,
- so the early second-moment estimate is biased toward zero,
- and the startup phase can therefore be more aggressive than the nominal learning rate suggests.
Practical note
RMSProp remains an important optimizer historically and conceptually. In modern deep-learning practice, however, Adam is often preferred because it augments the same adaptive denominator with momentum in the numerator and with explicit bias correction during the startup phase.