Hinton’s intuition: forget the distant past
Limitation of AdaGrad
The weakness of AdaGrad lies in the fact that it accumulates the entire history of squared gradients.
As training progresses, the denominator in the update rule grows unbounded, causing the effective learning rate for each parameter to shrink continuously.
💡 Idea (G. Hinton)
If the problem comes from summing the entire history, then stop summing it all.
Instead, assign decreasing weights to older gradients, so that recent information matters more than distant past.
Solution: use EMA
The solution builds on a widely applicable concept that goes beyond Deep Learning itself: the Exponential Moving Average (EMA).
Exponential Moving Average (EMA)
Definition
EMA definition
The Exponential Moving Average (EMA) is a type of infinite impulse response filter that applies exponentially decaying weights to past observations.
Unlike a simple moving average, which assigns equal importance to all past values within a fixed window, the EMA emphasizes recent data points while gradually “forgetting” the distant past.
This makes it particularly useful for applications such as time-series normalization and, in Deep Learning, for controlling the accumulation of gradient information.Given a sequence of data points and a decay factor , the EMA is defined recursively as:
- The most recent value contributes with weight .
- Each past contribution is attenuated by a factor of per time step → the further back in time, the smaller its influence on the current EMA.
Why EMA matters for optimization
When applied to the squared gradients, the EMA provides a “limited memory” mechanism:
- It preserves per-parameter adaptivity, just as in AdaGrad.
- It avoids the unbounded accumulation that causes AdaGrad’s learning rate to shrink towards zero.
- It forms the foundation of modern adaptive optimizers such as RMSProp, Adam, and AdaDelta.
Key intuition
The EMA acts as a soft forgetting mechanism.
Instead of treating all past gradients as equally important, it ensures that recent gradients dominate the dynamics of learning, while older ones gradually fade in influence.
RMSProp in depth
Update rule
The parameter update in RMSProp follows the rule:
where:
| General EMA formula | Specific case: RMSProp for | Meaning in RMSProp |
|---|---|---|
| Current value of the exponential moving average of the squared -th gradient component | ||
| Exponential moving average, at step , of the squared -th gradient component | ||
| Current input: squared value of the -th gradient component at step | ||
| Decay rate: controls how much past information is remembered | ||
| Contribution of the current -th gradient component |
Note
The denominator averages the squared past gradient components for parameter , but does so by assigning greater weight to the more recent ones.
Why is it called RMSProp
| Term | Explanation |
|---|---|
| oot | The square root of the EMA of squared gradients, , is computed. This brings the adaptive term back to the same scale as the gradient, preventing numerical imbalance in the update. Without the square root, the denominator would grow too quickly, making the effective learning rate excessively small. The root stabilizes and normalizes the effective learning rate used in the parameter update. |
| ean | An average (moving average) is used instead of a single gradient observation, producing a more stable estimate. |
| quare | The gradient component is squared in the EMA formula to measure only its magnitude (the sign is irrelevant). This provides an indicator of the gradient’s energy, useful for modulating the effective learning rate applied to parameter . |
| agation | The estimate is not reset at each step but updated recursively. This way, the effect of past -th gradient components propagates across iterations, making the effective learning rate for continuously adaptive. |
Unrolling of the EMA formula
To fully understand the behavior of RMSProp, it is useful to expand recursively the definition of the exponential moving average (EMA) of the squared gradients step by step.
This makes it clear how past -th gradient components gradually lose influence over time, illustrating the mechanism of exponential memory decay.
Let’s unroll the recursion for the first steps:
Step 1
Step 2
Step 3
By induction, this leads to the explicit formula:
From this expression it can be observed that:
- Each squared gradient term is multiplied by a coefficient ,
- The further back in time a gradient component is, the smaller its impact on .
Effect of exponential decay
Since , the coefficient applies an exponentially decreasing weight to past -th gradient components.
Here, denotes the current iteration, while indexes a past iteration ().
Thus, the further back in time a squared gradient term is (i.e., when ), the less influence it has on .
Recent gradient components dominate the value of the moving average.For example, with :
- ,
- ,
- ,
- .
This shows that gradient components older than just a few hundred iterations carry virtually no weight: the optimizer effectively “forgets” the distant past, favoring an up-to-date and responsive estimate of gradient variability.
Important
This exponential decay ensures that recent gradients exert the most influence on the current value of , while past information is progressively “forgotten.”
This mechanism is the very essence of RMSProp’s adaptive behavior.
RMSProp weakness
EMA is biased towards
With and , the exponential moving average
initially incorporates only 1% of the current squared gradient.
Iteration computed Weight given to current gradient 1% ≈ 1% During these very first iterations, the denominator in the RMSProp update rule
is still extremely small (close to ).
As a result, the effective learning rate becomes extremely large.
The risk of “wild jumps” in the loss landscape at the start of training
As a consequence, at the start of training, the RMSProp optimizer takes giant uncontrolled steps across the loss surface.
Instead of gradually descending into a promising valley, the model may:
- overshoot minima entirely,
- bounce chaotically across different regions,
- or even diverge.
This behavior can be described as the model “wandering around” the loss landscape, wasting early epochs without meaningful convergence.
Only after this short but turbulent transient phase does the EMA begin to accumulate enough recent gradient information.
Once has accumulated enough recent gradients, the denominator in the update rule grows accordingly.
This activates RMSProp’s self-normalization mechanism, which gradually reduces the effective learning rate and brings the optimizer out of its turbulent startup phase.
From that point on, RMSProp transitions into a stable regime of more controlled descent (exploitation).At that point, if the model has not already been pushed out of promising valleys of the loss surface, the stabilized learning rate allows it to exploit one basin and converge more steadily toward a minimum.