1. Intro
Before introducing adaptive optimizers such as AdaGrad, RMSProp, and Adam, one must first understand the optimization pathology they were designed to address.
That pathology is the presence of sparse gradients.
Question
Why can classical SGD become inefficient when only a small subset of parameters receives informative updates?
2. What sparse gradients mean
Sparse gradients
A gradient is called sparse when, at a given iteration or over a long temporal window, only a small subset of its coordinates carries most of the useful learning signal. The remaining coordinates are:
- exactly zero,
- close to zero,
- or dominated by weak and uninformative contributions.
This notion should be interpreted broadly.
Note
In deep learning, sparsity does not necessarily mean that most coordinates are identically zero in a strict algebraic sense. It often means something slightly more general and more relevant in practice:
- only a few parameters receive updates frequently;
- only a few coordinates receive large gradients;
- or only a few parts of the network are meaningfully involved in learning for a given example.
Three important sources of this phenomenon are:
- label sparsity,
- activation sparsity,
- data sparsity.
3. Sources of sparse gradients
3.1 Sparse labels
In many real-world tasks, the patterns of interest occur only rarely, while the majority of training examples correspond to background, absence, or negative cases.
This means that the parameters specialized for detecting the rare signal are activated only occasionally, whereas parameters associated with common background structure are updated far more often.
Sparse labels in image recognition
Consider training a model to detect remote controls in images. Most images do not contain the target object. As a result:
- the parameters relevant to remote-control detection receive informative gradients only on a small subset of examples;
- the parameters associated with background statistics are updated almost everywhere.
Sparse labels in medical imaging
The imbalance is often even stronger in medical imaging. A small lesion inside a large scan may occupy only a tiny fraction of the image, so the gradient signal attached to the pathological region is vastly outnumbered by gradients associated with healthy tissue.
Consequence
SGD then spends most of its time fitting what is common rather than what is important. Rare but semantically crucial structures can be updated too infrequently to compete with the cumulative effect of background-driven gradients.
3.2 Sparse activations
Neural networks can generate gradient sparsity even when the data distribution itself is not sparse.
Activation functions such as ReLU introduce large regions where the output is exactly zero. Techniques such as Dropout reinforce the same effect by intentionally silencing subsets of units during training.
Whenever a neuron is inactive, its local contribution to backpropagation is reduced or eliminated. Consequently, many parameters may receive no update, or only a negligible one, at that iteration.
Consequence
Gradient flow becomes highly uneven across the network. Some parameters are trained repeatedly, while others remain dormant for long stretches of optimization, creating a strong asymmetry in how learning signal is distributed over time.
3.3 Sparse data
Sparsity may also be inherent in the representation of the input itself.
This is especially common when examples are encoded through high-dimensional features in which only a few entries are active at a time.
Typical situations include:
- one-hot or multi-hot representations in language and recommendation systems;
- bag-of-words style feature vectors, where each example activates only a tiny subset of the vocabulary;
- engineered or selective feature maps in which most coordinates are zero and only a few capture task-relevant structure.
In such settings, only the parameters connected to the active features are significantly updated on a given step.
Consequence
The optimization problem becomes intrinsically sparse at the coordinate level. The issue is then not merely produced by the network architecture, but already present in the data representation that drives the gradients.
4. Why sparse gradients are difficult for classical SGD
Limitation of classical SGD
Classical SGD applies the same global learning rate to every coordinate of the gradient vector. It therefore treats all parameters as if they should move on comparable scales, even when the frequency and informativeness of their updates differ dramatically.
Suppose the update is
This rule is uniform across coordinates. If a parameter receives:
- frequent gradients, it accumulates many updates;
- rare gradients, it moves only occasionally;
- small gradients, it moves only weakly even when updated.
The problem is therefore cumulative. A rare coordinate is penalized twice:
- it is updated less often;
- when it is updated, its gradient may already be small compared with dominant coordinates.
Important
Sparse gradients do not mean that rare coordinates are unimportant. On the contrary, in many tasks the rare coordinates are precisely the ones carrying the most discriminative information.
Update imbalance
Under a single global learning rate, dominant coordinates keep accumulating progress, while rare but potentially crucial coordinates lag behind. Classical SGD therefore tends to under-invest in the parameters that are hardest to train precisely because they appear infrequently.
5. The natural remedy: adaptive coordinatewise scaling
The basic response to this problem is conceptually simple:
- keep a global base learning rate ,
- but rescale it separately for each parameter.
This leads to the generic adaptive update
where is a parameter-specific scaling factor.
The role of is to encode some notion of how aggressively coordinate should currently be updated.
| Method | Update rule |
|---|---|
| Classical SGD | |
| Adaptive SGD |
Effective learning rate
The quantity
is the effective learning rate of parameter at iteration . Adaptive optimizers differ precisely in how they define and update the scaling factor .
Important
This establishes a form of coordinatewise balance:
- parameters that have already received strong or frequent updates can be damped;
- parameters that are rarely updated can be granted a relatively larger effective step.
In this sense, adaptive optimization introduces a form of democracy among updates: learning effort is no longer distributed uniformly by rule, but reallocated according to how much each parameter actually needs to move.
Example: rare object recognition
If a parameter is associated with a rare object class, its gradient may appear infrequently. A coordinatewise adaptive rule can compensate for that rarity by assigning it a larger effective learning rate than the one given to parameters that are already being updated constantly.
Summary
Adaptive optimization begins with a simple principle: do not force all gradient coordinates to live under the same learning-rate geometry. Once gradients are sparse or highly imbalanced, parameterwise scaling becomes a natural extension of SGD rather than a minor adjustment.
6. Adaptive optimizers overview
This general idea gives rise to a family of adaptive methods:
- AdaGrad: accumulates squared gradients and scales each coordinate accordingly;
- RMSProp: replaces cumulative history with an exponential moving average;
- Adam: combines adaptive scaling with momentum and bias correction;
- AdamW: decouples weight decay from Adam-style adaptive updates.
Info
The next notes can be read as progressively more refined answers to the same question: how should the optimizer redistribute learning effort when informative gradients are sparse, imbalanced, or unevenly distributed across parameters?