Intro

The discussion now turns to alternatives of Stochastic Gradient Descent (SGD).

Question

What limitations of classical SGD motivated the introduction of these alternatives?


The sparse gradient problem in SGD

Sparse gradients

Sparse gradients occur when only a small subset of the gradient vector carries significant information, while the majority of components are close to zero or correspond to less relevant parameters.

Sources of gradient sparsity:

  • Label sparsity
  • Activation sparsity
  • Data sparsity

1. Sparse labels

In many real-world datasets, the objects of interest appear only rarely, while most of the data corresponds to irrelevant background.
This sparsity at the label level creates a significant imbalance during training.

Sparse labels in image recognition

Consider training a neural network to recognize remote controls in images.

During training, the neurons responsible for detecting the remote control activate only rarely, since the object appears in just a few images.

This results in temporal sparsity of activations: only a small fraction of the network “wakes up” when the target object is actually present.

Meanwhile, most parameter updates occur when the network processes images without remote controls, i.e., dominated by background or irrelevant objects.

Real images typically contain mostly background pixels, with only a few corresponding to the target object.
As a consequence:

  • most neurons in the network specialize in recognizing background,
  • neurons associated with rare objects receive very few updates,
  • the learning dynamics become heavily biased toward irrelevant information.

Sparse labels in medical imaging

This issue is even more pronounced in medical imaging.
A small lesion within a very large image contributes nonzero gradients to only a handful of components, while the vast majority corresponds to healthy tissue (background).

Consequence

This imbalance implicitly treats the background as equally important as the target object, allowing background-related gradients to dominate the optimization process.


2. Sparse activations

Activation functions such as ReLU and its variants, together with techniques like Dropout, make neural networks intrinsically sparse.

In many situations, the activation output is identically zero across entire regions of the input domain. As a result, large portions of the network remain inactive or “silent” for certain inputs, since many neurons do not fire at all.

Consequence

This sparsity in activations directly affects the flow and distribution of gradients during backpropagation: when a neuron does not activate, it also does not propagate gradients backward, reducing the number of parameters that receive updates in that training step.


3. Sparse data

Sparsity can also originate directly from the data:

  • In computer vision, certain types of extracted features, such as Haar-like features, are deliberately designed to be compact and selective, focusing only on specific aspects of the visual content. This design makes them inherently sparse: most feature values are zero or uninformative, while only a small subset captures meaningful information.
  • In many applications, the majority of raw data is uninformative or redundant, while only a small subset of components carries the meaningful information that drives learning.

Consequence

This means that sparsity is not just a property of the network or activations, but can be inherent in the data representation itself.


Why SGD fails with sparse gradients

Limitation of classical SGD

One of the main limitations of SGD is its inability to properly handle sparse gradients.
The gradient vector, which contains the partial derivatives with respect to each network parameter, is often treated as if all components contributed equally, following the assumption that one counts as one.

Important

In practice, not all gradient components contribute equally to learning.

Uniform treatment of all components, the hallmark of classical SGD, is no longer sustainable.
Some parameters correspond to strong signals and should be emphasized, while others correspond to weak or irrelevant signals.
Ignoring this disparity can significantly undermine the effectiveness of the optimization process.

Despite sparsity being present at all levels of a neural network, in classical SGD all components of the gradient vector are scaled by the same learning rate .
Some components (e.g., those associated with rare objects) are updated much less frequently than others (such as those related to the background).
Moreover, whenever components produce weak gradients, they naturally lead to weak updates, further diminishing their overall contribution to learning.

Update imbalance in classical SGD

If treated in the same way, the rare or weak-gradient components are overshadowed by the dominant ones.

This imbalance in the dynamics of parameter updates directly contributes to the phenomenon of learning slowdown.


Solution: Adaptive SGD

Solution

The problem of sparse gradients has a natural remedy: scaling the learning rate individually for each network parameter. Each parameter has its own scaling factor , which modulates the base learning rate , amplifying the LR for s which received fewer updates in the past iterations.

MethodUpdate rule
Classical SGD
Adaptive SGD

Note

From now on, the effective learning rate is always expressed as a combination of the base learning rate and the adaptive factor specific to each network parameter.

Important

This approach establishes a form of “democracy among gradient components”: weaker components are strengthened, while dominant ones are moderated.

Example: Object recognition

In the context of object recognition in images, introducing a scaling factor for the learning rate has the following effects:

  • Gradient components associated with rarer objects become more significant (since they may have a smaller scaling factor in the denominator).
  • Less important components have a larger scaling factor in the denominator.

Summary

Adaptive SGD is a strategy that rescales the learning rate in the update step in a customized way for each network parameter , through the introduction of an appropriate scaling factor .
The term adaptive reflects this ability to adjust to the importance of each parameter, making the optimization process more balanced and effective, especially in the presence of sparse or imbalanced data.


Adaptive optimizers overview

Building on this idea, a number of adaptive optimizers were proposed between 2011–2019:

  • AdaGrad (2011)
  • RMSprop (2012) (proposed by Geoffrey Hinton)
  • Adam (2014) (de-facto standard in modern Deep Learning)
  • AdamW (Adam + weight decay) (2019)