Intro

AdaGrad (ptive ient) is an optimization algorithm, introduced in , that dynamically adjusts the learning rate for each parameter of the network based on the history of past gradients.

SGD vs AdaGrad

Unlike classical SGD, which uses a single global learning rate , AdaGrad:

  • adapts the base learning rate individually for each parameter ,
  • decreases the learning rate for parameters that are updated frequently,
  • keeps the learning rate higher for rarely updated parameters, which are often the most informative.

Note

This strategy improves convergence performance in scenarios with sparse data or rarely activated parameters, such as in Natural Language Processing (NLP) and Image Recognition.


AdaGrad update rule

The parameter update in AdaGrad follows the formula:

where:

  • : value of the -th parameter at step ,

  • : initial global learning rate,

  • : gradient component associated with parameter at step ,

  • : small constant number (usually ) to avoid division by zero,

  • : cumulative sum of the squared gradients of parameter up to time .


1.1 Cumulative sum of squared gradients term

The growth of the term

is driven by two distinct factors:

  • Frequency (Consistency of updates): this factor reflects how consistently a parameter receives a non-zero gradient. Since the sum involves squared terms (always non-negative), increases inexorably at every step in which . Even a long sequence of small gradients results in a steady and significant accumulation.
  • Intensity: this factor captures the impact of the gradient magnitude. Squaring introduces a nonlinear effect: large gradients contribute disproportionately to the sum. If each has a large magnitude, squaring it produces an even larger contribution to , accelerating its growth. For instance, a single gradient of magnitude (adding to the sum) has ten times the impact of ten gradients of magnitude (which together add only ).

Note

Thus, grows when parameter receives gradient contributions frequently and/or strongly.

This has a direct impact on the effective learning rate:

  • If is large → denominator in update rule is large → smaller learning rate → the parameter update is dampened.
  • If is small → denominator in update rule is small → larger learning rate → the parameter update is amplified.

Adaptive balance in AdaGrad

AdaGrad acts as an adaptive optimizer that takes into account the history of each parameter.
It enforces a form of “democracy among parameters”:

  • parameters with frequent or strong gradients are down-scaled,
  • parameters with rare or weaker gradients are emphasized.

This prevents rare gradients from being dominated by those arising more frequently, ensuring a more balanced learning process.


Numerical example

Consider two parameters, and , with the following gradients observed over three time steps:

step
1
2
3

Gradient accumulation

The diagonal terms are obtained as cumulative sums of squared gradients:

  • received strong and frequent gradients, leading to a large accumulated value .
  • received weak and sparse gradients, so remains very small.

Effective learning rates

AdaGrad scales the base learning rate by the inverse square root of :

  • For , the update is dampened to avoid overshooting, since this parameter has already been heavily influenced by the gradients.
  • For , the update is amplified to give more weight to a parameter that would otherwise lag behind.

Interpretation

This example illustrates AdaGrad’s balancing effect:

  • Extreme parameter updates (frequent, strong gradients) get dampened to prevent them from overwhelming the optimization.
  • Rare or underrepresented parameters that get few or small updates receive higher learning rates, ensuring they still contribute to learning.

Takeaway

AdaGrad prevents parameters linked to rare signals from being overshadowed by those linked to frequent signals.
This per-parameter rescaling is the core reason why AdaGrad is especially effective in settings with sparse or imbalanced data.


AdaGrad in depth

The matrix G

In AdaGrad, a gradient accumulation matrix is first defined, and it forms the basis of AdaGrad’s adaptive mechanism

The matrix is given by the sum of the outer products of the loss gradients across different time steps:

It represents an accumulation matrix that contains, for each pair of parameters , the sum of the products of their respective partial derivatives over time:

where:

  • The diagonal elements contain the sum of squared gradients for each parameter, i.e., the variances of the gradients of individual parameters.
  • The off-diagonal elements represent the covariances between gradients of different parameters.

AdaGrad does not use the full matrix , but only its diagonal elements:

where:

  • is the partial derivative of the loss function at step , with respect to parameter .
  • represents the cumulative sum of squared gradients over the entire history of the experiment (i.e., from to the current iteration ) for parameter only. is then used to scale the learning rate over time, independently for each parameter.

Why use instead of

In AdaGrad, only the diagonal elements of the accumulation matrix are used.

  • Computational efficiency: The full is a matrix (with being the number of parameters). For modern neural networks, can be in the millions, making storage and updates of completely infeasible. Using only the diagonal reduces this to .

  • Sufficiency for learning rate adaptation: Each diagonal element tracks the cumulative squared gradients of a single parameter . This is precisely the information needed to adapt the learning rate independently per parameter, which is the central goal of AdaGrad.

  • Limited usefulness of covariances: Off-diagonal terms represent covariances between gradients of different parameters. While potentially informative, they are not essential for the AdaGrad update rule. In practice, the added complexity does not translate into significant optimization benefits, whereas the per-parameter scaling provided by the diagonal already captures the key adaptive behavior.

Thus, focusing on strikes the right balance: computationally tractable while still providing effective per-parameter learning rate adaptation.


The Motivation Behind AdaGrad: Preconditioning

Important

To understand why AdaGrad is so effective, especially in scenarios with sparse features, it’s useful to first explore the concept of preconditioning in optimization. The core idea is to address a common challenge in optimization: ill-conditioned problems.

The Problem of Ill-Conditioned Functions

Consider the simple convex optimization problem of minimizing a quadratic function:

The shape of this function’s loss surface is determined by the matrix .

  • If the eigenvalues of are all similar, the surface resembles a round bowl, and finding the minimum is straightforward.
  • If the eigenvalues are very different, the surface looks like a steep, narrow canyon or an elongated ellipse.

This “stretched” nature is quantified by the condition number (), defined as the ratio of the largest to the smallest eigenvalue of :

A large condition number () means the problem is ill-conditioned.

SGD and ill-conditioned problems

Standard gradient descent algorithms struggle in this scenario, as they tend to oscillate across the steep walls of the canyon while making very slow progress along its flat floor, which is the true path to the minimum.


The Ideal (but Impractical) Solution: Full Preconditioning

Theoretically, this problem could be “fixed” by transforming the coordinate space to turn the canyon back into a perfectly round bowl. This can be achieved using the eigendecomposition of the Hessian matrix, . By changing variables to , the problem becomes:

In this new coordinate system, each coordinate can be optimized independently, which is trivial.

Why is this impractical?

Computing the full set of eigenvalues and eigenvectors of for a high-dimensional problem (like a deep neural network) is often more computationally expensive than solving the optimization problem itself.


The Adagrad Insight: Using Gradients as a Proxy for Curvature

AdaGrad

Since directly using the Hessian is infeasible in Deep Learning, AdaGrad introduces an ingenious idea: use the history of the gradients as a proxy for the diagonal of the Hessian.

The gradient of the previous quadratic function is:

where is the location of the minimum.

This equation tells us that the magnitude of the gradient in a particular direction depends on the curvature in that direction (encoded in ).

  • High Curvature Directions (steep walls of the canyon) will consistently produce large gradients.
  • Low Curvature Directions (flat floor of the canyon) will consistently produce small gradients.

Therefore, by accumulating the square of the gradients over time for each parameter, we get a running estimate of the local curvature. This is precisely what Adagrad does. It uses this accumulated value, , to normalize the learning rate for each parameter :

This allows Adagrad to:

  • Decrease the learning rate in directions with large, frequent gradients (the steep walls).
  • Increase the learning rate in directions with small, infrequent gradients (the flat floor).

Summary

In essence, Adagrad implements an adaptive, diagonal preconditioning scheme without ever needing to compute the Hessian, making it a powerful and efficient optimizer.


AdaGrad weakness

Intrinsic limitation of AdaGrad

In AdaGrad, each diagonal term of the accumulation matrix :

grows monotonically (sums of squares cannot decrease), even for parameters updated only rarely.
As a result, the effective learning rate

decays irreversibly over time for every parameter.

Side effects

  • After many iterations, steps in the loss landscape become so small that learning may slow down or even stall.
  • This internal decay adds on top of external LR schedulers (e.g., warm-up, cosine decay, etc.), causing the learning rate to collapse too quickly.
  • In non-sparse problems, this aggressive reduction of the learning rate may prevent the optimizer from reaching a satisfactory minimum.

Why this happens

AdaGrad’s mechanism mixes two different goals in a single formula:

  1. Equalizing the learning rate parameter by parameter makes each proportionally sensitive to its own “history” of gradients (a desirable function of an adaptive optimizer ).

  2. Making the learning rate decay (globally) over time → since grows with training iterations, the factor causes all steps to decrease as training progresses (this should be managed by an explicit LR scheduler, not by the optimizer itself).

Consequence

AdaGrad intertwines the above two logics, preventing the separate control of:

  • the learning rate adaptation for each parameter,
  • the global decay of the learning rate over time.

Desired behavior: LR Decay is useful, when kept in check

Ideally, in many real-world applications, the learning rate should start high (exploration) and then decrease gradually (exploitation).
With AdaGrad, however, this decay is endogenous and uncontrollable, as it depends entirely on gradient accumulation and cannot be directly controlled by the user.


Beyond AdaGrad

Optimizers developed as evolutions of AdaGrad (RMSProp, AdaDelta, Adam) preserve its desirable feature of per-parameter adaptation, while delegating the decay of the learning rate to an explicit LR scheduler.

OptimizerKey ideaHow it avoids the problem
RMSPropExponential moving average (EMA) of squared gradients“Forgets” distant past gradients, preventing from diverging
AdaDeltaEMA of squared gradients and parameter updatesNormalizes the learning rate by balancing gradient magnitudes with past update magnitudes, ensuring stable and meaningful steps
AdamRMSProp + momentumCombines EMA of squared gradients (denominator) and EMA of gradients (numerator), ensuring adaptive steps that never become too small

Summary

AdaGrad is well-suited for very sparse problems and short training runs. For prolonged training or less sparse data, its evolutions (RMSProp, AdaDelta, Adam) allow for explicit control of the learning rate decay.