Intro
AdaGrad (ptive ient) is an optimization algorithm, introduced in , that dynamically adjusts the learning rate for each parameter of the network based on the history of past gradients.
SGD vs AdaGrad
Unlike classical SGD, which uses a single global learning rate , AdaGrad:
- adapts the base learning rate individually for each parameter ,
- decreases the learning rate for parameters that are updated frequently,
- keeps the learning rate higher for rarely updated parameters, which are often the most informative.
Note
This strategy improves convergence performance in scenarios with sparse data or rarely activated parameters, such as in Natural Language Processing (NLP) and Image Recognition.
AdaGrad update rule
The parameter update in AdaGrad follows the formula:
where:
-
: value of the -th parameter at step ,
-
: initial global learning rate,
-
: gradient component associated with parameter at step ,
-
: small constant number (usually ) to avoid division by zero,
-
: cumulative sum of the squared gradients of parameter up to time .
1.1 Cumulative sum of squared gradients term
The growth of the term
is driven by two distinct factors:
- Frequency (Consistency of updates): this factor reflects how consistently a parameter receives a non-zero gradient. Since the sum involves squared terms (always non-negative), increases inexorably at every step in which . Even a long sequence of small gradients results in a steady and significant accumulation.
- Intensity: this factor captures the impact of the gradient magnitude. Squaring introduces a nonlinear effect: large gradients contribute disproportionately to the sum. If each has a large magnitude, squaring it produces an even larger contribution to , accelerating its growth. For instance, a single gradient of magnitude (adding to the sum) has ten times the impact of ten gradients of magnitude (which together add only ).
Note
Thus, grows when parameter receives gradient contributions frequently and/or strongly.
This has a direct impact on the effective learning rate:
- If is large → denominator in update rule is large → smaller learning rate → the parameter update is dampened.
- If is small → denominator in update rule is small → larger learning rate → the parameter update is amplified.
Adaptive balance in AdaGrad
AdaGrad acts as an adaptive optimizer that takes into account the history of each parameter.
It enforces a form of “democracy among parameters”:
- parameters with frequent or strong gradients are down-scaled,
- parameters with rare or weaker gradients are emphasized.
This prevents rare gradients from being dominated by those arising more frequently, ensuring a more balanced learning process.
Numerical example
Consider two parameters, and , with the following gradients observed over three time steps:
| step | ||
|---|---|---|
| 1 | ||
| 2 | ||
| 3 |
Gradient accumulation
The diagonal terms are obtained as cumulative sums of squared gradients:
- received strong and frequent gradients, leading to a large accumulated value .
- received weak and sparse gradients, so remains very small.
Effective learning rates
AdaGrad scales the base learning rate by the inverse square root of :
- For , the update is dampened to avoid overshooting, since this parameter has already been heavily influenced by the gradients.
- For , the update is amplified to give more weight to a parameter that would otherwise lag behind.
Interpretation
This example illustrates AdaGrad’s balancing effect:
- Extreme parameter updates (frequent, strong gradients) get dampened to prevent them from overwhelming the optimization.
- Rare or underrepresented parameters that get few or small updates receive higher learning rates, ensuring they still contribute to learning.
Takeaway
AdaGrad prevents parameters linked to rare signals from being overshadowed by those linked to frequent signals.
This per-parameter rescaling is the core reason why AdaGrad is especially effective in settings with sparse or imbalanced data.
AdaGrad in depth
The matrix G
In AdaGrad, a gradient accumulation matrix is first defined, and it forms the basis of AdaGrad’s adaptive mechanism
The matrix is given by the sum of the outer products of the loss gradients across different time steps:
It represents an accumulation matrix that contains, for each pair of parameters , the sum of the products of their respective partial derivatives over time:
where:
- The diagonal elements contain the sum of squared gradients for each parameter, i.e., the variances of the gradients of individual parameters.
- The off-diagonal elements represent the covariances between gradients of different parameters.
Expansion of the outer product in the matrix
The outer product can be expressed as the product of a column vector and its transpose:
The outer product of two column vectors yields a matrix whose entries are all possible products of the vector components:
In our case, the gradient can be written as:
Thus, the outer product becomes:
AdaGrad does not use the full matrix , but only its diagonal elements:
where:
- is the partial derivative of the loss function at step , with respect to parameter .
- represents the cumulative sum of squared gradients over the entire history of the experiment (i.e., from to the current iteration ) for parameter only. is then used to scale the learning rate over time, independently for each parameter.
Why use instead of
In AdaGrad, only the diagonal elements of the accumulation matrix are used.
Computational efficiency: The full is a matrix (with being the number of parameters). For modern neural networks, can be in the millions, making storage and updates of completely infeasible. Using only the diagonal reduces this to .
Sufficiency for learning rate adaptation: Each diagonal element tracks the cumulative squared gradients of a single parameter . This is precisely the information needed to adapt the learning rate independently per parameter, which is the central goal of AdaGrad.
Limited usefulness of covariances: Off-diagonal terms represent covariances between gradients of different parameters. While potentially informative, they are not essential for the AdaGrad update rule. In practice, the added complexity does not translate into significant optimization benefits, whereas the per-parameter scaling provided by the diagonal already captures the key adaptive behavior.
Thus, focusing on strikes the right balance: computationally tractable while still providing effective per-parameter learning rate adaptation.
The Motivation Behind AdaGrad: Preconditioning
Important
To understand why AdaGrad is so effective, especially in scenarios with sparse features, it’s useful to first explore the concept of preconditioning in optimization. The core idea is to address a common challenge in optimization: ill-conditioned problems.
The Problem of Ill-Conditioned Functions
Consider the simple convex optimization problem of minimizing a quadratic function:
The shape of this function’s loss surface is determined by the matrix .
- If the eigenvalues of are all similar, the surface resembles a round bowl, and finding the minimum is straightforward.
- If the eigenvalues are very different, the surface looks like a steep, narrow canyon or an elongated ellipse.
This “stretched” nature is quantified by the condition number (), defined as the ratio of the largest to the smallest eigenvalue of :
A large condition number () means the problem is ill-conditioned.
SGD and ill-conditioned problems
Standard gradient descent algorithms struggle in this scenario, as they tend to oscillate across the steep walls of the canyon while making very slow progress along its flat floor, which is the true path to the minimum.
The Ideal (but Impractical) Solution: Full Preconditioning
Theoretically, this problem could be “fixed” by transforming the coordinate space to turn the canyon back into a perfectly round bowl. This can be achieved using the eigendecomposition of the Hessian matrix, . By changing variables to , the problem becomes:
In this new coordinate system, each coordinate can be optimized independently, which is trivial.
Why is this impractical?
Computing the full set of eigenvalues and eigenvectors of for a high-dimensional problem (like a deep neural network) is often more computationally expensive than solving the optimization problem itself.
The Adagrad Insight: Using Gradients as a Proxy for Curvature
AdaGrad
Since directly using the Hessian is infeasible in Deep Learning, AdaGrad introduces an ingenious idea: use the history of the gradients as a proxy for the diagonal of the Hessian.
The gradient of the previous quadratic function is:
where is the location of the minimum.
Derivation of the Quadratic Gradient
To establish the analytical relationship between the gradient and the Hessian matrix in a quadratic context, consider the standard objective function:
where is a symmetric positive-definite matrix representing the Hessian, and is a constant vector.
1. First-Order Derivative
The gradient of with respect to is obtained through the application of matrix calculus identities:
2. Optimality Condition
Let denote the global minimizer of the function. At the point of optimality, the first-order condition requires the gradient to vanish:
This equality permits the expression of the constant vector in terms of the Hessian and the optimal coordinate:
3. Synthesis of the Final Form
Substituting the resulting expression for back into the general gradient formula yields:
Factoring the matrix leads to the definitive representation:
This equation tells us that the magnitude of the gradient in a particular direction depends on the curvature in that direction (encoded in ).
- High Curvature Directions (steep walls of the canyon) will consistently produce large gradients.
- Low Curvature Directions (flat floor of the canyon) will consistently produce small gradients.
Therefore, by accumulating the square of the gradients over time for each parameter, we get a running estimate of the local curvature. This is precisely what Adagrad does. It uses this accumulated value, , to normalize the learning rate for each parameter :
This allows Adagrad to:
- Decrease the learning rate in directions with large, frequent gradients (the steep walls).
- Increase the learning rate in directions with small, infrequent gradients (the flat floor).
Summary
In essence, Adagrad implements an adaptive, diagonal preconditioning scheme without ever needing to compute the Hessian, making it a powerful and efficient optimizer.
AdaGrad weakness
Intrinsic limitation of AdaGrad
In AdaGrad, each diagonal term of the accumulation matrix :
grows monotonically (sums of squares cannot decrease), even for parameters updated only rarely.
As a result, the effective learning ratedecays irreversibly over time for every parameter.
Side effects
- After many iterations, steps in the loss landscape become so small that learning may slow down or even stall.
- This internal decay adds on top of external LR schedulers (e.g., warm-up, cosine decay, etc.), causing the learning rate to collapse too quickly.
- In non-sparse problems, this aggressive reduction of the learning rate may prevent the optimizer from reaching a satisfactory minimum.
Why this happens
AdaGrad’s mechanism mixes two different goals in a single formula:
Equalizing the learning rate parameter by parameter → makes each proportionally sensitive to its own “history” of gradients (a desirable function of an adaptive optimizer ).
Making the learning rate decay (globally) over time → since grows with training iterations, the factor causes all steps to decrease as training progresses (this should be managed by an explicit LR scheduler, not by the optimizer itself).
Consequence
AdaGrad intertwines the above two logics, preventing the separate control of:
- the learning rate adaptation for each parameter,
- the global decay of the learning rate over time.
Desired behavior: LR Decay is useful, when kept in check
Ideally, in many real-world applications, the learning rate should start high (exploration) and then decrease gradually (exploitation).
With AdaGrad, however, this decay is endogenous and uncontrollable, as it depends entirely on gradient accumulation and cannot be directly controlled by the user.
Beyond AdaGrad
Optimizers developed as evolutions of AdaGrad (RMSProp, AdaDelta, Adam) preserve its desirable feature of per-parameter adaptation, while delegating the decay of the learning rate to an explicit LR scheduler.
| Optimizer | Key idea | How it avoids the problem |
|---|---|---|
| RMSProp | Exponential moving average (EMA) of squared gradients | “Forgets” distant past gradients, preventing from diverging |
| AdaDelta | EMA of squared gradients and parameter updates | Normalizes the learning rate by balancing gradient magnitudes with past update magnitudes, ensuring stable and meaningful steps |
| Adam | RMSProp + momentum | Combines EMA of squared gradients (denominator) and EMA of gradients (numerator), ensuring adaptive steps that never become too small |
Summary
AdaGrad is well-suited for very sparse problems and short training runs. For prolonged training or less sparse data, its evolutions (RMSProp, AdaDelta, Adam) allow for explicit control of the learning rate decay.