1. Intro
AdaGrad (ptive ient) is an optimizer introduced in 2011 that assigns a different effective learning rate to each parameter by using the history of past gradients.
Its central idea is simple:
- parameters that have already received many large gradients should be updated more cautiously;
- parameters that receive small or infrequent gradients should remain more reactive.
This makes AdaGrad especially appealing when the signal is sparse, imbalanced, or concentrated on only a subset of parameters.
Theoretical context
AdaGrad was originally developed in the setting of online convex optimization, not in the language of modern deep-learning loss-landscape intuition. In the original 2011 paper, the central theoretical question is therefore not “does the optimizer move well inside a nonconvex neural-loss surface?”, but rather “how well does the sequence of decisions produced by the algorithm perform over time?”
This is why the analysis is phrased in terms of a regret bound. Informally, the regret after iterations measures how much cumulative loss the algorithm has suffered, compared with the loss that would have been obtained by the best single fixed parameter vector chosen in hindsight after seeing the entire sequence of data. A small regret means that, even though the algorithm did not know the future, its running sequence of updates performed nearly as well as that best fixed comparator.
SGD vs AdaGrad
Unlike plain SGD, which uses one global learning rate for all coordinates, AdaGrad:
- adapts the effective step size individually for each parameter,
- reduces the step size along coordinates that have accumulated large gradients,
- preserves relatively larger steps along coordinates that have seen little activity.
Note
This note explains AdaGrad itself through its standard coordinatewise update rule. A full-matrix viewpoint is introduced later only to clarify the geometric intuition behind that rule.
2. AdaGrad update rule
Let
be the gradient of the loss with respect to parameter at iteration .
AdaGrad defines the accumulated squared-gradient term
and updates the parameter according to
where:
- is the base learning rate,
- is a small numerical-stability constant,
- is the cumulative sum of squared gradients associated with parameter up to iteration .
The quantity
is the effective learning rate for coordinate at step .
2.1 Why the accumulated term matters
The growth of
is controlled by two distinct effects:
- frequency: if a parameter repeatedly receives nonzero gradients, the sum keeps growing;
- magnitude: large gradients contribute disproportionately because they are squared.
Hence AdaGrad does not merely count how often a parameter moves. It also measures how strongly that parameter has been pushed over time.
This immediately determines the adaptive behavior:
- if is large, the denominator is large and the update along coordinate is damped;
- if is small, the denominator is smaller and the update along coordinate remains more aggressive.
Adaptive balance
AdaGrad implements a simple but powerful balancing principle. It enforces a form of democracy among parameters:
- coordinates that have already absorbed substantial gradient signal are down-scaled,
- coordinates that have seen little signal are comparatively emphasized.
This prevents rare gradients from being dominated by coordinates that fire more often. It is also the main reason AdaGrad is often effective on problems with sparse features or rare but informative events.
3. Numerical example
Consider two parameters, and , with the following gradients over three iterations:
| step | ||
|---|---|---|
| 1 | ||
| 2 | ||
| 3 |
3.1 Gradient accumulation
The diagonal accumulators are
Thus:
- has accumulated strong and frequent gradients;
- has accumulated weak and sparse gradients.
3.2 Effective learning rates
AdaGrad therefore assigns
So the second parameter receives a much larger effective step scale than the first.
3.3 Interpretation
This example illustrates AdaGrad’s central mechanism:
- coordinates with persistent large gradients are automatically damped;
- coordinates with weaker or rarer gradients are allowed to keep moving.
Takeaway
AdaGrad does not treat all coordinates equally. It treats them proportionally to their optimization history. That per-coordinate rescaling is the source of both its strength and, later, its main limitation.
4. AdaGrad in depth
4.1 The matrix G
The diagonal formula above is easier to understand if one starts from a more general matrix viewpoint.
The matrix is given by the sum of the outer products of the loss gradients across different time steps:
Equivalent matrix notation
Equivalently,
These two expressions are the same object: the outer product
is simply the matrix
Written entrywise,
So:
- the diagonal entries collect sums of squared gradients;
- the off-diagonal entries collect sums of gradient cross-products between different coordinates.
Note
These entries are not variances and covariances in the strict statistical sense, because no centering by the mean is being performed. They are better interpreted as accumulated second-order and cross-coordinate gradient interactions.
The fully expanded matrix is:
Expansion of the outer product
The outer product
may be written as
If
then
Therefore the outer product stores all pairwise products between gradient coordinates, and summing these matrices over time produces .
4.2 Why deep learning uses only the diagonal
The practical AdaGrad update used in deep learning keeps only
instead of the full matrix .
Why use only
The diagonal approximation is used for three reasons:
- computational tractability: storing a full matrix is infeasible when is in the millions;
- per-parameter adaptation: the diagonal already provides exactly the coordinatewise scaling needed for adaptive learning rates;
- good practical tradeoff: the off-diagonal cross-coordinate interactions are expensive to maintain and usually not worth their cost in large-scale deep learning.
In short, diagonal AdaGrad sacrifices second-order richness in exchange for scalability.
4.3 The motivation behind AdaGrad: preconditioning
Important
To understand why AdaGrad is so effective, especially on sparse or ill-conditioned problems, it is useful to interpret it as a form of diagonal preconditioning.
4.3.1 The problem of ill-conditioned objectives
Consider the quadratic objective
where is symmetric positive definite.
The eigenvalues of determine the geometry of the level sets:
- if the eigenvalues are similar, the landscape is close to isotropic and optimization is relatively easy;
- if they differ strongly, the landscape becomes elongated and optimization becomes much harder.
The severity of this anisotropy is measured by the condition number
Large means the problem is ill-conditioned.
SGD on ill-conditioned problems
In an elongated valley, plain gradient descent tends to oscillate in steep directions while making slow progress in flat directions. The algorithm is then using one global learning rate to solve a problem that naturally calls for direction-dependent scaling.
4.3.2 The ideal but impractical solution: full preconditioning
If one had access to the right geometry, one would like to transform the coordinates so that the objective becomes much better conditioned.
Using the eigendecomposition
the change of variables
diagonalizes the quadratic form:
In that transformed coordinate system, each direction can in principle be handled according to its own curvature.
Why this is impractical
For modern neural networks, computing and maintaining the full second-order geometry is typically far too expensive. This is the central reason adaptive first-order methods rely on cheap approximations instead.
4.3.3 AdaGrad’s insight: a diagonal geometric surrogate
AdaGrad does not estimate the Hessian exactly. What it does is more modest and more defensible:
- it uses accumulated squared gradients as a diagonal surrogate for local geometry;
- directions that repeatedly exhibit large gradients are progressively down-scaled;
- directions that repeatedly exhibit small gradients remain less penalized.
For the quadratic objective above,
where is the minimizer.
Derivation of the quadratic gradient
Starting from
with symmetric, one has
If is the minimizer, then
so
Substituting back gives
This expression clarifies why gradient magnitude is related to geometry: directions associated with large curvature tend to generate large gradient components when the iterate is displaced from the optimum.
That relationship is not exact enough to identify curvature perfectly, but it is strong enough to motivate AdaGrad’s scaling rule:
The effect is precisely what one wants from a diagonal preconditioner:
- steep directions get progressively damped;
- shallow or rarely active directions keep relatively larger steps.
Summary
AdaGrad can be interpreted as a scalable diagonal preconditioning scheme built from gradient history rather than from explicit Hessian computations.
5. AdaGrad weakness
5.1 The intrinsic limitation
The same mechanism that gives AdaGrad its strength is also the source of its main weakness.
Since
each coordinate accumulator is monotonically nondecreasing. Therefore the coordinatewise factor
is monotonically nonincreasing.
In other words, AdaGrad never forgets. Once a coordinate has accumulated substantial gradient mass, its effective learning rate can only stay where it is or decrease further.
Irreversible shrinkage
AdaGrad does not merely adapt learning rates across coordinates. It also builds in an irreversible time-dependent decay through the growth of . That decay is endogenous: it is produced by the optimizer itself rather than by an explicit scheduler.
5.2 Asymptotic consequence
To understand the long-run behavior of AdaGrad, one must ask how fast the cumulative quantity
grows with .
Since is a sum over iterations, the natural object to examine is the corresponding average contribution per iteration, namely
Suppose that along coordinate this average stabilizes for large , so that
Notation
Here, should be understood simply as the long-run average squared-gradient scale along coordinate . It is not introduced as an exact variance parameter in a probabilistic model. It is just a compact way to say that, over many iterations, the squared gradients of coordinate contribute on average about per step.
Then
and the effective learning rate behaves like
So even when the problem would benefit from sustained movement, AdaGrad keeps reducing the step scale approximately like .
Practical consequences
This has three important effects:
- after enough iterations, progress may become extremely slow or stall;
- the optimizer’s internal decay stacks on top of any external learning-rate schedule;
- on non-sparse or long-horizon problems, AdaGrad may become too conservative long before optimization is truly finished.
Why later optimizers replace this mechanism
AdaGrad mixes two ideas in one quantity:
- coordinatewise adaptation, which is desirable;
- permanent accumulation of history, which is often too aggressive.
RMSProp, AdaDelta, and Adam keep the first idea but weaken or replace the second.
6. Beyond AdaGrad
Later optimizers preserve AdaGrad’s central insight, namely per-parameter adaptation, while addressing its infinite-memory weakness in different ways.
| Optimizer | Key idea | How it avoids AdaGrad’s main problem |
|---|---|---|
| RMSProp | EMA of squared gradients | Replaces cumulative history with exponentially decaying memory |
| AdaDelta | EMA of squared gradients and updates | Removes the need for unbounded accumulation and stabilizes step scaling |
| Adam | RMSProp + momentum + bias correction | Keeps adaptive scaling but controls both direction and startup behavior more carefully |
Summary
AdaGrad is historically foundational and still highly effective on genuinely sparse problems. Its core limitation is equally fundamental: by remembering the entire past, it eventually becomes too conservative for many modern deep-learning workloads.