1. Intro

AdaGrad (ptive ient) is an optimizer introduced in 2011 that assigns a different effective learning rate to each parameter by using the history of past gradients.

Its central idea is simple:

  • parameters that have already received many large gradients should be updated more cautiously;
  • parameters that receive small or infrequent gradients should remain more reactive.

This makes AdaGrad especially appealing when the signal is sparse, imbalanced, or concentrated on only a subset of parameters.

Theoretical context

AdaGrad was originally developed in the setting of online convex optimization, not in the language of modern deep-learning loss-landscape intuition. In the original 2011 paper, the central theoretical question is therefore not “does the optimizer move well inside a nonconvex neural-loss surface?”, but rather “how well does the sequence of decisions produced by the algorithm perform over time?”

This is why the analysis is phrased in terms of a regret bound. Informally, the regret after iterations measures how much cumulative loss the algorithm has suffered, compared with the loss that would have been obtained by the best single fixed parameter vector chosen in hindsight after seeing the entire sequence of data. A small regret means that, even though the algorithm did not know the future, its running sequence of updates performed nearly as well as that best fixed comparator.

SGD vs AdaGrad

Unlike plain SGD, which uses one global learning rate for all coordinates, AdaGrad:

  • adapts the effective step size individually for each parameter,
  • reduces the step size along coordinates that have accumulated large gradients,
  • preserves relatively larger steps along coordinates that have seen little activity.

Note

This note explains AdaGrad itself through its standard coordinatewise update rule. A full-matrix viewpoint is introduced later only to clarify the geometric intuition behind that rule.


2. AdaGrad update rule

Let

be the gradient of the loss with respect to parameter at iteration .

AdaGrad defines the accumulated squared-gradient term

and updates the parameter according to

where:

  • is the base learning rate,
  • is a small numerical-stability constant,
  • is the cumulative sum of squared gradients associated with parameter up to iteration .

The quantity

is the effective learning rate for coordinate at step .


2.1 Why the accumulated term matters

The growth of

is controlled by two distinct effects:

  • frequency: if a parameter repeatedly receives nonzero gradients, the sum keeps growing;
  • magnitude: large gradients contribute disproportionately because they are squared.

Hence AdaGrad does not merely count how often a parameter moves. It also measures how strongly that parameter has been pushed over time.

This immediately determines the adaptive behavior:

  • if is large, the denominator is large and the update along coordinate is damped;
  • if is small, the denominator is smaller and the update along coordinate remains more aggressive.

Adaptive balance

AdaGrad implements a simple but powerful balancing principle. It enforces a form of democracy among parameters:

  • coordinates that have already absorbed substantial gradient signal are down-scaled,
  • coordinates that have seen little signal are comparatively emphasized.

This prevents rare gradients from being dominated by coordinates that fire more often. It is also the main reason AdaGrad is often effective on problems with sparse features or rare but informative events.


3. Numerical example

Consider two parameters, and , with the following gradients over three iterations:

step
1
2
3

3.1 Gradient accumulation

The diagonal accumulators are

Thus:

  • has accumulated strong and frequent gradients;
  • has accumulated weak and sparse gradients.

3.2 Effective learning rates

AdaGrad therefore assigns

So the second parameter receives a much larger effective step scale than the first.

3.3 Interpretation

This example illustrates AdaGrad’s central mechanism:

  • coordinates with persistent large gradients are automatically damped;
  • coordinates with weaker or rarer gradients are allowed to keep moving.

Takeaway

AdaGrad does not treat all coordinates equally. It treats them proportionally to their optimization history. That per-coordinate rescaling is the source of both its strength and, later, its main limitation.


4. AdaGrad in depth

4.1 The matrix G

The diagonal formula above is easier to understand if one starts from a more general matrix viewpoint.

The matrix is given by the sum of the outer products of the loss gradients across different time steps:

Written entrywise,

So:

  • the diagonal entries collect sums of squared gradients;
  • the off-diagonal entries collect sums of gradient cross-products between different coordinates.

Note

These entries are not variances and covariances in the strict statistical sense, because no centering by the mean is being performed. They are better interpreted as accumulated second-order and cross-coordinate gradient interactions.

The fully expanded matrix is:

4.2 Why deep learning uses only the diagonal

The practical AdaGrad update used in deep learning keeps only

instead of the full matrix .

Why use only

The diagonal approximation is used for three reasons:

  • computational tractability: storing a full matrix is infeasible when is in the millions;
  • per-parameter adaptation: the diagonal already provides exactly the coordinatewise scaling needed for adaptive learning rates;
  • good practical tradeoff: the off-diagonal cross-coordinate interactions are expensive to maintain and usually not worth their cost in large-scale deep learning.

In short, diagonal AdaGrad sacrifices second-order richness in exchange for scalability.


4.3 The motivation behind AdaGrad: preconditioning

Important

To understand why AdaGrad is so effective, especially on sparse or ill-conditioned problems, it is useful to interpret it as a form of diagonal preconditioning.

4.3.1 The problem of ill-conditioned objectives

Consider the quadratic objective

where is symmetric positive definite.

The eigenvalues of determine the geometry of the level sets:

  • if the eigenvalues are similar, the landscape is close to isotropic and optimization is relatively easy;
  • if they differ strongly, the landscape becomes elongated and optimization becomes much harder.

The severity of this anisotropy is measured by the condition number

Large means the problem is ill-conditioned.

SGD on ill-conditioned problems

In an elongated valley, plain gradient descent tends to oscillate in steep directions while making slow progress in flat directions. The algorithm is then using one global learning rate to solve a problem that naturally calls for direction-dependent scaling.


4.3.2 The ideal but impractical solution: full preconditioning

If one had access to the right geometry, one would like to transform the coordinates so that the objective becomes much better conditioned.

Using the eigendecomposition

the change of variables

diagonalizes the quadratic form:

In that transformed coordinate system, each direction can in principle be handled according to its own curvature.

Why this is impractical

For modern neural networks, computing and maintaining the full second-order geometry is typically far too expensive. This is the central reason adaptive first-order methods rely on cheap approximations instead.


4.3.3 AdaGrad’s insight: a diagonal geometric surrogate

AdaGrad does not estimate the Hessian exactly. What it does is more modest and more defensible:

  • it uses accumulated squared gradients as a diagonal surrogate for local geometry;
  • directions that repeatedly exhibit large gradients are progressively down-scaled;
  • directions that repeatedly exhibit small gradients remain less penalized.

For the quadratic objective above,

where is the minimizer.

This expression clarifies why gradient magnitude is related to geometry: directions associated with large curvature tend to generate large gradient components when the iterate is displaced from the optimum.

That relationship is not exact enough to identify curvature perfectly, but it is strong enough to motivate AdaGrad’s scaling rule:

The effect is precisely what one wants from a diagonal preconditioner:

  • steep directions get progressively damped;
  • shallow or rarely active directions keep relatively larger steps.

Summary

AdaGrad can be interpreted as a scalable diagonal preconditioning scheme built from gradient history rather than from explicit Hessian computations.


5. AdaGrad weakness

5.1 The intrinsic limitation

The same mechanism that gives AdaGrad its strength is also the source of its main weakness.

Since

each coordinate accumulator is monotonically nondecreasing. Therefore the coordinatewise factor

is monotonically nonincreasing.

In other words, AdaGrad never forgets. Once a coordinate has accumulated substantial gradient mass, its effective learning rate can only stay where it is or decrease further.

Irreversible shrinkage

AdaGrad does not merely adapt learning rates across coordinates. It also builds in an irreversible time-dependent decay through the growth of . That decay is endogenous: it is produced by the optimizer itself rather than by an explicit scheduler.

5.2 Asymptotic consequence

To understand the long-run behavior of AdaGrad, one must ask how fast the cumulative quantity

grows with .

Since is a sum over iterations, the natural object to examine is the corresponding average contribution per iteration, namely

Suppose that along coordinate this average stabilizes for large , so that

Notation

Here, should be understood simply as the long-run average squared-gradient scale along coordinate . It is not introduced as an exact variance parameter in a probabilistic model. It is just a compact way to say that, over many iterations, the squared gradients of coordinate contribute on average about per step.

Then

and the effective learning rate behaves like

So even when the problem would benefit from sustained movement, AdaGrad keeps reducing the step scale approximately like .

Practical consequences

This has three important effects:

  • after enough iterations, progress may become extremely slow or stall;
  • the optimizer’s internal decay stacks on top of any external learning-rate schedule;
  • on non-sparse or long-horizon problems, AdaGrad may become too conservative long before optimization is truly finished.

Why later optimizers replace this mechanism

AdaGrad mixes two ideas in one quantity:

  • coordinatewise adaptation, which is desirable;
  • permanent accumulation of history, which is often too aggressive.

RMSProp, AdaDelta, and Adam keep the first idea but weaken or replace the second.


6. Beyond AdaGrad

Later optimizers preserve AdaGrad’s central insight, namely per-parameter adaptation, while addressing its infinite-memory weakness in different ways.

OptimizerKey ideaHow it avoids AdaGrad’s main problem
RMSPropEMA of squared gradientsReplaces cumulative history with exponentially decaying memory
AdaDeltaEMA of squared gradients and updatesRemoves the need for unbounded accumulation and stabilizes step scaling
AdamRMSProp + momentum + bias correctionKeeps adaptive scaling but controls both direction and startup behavior more carefully

Summary

AdaGrad is historically foundational and still highly effective on genuinely sparse problems. Its core limitation is equally fundamental: by remembering the entire past, it eventually becomes too conservative for many modern deep-learning workloads.