AdaGrad

Intro

AdaGrad ( $Ada$ ptive $Grad$ ient) is an optimization algorithm, introduced in $2011$ , that dynamically adjusts the learning rate for each parameter of the network based on the history of past gradients.

SGD vs AdaGrad

Unlike classical SGD, which uses a single global learning rate $η$ , AdaGrad:

adapts the base learning rate $η$ individually for each parameter $θ_{j}$ ,

decreases the learning rate for parameters that are updated frequently,

keeps the learning rate higher for rarely updated parameters, which are often the most informative.

Note

This strategy improves convergence performance in scenarios with sparse data or rarely activated parameters, such as in Natural Language Processing (NLP) and Image Recognition.

AdaGrad update rule

The parameter update in AdaGrad follows the formula:

θ_{j}^{(t + 1)} = θ_{j}^{(t)} - \frac{η}{G _{jj}^{(t)} + ε} \cdot \nabla_{θ_{j}} L^{(t)}

where:

$θ_{j}^{(t)}$ : value of the $j$ -th parameter at step $t$ ,
$η$ : initial global learning rate,
$\nabla_{θ_{j}} L^{(t)} = \partial L^{(t)} / \partial θ_{j}$ : gradient component associated with parameter $θ_{j}$ at step $t$ ,
$ε$ : small constant number (usually $1 0^{- 10}$ ) to avoid division by zero,
$G_{jj}^{(t)} = \sum_{k = 1}^{t} (\nabla_{θ_{j}} L^{(k)})^{2}$ : cumulative sum of the squared gradients of parameter $θ_{j}$ up to time $t$ .

1.1 Cumulative sum of squared gradients term

The growth of the term $G_{jj}^{(t)}$

G_{jj}^{(t)} = k = 1 \sum t (\nabla_{θ_{j}} L^{(k)})^{2}

is driven by two distinct factors:

Frequency (Consistency of updates): this factor reflects how consistently a parameter receives a non-zero gradient. Since the sum involves squared terms (always non-negative), $G_{jj}^{(t)}$ increases inexorably at every step in which $\nabla_{θ_{j}} L^{(k)} \neq = 0$ . Even a long sequence of small gradients results in a steady and significant accumulation.
Intensity: this factor captures the impact of the gradient magnitude. Squaring introduces a nonlinear effect: large gradients contribute disproportionately to the sum. If each $\nabla_{θ_{j}} L^{(k)}$ has a large magnitude, squaring it produces an even larger contribution to , accelerating its growth. For instance, a single gradient of magnitude $10$ (adding $100$ to the sum) has ten times the impact of ten gradients of magnitude $1$ (which together add only $10$ ).

Note

Thus, $G_{jj}^{(t)}$ grows when parameter $θ_{j}$ receives gradient contributions frequently and/or strongly.

This has a direct impact on the effective learning rate:

If $G_{jj}^{(t)}$ is large → denominator in update rule is large → smaller learning rate → the parameter update is dampened.
If $G_{jj}^{(t)}$ is small → denominator in update rule is small → larger learning rate → the parameter update is amplified.

Adaptive balance in AdaGrad

AdaGrad acts as an adaptive optimizer that takes into account the history of each parameter.
It enforces a form of “democracy among parameters”:

parameters with frequent or strong gradients are down-scaled,

parameters with rare or weaker gradients are emphasized.

This prevents rare gradients from being dominated by those arising more frequently, ensuring a more balanced learning process.

Numerical example

Consider two parameters, $θ_{1}$ and $θ_{2}$ , with the following gradients observed over three time steps:

step $k$	$\nabla_{θ_{1}} L^{(k)}$	$\nabla_{θ_{2}} L^{(k)}$
1	$0.9$	$0.1$
2	$0.8$	$0.0$
3	$1.0$	$0.2$

Gradient accumulation

The diagonal terms $G_{jj}^{(t)}$ are obtained as cumulative sums of squared gradients:

G_{11} G_{22} = 0. 9^{2} + 0. 8^{2} + 1. 0^{2} ≃ 2.45, = 0. 1^{2} + 0 + 0. 2^{2} = 0.05.

$θ_{1}$ received strong and frequent gradients, leading to a large accumulated value $G_{11}$ .
$θ_{2}$ received weak and sparse gradients, so $G_{22}$ remains very small.

Effective learning rates

AdaGrad scales the base learning rate $η$ by the inverse square root of $G_{jj}$ :

\frac{η}{G _{11}} \approx \frac{η}{1.57} (smaller LR), \frac{η}{G _{22}} \approx \frac{η}{0.22} (larger LR) .

For $θ_{1}$ , the update is dampened to avoid overshooting, since this parameter has already been heavily influenced by the gradients.
For $θ_{2}$ , the update is amplified to give more weight to a parameter that would otherwise lag behind.

Interpretation

This example illustrates AdaGrad’s balancing effect:

Extreme parameter updates (frequent, strong gradients) get dampened to prevent them from overwhelming the optimization.
Rare or underrepresented parameters that get few or small updates receive higher learning rates, ensuring they still contribute to learning.

Takeaway

AdaGrad prevents parameters linked to rare signals from being overshadowed by those linked to frequent signals.
This per-parameter rescaling is the core reason why AdaGrad is especially effective in settings with sparse or imbalanced data.

AdaGrad in depth

The matrix G

In AdaGrad, a gradient accumulation matrix $G^{(t)}$ is first defined, and it forms the basis of AdaGrad’s adaptive mechanism

The matrix $G^{(t)}$ is given by the sum of the outer products of the loss gradients across different time steps:

G^{(t)} ≜ k = 1 \sum t [\nabla_{θ} L^{(k)} \otimes \nabla_{θ} L^{(k)}] = k = 1 \sum t (\frac{\partial L ^{(k)}}{\partial θ _{1}})^{2} \frac{\partial L ^{(k)}}{\partial θ _{2}} \frac{\partial L ^{(k)}}{\partial θ _{1}} ⋮ \frac{\partial L ^{(k)}}{\partial θ _{d}} \frac{\partial L ^{(k)}}{\partial θ _{1}} \frac{\partial L ^{(k)}}{\partial θ _{1}} \frac{\partial L ^{(k)}}{\partial θ _{2}} (\frac{\partial L ^{(k)}}{\partial θ _{2}})^{2} ⋮ \frac{\partial L ^{(k)}}{\partial θ _{d}} \frac{\partial L ^{(k)}}{\partial θ _{2}} \dots \dots ⋱ \dots \frac{\partial L ^{(k)}}{\partial θ _{1}} \frac{\partial L ^{(k)}}{\partial θ _{d}} \frac{\partial L ^{(k)}}{\partial θ _{2}} \frac{\partial L ^{(k)}}{\partial θ _{d}} ⋮ (\frac{\partial L ^{(k)}}{\partial θ _{d}})^{2} = k = 1 \sum t (\frac{\partial L ^{(k)}}{\partial θ _{1}})^{2} ⋮ k = 1 \sum t \frac{\partial L ^{(k)}}{\partial θ _{d}} \frac{\partial L ^{(k)}}{\partial θ _{1}} \dots ⋱ \dots k = 1 \sum t \frac{\partial L ^{(k)}}{\partial θ _{1}} \frac{\partial L ^{(k)}}{\partial θ _{d}} ⋮ k = 1 \sum t (\frac{\partial L ^{(k)}}{\partial θ _{d}})^{2}

It represents an accumulation matrix that contains, for each pair of parameters $(θ_{i}, θ_{j})$ , the sum of the products of their respective partial derivatives over time:

[G^{(t)}]_{ij} = k = 1 \sum t \frac{\partial L ^{(k)}}{\partial θ _{i}} \cdot \frac{\partial L ^{(k)}}{\partial θ _{j}}

where:

The diagonal elements $G_{jj}$ contain the sum of squared gradients for each parameter, i.e., the variances of the gradients of individual parameters.
The off-diagonal elements represent the covariances between gradients of different parameters.

Expansion of the outer product in the matrix $G^{(t)}$

The outer product $\nabla_{θ} L^{(k)} \otimes \nabla_{θ} L^{(k)}$ can be expressed as the product of a column vector and its transpose:
$\nabla_{θ} L^{(k)} \otimes \nabla_{θ} L^{(k)} = \nabla_{θ} L^{(k)} \cdot (\nabla_{θ} L^{(k)})^{T}$
The outer product of two column vectors yields a matrix whose entries are all possible products of the vector components:
$(\nabla_{θ} L^{(k)} \cdot (\nabla_{θ} L^{(k)})^{T})_{ij} = \nabla_{θ_{i}} L^{(k)} \cdot \nabla_{θ_{j}} L^{(k)}$
In our case, the gradient can be written as:
$\nabla_{θ} L^{(k)} = \frac{\partial L ^{(k)}}{\partial θ _{1}} \frac{\partial L ^{(k)}}{\partial θ _{2}} ⋮ \frac{\partial L ^{(k)}}{\partial θ _{d}} \in R^{d \times 1}, (\nabla_{θ} L^{(k)})^{T} = [\frac{\partial L ^{(k)}}{\partial θ _{1}} \frac{\partial L ^{(k)}}{\partial θ _{2}} \dots \frac{\partial L ^{(k)}}{\partial θ _{d}}] \in R^{1 \times d}$
Thus, the outer product becomes:
$\nabla_{θ} L^{(k)} \otimes \nabla_{θ} L^{(k)} = \nabla_{θ} L^{(k)} \cdot (\nabla_{θ} L^{(k)})^{T} = (\frac{\partial L ^{(k)}}{\partial θ _{1}})^{2} \frac{\partial L ^{(k)}}{\partial θ _{2}} \frac{\partial L ^{(k)}}{\partial θ _{1}} ⋮ \frac{\partial L ^{(k)}}{\partial θ _{d}} \frac{\partial L ^{(k)}}{\partial θ _{1}} \frac{\partial L ^{(k)}}{\partial θ _{1}} \frac{\partial L ^{(k)}}{\partial θ _{2}} (\frac{\partial L ^{(k)}}{\partial θ _{2}})^{2} ⋮ \frac{\partial L ^{(k)}}{\partial θ _{d}} \frac{\partial L ^{(k)}}{\partial θ _{2}} \dots \dots ⋱ \dots \frac{\partial L ^{(k)}}{\partial θ _{1}} \frac{\partial L ^{(k)}}{\partial θ _{d}} \frac{\partial L ^{(k)}}{\partial θ _{2}} \frac{\partial L ^{(k)}}{\partial θ _{d}} ⋮ (\frac{\partial L ^{(k)}}{\partial θ _{d}})^{2} \in R^{d \times d}$

AdaGrad does not use the full matrix $G$ , but only its diagonal elements:

G_{jj}^{(t)} = k = 1 \sum t (\nabla_{θ_{j}} L^{(k)})^{2} = k = 1 \sum t (\frac{\partial L ^{(k)}}{\partial θ _{j}})^{2}

where:

$\nabla_{θ_{j}} L^{(k)}$ is the partial derivative of the loss function $L$ at step $k$ , with respect to parameter $θ_{j}$ .
$G_{jj}^{(t)}$ represents the cumulative sum of squared gradients over the entire history of the experiment (i.e., from $k = 1$ to the current iteration $k = t$ ) for parameter $θ_{j}$ only. $G_{jj}^{(t)}$ is then used to scale the learning rate over time, independently for each parameter.

Why use $G_{jj}^{(t)}$ instead of $G^{(t)}$

In AdaGrad, only the diagonal elements of the accumulation matrix $G$ are used.

Computational efficiency: The full $G^{(t)}$ is a $d \times d$ matrix (with $d$ being the number of parameters). For modern neural networks, $d$ can be in the millions, making storage and updates of $O (d^{2})$ completely infeasible. Using only the diagonal reduces this to $O (d)$ .

Sufficiency for learning rate adaptation: Each diagonal element $G_{jj}^{(t)}$ tracks the cumulative squared gradients of a single parameter $θ_{j}$ . This is precisely the information needed to adapt the learning rate independently per parameter, which is the central goal of AdaGrad.

Limited usefulness of covariances: Off-diagonal terms represent covariances between gradients of different parameters. While potentially informative, they are not essential for the AdaGrad update rule. In practice, the added complexity does not translate into significant optimization benefits, whereas the per-parameter scaling provided by the diagonal already captures the key adaptive behavior.

Thus, focusing on $G_{jj}^{(t)}$ strikes the right balance: computationally tractable while still providing effective per-parameter learning rate adaptation.

The Motivation Behind AdaGrad: Preconditioning

Important

To understand why AdaGrad is so effective, especially in scenarios with sparse features, it’s useful to first explore the concept of preconditioning in optimization. The core idea is to address a common challenge in optimization: ill-conditioned problems.

The Problem of Ill-Conditioned Functions

Consider the simple convex optimization problem of minimizing a quadratic function:

f (x) = \frac{1}{2} x^{T} Qx + c^{T} x

The shape of this function’s loss surface is determined by the matrix $Q$ .

If the eigenvalues of $Q$ are all similar, the surface resembles a round bowl, and finding the minimum is straightforward.
If the eigenvalues are very different, the surface looks like a steep, narrow canyon or an elongated ellipse.

This “stretched” nature is quantified by the condition number ( $κ$ ), defined as the ratio of the largest to the smallest eigenvalue of $Q$ :

κ = \frac{λ _{max}}{λ _{min}}

A large condition number ( $κ ≫ 1$ ) means the problem is ill-conditioned.

SGD and ill-conditioned problems

Standard gradient descent algorithms struggle in this scenario, as they tend to oscillate across the steep walls of the canyon while making very slow progress along its flat floor, which is the true path to the minimum.

The Ideal (but Impractical) Solution: Full Preconditioning

Theoretically, this problem could be “fixed” by transforming the coordinate space to turn the canyon back into a perfectly round bowl. This can be achieved using the eigendecomposition of the Hessian matrix, $Q = U Λ U^{⊤}$ . By changing variables to $x^{'} = U^{⊤} x$ , the problem becomes:

f (x) = \frac{1}{2} x^{'}^{T} Λ x^{'} + (U^{T} c)^{T} x^{'} + const

In this new coordinate system, each coordinate can be optimized independently, which is trivial.

Why is this impractical?

Computing the full set of eigenvalues and eigenvectors of $Q$ for a high-dimensional problem (like a deep neural network) is often more computationally expensive than solving the optimization problem itself.

The Adagrad Insight: Using Gradients as a Proxy for Curvature

AdaGrad

Since directly using the Hessian is infeasible in Deep Learning, AdaGrad introduces an ingenious idea: use the history of the gradients as a proxy for the diagonal of the Hessian.

The gradient of the previous quadratic function is:

\nabla f (x) = Q (x - x^{*})

where $x^{*}$ is the location of the minimum.

Derivation of the Quadratic Gradient

To establish the analytical relationship between the gradient and the Hessian matrix in a quadratic context, consider the standard objective function:
$f (x) = \frac{1}{2} x^{⊤} Qx + c^{⊤} x$
where $Q \in R^{n \times n}$ is a symmetric positive-definite matrix representing the Hessian, and $c \in R^{n}$ is a constant vector.

1. First-Order Derivative

The gradient of $f (x)$ with respect to $x$ is obtained through the application of matrix calculus identities:
$\nabla f (x) = Qx + c$
2. Optimality Condition

Let $x^{*}$ denote the global minimizer of the function. At the point of optimality, the first-order condition requires the gradient to vanish:
$\nabla f (x^{*}) = Q x^{*} + c = 0$
This equality permits the expression of the constant vector $c$ in terms of the Hessian and the optimal coordinate:
$c = - Q x^{*}$
3. Synthesis of the Final Form

Substituting the resulting expression for $c$ back into the general gradient formula yields:
$\nabla f (x) = Qx - Q x^{*}$
Factoring the matrix $Q$ leads to the definitive representation:
$\nabla f (x) = Q (x - x^{*})$

This equation tells us that the magnitude of the gradient in a particular direction depends on the curvature in that direction (encoded in $Q$ ).

High Curvature Directions (steep walls of the canyon) will consistently produce large gradients.
Low Curvature Directions (flat floor of the canyon) will consistently produce small gradients.

Therefore, by accumulating the square of the gradients over time for each parameter, we get a running estimate of the local curvature. This is precisely what Adagrad does. It uses this accumulated value, $G_{jj}^{(t)}$ , to normalize the learning rate for each parameter $j$ :

θ_{j}^{(t + 1)} = θ_{j}^{(t)} - \frac{η}{G _{jj}^{(t)} + ε} \cdot \nabla_{θ_{j}} L^{(t)}

This allows Adagrad to:

Decrease the learning rate in directions with large, frequent gradients (the steep walls).
Increase the learning rate in directions with small, infrequent gradients (the flat floor).

Summary

In essence, Adagrad implements an adaptive, diagonal preconditioning scheme without ever needing to compute the Hessian, making it a powerful and efficient optimizer.

AdaGrad weakness

Intrinsic limitation of AdaGrad

In AdaGrad, each diagonal term of the accumulation matrix $G$ :
$G_{jj}^{(t)} = k = 1 \sum t (\nabla_{θ_{j}} L^{(k)})^{2}$
grows monotonically (sums of squares cannot decrease), even for parameters updated only rarely.
As a result, the effective learning rate
$\frac{η}{G _{jj}^{(t)} + ε}$
decays irreversibly over time for every parameter.

Side effects

After many iterations, steps in the loss landscape become so small that learning may slow down or even stall.

This internal decay adds on top of external LR schedulers (e.g., warm-up, cosine decay, etc.), causing the learning rate to collapse too quickly.

In non-sparse problems, this aggressive reduction of the learning rate may prevent the optimizer from reaching a satisfactory minimum.

Why this happens

AdaGrad’s mechanism mixes two different goals in a single formula:

Equalizing the learning rate parameter by parameter → $\frac{η}{G _{jj}}$ makes each $θ_{j}$ proportionally sensitive to its own “history” of gradients (a desirable function of an adaptive optimizer ).

Making the learning rate decay (globally) over time → since $G_{jj}^{(t)}$ grows with training iterations, the factor $G_{jj}^{(t)}$ causes all steps to decrease as training progresses (this should be managed by an explicit LR scheduler, not by the optimizer itself).

Consequence

AdaGrad intertwines the above two logics, preventing the separate control of:

the learning rate adaptation for each parameter,

the global decay of the learning rate over time.

Desired behavior: LR Decay is useful, when kept in check

Ideally, in many real-world applications, the learning rate should start high (exploration) and then decrease gradually (exploitation).
With AdaGrad, however, this decay is endogenous and uncontrollable, as it depends entirely on gradient accumulation and cannot be directly controlled by the user.

Beyond AdaGrad

Optimizers developed as evolutions of AdaGrad (RMSProp, AdaDelta, Adam) preserve its desirable feature of per-parameter adaptation, while delegating the decay of the learning rate to an explicit LR scheduler.

Optimizer	Key idea	How it avoids the problem
RMSProp	Exponential moving average (EMA) of squared gradients	“Forgets” distant past gradients, preventing $G_{jj}$ from diverging
AdaDelta	EMA of squared gradients and parameter updates	Normalizes the learning rate by balancing gradient magnitudes with past update magnitudes, ensuring stable and meaningful steps
Adam	RMSProp + momentum	Combines EMA of squared gradients (denominator) and EMA of gradients (numerator), ensuring adaptive steps that never become too small

Summary

AdaGrad is well-suited for very sparse problems and short training runs. For prolonged training or less sparse data, its evolutions (RMSProp, AdaDelta, Adam) allow for explicit control of the learning rate decay.

Deep Learning

Explorer

AdaGrad

Intro

AdaGrad update rule

1.1 Cumulative sum of squared gradients term

Numerical example

Gradient accumulation

Effective learning rates

Interpretation

AdaGrad in depth

The matrix G

The Motivation Behind AdaGrad: Preconditioning

The Problem of Ill-Conditioned Functions

The Ideal (but Impractical) Solution: Full Preconditioning

The Adagrad Insight: Using Gradients as a Proxy for Curvature

1. First-Order Derivative

2. Optimality Condition

3. Synthesis of the Final Form

AdaGrad weakness

Beyond AdaGrad

Graph View

Table of Contents

Backlinks