Gradient Clipping

The analysis in BPTT Problems showed that backpropagating through a deep recurrent computation can amplify gradients exponentially. When the effective scale $γ = γ_{h} σ_{m a x} (W_{hh})$ exceeds $1$ , the norm of the recurrent Jacobian product grows like $γ^{t - k + 1}$ , and the gradient of the loss with respect to the recurrent parameters can become numerically enormous. The optimizer then takes a step proportional to that gradient and overshoots wildly.

Gradient clipping is the standard defense against this failure mode. It does not change the model or the architecture; it intervenes only at the optimizer interface, rescaling the gradient whenever its norm exceeds a fixed threshold. The technique is due to Pascanu, Mikolov and Bengio (2013) and is used almost universally in RNN training.

Cliffs in the Loss Surface

The loss surfaces of recurrent networks have a characteristic geometric feature: cliffs. Most of the parameter space is fairly flat or gently curved, but a narrow region exists where a small change in $W_{hh}$ causes a very large change in the loss. The reason is structural, and it can be built in four steps.

Step 1: the cell becomes linear in the right regime. The forward pass of an RNN applies the same recurrent matrix once per time step. When the hidden activations operate in the near-linear region of $tanh$ , where $tanh (z) \approx z$ for small $z$ , the cell reduces to an approximately linear map:

h_{t + 1} \approx W_{hh} h_{t} + W_{x h} x_{t + 1} + b_{h} .

Step 2: iterating the linear map produces a matrix power. Substituting this approximation $T$ times along the sequence, the hidden state at the end of the sequence is

h_{T} \approx W_{hh}^{T} h_{0} + (accumulated input-driven terms) .

The forward pass has effectively raised $W_{hh}$ to the $T$ -th power.

Notation

The superscript $T$ in $W_{hh}^{T}$ is the sequence length, not a transpose. The transpose of a matrix $A$ is written $A^{⊤}$ throughout these notes.

Step 3: a small change in $W_{hh}$ is amplified by that power. Perturb the recurrent matrix by a tiny $Δ W_{hh}$ . Linearising the $T$ -th power around $W_{hh}$ , the induced change in $h_{T}$ scales as

∥Δ h_{T} ∥ \sim ∥ W_{hh} ∥^{T - 1} ∥Δ W_{hh} ∥.

The exponent $T - 1$ is the same one that appears in the recurrent Jacobian product analysed in BPTT Problems. It is the geometric signature of the recurrent structure: an effect that happens once per step is multiplied $T$ times.

Optional: derivation of the $∥ W_{hh} ∥^{T - 1}$ scaling

Write $W$ for $W_{hh}$ for brevity. Expand $(W + Δ W)^{T}$ as the product of $T$ identical factors $(W + Δ W)$ and keep only first-order contributions in $Δ W$ . The surviving terms are exactly those in which $Δ W$ replaces $W$ in one of the $T$ positions:
$(W + Δ W)^{T} = W^{T} + k = 0 \sum T - 1 W^{T - 1 - k} Δ W W^{k} + O (∥Δ W ∥^{2}) .$
If positions are numbered $0, \dots, T - 1$ from right to left, the term with $Δ W$ in position $k$ has $T - 1 - k$ copies of $W$ on its left and $k$ on its right, hence the form $W^{T - 1 - k} Δ W W^{k}$ .

Each of the $T$ terms is a product of $T - 1$ copies of $W$ and one copy of $Δ W$ , so the submultiplicativity of the operator norm gives the bound
$∥ W^{T - 1 - k} Δ W W^{k} ∥ \leq ∥ W ∥^{T - 1 - k} ∥Δ W ∥ ∥ W ∥^{k} = ∥ W ∥^{T - 1} ∥Δ W ∥,$
independent of $k$ . Summing the $T$ contributions yields the worst-case bound $T ∥ W ∥^{T - 1} ∥Δ W ∥$ . In the exploding regime $∥ W ∥ > 1$ , the exponential $∥ W ∥^{T - 1}$ dominates the polynomial prefactor $T$ asymptotically in $T$ , which is why Step 3 quotes only the exponential factor. Multiplying through by $h_{0}$ gives the scaling $∥Δ h_{T} ∥ \sim ∥ W_{hh} ∥^{T - 1} ∥Δ W_{hh} ∥$ stated above.

Step 4: the loss inherits the amplification. The loss sees the recurrent matrix only through the hidden states it produces, so the sensitivity of $h_{T}$ to $W_{hh}$ from Step 3 carries straight through to the loss. To first order, a change $Δ W_{hh}$ moves the loss by

∣Δ L ∣ \sim ∥ W_{hh} ∥^{T - 1} ∥Δ W_{hh} ∥,

up to a constant that does not depend on $T$ (it only measures how strongly the loss reacts to the final hidden state through the output head). The chain is the point: $Δ W_{hh}$ changes $h_{T}$ , and $h_{T}$ changes the loss, and the first link carries the exponential factor $∥ W_{hh} ∥^{T - 1}$ .

When $∥ W_{hh} ∥ > 1$ , that factor is enormous for a long sequence: a microscopic change in $W_{hh}$ produces a huge change in the loss. A huge change in the loss per unit change in the parameter is, by definition, a huge gradient. This amplification is not uniform across directions. It is strongest along the direction that $W_{hh}$ stretches the most, its dominant eigenvector, because repeated multiplication grows fastest there; perturbations along the other directions are stretched less and stay tame. The steepness is therefore concentrated in a narrow band of parameter space. That band is the cliff: a thin region where the loss surface rises almost vertically, surrounded by terrain that looks flat by comparison.

A cliff is dangerous for gradient descent because of one hidden assumption. A step $θ \leftarrow θ - η \nabla L$ implicitly trusts that the loss stays approximately linear over the whole length of the step. The gradient is exact at the current point, but near a cliff it is so large that a step of the usual size $η$ travels far beyond the tiny neighborhood where the linear approximation actually holds. The update therefore overshoots the cliff and lands in an unrelated part of the loss surface, typically at a much higher loss, destroying the progress made so far.

The figure makes the consequence concrete. The top panel shows a descent trajectory without clipping: the path approaches the cliff, the gradient becomes enormous along the cliff face, and the next step ejects the parameters far away from the optimum. The bottom panel shows the same trajectory with clipping: the gradient direction is preserved, but its magnitude is capped, so the step length stays bounded and the trajectory follows the cliff edge instead of being thrown off.

The Clipping Rule

Let $g$ denote the gradient of the loss with respect to the parameters, accumulated over the mini-batch. Norm clipping replaces $g$ with a rescaled version whenever its norm exceeds a fixed threshold $M$ :

g \leftarrow ⎩ ⎨ ⎧ \frac{M g}{∥ g ∥}, g, if ∥ g ∥ > M, otherwise.

Two properties are worth making explicit.

Direction is preserved. The rescaling is a positive scalar multiplication, so the clipped vector still points along $g$ . The optimizer is therefore still descending the loss in the same direction it would have chosen otherwise.
Step length is capped. After clipping, $∥ g ∥ \leq M$ holds by construction. The optimizer step $η g$ is bounded by $η M$ regardless of how steep the cliff was.

The norm used in the formula is the Euclidean norm of the concatenation of all parameter gradients, not the norm of each layer’s gradient computed in isolation. This matters in deep networks: per-layer thresholds would not bound the total update.

Norm clipping vs value clipping

An alternative variant, value clipping, simply truncates each component of $g$ into a fixed interval $[- M, M]$ . It is simpler and faster to compute, but it does not preserve the gradient direction: large components are squashed disproportionately while small ones pass through unchanged. Norm clipping is the principled default and is the form implemented by torch.nn.utils.clip_grad_norm_. Value clipping is available as torch.nn.utils.clip_grad_value_ and is rarely the right choice in practice.

Choosing the Threshold $M$

The threshold is a hyperparameter. Typical practice in RNN training uses

M \in {0.25, 0.5, 1.0, 5.0},

with $M = 0.5$ a common default for vanilla and gated recurrent cells. The right value depends on three factors:

the learning rate $η$ , because the worst-case optimizer step is $η M$ ;
the typical norm of unclipped gradients, which itself depends on depth, initialization, and batch size;
the loss function and its scale (cross-entropy versus MSE, for instance).

In practice the choice is rarely critical: any $M$ in the order of $0.1$ to $10$ is usually sufficient to keep training stable, and the loss curve is fairly insensitive to the exact value as long as clipping actually triggers on the largest gradients.

Monitoring whether clipping fires

A useful diagnostic is to log the pre-clip gradient norm at every step. Three regimes are informative:

clipping never triggers: $M$ is set too high and the technique is doing nothing;

clipping triggers at almost every step: $M$ is set too low and the optimizer is permanently rate-limited;

clipping triggers on the occasional spike, especially in early training: this is the healthy regime.

What Clipping Does Not Fix

Clipping is the canonical remedy against exploding gradients. It is not a remedy against vanishing gradients. The two failure modes have the same source (the product of recurrent Jacobians analyzed in BPTT Problems) but require entirely different interventions:

exploding gradients are an optimizer-level problem and are fixed by a post-hoc rescaling like clipping;
vanishing gradients are an architectural problem and are fixed by changing the cell so that the gradient has a high-bandwidth path through time. This is the role of the gates in LSTM and GRU, motivated in the limitations of vanilla RNNs.

Clipping also does nothing about the cost of unrolling a long sequence (the memory and compute that grow with $T$ ): that is bounded separately by the truncated and randomized schedules of BPTT Variants.

Clipping is therefore necessary but not sufficient. Modern RNN training stacks it on top of a gated cell, a gradient-friendly initialization of $W_{hh}$ , a tuned learning-rate schedule, and, for long sequences, the truncated BPTT schedule of BPTT Variants.

Beyond RNNs

The figure and the historical motivation are recurrent, but gradient clipping is also a standard component of Transformer training, where it controls the rare large-norm steps produced by the attention softmax during early epochs. Default values in modern Transformer recipes are typically $M = 1.0$ . The technique is, in this broader sense, less a remedy for a specific architecture than a generic safeguard against the rare optimizer step that would destroy the parameters accumulated so far.

Deep Learning: Zero to Hero

Explorer

Gradient Clipping

Cliffs in the Loss Surface

The Clipping Rule

Choosing the Threshold $M$

What Clipping Does Not Fix

Graph View

Table of Contents

Backlinks

Deep Learning: Zero to Hero

Explorer

Gradient Clipping

Cliffs in the Loss Surface

The Clipping Rule

Choosing the Threshold M

What Clipping Does Not Fix

Graph View

Table of Contents

Backlinks

Choosing the Threshold $M$