The analysis in BPTT Problems showed that backpropagating through a deep recurrent computation can amplify gradients exponentially. When the effective scale exceeds , the norm of the recurrent Jacobian product grows like , and the gradient of the loss with respect to the recurrent parameters can become numerically enormous. The optimizer then takes a step proportional to that gradient and overshoots wildly.
Gradient clipping is the standard defense against this failure mode. It does not change the model or the architecture; it intervenes only at the optimizer interface, rescaling the gradient whenever its norm exceeds a fixed threshold. The technique is due to Pascanu, Mikolov and Bengio (2013) and is used almost universally in RNN training.
Cliffs in the Loss Surface
The loss surfaces of recurrent networks have a characteristic geometric feature: cliffs. Most of the parameter space is fairly flat or gently curved, but a narrow region exists where a small change in causes a very large change in the loss. The reason is structural, and it can be built in four steps.
Step 1: the cell becomes linear in the right regime. The forward pass of an RNN applies the same recurrent matrix once per time step. When the hidden activations operate in the near-linear region of , where for small , the cell reduces to an approximately linear map:
Step 2: iterating the linear map produces a matrix power. Substituting this approximation times along the sequence, the hidden state at the end of the sequence is
The forward pass has effectively raised to the -th power.
Notation
The superscript in is the sequence length, not a transpose. The transpose of a matrix is written throughout these notes.
Step 3: a small change in is amplified by that power. Perturb the recurrent matrix by a tiny . Linearising the -th power around , the induced change in scales as
The exponent is the same one that appears in the recurrent Jacobian product analysed in BPTT Problems. It is the geometric signature of the recurrent structure: an effect that happens once per step is multiplied times.
Optional: derivation of the scaling
Write for for brevity. Expand as the product of identical factors and keep only first-order contributions in . The surviving terms are exactly those in which replaces in one of the positions:
If positions are numbered from right to left, the term with in position has copies of on its left and on its right, hence the form .
Each of the terms is a product of copies of and one copy of , so the submultiplicativity of the operator norm gives the bound
independent of . Summing the contributions yields the worst-case bound . In the exploding regime , the exponential dominates the polynomial prefactor asymptotically in , which is why Step 3 quotes only the exponential factor. Multiplying through by gives the scaling stated above.
Step 4: the loss inherits the amplification. The loss sees the recurrent matrix only through the hidden states it produces, so the sensitivity of to from Step 3 carries straight through to the loss. To first order, a change moves the loss by
up to a constant that does not depend on (it only measures how strongly the loss reacts to the final hidden state through the output head). The chain is the point: changes , and changes the loss, and the first link carries the exponential factor .
When , that factor is enormous for a long sequence: a microscopic change in produces a huge change in the loss. A huge change in the loss per unit change in the parameter is, by definition, a huge gradient. This amplification is not uniform across directions. It is strongest along the direction that stretches the most, its dominant eigenvector, because repeated multiplication grows fastest there; perturbations along the other directions are stretched less and stay tame. The steepness is therefore concentrated in a narrow band of parameter space. That band is the cliff: a thin region where the loss surface rises almost vertically, surrounded by terrain that looks flat by comparison.
A cliff is dangerous for gradient descent because of one hidden assumption. A step implicitly trusts that the loss stays approximately linear over the whole length of the step. The gradient is exact at the current point, but near a cliff it is so large that a step of the usual size travels far beyond the tiny neighborhood where the linear approximation actually holds. The update therefore overshoots the cliff and lands in an unrelated part of the loss surface, typically at a much higher loss, destroying the progress made so far.

The figure makes the consequence concrete. The top panel shows a descent trajectory without clipping: the path approaches the cliff, the gradient becomes enormous along the cliff face, and the next step ejects the parameters far away from the optimum. The bottom panel shows the same trajectory with clipping: the gradient direction is preserved, but its magnitude is capped, so the step length stays bounded and the trajectory follows the cliff edge instead of being thrown off.
The Clipping Rule
Let denote the gradient of the loss with respect to the parameters, accumulated over the mini-batch. Norm clipping replaces with a rescaled version whenever its norm exceeds a fixed threshold :
Two properties are worth making explicit.
- Direction is preserved. The rescaling is a positive scalar multiplication, so the clipped vector still points along . The optimizer is therefore still descending the loss in the same direction it would have chosen otherwise.
- Step length is capped. After clipping, holds by construction. The optimizer step is bounded by regardless of how steep the cliff was.
The norm used in the formula is the Euclidean norm of the concatenation of all parameter gradients, not the norm of each layer’s gradient computed in isolation. This matters in deep networks: per-layer thresholds would not bound the total update.
Norm clipping vs value clipping
An alternative variant, value clipping, simply truncates each component of into a fixed interval . It is simpler and faster to compute, but it does not preserve the gradient direction: large components are squashed disproportionately while small ones pass through unchanged. Norm clipping is the principled default and is the form implemented by
torch.nn.utils.clip_grad_norm_. Value clipping is available astorch.nn.utils.clip_grad_value_and is rarely the right choice in practice.
Choosing the Threshold
The threshold is a hyperparameter. Typical practice in RNN training uses
with a common default for vanilla and gated recurrent cells. The right value depends on three factors:
- the learning rate , because the worst-case optimizer step is ;
- the typical norm of unclipped gradients, which itself depends on depth, initialization, and batch size;
- the loss function and its scale (cross-entropy versus MSE, for instance).
In practice the choice is rarely critical: any in the order of to is usually sufficient to keep training stable, and the loss curve is fairly insensitive to the exact value as long as clipping actually triggers on the largest gradients.
Monitoring whether clipping fires
A useful diagnostic is to log the pre-clip gradient norm at every step. Three regimes are informative:
- clipping never triggers: is set too high and the technique is doing nothing;
- clipping triggers at almost every step: is set too low and the optimizer is permanently rate-limited;
- clipping triggers on the occasional spike, especially in early training: this is the healthy regime.
What Clipping Does Not Fix
Clipping is the canonical remedy against exploding gradients. It is not a remedy against vanishing gradients. The two failure modes have the same source (the product of recurrent Jacobians analyzed in BPTT Problems) but require entirely different interventions:
- exploding gradients are an optimizer-level problem and are fixed by a post-hoc rescaling like clipping;
- vanishing gradients are an architectural problem and are fixed by changing the cell so that the gradient has a high-bandwidth path through time. This is the role of the gates in LSTM and GRU, motivated in the limitations of vanilla RNNs.
Clipping also does nothing about the cost of unrolling a long sequence (the memory and compute that grow with ): that is bounded separately by the truncated and randomized schedules of BPTT Variants.
Clipping is therefore necessary but not sufficient. Modern RNN training stacks it on top of a gated cell, a gradient-friendly initialization of , a tuned learning-rate schedule, and, for long sequences, the truncated BPTT schedule of BPTT Variants.
Beyond RNNs
The figure and the historical motivation are recurrent, but gradient clipping is also a standard component of Transformer training, where it controls the rare large-norm steps produced by the attention softmax during early epochs. Default values in modern Transformer recipes are typically . The technique is, in this broader sense, less a remedy for a specific architecture than a generic safeguard against the rare optimizer step that would destroy the parameters accumulated so far.