BPTT Problems

The derivation of Backpropagation Through Time shows that the gradients of an unfolded RNN contain products of recurrent Jacobians. Training a vanilla RNN runs into three distinct problems, and it helps to separate them at the outset:

The cost of unrolling the full sequence. A full forward and backward pass over $T$ time steps costs $O (T)$ in both time and memory. This problem is computational, and it is addressed by the chunking strategy of BPTT Variants (truncated BPTT, in its stateful and stateless modes).
Exploding gradients. Repeated recurrent multiplication can amplify the backward signal exponentially with the temporal distance.
Vanishing gradients. The same repeated multiplication can instead suppress the backward signal exponentially.

This note studies the second and third problems: the stability of the gradient as it flows backward through time. The first problem, scaling, is treated separately in BPTT Variants.

Historical perspective

The vanishing gradient problem was a major obstacle for recurrent models throughout the 1990s. At the time, the architectural mechanisms that eventually mitigated it (gated cells, and later residual or skip connections) had not yet been developed. This is part of why training deep recurrent models was long considered so difficult.

The instability becomes especially visible when the unrolled computation spans many time steps: even $T = 50$ or $T = 100$ can already make the vanishing gradient problem severe.

The Recurrent Jacobian Product

Two factors determine how the magnitude of a backpropagated gradient changes as it moves backward:

the weight matrices it is multiplied by;
the activation functions, or more precisely their derivatives.

In a vanilla RNN both factors are applied repeatedly, because the same recurrent transition is traversed once per time step. The critical term, already isolated in the BPTT derivation, is the derivative that connects a later hidden state to an earlier one,

\frac{\partial h _{t + 1}}{\partial h _{k}} = j = k \prod t \frac{\partial h _{j + 1}}{\partial h _{j}} = \frac{\partial h _{t + 1}}{\partial h _{t}} \frac{\partial h _{t}}{\partial h _{t - 1}} \dots \frac{\partial h _{k + 1}}{\partial h _{k}} .

The product from $j = k$ to $j = t$ contains $t - k + 1$ factors: the longer the temporal gap, the more Jacobians the gradient must pass through.

The recurrent term, and its MLP analogue

The repeated multiplication by $W_{hh}$ that appears in this product is also what drives the error-signal recursion derived in BPTT. Combining its two equations
$δ_{t}^{h} = W_{y h}^{⊤} δ_{t}^{o} + W_{hh}^{⊤} δ_{t + 1}^{a}, δ_{t}^{a} = ϕ_{h}^{'} (a_{t}) ⊙ δ_{t}^{h},$
gives the compact recursion
$δ_{t}^{a} = ϕ_{h}^{'} (a_{t}) ⊙ (W_{y h}^{⊤} δ_{t}^{o} + W_{hh}^{⊤} δ_{t + 1}^{a}) .$
The red term is the recurrent contribution: the same matrix $W_{hh}^{⊤}$ is encountered again and again as the error signal moves backward through time. Isolating this temporal path, and dropping the local output contribution for intuition,
$δ_{t}^{a} \approx ϕ_{h}^{'} (a_{t}) ⊙ (W_{hh}^{⊤} δ_{t + 1}^{a}),$
which is exactly the MLP error-backpropagation equation $δ^{l} = ((W^{l + 1})^{⊤} δ^{l + 1}) ⊙ σ^{'} (z^{l})$ , with one crucial difference. In an MLP each layer has its own weight matrix; here the same recurrent matrix is multiplied at every step. Repeated multiplication by one fixed matrix is precisely what makes RNN gradients prone to explode or vanish.

The Local Recurrent Jacobian

For a vanilla RNN, the pre-activation and hidden state at time $j + 1$ are

a_{j + 1} = W_{x h} x_{j + 1} + W_{hh} h_{j} + b_{h}, h_{j + 1} = ϕ_{h} (a_{j + 1}) .

Since $ϕ_{h}$ is applied element-wise, the Jacobian $\partial h_{j + 1} / \partial a_{j + 1}$ is diagonal (derived component by component in BPTT). Composing it with the linear step $\partial a_{j + 1} / \partial h_{j} = W_{hh}$ via the chain rule gives the local recurrent Jacobian

J_{j} := \frac{\partial h _{j + 1}}{\partial h _{j}} = diag (ϕ_{h}^{'} (a_{j + 1})) W_{hh} .

Every step therefore contributes the same two factors: a diagonal matrix carrying the activation derivatives, and the recurrent weight matrix $W_{hh}$ . The full product becomes

\frac{\partial h _{t + 1}}{\partial h _{k}} = j = k \prod t J_{j} = j = k \prod t diag (ϕ_{h}^{'} (a_{j + 1})) W_{hh} .

The magnitude of this product is what decides between exploding and vanishing gradients. Two complementary lenses describe it: an eigenvalue intuition and a rigorous norm bound.

How the Product Grows or Shrinks

The single quantity that measures how much $W_{hh}$ can stretch a vector is its spectral norm, the largest singular value

σ_{m a x} (W_{hh}) = ∥ v ∥_{2} = 1 max ∥ W_{hh} v ∥_{2} = λ_{m a x} (W_{hh}^{⊤} W_{hh}) .

The first expression reads directly: among all unit-length vectors $v$ , $σ_{m a x}$ is the largest length that $W_{hh} v$ can reach, i.e., the largest factor by which $W_{hh}$ can stretch any vector. The second expression is the standard identity relating the largest singular value to the eigenvalues of $W_{hh}^{⊤} W_{hh}$ : it is the square root of the largest such eigenvalue. This is the natural quantity to track, because during BPTT the error signal is multiplied by $W_{hh}$ , inside the Jacobian, once per step.

Eigenvalue intuition

Suppose, for intuition, that a local Jacobian $J_{j}$ is diagonalizable with eigenvalues $λ_{1}, \dots, λ_{n}$ and eigenvectors $v_{1}, \dots, v_{n}$ . A perturbation along $v_{i}$ is scaled by $λ_{i}$ at each step, so across $r$ recurrent steps it is scaled by $λ_{i}^{r}$ . Consequently:

if the dominant eigenvalue magnitude is below $1$ , repeated multiplication suppresses the component and the gradient tends to vanish;
if it is above $1$ , repeated multiplication amplifies the component and the gradient can explode.

This is only intuition: in a real RNN the diagonal factor $diag (ϕ_{h}^{'} (a_{j + 1}))$ depends on the activations, so the product is one of time-varying matrices, not a power of a single fixed matrix.

Norm-based bound

A rigorous bound, valid without diagonalizability or a fixed matrix, comes from matrix norms. The key property is submultiplicativity: for any two matrices, the norm of a product is at most the product of the norms, $∥ AB ∥ \leq ∥ A ∥ ∥ B ∥$ . Applied to the local Jacobian $J_{j} = diag (ϕ_{h}^{'} (a_{j + 1})) W_{hh}$ ,

∥ J_{j} ∥ \leq ∥ diag (ϕ_{h}^{'} (a_{j + 1})) ∥ ∥ W_{hh} ∥ = γ_{h} σ_{m a x} (W_{hh}) .

The factor $∥ W_{hh} ∥ = σ_{m a x} (W_{hh})$ bounds the recurrent contribution. The other factor is the norm of a diagonal matrix, which equals the largest magnitude among its diagonal entries, because a diagonal matrix simply scales each coordinate independently and the largest stretch is the largest scaling factor. Here those entries are the activation derivatives $ϕ_{h}^{'} (a_{j + 1})$ , so their largest possible value is $γ_{h}$ : namely $γ_{h} \leq 1$ for $tanh$ , and $γ_{h} \leq 0.25$ for the logistic sigmoid.

The product over $t - k + 1$ steps is bounded by applying submultiplicativity once per factor: the norm of the whole product is at most the product of the per-step norms,

\frac{\partial h _{t + 1}}{\partial h _{k}} = j = k \prod t J_{j} \leq j = k \prod t ∥ J_{j} ∥ \leq (γ_{h} σ_{m a x} (W_{hh}))^{t - k + 1} .

Define the effective per-step scale

γ = γ_{h} σ_{m a x} (W_{hh}) .

If $γ < 1$ , the bound decays exponentially with the temporal gap and the gradient vanishes. If $γ > 1$ , the bound grows exponentially and the gradient can explode. The exponent $t - k + 1$ is what turns a small per-step deviation from $1$ into an exponential effect across time.

The Two Failure Modes

The same recurrent structure produces opposite pathologies depending on the value of the effective scale $γ = γ_{h} σ_{m a x} (W_{hh})$ .

Exploding Gradients

When the recurrent matrix has a large spectral norm, so that the effective scale exceeds $1$ ,
$γ = γ_{h} σ_{m a x} (W_{hh}) > 1 ⟹ exploding gradients,$
the repeated recurrent multiplication amplifies the backward signal exponentially. The gradient can become extremely large, making the first-order approximation used by gradient descent unreliable and the parameter updates numerically unstable.

Solution: Gradient Clipping

The standard optimizer-level fix is gradient clipping: after backpropagation but before the update, the gradient vector $g$ is rescaled if its norm exceeds a threshold $M$ ,
$g \leftarrow \frac{M}{∥ g ∥ _{2}} g if ∥ g ∥_{2} > M .$
Clipping does not remove the recurrent cause of the explosion; it caps the resulting step so it cannot become numerically destructive. The geometric justification (cliffs in the loss surface) and the choice of the threshold $M$ are developed in Gradient Clipping.

Vanishing Gradients

When the effective scale falls below $1$ ,
$γ = γ_{h} σ_{m a x} (W_{hh}) < 1 ⟹ vanishing gradients,$
the repeated multiplication suppresses the backward signal. Because $γ_{h} \leq 1$ for $tanh$ , a spectral norm $σ_{m a x} (W_{hh}) < 1$ already guarantees vanishing, and the effect is even stronger when $tanh$ saturates and $ϕ_{h}^{'} (a_{t})$ drops toward $0$ . Early hidden states then receive almost no learning signal from later losses.

Solution: Gated Memory and Attention Mechanism

Vanishing gradients are harder to fix with a post-processing trick, because the signal has already been attenuated before the optimizer sees it. The effective solution is architectural.

LSTMs and GRUs modify the recurrent cell with gates and a memory path that let information and gradients persist across long temporal distances; the structural motivation is developed in the limitations of vanilla RNNs. Later, attention mechanisms provide an even shorter route: instead of forcing all long-range information through every recurrent step, the model combines distant hidden states directly when needed.

Two further remedies are complementary to the two solutions above:

Careful initialization of $W_{hh}$ keeps $σ_{m a x} (W_{hh})$ near $1$ from the start. The variance-preserving schemes of Xavier and He initialization are designed precisely to place the network at the edge between the vanishing and exploding regimes; regularization plays a similar role by indirectly bounding the spectral norm during training.
Non-saturating activations such as ReLU and its variants (surveyed here) reduce the activation-derivative attenuation. They do not by themselves solve the problem, however, since the recurrent matrix $W_{hh}$ can still amplify or suppress the gradient regardless of the activation.

Consequences for Long-Term Dependencies

Vanishing gradients are not exclusive to RNNs; they also occur in deep feedforward networks. RNNs are especially vulnerable because unrolling them through time creates a very deep computational graph with repeated use of the same recurrent transformation, so a per-step deviation of $γ$ from $1$ compounds over the whole sequence.

The practical consequence is a difficulty in learning long-range dependencies. Information from the distant past may be present in the hidden state in principle, but the learning signal needed to adjust the parameters that would exploit it becomes too weak to be useful. This single limitation is the central reason vanilla RNNs were eventually displaced by gated cells and, later, by attention.

Deep Learning: Zero to Hero

Explorer

The Recurrent Jacobian Product

The Local Recurrent Jacobian

How the Product Grows or Shrinks

Eigenvalue intuition

Norm-based bound

The Two Failure Modes

Consequences for Long-Term Dependencies

Graph View

Table of Contents

Backlinks