The architectural promise made in Vanilla RNNs limitations and structurally realized in Cell state is the following: the LSTM allows gradient to travel back across many time steps without geometric attenuation, because the carry path no longer crosses a weight matrix at every step. This note works out that promise explicitly, by computing the recurrent Jacobian of the LSTM and comparing it to the vanilla RNN Jacobian derived in BPTT Problems.

The conclusion, in one line: the recurrent Jacobian along the cell state is the diagonal forget-gate matrix , not a product of dense matrices and saturating slopes. That single fact is the whole reason LSTMs train on long sequences.

The recurrent Jacobian of the cell state

Recall the cell-state update from Putting all together:

To compute , two contributions must be considered. The direct dependence on is through the first term, . The indirect dependence would be through the gates , but these gates are functions of , not of . The cell state of the previous step does not feed back into the gates of the current step (it only feeds back through via the read-out , which is handled separately below).

Restricting attention to the direct path through the carry line, the Jacobian is

This is the constant error carousel in modern language: the recurrent Jacobian of the memory line is diagonal, with entries in learned at every step.

Dimensions used in this note

symbolroleshape
cell state
hidden state
gates
candidate update
recurrent Jacobian (diagonal)
vanilla RNN Jacobian (for comparison) (dense)

Compare with the vanilla RNN

The vanilla recurrent Jacobian derived in BPTT Problems is

a dense matrix whose spectrum is controlled by and the saturating slope . The LSTM Jacobian is diagonal and does not involve any recurrent weight matrix at all. The change in qualitative behaviour comes from this structural fact, not from any new tuning.

The gradient product across many steps

Multiplying the per-step Jacobians across consecutive time steps gives

The product of diagonal matrices is diagonal, with entries equal to the element-wise products of the diagonals. The gradient that connects to is therefore controlled, coordinate by coordinate, by the product

for each coordinate . Two qualitative regimes follow:

  • If the forget gates along a coordinate stay close to , the product stays close to across many steps. The gradient survives. The vanishing-gradient pathology of the vanilla RNN, where the product is bounded by with generically, is gone.
  • If the forget gates close along a coordinate at some intermediate step (one is enough), the product collapses to zero from that step onward. The gradient is intentionally cut.

Both regimes are learned, not imposed. The network decides, gate by gate and coordinate by coordinate, which long-range dependencies to preserve and which to sever. This is exactly the operational definition of “preservation as a learned decision” anticipated in Vanilla RNNs limitations.

The vanishing-gradient barrier dissolved, not bypassed

The vanilla recurrence had no stable way to satisfy the knife-edge condition along every direction and every step. The LSTM does not solve that condition; it removes it. By replacing the matrix-product Jacobian with a diagonal one whose entries are individually clamped to and individually learned, the LSTM eliminates the geometric mechanism that produced vanishing or exploding gradients in the first place.

Exploding gradients are not architecturally eliminated: the gradient with respect to the LSTM’s parameters can still grow large if the loss landscape has cliffs (see Gradient Clipping). But the product across time along the carry path is bounded coordinate-wise by , which is what mattered.

The full backward pass: a sketch

A full BPTT through the LSTM is more involved than the carry-path analysis above, because the gradient must also flow through the hidden-state read-out , which closes a loop into the gates of the next step. Three observations make the full picture tractable.

  1. The gradient flowing backward through the cell state is governed by at each step, as derived above. This is the long-range backbone.
  2. The gradient flowing backward through the hidden state at step comes from two sources: the downstream prediction head at step , and the gates at step that consumed . Each of these is a short-range path: it enters the cell at step , passes through one affine map and one nonlinearity, and joins the cell-state backbone via the gate update equations.
  3. The hidden-state backward path does cross the recurrent weight matrices , exactly once per step, and therefore in principle suffers from the same vanilla-RNN pathology over long horizons. The reason this does not actually break long-range learning is that the vast majority of the long-range gradient signal travels through the cell-state line, where the Jacobian is diagonal and well-behaved; the hidden-state line is short-range and contributes the local correction, not the long-range backbone.

The take-away is that the LSTM provides a two-track gradient flow: a long-range, well-behaved highway through the cell state, and a short-range, locally-corrective channel through the hidden state. The first is what makes the architecture trainable on long sequences.

Why this design predated residual networks by 18 years

The same two-track logic later turned out to be the recipe for very deep feedforward networks: a residual stream that carries the signal unchanged, with learned modules that perturb but do not replace it. ResNets (He et al., 2015) applied the idea across depth; Transformers carried it further still. The LSTM applied it across time in 1997, with the additional twist that the carry has a learned per-coordinate decay rather than a fixed identity.

Reading the LSTM as the temporal special case of “learned modules around an identity highway” makes its design feel less idiosyncratic and more like the first instance of a now-pervasive pattern.

Historical footnote

The original LSTM (Hochreiter and Schmidhuber, 1997) did not contain a forget gate. Its cell-state update was strictly additive,

with recurrent Jacobian exactly equal to the identity. The constant error carousel was, in this original formulation, truly constant: nothing decayed, nothing could be erased.

This was sufficient to prove the architectural point (gradients no longer vanish), but unsatisfactory in practice: the cell state grew without bound, and the network had no way to clear obsolete memories. The forget gate was introduced by Gers, Schmidhuber and Cummins in 2000 to repair this. The resulting Jacobian is rather than , which is a strict generalization: setting recovers the original CEC, and any non-trivial gives the network the ability to forget. The modern LSTM with forget gate, as derived in Putting all together, is the version actually used in practice and the version analysed throughout these notes.

What this section accomplished, and where it falls short

The LSTM dissolves the vanishing-gradient barrier of BPTT by routing the long-range gradient through a diagonal Jacobian, , instead of a product of dense matrices. Long-range training is no longer a knife-edge of the recurrent spectrum but a learned, per-coordinate decision. The same architectural pattern (an identity highway with learned modules perturbing it) later powered very deep feedforward networks and Transformers.

Two limitations of the recurrence remain even with the LSTM:

  • The forward computation is strictly sequential: cannot be computed before , so the cell does not benefit from parallelism along the time axis. Training throughput is bounded by sequence length.
  • Long-range information must still be routed through a finite-width bottleneck, namely , no matter how distant the source. The cell can preserve information across long gaps, but it cannot grow more capacity on demand.

These two limitations are exactly what attention addresses: parallelism along the time axis, and a direct learned pointer from any position to any other, bypassing the recurrent bottleneck. The attention mechanism and the architectures built on top of it are the subject of the next section.