Gradient in LSTM

The architectural promise made in Vanilla RNNs limitations and structurally realized in Cell state is the following: the LSTM allows gradient to travel back across many time steps without geometric attenuation, because the carry path no longer crosses a weight matrix at every step. This note works out that promise explicitly, by computing the recurrent Jacobian of the LSTM and comparing it to the vanilla RNN Jacobian derived in BPTT Problems.

The conclusion, in one line: the recurrent Jacobian along the cell state is the diagonal forget-gate matrix $diag (f_{t})$ , not a product of dense matrices and saturating slopes. That single fact is the whole reason LSTMs train on long sequences.

The recurrent Jacobian of the cell state

Recall the cell-state update from Putting all together:

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t} .

To compute $\partial c_{t} / \partial c_{t - 1}$ , two contributions must be considered. The direct dependence on $c_{t - 1}$ is through the first term, $f_{t} ⊙ c_{t - 1}$ . The indirect dependence would be through the gates $f_{t}, i_{t}, \tilde{c}_{t}$ , but these gates are functions of $(x_{t}, h_{t - 1})$ , not of $c_{t - 1}$ . The cell state of the previous step does not feed back into the gates of the current step (it only feeds back through $h_{t - 1}$ via the read-out $h_{t - 1} = o_{t - 1} ⊙ tanh (c_{t - 1})$ , which is handled separately below).

Restricting attention to the direct path through the carry line, the Jacobian is

\frac{\partial c _{t}}{\partial c _{t - 1}} = diag (f_{t}) .

Multivariable chain rule, written out

The cell state at step $t$ is a function of three independent inputs at step $t$ :
$c_{t} = f_{t} (x_{t}, h_{t - 1}) ⊙ c_{t - 1} + i_{t} (x_{t}, h_{t - 1}) ⊙ \tilde{c}_{t} (x_{t}, h_{t - 1}) .$
The arguments of the gates are written out explicitly to make the dependences visible: each gate is a function of $x_{t}$ and $h_{t - 1}$ , never of $c_{t - 1}$ . The total derivative of $c_{t}$ with respect to $c_{t - 1}$ at fixed $(x_{t}, h_{t - 1})$ is therefore the partial derivative through the first term alone:
$\frac{\partial c _{t}}{\partial c _{t - 1}}_{x_{t}, h_{t - 1} fixed} = \frac{\partial}{\partial c _{t - 1}} (f_{t} ⊙ c_{t - 1}) = diag (f_{t}),$
using coordinate independence of $⊙$ (the $j$ -th output coordinate depends only on the $j$ -th input coordinate), so the Jacobian is diagonal.

The full BPTT through the unrolled network also tracks the path through $h_{t - 1}$ : at step $t - 1$ the read-out $h_{t - 1} = o_{t - 1} ⊙ tanh (c_{t - 1})$ closes a second route from $c_{t - 1}$ into $c_{t}$ , namely $c_{t - 1} \to h_{t - 1} \to gates at step t \to c_{t}$ . By the multivariable chain rule, the total derivative is
$\frac{d c _{t}}{d c _{t - 1}} = direct carry path diag (f_{t}) + indirect path through the read-out \frac{\partial c _{t}}{\partial h _{t - 1}} \frac{d h _{t - 1}}{d c _{t - 1}} .$
The first term is the long-range backbone; the second is short-range and dense, and is the subject of the “Two-track gradient flow” discussion below.

This is the constant error carousel in modern language: the recurrent Jacobian of the memory line is diagonal, with entries in $(0, 1)$ learned at every step.

Dimensions used in this note

symbol role shape
$c_{t - 1}, c_{t}$ cell state $n_{neurons}$
$h_{t - 1}, h_{t}$ hidden state $n_{neurons}$
$f_{t}, i_{t}, o_{t}$ gates $n_{neurons}$
$\tilde{c}_{t}$ candidate update $n_{neurons}$
$\partial c_{t} / \partial c_{t - 1}$ recurrent Jacobian $n_{neurons} \times n_{neurons}$ (diagonal)
$\partial h_{t} / \partial h_{t - 1}$ vanilla RNN Jacobian (for comparison) $n_{neurons} \times n_{neurons}$ (dense)

symbol	role	shape
$c_{t - 1}, c_{t}$	cell state	$n_{neurons}$
$h_{t - 1}, h_{t}$	hidden state	$n_{neurons}$
$f_{t}, i_{t}, o_{t}$	gates	$n_{neurons}$
$\tilde{c}_{t}$	candidate update	$n_{neurons}$
$\partial c_{t} / \partial c_{t - 1}$	recurrent Jacobian	$n_{neurons} \times n_{neurons}$ (diagonal)
$\partial h_{t} / \partial h_{t - 1}$	vanilla RNN Jacobian (for comparison)	$n_{neurons} \times n_{neurons}$ (dense)

Compare with the vanilla RNN

The vanilla recurrent Jacobian derived in BPTT Problems is
$\frac{\partial h _{t}}{\partial h _{t - 1}} = diag (ϕ_{h}^{'} (a_{t})) W_{hh},$
a dense matrix whose spectrum is controlled by $W_{hh}$ and the saturating slope $ϕ_{h}^{'}$ . The LSTM Jacobian is diagonal and does not involve any recurrent weight matrix at all. The change in qualitative behaviour comes from this structural fact, not from any new tuning.

The gradient product across many steps

Multiplying the per-step Jacobians across $k$ consecutive time steps gives

s = t - k + 1 \prod t \frac{\partial c _{s}}{\partial c _{s - 1}} = s = t - k + 1 \prod t diag (f_{s}) = diag (s = t - k + 1 \prod t f_{s}) .

The product of diagonal matrices is diagonal, with entries equal to the element-wise products of the diagonals. The gradient that connects $c_{t - k}$ to $c_{t}$ is therefore controlled, coordinate by coordinate, by the product

s = t - k + 1 \prod t f_{s}^{(j)} \in (0, 1),

for each coordinate $j = 1, \dots, n_{neurons}$ . Two qualitative regimes follow:

If the forget gates along a coordinate stay close to $1$ , the product stays close to $1$ across many steps. The gradient survives. The vanishing-gradient pathology of the vanilla RNN, where the product is bounded by $γ^{k}$ with $γ < 1$ generically, is gone.
If the forget gates close along a coordinate at some intermediate step (one $f_{s}^{(j)} \approx 0$ is enough), the product collapses to zero from that step onward. The gradient is intentionally cut.

Both regimes are learned, not imposed. The network decides, gate by gate and coordinate by coordinate, which long-range dependencies to preserve and which to sever. This is exactly the operational definition of “preservation as a learned decision” anticipated in Vanilla RNNs limitations.

The vanishing-gradient barrier dissolved, not bypassed

The vanilla recurrence had no stable way to satisfy the knife-edge condition $γ = 1$ along every direction and every step. The LSTM does not solve that condition; it removes it. By replacing the matrix-product Jacobian with a diagonal one whose entries are individually clamped to $(0, 1)$ and individually learned, the LSTM eliminates the geometric mechanism that produced vanishing or exploding gradients in the first place.

Exploding gradients are not architecturally eliminated: the gradient with respect to the LSTM’s parameters can still grow large if the loss landscape has cliffs (see Gradient Clipping). But the product across time along the carry path is bounded coordinate-wise by $1$ , which is what mattered.

The full backward pass: a sketch

A full BPTT through the LSTM is more involved than the carry-path analysis above, because the gradient must also flow through the hidden-state read-out $h_{t} = o_{t} ⊙ tanh (c_{t})$ , which closes a loop into the gates of the next step. Three observations make the full picture tractable.

The gradient flowing backward through the cell state is governed by $diag (f_{t})$ at each step, as derived above. This is the long-range backbone.
The gradient flowing backward through the hidden state at step $t$ comes from two sources: the downstream prediction head at step $t$ , and the gates at step $t + 1$ that consumed $h_{t}$ . Each of these is a short-range path: it enters the cell at step $t + 1$ , passes through one affine map and one nonlinearity, and joins the cell-state backbone via the gate update equations.
The hidden-state backward path does cross the recurrent weight matrices $W_{h ∙}$ , exactly once per step, and therefore in principle suffers from the same vanilla-RNN pathology over long horizons. The reason this does not actually break long-range learning is that the vast majority of the long-range gradient signal travels through the cell-state line, where the Jacobian is diagonal and well-behaved; the hidden-state line is short-range and contributes the local correction, not the long-range backbone.

The take-away is that the LSTM provides a two-track gradient flow: a long-range, well-behaved highway through the cell state, and a short-range, locally-corrective channel through the hidden state. The first is what makes the architecture trainable on long sequences.

Why this design predated residual networks by 18 years

The same two-track logic later turned out to be the recipe for very deep feedforward networks: a residual stream that carries the signal unchanged, with learned modules that perturb but do not replace it. ResNets (He et al., 2015) applied the idea across depth; Transformers carried it further still. The LSTM applied it across time in 1997, with the additional twist that the carry has a learned per-coordinate decay rather than a fixed identity.

Reading the LSTM as the temporal special case of “learned modules around an identity highway” makes its design feel less idiosyncratic and more like the first instance of a now-pervasive pattern.

Historical footnote

The original LSTM (Hochreiter and Schmidhuber, 1997) did not contain a forget gate. Its cell-state update was strictly additive,

c_{t} = c_{t - 1} + i_{t} ⊙ \tilde{c}_{t},

with recurrent Jacobian exactly equal to the identity. The constant error carousel was, in this original formulation, truly constant: nothing decayed, nothing could be erased.

This was sufficient to prove the architectural point (gradients no longer vanish), but unsatisfactory in practice: the cell state grew without bound, and the network had no way to clear obsolete memories. The forget gate was introduced by Gers, Schmidhuber and Cummins in 2000 to repair this. The resulting Jacobian is $diag (f_{t})$ rather than $I$ , which is a strict generalization: setting $f_{t} = 1$ recovers the original CEC, and any non-trivial $f_{t}$ gives the network the ability to forget. The modern LSTM with forget gate, as derived in Putting all together, is the version actually used in practice and the version analysed throughout these notes.

What this section accomplished, and where it falls short

The LSTM dissolves the vanishing-gradient barrier of BPTT by routing the long-range gradient through a diagonal Jacobian, $diag (f_{t})$ , instead of a product of dense matrices. Long-range training is no longer a knife-edge of the recurrent spectrum but a learned, per-coordinate decision. The same architectural pattern (an identity highway with learned modules perturbing it) later powered very deep feedforward networks and Transformers.

Two limitations of the recurrence remain even with the LSTM:

The forward computation is strictly sequential: $c_{t}$ cannot be computed before $c_{t - 1}$ , so the cell does not benefit from parallelism along the time axis. Training throughput is bounded by sequence length.
Long-range information must still be routed through a finite-width bottleneck, namely $c_{t} \in R^{n_{neurons}}$ , no matter how distant the source. The cell can preserve information across long gaps, but it cannot grow more capacity on demand.

These two limitations are exactly what attention addresses: parallelism along the time axis, and a direct learned pointer from any position to any other, bypassing the recurrent bottleneck. The attention mechanism and the architectures built on top of it are the subject of the next section.

Deep Learning: Zero to Hero

Explorer

The recurrent Jacobian of the cell state

The gradient product across many steps

The full backward pass: a sketch

Historical footnote

What this section accomplished, and where it falls short

Graph View

Table of Contents

Backlinks