BackPropagation Through Time (BPTT)

Backpropagation Through Time (BPTT) is the application of the canonical backpropagation algorithm to a Recurrent Neural Network (RNN) after the recurrent computation has been explicitly unrolled over a finite number of time steps.

Prerequisite: backpropagation as automatic differentiation

For a deeper understanding of how backpropagation works at the level of computational graphs, local derivatives, gradient accumulation, and topological ordering, see the Micrograd series. BPTT uses the same mechanism, with the additional structure introduced by the temporal unrolling of a recurrent computation.

In essence, the recurrent layer is unfolded along the temporal axis and treated as a deep feedforward computational graph whose temporal copies share the same parameters. Backpropagation is then applied to this unfolded graph. The number of time steps over which gradients are propagated is therefore a relevant modeling choice: in full BPTT it coincides with the sequence length, whereas in truncated BPTT it becomes an explicit truncation horizon.

Question

How should the hidden state before the beginning of the sequence be initialized, namely the initial state $h_{0}$ ?

Note

At the first time step ( $t = 1$ ), no previous hidden state is available. It is therefore customary to set
$h_{0} = 0$
that is, a zero vector. This choice is analogous to zero-padding in convolutional neural networks: in CNNs, zeros are added at the spatial boundary; here, mutatis mutandis, the same idea is applied at the temporal boundary of the sequence.

Conceptually, BPTT is equivalent to backpropagating the loss through a deep feedforward graph made of $T$ temporal copies of the same recurrent cell, from $t = 1$ to $t = T$ . Unlike a standard MLP with independent layers, however, these temporal copies all share the same parameters.

In depth derivation

For an explicit derivation of Backpropagation Through Time (BPTT), the RNN is represented in its unfolded form along the temporal axis, as shown in the figure above. The diagram highlights both the forward computation, from inputs to local losses, and the backward propagation of gradients through time.

The compact RNN block shown on the left of the figure should be read as a schematic view of a single recurrent cell. The hidden-to-output matrix $W_{y h}$ maps the hidden state to the logit vector $o_{t}$ ; the prediction $\hat{y}_{t}$ is obtained only after applying the output activation, typically a softmax in multiclass classification.

Component / Symbol	Description
$C$	Number of classes in the classification task.
$W_{x h} \in R^{n_{neurons} \times n_{inputs}}$ , $W_{hh} \in R^{n_{neurons} \times n_{neurons}}$ , $W_{y h} \in R^{C \times n_{neurons}}$	Weight matrices. Although they are drawn at each time step, they are shared across time. This parameter sharing reduces the total number of parameters and expresses the recurrent nature of the model.
$x_{t} \in R^{n_{inputs} \times 1}$	Input column vector at time step $t$ .
$h_{t} \in R^{n_{neurons} \times 1}$	Hidden state at time step $t$ .
$o_{t} \in R^{C \times 1}$	Raw output vector, or logit vector, before the softmax.
$\hat{y}_{t} \in R^{C \times 1}$	Predicted probability vector. It is obtained by applying the softmax function to the logits $o_{t}$ . For each component $i$ , $\hat{y}_{t}^{(i)} = \frac{e ^{o_{t}^{(i)}}}{\sum _{j = 1}^{C} e ^{o_{t}^{(j)}}}$ The softmax ensures that $\hat{y}_{t}$ is a probability vector, which makes it suitable for multiclass classification and compatible with cross-entropy loss.
$L_{t}$	Local loss at time step $t$ .
Propagation Phases	The computation has two phases. In the forward pass, shown by gray arrows, the input $x_{t}$ and the previous hidden state $h_{t - 1}$ are combined to produce the new hidden state $h_{t}$ and the output. In the backward pass, shown by red arrows, the gradient of the total loss is propagated backward through the unfolded time steps and accumulated for the shared parameters.

RNN as a Dynamical System

A standard RNN can be viewed as a dynamical system defined by a state transition function, which updates the internal state of the network, and an output function, which maps the hidden state to the model output.

Single-Sequence Formulation

For a single sequence, the system can be written in explicit form as

h_{t} \hat{y}_{t} = f_{h} (x_{t}, h_{t - 1}) = f_{o} (h_{t}) .

For a conventional RNN, these functions are usually instantiated as

a_{t} h_{t} o_{t} \hat{y}_{t} = W_{x h} x_{t} + W_{hh} h_{t - 1} + b_{h} = f_{h} (x_{t}, h_{t - 1}) = ϕ_{h} (a_{t}) = W_{y h} h_{t} + b_{y} = f_{o} (h_{t}) = ϕ_{o} (o_{t}) .

The vector $a_{t} = W_{x h} x_{t} + W_{hh} h_{t - 1} + b_{h}$ is the hidden pre-activation: the recurrent counterpart of the pre-activation $z$ in an MLP, with the hidden state obtained as its element-wise nonlinearity, $h_{t} = ϕ_{h} (a_{t})$ . It is named explicitly because the backward pass differentiates the loss with respect to it.

The functions $ϕ_{h} (\cdot)$ and $ϕ_{o} (\cdot)$ are nonlinear functions. In particular:

$ϕ_{h} (\cdot)$ is an element-wise hidden activation function, commonly $tanh (\cdot)$ .
$ϕ_{o} (\cdot)$ is the output activation. In multiclass classification, it is typically the softmax, because its output can be interpreted as a probability distribution over classes.

Symbol	Dimension	Description
$x_{t}$	$R^{n_{inputs} \times 1}$	Input vector at time $t$ .
$h_{t - 1}$ , $h_{t}$	$R^{n_{neurons} \times 1}$	Previous and current hidden states.
$a_{t}$	$R^{n_{neurons} \times 1}$	Hidden pre-activation, $a_{t} = W_{x h} x_{t} + W_{hh} h_{t - 1} + b_{h}$ , with $h_{t} = ϕ_{h} (a_{t})$ .
$o_{t}$	$R^{C \times 1}$	Logit vector before the softmax.
$\hat{y}_{t}$	$R^{C \times 1}$	Predicted probability vector after the softmax.
$W_{x h}$	$R^{n_{neurons} \times n_{inputs}}$	Input-to-hidden weight matrix.
$W_{hh}$	$R^{n_{neurons} \times n_{neurons}}$	Hidden-to-hidden recurrent weight matrix.
$W_{y h}$	$R^{C \times n_{neurons}}$	Hidden-to-output/logit weight matrix.
$b_{h}$	$R^{n_{neurons} \times 1}$	Hidden-state bias vector.
$b_{y}$	$R^{C \times 1}$	Output/logit bias vector.
$ϕ_{h} (\cdot)$	—	Hidden activation function, for instance $tanh$ .
$ϕ_{o} (\cdot)$	—	Output activation function, typically softmax in classification.
$L_{t}$	$R$	Local scalar loss at time $t$ .

Mini-Batch Formulation

Important

For computational efficiency, training is usually performed on mini-batches of sequences. In this setting, the input vector of a single time step, $x_{t} \in R^{n_{inputs}}$ , is generalized to an input matrix $X_{t} \in R^{B \times n_{inputs}}$ , where $B$ is the mini-batch size.

Similarly, the previous hidden states are collected into a matrix $H_{t - 1} \in R^{B \times n_{neurons}}$ , where each row contains the hidden state associated with one sequence in the batch.

The recurrent layer can then be evaluated for the whole mini-batch as

H_{t} = ϕ_{h} (X_{t} W_{x h}^{⊤} + H_{t - 1} W_{hh}^{⊤} + b_{h}), O_{t} = H_{t} W_{y h}^{⊤} + b_{y}, \hat{Y}_{t} = softmax (O_{t}) .

Here:

$X_{t} \in R^{B \times n_{inputs}}$ is the mini-batch input matrix at time $t$ .
$H_{t - 1} \in R^{B \times n_{neurons}}$ is the matrix of previous hidden states.
$H_{t} \in R^{B \times n_{neurons}}$ contains the hidden states computed at time $t$ .
$O_{t} \in R^{B \times C}$ contains the logits for each example in the batch.
$\hat{Y}_{t} \in R^{B \times C}$ contains the predicted probabilities after the softmax.

The main tensor dimensions are summarized below.

Symbol	Dimension	Description
$X_{t}$	$R^{B \times n_{inputs}}$	Mini-batch input at time $t$ .
$H_{t - 1}$	$R^{B \times n_{neurons}}$	Mini-batch hidden state at time $t - 1$ .
$W_{x h}$	$R^{n_{neurons} \times n_{inputs}}$	Input-to-hidden weight matrix.
$W_{hh}$	$R^{n_{neurons} \times n_{neurons}}$	Hidden-to-hidden recurrent weight matrix.
$W_{y h}$	$R^{C \times n_{neurons}}$	Hidden-to-output/logit weight matrix.
$b_{h}$	$R^{n_{neurons} \times 1}$ , broadcast to $R^{B \times n_{neurons}}$	Hidden-state bias.
$b_{y}$	$R^{C \times 1}$ , broadcast to $R^{B \times C}$	Output/logit bias.
$\hat{Y}_{t}$	$R^{B \times C}$	Mini-batch predicted probabilities at time $t$ .

Why the Transposed Weight Matrices?

The transposes come from using two compatible conventions at the same time: column vectors for the mathematical derivation, and row-wise mini-batches for implementation.

Single-sequence view. The input $x_{t}$ is a column vector of shape $[n_{inputs} \times 1]$ . The matrix $W_{x h}$ has shape $[n_{neurons} \times n_{inputs}]$ , so the product $W_{x h} x_{t}$ is dimensionally valid.

Mini-batch view. The input $X_{t}$ stores one example per row, so it has shape $[B \times n_{inputs}]$ . To apply the same linear map to every row, the weight matrix must appear on the right with shape $[n_{inputs} \times n_{neurons}]$ .

Since $W_{x h}$ has been defined in the column-vector convention as $[n_{neurons} \times n_{inputs}]$ , the mini-batch computation uses its transpose:
$X_{t} W_{x h}^{⊤} \in R^{B \times n_{neurons}} .$
The same reasoning applies to the recurrent term:
$H_{t - 1} W_{hh}^{⊤} \in R^{B \times n_{neurons}} .$
This is also the convention used by PyTorch. A layer such as nn.Linear(in_features, out_features) stores weight with shape [out_features, in_features], and the forward computation is performed as input @ weight.T + bias. The same pattern appears in nn.RNN, where the recurrent update is written in batched form using terms such as x_t @ W_ih.T and h_{t-1} @ W_hh.T.

Compact Notation for Input and Previous Hidden State

In optimized implementations, it is common to concatenate the input at time $t$ and the previous hidden state into a single matrix:
$[X_{t} H_{t - 1}] \in R^{B \times (n_{inputs} + n_{neurons})} .$
This matrix is paired with an extended weight matrix
$W_{cat} = [W_{x h}^{⊤} W_{hh}^{⊤}] \in R^{(n_{inputs} + n_{neurons}) \times n_{neurons}} .$
The recurrent transition can then be written compactly as
$H_{t} = ϕ_{h} ([X_{t} H_{t - 1}] W_{cat} + b_{h}) .$
This notation simplifies the implementation and reduces the recurrent transition to a single matrix multiplication.

Note

At the first time step ( $t = 1$ ), the previous hidden state is initialized as
$H_{0} = 0,$
where $0 \in R^{B \times n_{neurons}}$ is a matrix of zeros.

Important

This structure makes it possible to derive gradients systematically with respect to the shared parameters $W_{x h}$ , $W_{hh}$ , $W_{y h}$ , $b_{h}$ , and $b_{y}$ . This is the core mechanism of BPTT.

Gradient Derivation: Backpropagation Through Time

BPTT Derivation: Returning to a Single Sequence

The forward computation has been described both for a single sequence and for a mini-batch. In the mini-batch formulation, matrices such as $X_{t} \in R^{B \times n_{inputs}}$ and $H_{t} \in R^{B \times n_{neurons}}$ are used, with one sequence per row.

For the derivation of Backpropagation Through Time, the notation returns to a single sequence. Thus, $\hat{y}_{t}$ , $y_{t}$ , $o_{t}$ , and $h_{t}$ are vectors rather than mini-batch matrices. This is purely a notational choice; the mini-batch case follows from the single-sequence case by the linearity of differentiation.

If the mini-batch loss is defined as the average of the per-sequence losses,
$L_{batch} = \frac{1}{B} i = 1 \sum B L^{(i)},$
with $L^{(i)}$ the loss computed on the $i$ -th sequence of the mini-batch, then for any parameter $θ$ the linearity of $\partial / \partial θ$ gives
$\frac{\partial L _{batch}}{\partial θ} = \frac{1}{B} i = 1 \sum B \frac{\partial L ^{(i)}}{\partial θ} .$
The single-sequence derivation that follows therefore applies unchanged at the mini-batch level: each $\partial L^{(i)} / \partial θ$ is computed by the formulas derived below, applied to sequence $i$ in isolation, and the resulting per-sequence gradients are averaged. The same identity holds with sum reduction (i.e., without the $1/ B$ factor), which is the convention used by some implementations.

The total loss over the unfolded sequence is defined as

L (\hat{y}, y) = t = 1 \sum T L_{t} (\hat{y}_{t}, y_{t}) = - t = 1 \sum T y_{t}^{⊤} lo g \hat{y}_{t} = - t = 1 \sum T y_{t}^{⊤} lo g [softmax (o_{t})] .

Here, $y_{t}$ is the one-hot target vector at time $t$ , and $\hat{y}_{t}$ is the predicted probability distribution.

The sequence loss sums the local losses over the $T$ time steps. Averaging instead, $\frac{1}{T} \sum_{t} L_{t}$ , is an equally common convention: it rescales every gradient derived below by $1/ T$ and changes nothing structural, exactly as the mini-batch reduction can be taken as a sum or a mean. The summed form is kept here to match the standard BPTT presentation.

Cross-Entropy and One-Hot Targets

For a single time step, the multiclass cross-entropy can be written as
$L_{t} = - y_{t}^{⊤} lo g \hat{y}_{t},$
where $y_{t} \in {0, 1}^{C}$ is a one-hot target vector and $\hat{y}_{t} \in R^{C}$ is the predicted probability vector after softmax. The dot product hides the sum over classes:
$y_{t}^{⊤} lo g \hat{y}_{t} = c = 1 \sum C y_{t}^{(c)} lo g \overset{y}{^}_{t}^{(c)} .$
Since $y_{t}$ is one-hot, only the true class contributes. Therefore,
$L_{t} = - lo g \overset{y}{^}_{t}^{(c^{⋆})} .$

Error Signals and the Backward Recursion

This section derives the gradients of $L$ with respect to the five shared parameters. The derivation follows the same pattern as backpropagation for MLPs: introduce error signals at intermediate points of the computational graph, derive a backward recursion that computes them in a single sweep, and express each parameter gradient as an outer product of an error signal and a forward input. The whole procedure runs in $O (T)$ time on a sequence of length $T$ .

What distinguishes BPTT from MLP backpropagation is the temporal axis. The same parameters are reused at every time step, so each parameter contributes to every local loss $L_{1}, \dots, L_{T}$ . The recursion that propagates the error runs from time $T$ down to time $1$ , and the parameter gradients are sums over the $T$ temporal slices. Apart from this extra summation, every step has a direct counterpart in the MLP case.

Relation to the MLP error signal

The error signal $δ_{j}^{l} = \partial L / \partial z_{j}^{l}$ introduced in MLP backpropagation is the gradient of the loss with respect to the pre-activation of a unit. The same role is played here by $δ_{t}^{a}$ .

The letters used for activations and pre-activations are swapped between the two sections:

the MLP notes use $z$ for the pre-activation and $a$ for the activation

the RNN notes use $a_{t}$ for the pre-activation and $h_{t}$ for the activation.

The structural role of the error signal is identical in both: a gradient against the pre-activation, computed by a backward recursion and combined with forward inputs to produce parameter gradients.

Three error signals

For a single sequence, define the error signal at three points of the computational graph:

δ_{t}^{o} := \frac{\partial L}{\partial o _{t}}, δ_{t}^{h} := \frac{\partial L}{\partial h _{t}}, δ_{t}^{a} := \frac{\partial L}{\partial a _{t}} .

These are the gradients of the total loss with respect to the logits, the hidden state, and the pre-activation at time $t$ . All three are vectors of the appropriate dimension.

The first one is immediate. Under the standard softmax-plus-cross-entropy combination, the derivative simplifies to

δ_{t}^{o} = \hat{y}_{t} - y_{t} .

This is the result of a clean cancellation between the softmax derivative and the cross-entropy derivative. The takeaway is that the gradient at the output is the prediction error, signed and component-wise: a vector of mismatches between the predicted probabilities $\hat{y}_{t}$ and the one-hot targets $y_{t}$ , with no residual Jacobian to track. A broader treatment of the same identity is in Softmax and Cross-Entropy.

Optional: derivation of $δ_{t}^{o} = \hat{y}_{t} - y_{t}$

Since the total loss $L = \sum_{s} L_{s}$ depends on $o_{t}$ only through the local loss $L_{t}$ at the same time step,
$δ_{t}^{o} = \frac{\partial L}{\partial o _{t}} = \frac{\partial L _{t}}{\partial o _{t}} .$
The local loss is $L_{t} = - \sum_{c} y_{t}^{(c)} lo g \overset{y}{^}_{t}^{(c)}$ , with $\overset{y}{^}_{t}^{(c)} = e^{o_{t}^{(c)}} / \sum_{j} e^{o_{t}^{(j)}}$ .

Softmax derivative. Differentiating the softmax with respect to $o_{t}^{(i)}$ ,
$\frac{\partial y ^ _{t}^{(c)}}{\partial o _{t}^{(i)}} = \overset{y}{^}_{t}^{(c)} (δ_{c i} - \overset{y}{^}_{t}^{(i)}),$
where $δ_{c i}$ is the Kronecker delta (which equals $1$ when $c = i$ and $0$ otherwise; unrelated to the error signal $δ$ ).

Cross-entropy derivative. Differentiating $L_{t}$ with respect to $\overset{y}{^}_{t}^{(c)}$ ,
$\frac{\partial L _{t}}{\partial y ^ _{t}^{(c)}} = - \frac{y _{t}^{(c)}}{y ^ _{t}^{(c)}} .$
Chain rule. Combining the two via the multivariate chain rule across the $C$ logits,
$\frac{\partial L _{t}}{\partial o _{t}^{(i)}} = c \sum \frac{\partial L _{t}}{\partial y ^ _{t}^{(c)}} \frac{\partial y ^ _{t}^{(c)}}{\partial o _{t}^{(i)}} = c \sum (- \frac{y _{t}^{(c)}}{y ^ _{t}^{(c)}}) \overset{y}{^}_{t}^{(c)} (δ_{c i} - \overset{y}{^}_{t}^{(i)}) = - c \sum y_{t}^{(c)} (δ_{c i} - \overset{y}{^}_{t}^{(i)}) = - y_{t}^{(i)} + \overset{y}{^}_{t}^{(i)} c \sum y_{t}^{(c)} = \overset{y}{^}_{t}^{(i)} - y_{t}^{(i)},$
where the last step uses $\sum_{c} y_{t}^{(c)} = 1$ because $y_{t}$ is one-hot. Stacking the scalar derivatives into a vector gives the boxed identity $δ_{t}^{o} = \hat{y}_{t} - y_{t}$ .

Where the simplification comes from. The factor $\overset{y}{^}_{t}^{(c)}$ at the front of the softmax derivative cancels the $1/ \overset{y}{^}_{t}^{(c)}$ that the cross-entropy derivative leaves behind. Without this cancellation, the gradient at the output would still depend on the softmax outputs in a non-trivial way; with it, the gradient is the prediction error in its rawest form, which is exactly what makes the downstream backward recursion clean.

Backward recursion for $δ_{t}^{h}$

The hidden state $h_{t}$ affects the total loss through two paths:

Locally, through the output at time $t$ : $h_{t} \to o_{t} \to \hat{y}_{t} \to L_{t}$ .
Recursively, through the future hidden states: $h_{t} \to a_{t + 1} \to h_{t + 1} \to \dots$ , which in turn affect $L_{t + 1}, L_{t + 2}, \dots, L_{T}$ .

Each path contributes to $δ_{t}^{h}$ via the multivariate chain rule. The general identity, for any vector-valued forward step $u = f (v)$ and any scalar loss $L$ , is

\frac{\partial L}{\partial v} = (\frac{\partial u}{\partial v})^{⊤} \frac{\partial L}{\partial u} .

Here $\partial u / \partial v$ is the Jacobian matrix of $f$ , and the transpose makes the dimensions match: a column gradient with respect to $u$ on the right side produces, after multiplication by the transposed Jacobian, a column gradient with respect to $v$ on the left side.

Applied to the two paths through which $h_{t}$ affects $L$ :

Local path $(h_{t} \to o_{t})$ . From the output equation $o_{t} = W_{y h} h_{t} + b_{y}$ , the Jacobian is $\partial o_{t} / \partial h_{t} = W_{y h}$ . Its contribution to $δ_{t}^{h}$ is therefore $W_{y h}^{⊤} δ_{t}^{o}$ .
Recursive path $(h_{t} \to a_{t + 1})$ . From the recurrent transition $a_{t + 1} = W_{x h} x_{t + 1} + W_{hh} h_{t} + b_{h}$ , the Jacobian is $\partial a_{t + 1} / \partial h_{t} = W_{hh}$ . Its contribution is $W_{hh}^{⊤} δ_{t + 1}^{a}$ , where $δ_{t + 1}^{a}$ is the future error signal that has already been computed by the backward sweep.

Summing the two contributions yields the recursion for $δ_{t}^{h}$ :

δ_{t}^{h} = W_{y h}^{⊤} δ_{t}^{o} + W_{hh}^{⊤} δ_{t + 1}^{a} .

The recursion is initialized at the right end of the sequence. No future hidden state exists beyond $t = T$ , so the recursive term vanishes by setting $δ_{T + 1}^{a} := 0$ , and

δ_{T}^{h} = W_{y h}^{⊤} δ_{T}^{o} .

From $δ_{t}^{h}$ to $δ_{t}^{a}$

The pre-activation error is one element-wise nonlinearity away from the hidden-state error, and the corresponding Jacobian has a very particular structure that is worth deriving in detail.

The map $h_{t} = ϕ_{h} (a_{t})$ acts component by component: the $i$ -th coordinate of $h_{t}$ depends only on the $i$ -th coordinate of $a_{t}$ ,

h_{t}^{(i)} = ϕ_{h} (a_{t}^{(i)}), i = 1, \dots, n_{neurons} .

Changing $a_{t}^{(j)}$ with $j \neq = i$ has no effect on $h_{t}^{(i)}$ , because the activation function never mixes coordinates. The partial derivatives of $h_{t}$ with respect to $a_{t}$ are therefore

\frac{\partial h _{t}^{(i)}}{\partial a _{t}^{(j)}} = {ϕ_{h}^{'} (a_{t}^{(i)}) 0 if i = j, if i \neq = j .

Assembling these scalar derivatives into the Jacobian matrix gives a square matrix of shape $n_{neurons} \times n_{neurons}$ whose only non-zero entries lie on the diagonal:

\frac{\partial h _{t}}{\partial a _{t}} = diag (ϕ_{h}^{'} (a_{t})) = ϕ_{h}^{'} (a_{t}^{(1)}) 0 ⋮ 0 0 ϕ_{h}^{'} (a_{t}^{(2)}) ⋮ 0 \dots \dots ⋱ \dots 00 ⋮ ϕ_{h}^{'} (a_{t}^{(n_{neurons})}) .

The diagonal structure is a direct consequence of the activation acting independently on each coordinate; there is no coupling between coordinates to fill in the off-diagonal entries. The same construction, an element-wise nonlinearity contributing a diagonal $diag (f^{'})$ factor to a Jacobian product, recurs throughout deep learning: it governs the exploding gradient across time and the gradient highway through feedforward residual blocks.

A useful general identity follows: multiplying any vector $v$ by a diagonal matrix $diag (d)$ is the same as taking the element-wise product,

diag (d) v = d ⊙ v .

Applying the chain rule with the Jacobian above, and using that the transpose of a diagonal matrix is the diagonal matrix itself, the pre-activation error becomes

δ_{t}^{a} = (\frac{\partial h _{t}}{\partial a _{t}})^{⊤} δ_{t}^{h} = diag (ϕ_{h}^{'} (a_{t})) δ_{t}^{h} = ϕ_{h}^{'} (a_{t}) ⊙ δ_{t}^{h} .

Boxing the final form:

δ_{t}^{a} = ϕ_{h}^{'} (a_{t}) ⊙ δ_{t}^{h},

where $⊙$ denotes the element-wise (Hadamard) product. For the canonical choice $ϕ_{h} = tanh$ , the derivative is $ϕ_{h}^{'} (a_{t}) = 1 - h_{t} ⊙ h_{t}$ , which can be evaluated for free from the hidden state $h_{t}$ saved during the forward pass.

The backward sweep in algorithmic form

Combining the two updates yields a single loop that walks the unrolled graph from right to left:

δ_{T + 1}^{a} := 0, for t = T, T - 1, \dots, 1 : ⎩ ⎨ ⎧ δ_{t}^{h} = W_{y h}^{⊤} δ_{t}^{o} + W_{hh}^{⊤} δ_{t + 1}^{a}, δ_{t}^{a} = ϕ_{h}^{'} (a_{t}) ⊙ δ_{t}^{h} .

Each iteration performs a constant number of matrix-vector products. The full backward pass therefore visits each of the $T$ time steps exactly once, for a total cost of $O (T \cdot n_{neurons}^{2})$ , linear in the sequence length.

Dynamic programming along the temporal axis

Standard backpropagation in an MLP is already a dynamic-programming algorithm: each layer’s error signal $δ^{l}$ is computed once from $δ^{l + 1}$ and reused in every parameter gradient at layer $l$ . Backprop is linear in the depth precisely because of this reuse. The novelty of BPTT is not dynamic programming itself, but the observation that the same DP structure survives the temporal axis: across time steps, just as across depth, error signals are computed once and reused.

This deserves emphasis because the naive chain-rule expansion of $\partial L / \partial W_{hh}$ produces a double sum over temporal paths that looks quadratic in $T$ . Sharing $W_{hh}$ across all time steps could in principle have spoiled the DP structure, forcing each path to be computed independently. The recursion on $δ_{t}^{h}$ and $δ_{t}^{a}$ above is the proof that this does not happen: the temporal dependency stays linear in $T$ , exactly as the depth-wise dependency stays linear in $L$ for an MLP. Same principle, extended to a new axis.

This is also exactly what PyTorch’s autograd computes when loss.backward() is called on the unrolled cell of the minimal PyTorch implementation: the recursion above runs inside the autograd engine, one matrix-vector product per time step. The hand-written derivation in this note and the production-grade autograd path implement the same algorithm.

Parameter Gradients as Outer-Product Sums

Once $δ_{t}^{o}$ and $δ_{t}^{a}$ are available for every $t$ , the gradients with respect to the five shared parameters are immediate, because the same template applies to every weight matrix in the cell. It is worth deriving the template once in general before listing the five formulas.

Why the gradient with respect to a weight matrix is an outer product. Consider a generic linear step $u = Wv + c$ inside the forward pass. The gradient of $L$ with respect to the scalar entry $W_{ij}$ uses the chain rule:

\frac{\partial L}{\partial W _{ij}} = \frac{\partial L}{\partial u _{i}} \frac{\partial u _{i}}{\partial W _{ij}} .

The first factor is the $i$ -th component of the error signal at the output of the linear step, $δ_{u}^{(i)} := \partial L / \partial u_{i}$ . The second factor follows from $u_{i} = \sum_{k} W_{ik} v_{k} + c_{i}$ : only the term $W_{ij} v_{j}$ involves the specific entry $W_{ij}$ , so

\frac{\partial u _{i}}{\partial W _{ij}} = v_{j} .

Multiplying the two factors gives the scalar derivative

\frac{\partial L}{\partial W _{ij}} = δ_{u}^{(i)} v_{j},

and assembling these scalars over the index pair $(i, j)$ recovers the outer product

\frac{\partial L}{\partial W} = δ_{u} v^{⊤} .

The same template, applied to a bias term $c$ , gives $\partial L / \partial c = δ_{u}$ , since $\partial u_{i} / \partial c_{i} = 1$ .

Applying the template to the RNN cell. Each of the five parameters of the cell appears in exactly one linear step of the forward pass, multiplying exactly one input vector at each time step $t$ :

$W_{y h}$ multiplies $h_{t}$ in the output equation, and its error signal at the output is $δ_{t}^{o}$ ;
$W_{x h}$ multiplies $x_{t}$ in the recurrent transition, and its error signal at the output is $δ_{t}^{a}$ ;
$W_{hh}$ multiplies $h_{t - 1}$ in the recurrent transition, and its error signal at the output is $δ_{t}^{a}$ ;
$b_{y}$ and $b_{h}$ inherit the corresponding output error signals directly.

Summing the per-step outer products over $t$ gives the five gradients:

\frac{\partial L}{\partial W _{y h}} \frac{\partial L}{\partial W _{x h}} \frac{\partial L}{\partial b _{h}} = t = 1 \sum T δ_{t}^{o} h_{t}^{⊤}, = t = 1 \sum T δ_{t}^{a} x_{t}^{⊤}, = t = 1 \sum T δ_{t}^{a} . \frac{\partial L}{\partial b _{y}} \frac{\partial L}{\partial W _{hh}} = t = 1 \sum T δ_{t}^{o}, = t = 1 \sum T δ_{t}^{a} h_{t - 1}^{⊤},

The structure is the same as ordinary backpropagation in an MLP; the only addition is the temporal sum. The two input-side weight matrices have a parallel role: $W_{x h}$ multiplies $x_{t}$ in the forward pass, $W_{hh}$ multiplies $h_{t - 1}$ , and the corresponding gradient is the outer product of $δ_{t}^{a}$ with whichever input the matrix saw at time $t$ . No separate derivation is needed for the two cases: they are the same operation applied to two different inputs.

Equivalence with the long-form chain rule

The compact formulas above are not an alternative to the chain rule; they are its natural form once the backward recursion has been applied. Expanding $δ_{t}^{a}$ via the recursion reproduces the long-form expression
$\frac{\partial L}{\partial W _{hh}} = t = 1 \sum T k = 1 \sum t \frac{\partial L _{t}}{\partial y ^ _{t}} \frac{\partial y ^ _{t}}{\partial o _{t}} \frac{\partial o _{t}}{\partial h _{t}} j = k \prod t - 1 \frac{\partial h _{j + 1}}{\partial h _{j}} \frac{\partial h _{k}}{\partial W _{hh}},$
which sums over all causal paths from $W_{hh}$ to each local loss. The inner index $k$ corresponds to the time step at which $W_{hh}$ is “tagged” along the path: the parameter participates in the computation of every hidden state $h_{k}$ , and each such participation defines one path that ends at the loss $L_{t}$ . The same sum applies to $W_{x h}$ with $\partial h_{k} / \partial W_{x h}$ in place of $\partial h_{k} / \partial W_{hh}$ .

This expansion is correct but should not be implemented as written. Computing each path independently leads to $O (T^{2})$ operations. The backward recursion in the previous section is the same sum evaluated in $O (T)$ by reusing each $δ_{t}^{a}$ across all the paths that pass through time $t$ .

The Product of Recurrent Jacobians

The unrolled form contains one term that is worth isolating, because it governs the stability of the entire algorithm:

\frac{\partial h _{t}}{\partial h _{k}} = j = k \prod t - 1 \frac{\partial h _{j + 1}}{\partial h _{j}} .

This is the chain of recurrent Jacobians multiplied together along the temporal gap between time $k$ and time $t$ . When the upper index is smaller than the lower one, the product is the identity by convention.

The same product appears, in disguise, inside the backward recursion: every step of the backward sweep applies $W_{hh}^{⊤}$ to $δ^{a}$ once, so propagating an error from time $t$ back to time $k$ involves $t - k$ applications of a Jacobian of $W_{hh}$ . For example, for $t = 3$ and $k = 1$ ,

\frac{\partial h _{3}}{\partial h _{1}} = \frac{\partial h _{3}}{\partial h _{2}} \cdot \frac{\partial h _{2}}{\partial h _{1}},

a product of two recurrent Jacobians, each of shape $n_{neurons} \times n_{neurons}$ .

Stability is dictated by this product

The chain $\prod_{j = k}^{t - 1} \partial h_{j + 1} / \partial h_{j}$ is the term through which BPTT carries gradient information across time. When its norm decays geometrically along the temporal gap, the gradient vanishes; when it grows geometrically, the gradient explodes. The full stability analysis, including the vanishing and exploding gradient regimes in vanilla RNNs, is the subject of BPTT Problems, and the optimizer-level remedy for the explosion is the subject of Gradient Clipping.

Conclusion

Every gradient computed by BPTT has the same shape:

\frac{\partial L}{\partial θ} = t = 1 \sum T (local error signal at time t) \otimes (local forward input at time t),

where the error signal is provided by the backward recursion on $δ_{t}^{h}$ and $δ_{t}^{a}$ , and the forward input is whatever quantity the parameter $θ$ multiplied during the forward pass ( $x_{t}$ , $h_{t - 1}$ , $h_{t}$ , or the constant vector $1$ for biases). A single backward sweep through the unrolled graph in $O (T)$ time computes every parameter gradient.

The same recurrent-Jacobian product that makes this computation possible also exposes vanilla RNNs to the stability issues analyzed in BPTT Problems: the algorithm’s efficiency and its instability share the same mathematical source.

For sequences where even $O (T)$ becomes prohibitive in memory (because every $h_{t}$ and $a_{t}$ must be retained for the backward pass), the practical training algorithm is Truncated BPTT, treated in BPTT Variants.

Deep Learning: Zero to Hero

Explorer

BackPropagation Through Time (BPTT)

In depth derivation

Single-Sequence Formulation

Mini-Batch Formulation

Gradient Derivation: Backpropagation Through Time

Error Signals and the Backward Recursion

Three error signals

Backward recursion for $δ_{t}^{h}$

From $δ_{t}^{h}$ to $δ_{t}^{a}$

The backward sweep in algorithmic form

Parameter Gradients as Outer-Product Sums

The Product of Recurrent Jacobians

Conclusion

Graph View

Table of Contents

Backlinks

Deep Learning: Zero to Hero

Explorer

BackPropagation Through Time (BPTT)

In depth derivation

Single-Sequence Formulation

Mini-Batch Formulation

Gradient Derivation: Backpropagation Through Time

Error Signals and the Backward Recursion

Three error signals

Backward recursion for δth​

From δth​ to δta​

The backward sweep in algorithmic form

Parameter Gradients as Outer-Product Sums

The Product of Recurrent Jacobians

Conclusion

Graph View

Table of Contents

Backlinks

Backward recursion for $δ_{t}^{h}$

From $δ_{t}^{h}$ to $δ_{t}^{a}$