Backpropagation Through Time (BPTT) is the application of the canonical backpropagation algorithm to a Recurrent Neural Network (RNN) after the recurrent computation has been explicitly unrolled over a finite number of time steps.
Prerequisite: backpropagation as automatic differentiation
For a deeper understanding of how backpropagation works at the level of computational graphs, local derivatives, gradient accumulation, and topological ordering, see the Micrograd series. BPTT uses the same mechanism, with the additional structure introduced by the temporal unrolling of a recurrent computation.
In essence, the recurrent layer is unfolded along the temporal axis and treated as a deep feedforward computational graph whose temporal copies share the same parameters. Backpropagation is then applied to this unfolded graph. The number of time steps over which gradients are propagated is therefore a relevant modeling choice: in full BPTT it coincides with the sequence length, whereas in truncated BPTT it becomes an explicit truncation horizon.
Question
How should the hidden state before the beginning of the sequence be initialized, namely the initial state ?
Note
At the first time step (), no previous hidden state is available. It is therefore customary to set
that is, a zero vector. This choice is analogous to zero-padding in convolutional neural networks: in CNNs, zeros are added at the spatial boundary; here, mutatis mutandis, the same idea is applied at the temporal boundary of the sequence.
Conceptually, BPTT is equivalent to backpropagating the loss through a deep feedforward graph made of temporal copies of the same recurrent cell, from to . Unlike a standard MLP with independent layers, however, these temporal copies all share the same parameters.
In depth derivation
For an explicit derivation of Backpropagation Through Time (BPTT), the RNN is represented in its unfolded form along the temporal axis, as shown in the figure above. The diagram highlights both the forward computation, from inputs to local losses, and the backward propagation of gradients through time.
The compact RNN block shown on the left of the figure should be read as a schematic view of a single recurrent cell. The hidden-to-output matrix maps the hidden state to the logit vector ; the prediction is obtained only after applying the output activation, typically a softmax in multiclass classification.
| Component / Symbol | Description |
|---|---|
| Number of classes in the classification task. | |
| , , | Weight matrices. Although they are drawn at each time step, they are shared across time. This parameter sharing reduces the total number of parameters and expresses the recurrent nature of the model. |
| Input column vector at time step . | |
| Hidden state at time step . | |
| Raw output vector, or logit vector, before the softmax. | |
| Predicted probability vector. It is obtained by applying the softmax function to the logits . For each component , The softmax ensures that is a probability vector, which makes it suitable for multiclass classification and compatible with cross-entropy loss. | |
| Local loss at time step . | |
| Propagation Phases | The computation has two phases. In the forward pass, shown by gray arrows, the input and the previous hidden state are combined to produce the new hidden state and the output. In the backward pass, shown by red arrows, the gradient of the total loss is propagated backward through the unfolded time steps and accumulated for the shared parameters. |
RNN as a Dynamical System
A standard RNN can be viewed as a dynamical system defined by a state transition function, which updates the internal state of the network, and an output function, which maps the hidden state to the model output.
Single-Sequence Formulation
For a single sequence, the system can be written in explicit form as
For a conventional RNN, these functions are usually instantiated as
The vector is the hidden pre-activation: the recurrent counterpart of the pre-activation in an MLP, with the hidden state obtained as its element-wise nonlinearity, . It is named explicitly because the backward pass differentiates the loss with respect to it.
The functions and are nonlinear functions. In particular:
- is an element-wise hidden activation function, commonly .
- is the output activation. In multiclass classification, it is typically the softmax, because its output can be interpreted as a probability distribution over classes.
| Symbol | Dimension | Description |
|---|---|---|
| Input vector at time . | ||
| , | Previous and current hidden states. | |
| Hidden pre-activation, , with . | ||
| Logit vector before the softmax. | ||
| Predicted probability vector after the softmax. | ||
| Input-to-hidden weight matrix. | ||
| Hidden-to-hidden recurrent weight matrix. | ||
| Hidden-to-output/logit weight matrix. | ||
| Hidden-state bias vector. | ||
| Output/logit bias vector. | ||
| — | Hidden activation function, for instance . | |
| — | Output activation function, typically softmax in classification. | |
| Local scalar loss at time . |
Mini-Batch Formulation
Important
For computational efficiency, training is usually performed on mini-batches of sequences. In this setting, the input vector of a single time step, , is generalized to an input matrix , where is the mini-batch size.
Similarly, the previous hidden states are collected into a matrix , where each row contains the hidden state associated with one sequence in the batch.
The recurrent layer can then be evaluated for the whole mini-batch as
Here:
- is the mini-batch input matrix at time .
- is the matrix of previous hidden states.
- contains the hidden states computed at time .
- contains the logits for each example in the batch.
- contains the predicted probabilities after the softmax.
The main tensor dimensions are summarized below.
| Symbol | Dimension | Description |
|---|---|---|
| Mini-batch input at time . | ||
| Mini-batch hidden state at time . | ||
| Input-to-hidden weight matrix. | ||
| Hidden-to-hidden recurrent weight matrix. | ||
| Hidden-to-output/logit weight matrix. | ||
| , broadcast to | Hidden-state bias. | |
| , broadcast to | Output/logit bias. | |
| Mini-batch predicted probabilities at time . |
Why the Transposed Weight Matrices?
The transposes come from using two compatible conventions at the same time: column vectors for the mathematical derivation, and row-wise mini-batches for implementation.
Single-sequence view. The input is a column vector of shape . The matrix has shape , so the product is dimensionally valid.
Mini-batch view. The input stores one example per row, so it has shape . To apply the same linear map to every row, the weight matrix must appear on the right with shape .
Since has been defined in the column-vector convention as , the mini-batch computation uses its transpose:
The same reasoning applies to the recurrent term:
This is also the convention used by PyTorch. A layer such as
nn.Linear(in_features, out_features)storesweightwith shape[out_features, in_features], and the forward computation is performed asinput @ weight.T + bias. The same pattern appears innn.RNN, where the recurrent update is written in batched form using terms such asx_t @ W_ih.Tandh_{t-1} @ W_hh.T.
Compact Notation for Input and Previous Hidden State
In optimized implementations, it is common to concatenate the input at time and the previous hidden state into a single matrix:
This matrix is paired with an extended weight matrix
The recurrent transition can then be written compactly as
This notation simplifies the implementation and reduces the recurrent transition to a single matrix multiplication.
Note
At the first time step (), the previous hidden state is initialized as
where is a matrix of zeros.
Important
This structure makes it possible to derive gradients systematically with respect to the shared parameters , , , , and . This is the core mechanism of BPTT.
Gradient Derivation: Backpropagation Through Time
BPTT Derivation: Returning to a Single Sequence
The forward computation has been described both for a single sequence and for a mini-batch. In the mini-batch formulation, matrices such as and are used, with one sequence per row.
For the derivation of Backpropagation Through Time, the notation returns to a single sequence. Thus, , , , and are vectors rather than mini-batch matrices. This is purely a notational choice; the mini-batch case follows from the single-sequence case by the linearity of differentiation.
If the mini-batch loss is defined as the average of the per-sequence losses,
with the loss computed on the -th sequence of the mini-batch, then for any parameter the linearity of gives
The single-sequence derivation that follows therefore applies unchanged at the mini-batch level: each is computed by the formulas derived below, applied to sequence in isolation, and the resulting per-sequence gradients are averaged. The same identity holds with
sumreduction (i.e., without the factor), which is the convention used by some implementations.
The total loss over the unfolded sequence is defined as
Here, is the one-hot target vector at time , and is the predicted probability distribution.
The sequence loss sums the local losses over the time steps. Averaging instead, , is an equally common convention: it rescales every gradient derived below by and changes nothing structural, exactly as the mini-batch reduction can be taken as a sum or a mean. The summed form is kept here to match the standard BPTT presentation.
Cross-Entropy and One-Hot Targets
For a single time step, the multiclass cross-entropy can be written as
where is a one-hot target vector and is the predicted probability vector after softmax. The dot product hides the sum over classes:
Since is one-hot, only the true class contributes. Therefore,
Error Signals and the Backward Recursion
This section derives the gradients of with respect to the five shared parameters. The derivation follows the same pattern as backpropagation for MLPs: introduce error signals at intermediate points of the computational graph, derive a backward recursion that computes them in a single sweep, and express each parameter gradient as an outer product of an error signal and a forward input. The whole procedure runs in time on a sequence of length .
What distinguishes BPTT from MLP backpropagation is the temporal axis. The same parameters are reused at every time step, so each parameter contributes to every local loss . The recursion that propagates the error runs from time down to time , and the parameter gradients are sums over the temporal slices. Apart from this extra summation, every step has a direct counterpart in the MLP case.
Relation to the MLP error signal
The error signal introduced in MLP backpropagation is the gradient of the loss with respect to the pre-activation of a unit. The same role is played here by .
The letters used for activations and pre-activations are swapped between the two sections:
- the MLP notes use for the pre-activation and for the activation
- the RNN notes use for the pre-activation and for the activation.
The structural role of the error signal is identical in both: a gradient against the pre-activation, computed by a backward recursion and combined with forward inputs to produce parameter gradients.
Three error signals
For a single sequence, define the error signal at three points of the computational graph:
These are the gradients of the total loss with respect to the logits, the hidden state, and the pre-activation at time . All three are vectors of the appropriate dimension.
The first one is immediate. Under the standard softmax-plus-cross-entropy combination, the derivative simplifies to
This is the result of a clean cancellation between the softmax derivative and the cross-entropy derivative. The takeaway is that the gradient at the output is the prediction error, signed and component-wise: a vector of mismatches between the predicted probabilities and the one-hot targets , with no residual Jacobian to track. A broader treatment of the same identity is in Softmax and Cross-Entropy.
Optional: derivation of
Since the total loss depends on only through the local loss at the same time step,
The local loss is , with .
Softmax derivative. Differentiating the softmax with respect to ,
where is the Kronecker delta (which equals when and otherwise; unrelated to the error signal ).
Cross-entropy derivative. Differentiating with respect to ,
Chain rule. Combining the two via the multivariate chain rule across the logits,
where the last step uses because is one-hot. Stacking the scalar derivatives into a vector gives the boxed identity .
Where the simplification comes from. The factor at the front of the softmax derivative cancels the that the cross-entropy derivative leaves behind. Without this cancellation, the gradient at the output would still depend on the softmax outputs in a non-trivial way; with it, the gradient is the prediction error in its rawest form, which is exactly what makes the downstream backward recursion clean.
Backward recursion for
The hidden state affects the total loss through two paths:
- Locally, through the output at time : .
- Recursively, through the future hidden states: , which in turn affect .
Each path contributes to via the multivariate chain rule. The general identity, for any vector-valued forward step and any scalar loss , is
Here is the Jacobian matrix of , and the transpose makes the dimensions match: a column gradient with respect to on the right side produces, after multiplication by the transposed Jacobian, a column gradient with respect to on the left side.
Applied to the two paths through which affects :
- Local path . From the output equation , the Jacobian is . Its contribution to is therefore .
- Recursive path . From the recurrent transition , the Jacobian is . Its contribution is , where is the future error signal that has already been computed by the backward sweep.
Summing the two contributions yields the recursion for :
The recursion is initialized at the right end of the sequence. No future hidden state exists beyond , so the recursive term vanishes by setting , and
From to
The pre-activation error is one element-wise nonlinearity away from the hidden-state error, and the corresponding Jacobian has a very particular structure that is worth deriving in detail.
The map acts component by component: the -th coordinate of depends only on the -th coordinate of ,
Changing with has no effect on , because the activation function never mixes coordinates. The partial derivatives of with respect to are therefore
Assembling these scalar derivatives into the Jacobian matrix gives a square matrix of shape whose only non-zero entries lie on the diagonal:
The diagonal structure is a direct consequence of the activation acting independently on each coordinate; there is no coupling between coordinates to fill in the off-diagonal entries. The same construction, an element-wise nonlinearity contributing a diagonal factor to a Jacobian product, recurs throughout deep learning: it governs the exploding gradient across time and the gradient highway through feedforward residual blocks.
A useful general identity follows: multiplying any vector by a diagonal matrix is the same as taking the element-wise product,
Applying the chain rule with the Jacobian above, and using that the transpose of a diagonal matrix is the diagonal matrix itself, the pre-activation error becomes
Boxing the final form:
where denotes the element-wise (Hadamard) product. For the canonical choice , the derivative is , which can be evaluated for free from the hidden state saved during the forward pass.
The backward sweep in algorithmic form
Combining the two updates yields a single loop that walks the unrolled graph from right to left:
Each iteration performs a constant number of matrix-vector products. The full backward pass therefore visits each of the time steps exactly once, for a total cost of , linear in the sequence length.
Dynamic programming along the temporal axis
Standard backpropagation in an MLP is already a dynamic-programming algorithm: each layer’s error signal is computed once from and reused in every parameter gradient at layer . Backprop is linear in the depth precisely because of this reuse. The novelty of BPTT is not dynamic programming itself, but the observation that the same DP structure survives the temporal axis: across time steps, just as across depth, error signals are computed once and reused.
This deserves emphasis because the naive chain-rule expansion of produces a double sum over temporal paths that looks quadratic in . Sharing across all time steps could in principle have spoiled the DP structure, forcing each path to be computed independently. The recursion on and above is the proof that this does not happen: the temporal dependency stays linear in , exactly as the depth-wise dependency stays linear in for an MLP. Same principle, extended to a new axis.
This is also exactly what PyTorch’s autograd computes when
loss.backward()is called on the unrolled cell of the minimal PyTorch implementation: the recursion above runs inside the autograd engine, one matrix-vector product per time step. The hand-written derivation in this note and the production-grade autograd path implement the same algorithm.
Parameter Gradients as Outer-Product Sums
Once and are available for every , the gradients with respect to the five shared parameters are immediate, because the same template applies to every weight matrix in the cell. It is worth deriving the template once in general before listing the five formulas.
Why the gradient with respect to a weight matrix is an outer product. Consider a generic linear step inside the forward pass. The gradient of with respect to the scalar entry uses the chain rule:
The first factor is the -th component of the error signal at the output of the linear step, . The second factor follows from : only the term involves the specific entry , so
Multiplying the two factors gives the scalar derivative
and assembling these scalars over the index pair recovers the outer product
The same template, applied to a bias term , gives , since .
Applying the template to the RNN cell. Each of the five parameters of the cell appears in exactly one linear step of the forward pass, multiplying exactly one input vector at each time step :
- multiplies in the output equation, and its error signal at the output is ;
- multiplies in the recurrent transition, and its error signal at the output is ;
- multiplies in the recurrent transition, and its error signal at the output is ;
- and inherit the corresponding output error signals directly.
Summing the per-step outer products over gives the five gradients:
The structure is the same as ordinary backpropagation in an MLP; the only addition is the temporal sum. The two input-side weight matrices have a parallel role: multiplies in the forward pass, multiplies , and the corresponding gradient is the outer product of with whichever input the matrix saw at time . No separate derivation is needed for the two cases: they are the same operation applied to two different inputs.
Equivalence with the long-form chain rule
The compact formulas above are not an alternative to the chain rule; they are its natural form once the backward recursion has been applied. Expanding via the recursion reproduces the long-form expression
which sums over all causal paths from to each local loss. The inner index corresponds to the time step at which is “tagged” along the path: the parameter participates in the computation of every hidden state , and each such participation defines one path that ends at the loss . The same sum applies to with in place of .
This expansion is correct but should not be implemented as written. Computing each path independently leads to operations. The backward recursion in the previous section is the same sum evaluated in by reusing each across all the paths that pass through time .
The Product of Recurrent Jacobians
The unrolled form contains one term that is worth isolating, because it governs the stability of the entire algorithm:
This is the chain of recurrent Jacobians multiplied together along the temporal gap between time and time . When the upper index is smaller than the lower one, the product is the identity by convention.
The same product appears, in disguise, inside the backward recursion: every step of the backward sweep applies to once, so propagating an error from time back to time involves applications of a Jacobian of . For example, for and ,
a product of two recurrent Jacobians, each of shape .
Stability is dictated by this product
The chain is the term through which BPTT carries gradient information across time. When its norm decays geometrically along the temporal gap, the gradient vanishes; when it grows geometrically, the gradient explodes. The full stability analysis, including the vanishing and exploding gradient regimes in vanilla RNNs, is the subject of BPTT Problems, and the optimizer-level remedy for the explosion is the subject of Gradient Clipping.
Conclusion
Every gradient computed by BPTT has the same shape:
where the error signal is provided by the backward recursion on and , and the forward input is whatever quantity the parameter multiplied during the forward pass (, , , or the constant vector for biases). A single backward sweep through the unrolled graph in time computes every parameter gradient.
The same recurrent-Jacobian product that makes this computation possible also exposes vanilla RNNs to the stability issues analyzed in BPTT Problems: the algorithm’s efficiency and its instability share the same mathematical source.
For sequences where even becomes prohibitive in memory (because every and must be retained for the backward pass), the practical training algorithm is Truncated BPTT, treated in BPTT Variants.