BPTT Variants

Full BPTT does not scale

The derivation of Backpropagation Through Time assumes that the entire sequence is unrolled and that gradients are computed across all $T$ time steps. This full BPTT is conceptually clean but rarely practical, because two costs grow linearly with the sequence length:

Memory. Every hidden state $h_{1}, \dots, h_{T}$ and pre-activation $a_{1}, \dots, a_{T}$ produced during the forward pass must be retained, because the backward pass needs them to evaluate the local Jacobians.

Compute. A single parameter update requires a full forward and a full backward over the sequence, so the time per update scales as $O (T)$ .

For sequences of a few hundred steps this is already problematic; for thousands of steps it becomes infeasible. The stability problems analyzed in BPTT Problems compound the issue: even when the computation fits in memory, gradients propagated through hundreds of recurrent Jacobians vanish or explode. The chunking strategy in this note bounds the memory and compute cost; the exploding case is handled at the optimizer level by Gradient Clipping, a complementary remedy.

The practical answer is truncated BPTT (TBPTT): partition the sequence into shorter chunks and bound the backward pass to one chunk at a time. Within that single idea one design decision is left, and it matters more than any hyperparameter: whether the hidden state survives a chunk boundary. Carrying it forward gives the stateful mode; resetting it gives the stateless mode. That one choice also fixes how the chunks may be ordered and which temporal dependencies the model can ever learn.

Truncated BPTT

The idea is to bound the BPTT horizon to a fixed window of length $K ≪ T$ . The sequence is sliced into $T / K$ contiguous chunks; forward and backward are performed inside each chunk, and one parameter update is applied per chunk. Typical values are $K \in {10, 20, 50}$ .

The cost is now $O (K)$ per update instead of $O (T)$ , and the recurrent Jacobian product that drives the stability analysis is also limited to $K$ factors. Each pass through the dataset produces $T / K$ updates per sequence, against the single update of full BPTT.

What remains to be decided is what happens to the hidden state at each chunk boundary. The two answers are the two modes of truncated BPTT.

Stateful: carry the hidden state forward

A standard implementation carries the hidden state forward across chunk boundaries, but stops the gradient at the boundary. The forward computation therefore remains a single continuous recurrence over the full sequence; only the backward pass is truncated. This preserves the evolution of the hidden state across chunks while keeping each backward window bounded, and it is the reason this mode can, in practice, track patterns somewhat longer than $K$ .

Because the state must flow from one chunk to the next, the chunks have to be processed in their natural temporal order.

"Carry the state" versus "stop the gradient"
Concretely, stopping the gradient is a single line in an autodiff framework. In PyTorch, the hidden state passed from one chunk to the next is detached from the computational graph:
h = h.detach()   # carry the value forward, cut the gradient path
After detach(), h still holds the same numbers (so the forward recurrence continues seamlessly), but it is treated as a constant by loss.backward(), so no gradient flows back beyond the start of the current chunk. This is exactly the difference between “carry the state” (the value crosses the boundary) and “stop the gradient” (the backward pass does not).

Stateless: reset the hidden state

The alternative resets the hidden state to a fresh initial value at the start of every chunk. Each chunk is then a self-contained recurrence of length $K$ that owes nothing to its neighbours. Once the chunks are independent in this way, the batch loader is free to stop respecting their temporal order and instead sample them at random from the dataset. Each mini-batch then contains independent windows drawn from arbitrary positions of arbitrary sequences.

The figure shows one mini-batch made of three such chunks, each labelled Stateless TBPTT. The two words name the two things happening inside a window: TBPTT is the truncation to $K$ steps that every window performs internally, with forward and backward confined to its own $K$ cells; stateless is the reset that starts each window from a fresh $h_{init} \leftarrow 0$ instead of inheriting the previous chunk’s state. Each window is therefore a self-contained recurrence of length $K$ , and the gaps between them are explicit: consecutive chunks in the mini-batch are no longer adjacent in time, and no hidden state is transported between them.

Pros

Decorrelation of updates. Contiguous chunks share lexical, statistical and contextual structure; random sampling breaks those correlations and brings the optimization closer to the i.i.d. regime that mini-batch SGD assumes.

Strong short-context generalization. When the task depends only on a short window, for instance next-step time-series forecasting, random sampling forces the model to extract features that work from arbitrary positions of the sequence. This improves generalization on the short timescale.

Cons

No long-range dependencies. Random sampling severs temporal continuity entirely. No hidden state is transported between chunks, and there is no temporal scale beyond $K$ that the model can ever observe. Any phenomenon with structure longer than $K$ is unreachable by construction.

A note on the name "implicit BPTT"

Some courses and slides call this stateless, random-chunk variant “implicit BPTT”. The term is best avoided, because in the wider literature “implicit BPTT”, more precisely implicit differentiation, means something different and almost opposite: computing gradients at the fixed point $z^{⋆} = f_{θ} (z^{⋆}, x)$ of a recurrence through the implicit-function theorem, without unrolling it at all, at $O (1)$ memory. That is the mechanism behind Deep Equilibrium Models (DEQ) and recurrent backpropagation. The variant on this page does the opposite, unrolling explicitly for $K$ steps. To avoid the clash, this note names the two modes by what they do to the state: stateful and stateless.

The shared limit: $K$ versus the longest temporal pattern

Both modes truncate the backward horizon at $K$ , with a direct learning consequence: any temporal dependency that spans more than $K$ steps is invisible to the gradient. Let $T^{*}$ denote the longest period present in the data. If $K < T^{*}$ , no gradient ever connects two states separated by more than $K$ time steps, so the model receives no signal about that scale. The stateful mode softens this slightly by carrying the state forward; the stateless mode hits the limit as a hard wall.

The figure illustrates the failure case. The target signal is a sine wave with period $T^{*} = 50$ . With $K = 10$ , the model is exposed only to ten-step windows and converges to a near-constant prediction around the mean of the signal, because no portion of the gradient ever traverses a full period. Increasing $K$ beyond $T^{*}$ recovers learnability, at proportionally higher compute and memory cost.

This is the structural reason why even well-tuned truncated BPTT scales poorly with the timescale of the phenomenon. The remedy that eventually replaced the bigger- $K$ recipe is attention: instead of forcing all long-range information through a recurrent state, an attention mechanism gives the model a direct, learned pointer to distant positions in the sequence. The development of attentional interfaces is treated in later notes.

Choosing between the two

The trade-off is summarized below:

	Stateful (sequential chunks)	Stateless (random chunks)
Hidden state at a boundary	Carried forward, gradient stopped	Reset: each chunk restarts from a fresh state
Chunk ordering	Natural temporal order (forced)	Arbitrary: chunks sampled at random
Longest pattern learnable	$K$ (longer in practice via state carry-over)	$K$ , hard limit
Optimization regime	Within-sequence correlated updates	i.i.d.-like batched updates
Typical use case	Language modeling, long time series	Short-horizon forecasting, when long context is not informative
Compute per update	$O (K)$	$O (K)$
Gradient stability across $T$	Bounded to $K$ factors	Bounded to $K$ factors

Both modes sit below modern attention-based architectures in expressive power: they make the recurrent core tractable, but neither closes the gap between $K$ and the actual temporal scale of the data. That gap is precisely what motivates the gated cells discussed in the limitations of vanilla RNNs and, beyond them, the attention mechanisms developed in later notes.

Deep Learning: Zero to Hero

Explorer

Truncated BPTT

Stateful: carry the hidden state forward

Stateless: reset the hidden state

The shared limit: $K$ versus the longest temporal pattern

Choosing between the two

Graph View

Table of Contents

Backlinks

Deep Learning: Zero to Hero

Explorer

BPTT Variants

Truncated BPTT

Stateful: carry the hidden state forward

Stateless: reset the hidden state

The shared limit: K versus the longest temporal pattern

Choosing between the two

Graph View

Table of Contents

Backlinks

The shared limit: $K$ versus the longest temporal pattern