Stacked RNNs

What this note builds

A single recurrent layer applies one nonlinear transformation per step, so its expressivity per step is capped by that one cell. Stacking lifts the cap by composing $L$ layers vertically: the per-step computation becomes a depth- $L$ feedforward network with stateful inputs. This note develops the construction as a 2-D grid over time and depth, shows why the gradient must now survive both axes (and why plain vanilla cells fail at it), and distils the practical recipe that made stacked LSTMs the workhorse of pre-Transformer speech and NLP: gated cells, two to four layers, between-layer dropout, and residual connections on depth.

Reading the figure

The figure unrolls the stack over time into the $(t, ℓ)$ grid: each row is a layer, each column a time step. Two kinds of arrow move information through it: solid horizontal arrows are the recurrence within a layer (the time axis), and dashed vertical arrows lift a layer’s output up to the next layer (the depth axis). The bottom row reads the raw input $x_{t}$ at each step; every row above reads features of features from the row below, and the top row emits the sequence $h_{t}^{L}$ that downstream modules consume. This stacked abstraction is what the slide calls “hierarchical sequential features”, the recurrent analogue of adding hidden layers to an MLP, and the reason stacks are standard in speech and NLP. (Colours: states at the current step $t$ are $green$ , those carried from earlier steps $blue$ .)

A single recurrent layer, whether vanilla RNN, LSTM or GRU, applies the same nonlinear transformation to its input at every time step. The expressive power available per step is fixed by the width and shape of that one cell. Stacking is the natural answer to a simple question: what if more expressivity per step is needed than one cell can deliver?

A stacked RNN answers by composing $L$ recurrent layers vertically. At every time step, the output of layer $ℓ$ becomes the input of layer $ℓ + 1$ , while each layer continues to carry its own hidden state forward in time. The result is a 2-D computation grid with two independent axes: time (horizontal) and depth (vertical).

Definition

Let $h_{t}^{(ℓ)} \in R^{n_{ℓ}}$ denote the hidden state of layer $ℓ$ at time $t$ , with $n_{ℓ}$ the width of layer $ℓ$ . The input to layer $ℓ$ at time $t$ is the hidden state of the previous layer at the same time:

h_{t}^{(0)} = x_{t}, h_{t}^{(ℓ)} = f^{(ℓ)} (h_{t}^{(ℓ - 1)}, h_{t - 1}^{(ℓ)}) for ℓ = 1, \dots, L .

Here $f^{(ℓ)}$ is the recurrent cell of layer $ℓ$ , with its own parameters $θ^{(ℓ)}$ . The cell can be any recurrent unit: vanilla, LSTM, GRU, or a mix. The final output sequence used by any downstream module (loss, attention, prediction head) is conventionally $h_{t}^{(L)}$ , the top-layer hidden state.

Dimensions used in this note

Colours follow the figure: states at the current step $t$ in $green$ , states carried from the previous step $t - 1$ in $blue$ .

symbol role shape
$x_{t}$ input $n_{inputs}$
$h_{t}^{(ℓ)}$ hidden state of layer $ℓ$ at time $t$ $n_{ℓ}$
$c_{t}^{(ℓ)}$ cell state of layer $ℓ$ at time $t$ (LSTM only) $n_{ℓ}$
$θ^{(ℓ)}$ parameters of layer $ℓ$ ‘s cell depends on cell type

Layer widths $n_{ℓ}$ may differ across layers, though in practice they are usually kept equal except possibly for the first layer (which adapts to $n_{inputs}$ ).

symbol	role	shape
$x_{t}$	input	$n_{inputs}$
$h_{t}^{(ℓ)}$	hidden state of layer $ℓ$ at time $t$	$n_{ℓ}$
$c_{t}^{(ℓ)}$	cell state of layer $ℓ$ at time $t$ (LSTM only)	$n_{ℓ}$
$θ^{(ℓ)}$	parameters of layer $ℓ$ ‘s cell	depends on cell type

For an LSTM stack, each layer carries both a hidden state $h_{t}^{(ℓ)}$ and a cell state $c_{t}^{(ℓ)}$ . Only the hidden state of layer $ℓ$ feeds layer $ℓ + 1$ ; the cell state stays inside the layer, exposed only via the read-out $h_{t}^{(ℓ)} = o_{t}^{(ℓ)} ⊙ tanh (c_{t}^{(ℓ)})$ .

Why stack at all

A single recurrent layer with $n_{neurons}$ units has the expressive power of one nonlinear function of $(x_{t}, h_{t - 1})$ . No matter how wide $n_{neurons}$ is made, there is exactly one $tanh$ (or one sigmoid-tanh-gated combination) between input and output at each step.

Stacking $L$ layers turns the per-step computation into a depth- $L$ feedforward network whose inputs are stateful. The same intuition that motivates deep MLPs over wide single-layer networks transfers directly: features at higher layers can be features of features, and the hierarchy is built once per time step rather than rebuilt at every step. Empirically, this is what makes stacked LSTMs and GRUs work well on tasks like acoustic modelling and machine translation, where the relevant structure at each time step is itself hierarchical (phonemes → words → phrases).

Higher layers are deeper, not slower

A natural misreading of “hierarchy” is that the upper layers run on a coarser timescale, updating less often and capturing longer patterns by construction. They do not. Every layer updates at every time step, consuming the layer below’s output each step, so stacking adds feature abstraction (depth), not temporal resolution. Any tendency of higher layers to track longer-range structure is emergent, not built in. Genuinely multi-timescale recurrence needs explicit machinery (clockwork RNNs, hierarchical multiscale RNNs, or strided/dilated stacking) that updates different layers at different rates on purpose.

Depth and time are orthogonal axes of expressivity

A useful mental model: a stacked RNN’s computation graph is a rectangular grid of cells indexed by $(t, ℓ)$ . Each cell takes input from its left neighbour (the layer’s own past, $h_{t - 1}^{(ℓ)}$ ) and from its bottom neighbour (the previous layer’s current output, $h_{t}^{(ℓ - 1)}$ ). Width adds parallel features at a given $(t, ℓ)$ ; depth adds compositional features across $ℓ$ ; time adds context across $t$ .

The three axes do different work, and trading them against each other has different consequences. Adding width helps with representational capacity at a fixed depth; adding depth helps with feature compositionality at a fixed width; adding context window (longer sequences) helps with long-range dependency, but only if the gradient survives the BPTT product.

Gradient flow along two axes

In a single-layer RNN the gradient must survive time (the $t$ -direction). In a stacked RNN it must additionally survive depth (the $ℓ$ -direction). At every position $(t, ℓ)$ in the grid the local Jacobian is roughly

\frac{\partial h _{t}^{(ℓ)}}{\partial h _{t - 1}^{(ℓ)}} (temporal), \frac{\partial h _{t}^{(ℓ)}}{\partial h _{t}^{(ℓ - 1)}} (vertical) .

A gradient signal from the loss at time $T$ in the top layer to the input at time $1$ in the bottom layer must traverse a path of length $T + L - 2$ in this grid, and each step of the path is a multiplication by one of the two Jacobians. Both axes are subject to the vanishing/exploding pathology of BPTT.

Stacking with vanilla RNNs amplifies the problem

A stack of vanilla RNNs has gradients that vanish or explode along two axes at once. If $γ_{time} < 1$ is the temporal contraction factor of one layer and $γ_{depth} < 1$ the vertical contraction across one layer, the gradient magnitude across a path of length $T$ steps and $L$ layers is bounded above by $γ_{time}^{T} \cdot γ_{depth}^{L}$ . Doubling the depth has the same effect on the gradient as doubling the time horizon.

This is why stacked vanilla RNNs are essentially never used in practice. They train poorly even at modest depths. The standard recipe is to use LSTM or GRU cells at every layer, which fixes the temporal axis with the gated additive update, and (optionally) to add residual or skip connections between layers, which fixes the vertical axis with the same trick applied to depth.

Practical guidance

A few empirical regularities that have held up across many sequence-modelling benchmarks:

Use gated cells (LSTM or GRU) at every layer. Vanilla cells stack badly; the same vanishing-gradient logic that rules them out at long horizons rules them out at depth.
Two to four layers is the sweet spot for most tasks. A single layer is often a strong baseline; three or four layers reliably improves on it; beyond four, returns diminish quickly and training becomes brittle. The deepest stacked LSTMs in production reached eight layers, in speech-recognition systems and in the encoder of Google’s neural translation system, an encoder-decoder, but only with careful initialization and residual connections.
Layer widths usually match. Choosing $n_{ℓ} = n_{neurons}$ for all $ℓ$ simplifies the architecture without measurable cost in most settings.
Dropout goes between layers, not inside the recurrence. Dropping units across the time axis breaks the very dependencies the recurrence is meant to learn; dropping units between layer $ℓ$ and layer $ℓ + 1$ regularizes without disturbing temporal information flow. This is the recipe of Zaremba, Sutskever and Vinyals (2014).
Residual or skip connections help past three layers. Adding $h_{t}^{(ℓ)} \leftarrow h_{t}^{(ℓ)} + h_{t}^{(ℓ - 1)}$ (assuming matching widths) turns the depth axis into a residual stream, with the same benefits the cell state line provides on the time axis.

Parameter count

Each layer is an independent recurrent cell with its own parameters. For a stack of $L$ identical LSTM layers of width $n_{neurons}$ , with input width $n_{inputs}$ at the bottom and $n_{neurons}$ everywhere above,

P_{stacked LSTM} = 4 (n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons}) + (L - 1) \cdot 4 (n_{neurons}^{2} + n_{neurons}^{2} + n_{neurons}) .

where:

the first term is the bottom layer, which sees $n_{inputs}$ -dimensional input;
the second is the sum across the remaining $L - 1$ layers, each of which sees an $n_{neurons}$ -dimensional input from the layer below.

For a stack of GRU layers, multiply the same expression by $3/4$ (see GRU for the parameter accounting per layer).

A note on terminology

The literature uses a few interchangeable terms for the same idea: stacked RNN, multilayer RNN, deep RNN. They all mean the construction described above, with one set of independent parameters per layer. The term “deep RNN” is sometimes also used in a narrower sense for architectures that add depth inside the per-step transition function (e.g. Pascanu, Gulcehre, Cho and Bengio, 2014), but this is uncommon today. In modern usage, “depth” in a recurrent model means vertical stacking.

The trick of stacking layers is orthogonal to the choice of recurrent cell: vanilla, LSTM, GRU and any other recurrent unit can be stacked in exactly the same way, with the same equations and the same caveats about gradient flow along depth. The same is true of bidirectionality, the topic of the next note: any cell, any stack, can be made bidirectional.

Deep Learning: Zero to Hero

Explorer

Definition

Why stack at all

Gradient flow along two axes

Practical guidance

Parameter count

A note on terminology

Graph View

Table of Contents

Backlinks