Putting all together

The four gates of the previous notes (forget, input, candidate update, and output) act on a single shared input pair $(x_{t}, h_{t - 1})$ and on a single shared memory $c_{t - 1}$ . This note collects the complete forward pass of one LSTM cell in one place, then states the mini-batch form, the parameter count, and the initialization conventions used throughout the rest of the section.

One cell, one time step

Given $x_{t} \in R^{n_{inputs}}$ , $h_{t - 1} \in R^{n_{neurons}}$ and $c_{t - 1} \in R^{n_{neurons}}$ , the LSTM cell computes

f_{t} i_{t} \tilde{c}_{t} o_{t} c_{t} h_{t} = σ (W_{x f} x_{t} + W_{h f} h_{t - 1} + b_{f}) = σ (W_{x i} x_{t} + W_{hi} h_{t - 1} + b_{i}) = tanh (W_{x c} x_{t} + W_{h c} h_{t - 1} + b_{c}) = σ (W_{x o} x_{t} + W_{h o} h_{t - 1} + b_{o}) = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t} = o_{t} ⊙ tanh (c_{t}) forget gate input gate candidate update output gate cell-state update hidden-state read-out

and returns $(h_{t}, c_{t})$ .

The first four lines are all the same affine-then-pointwise pattern, with separate parameters per gate and the same input pair going into each. The last two lines are the entire dynamics of the cell: a gated additive update on the memory line, followed by a gated, bounded read-out of the hidden state.

The whole architecture in two equations

Strip the gate definitions away and the LSTM reduces to
$c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t}, h_{t} = o_{t} ⊙ tanh (c_{t}) .$
Everything else (the four sigmoid/tanh layers, the shared input, the four times larger parameter count compared with a vanilla RNN) exists to make these two element-wise equations behave well. The first guarantees that the gradient survives across long gaps, because its Jacobian with respect to $c_{t - 1}$ is $diag (f_{t})$ . The second guarantees that the hidden state is bounded and selectively informative. Reading the LSTM as “two equations with learned coefficients” makes the design intent unambiguous.

Mini-batch form

For a mini-batch of $B$ sequences processed in parallel, every per-example vector becomes a matrix with one column per example. The parameters are shared across the batch (and across the unrolled time axis), so their shapes are unchanged. The cell becomes

F_{t} I_{t} \tilde{C}_{t} O_{t} C_{t} H_{t} = σ (W_{x f} X_{t} + W_{h f} H_{t - 1} + b_{f}), = σ (W_{x i} X_{t} + W_{hi} H_{t - 1} + b_{i}), = tanh (W_{x c} X_{t} + W_{h c} H_{t - 1} + b_{c}), = σ (W_{x o} X_{t} + W_{h o} H_{t - 1} + b_{o}), = F_{t} ⊙ C_{t - 1} + I_{t} ⊙ \tilde{C}_{t}, = O_{t} ⊙ tanh (C_{t}) .

All element-wise operations apply column-wise, so the entire batch is processed in parallel as a small number of dense matrix multiplications. This is the form that actually runs on hardware.

Dimensions used in this note (master table for the section)

symbol role single example mini-batch ( $B$ )
$x_{t}, X_{t}$ input $n_{inputs}$ $n_{inputs} \times B$
$h_{t - 1}, h_{t}, H_{t - 1}, H_{t}$ hidden state $n_{neurons}$ $n_{neurons} \times B$
$c_{t - 1}, c_{t}, C_{t - 1}, C_{t}$ cell state $n_{neurons}$ $n_{neurons} \times B$
$f_{t}, i_{t}, o_{t}, F_{t}, I_{t}, O_{t}$ gates $n_{neurons}$ $n_{neurons} \times B$
$\tilde{c}_{t}, \tilde{C}_{t}$ candidate update $n_{neurons}$ $n_{neurons} \times B$
$W_{x ∙}$ input-to-hidden weights, $∙ \in {f, i, c, o}$ $n_{neurons} \times n_{inputs}$ idem
$W_{h ∙}$ hidden-to-hidden weights $n_{neurons} \times n_{neurons}$ idem
$b_{∙}$ bias $n_{neurons}$ $n_{neurons}$

symbol	role	single example	mini-batch ( $B$ )
$x_{t}, X_{t}$	input	$n_{inputs}$	$n_{inputs} \times B$
$h_{t - 1}, h_{t}, H_{t - 1}, H_{t}$	hidden state	$n_{neurons}$	$n_{neurons} \times B$
$c_{t - 1}, c_{t}, C_{t - 1}, C_{t}$	cell state	$n_{neurons}$	$n_{neurons} \times B$
$f_{t}, i_{t}, o_{t}, F_{t}, I_{t}, O_{t}$	gates	$n_{neurons}$	$n_{neurons} \times B$
$\tilde{c}_{t}, \tilde{C}_{t}$	candidate update	$n_{neurons}$	$n_{neurons} \times B$
$W_{x ∙}$	input-to-hidden weights, $∙ \in {f, i, c, o}$	$n_{neurons} \times n_{inputs}$	idem
$W_{h ∙}$	hidden-to-hidden weights	$n_{neurons} \times n_{neurons}$	idem
$b_{∙}$	bias	$n_{neurons}$	$n_{neurons}$

Broadcasting: only in the bias addition

Across the entire LSTM cell, the only operation that requires broadcasting is the bias term in each affine map. In mini-batch form, $b_{∙} \in R^{n_{neurons}}$ is added to a matrix of shape $n_{neurons} \times B$ ; the bias is broadcast across the $B$ columns.

Every element-wise product $⊙$ in the cell acts on two operands of identical shape:

product both operands are shape (single / batch)
$f_{t} ⊙ c_{t - 1}$ gate $⊙$ cell state $n_{neurons}$ / $n_{neurons} \times B$
$i_{t} ⊙ \tilde{c}_{t}$ gate $⊙$ candidate $n_{neurons}$ / $n_{neurons} \times B$
$o_{t} ⊙ tanh (c_{t})$ gate $⊙$ bounded cell state $n_{neurons}$ / $n_{neurons} \times B$

This is not a coincidence: the weight matrices $W_{h ∙} \in R^{n_{neurons} \times n_{neurons}}$ are sized precisely so that every gate output matches the width of the cell state it modulates. The LSTM is designed so that each “switch” is a scalar attached to one slot of memory, and the $⊙$ operations therefore stay shape-aligned by construction.

product	both operands are	shape (single / batch)
$f_{t} ⊙ c_{t - 1}$	gate $⊙$ cell state	$n_{neurons}$ / $n_{neurons} \times B$
$i_{t} ⊙ \tilde{c}_{t}$	gate $⊙$ candidate	$n_{neurons}$ / $n_{neurons} \times B$
$o_{t} ⊙ tanh (c_{t})$	gate $⊙$ bounded cell state	$n_{neurons}$ / $n_{neurons} \times B$

The fused weight matrix used in practice

Modern frameworks (PyTorch, TensorFlow, cuDNN) store the eight weight matrices as a single fused tensor and perform all four affine maps with one matrix multiplication per source:
$Z_{t}^{f} Z_{t}^{i} Z_{t}^{c} Z_{t}^{o} = W_{x f} W_{x i} W_{x c} W_{x o} X_{t} + W_{h f} W_{hi} W_{h c} W_{h o} H_{t - 1} + b_{f} b_{i} b_{c} b_{o} .$
The four blocks of the result are then sliced apart and passed through their respective activations. This is purely an implementation optimization: it reduces kernel-launch overhead and exposes a single large GEMM to BLAS. Mathematically it is identical to the four separate affine maps written above. PyTorch’s nn.LSTMCell exposes this fused form through a single weight tensor named weight_ih (input-to-hidden) and weight_hh (hidden-to-hidden), each of shape $(4 n_{neurons}) \times (\cdot)$ .

The fused matmul parallelizes the gates, not time

The fused weight matrix turns the four affine maps into a single GEMM, and the mini-batch turns the $B$ sequences into extra columns of that GEMM. Both of those axes are parallelized. The one axis that stays strictly sequential is time: $c_{t}$ and $h_{t}$ cannot be computed until $c_{t - 1}$ and $h_{t - 1}$ exist, so a length- $T$ sequence forces $T$ dependent steps that no amount of hardware can collapse into fewer.

This is the structural reason LSTMs are slow on long sequences, and the reason the architectures that followed were designed around it. Self-attention removes the recurrence entirely and processes all positions at once; linear state-space models keep a recurrence but choose one whose decay admits a parallel scan, trading the LSTM’s per-step nonlinearity for parallelism across the sequence. The LSTM’s inductive bias is excellent; its sequential dependency along time is the price it pays for it.

Parameter count and initialization

Each of the four gates has weight matrices $W_{x ∙} \in R^{n_{neurons} \times n_{inputs}}$ , $W_{h ∙} \in R^{n_{neurons} \times n_{neurons}}$ and a bias $b_{∙} \in R^{n_{neurons}}$ . The total parameter count of one LSTM cell is therefore

P_{LSTM} = 4 (n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons}) .

Where the count comes from

One affine map of the form $W_{x} x_{t} + W_{h} h_{t - 1} + b$ contains:

$W_{x}$ : a matrix of shape $n_{neurons} \times n_{inputs}$ , i.e. $n_{neurons} \cdot n_{inputs}$ scalars;

$W_{h}$ : a matrix of shape $n_{neurons} \times n_{neurons}$ , i.e. $n_{neurons}^{2}$ scalars;

$b$ : a vector of length $n_{neurons}$ .

Summing, one affine map has $n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons}$ parameters. The LSTM cell has four such independent maps (one per gate, plus the candidate update), with no shared parameters between them, giving the factor of $4$ .

For comparison, the vanilla RNN has a single such affine map and therefore $n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons}$ parameters, exactly one quarter of the LSTM count at the same hidden width.

The corresponding vanilla RNN has one quarter of this count. The factor of $4$ is the price of preservation discussed in the overview: three of the four gates exist solely to make the memory line behave as a stable additive register.

The initial states are conventionally fixed:

h_{0} = 0, c_{0} = 0 .

A simple but consequential initialization detail (Jozefowicz, Zaremba and Sutskever, 2015) deserves restating: the forget-gate bias $b_{f}$ is initialized close to $1$ rather than $0$ , so that $σ (b_{f}) \approx 0.73$ at the start of training. Without this, a freshly initialized LSTM forgets roughly half of its memory at every step, and the cell-state path is effectively dead long before gradient descent can teach the gate to keep things. The justification is given at the end of the Forget gate note.

Unrolling and training

An LSTM layer processes a sequence by unrolling the cell across $T$ time steps, with the same parameters at every step. Forward, this is just the iterated application of the equations above, starting from $h_{0} = c_{0} = 0$ . Backward, this is Backpropagation Through Time on the resulting computational graph, in any of its practical variants (full BPTT, truncated BPTT with detach() at chunk boundaries, or random-chunk training when long context is not informative).

The next note, Gradient in LSTM, works out the backward pass on this graph and shows explicitly why the cell-state line carries gradient across long gaps, while the hidden-state line, despite passing through all four gates, behaves benignly thanks to the gated additive structure underneath.

Deep Learning: Zero to Hero

Explorer

Putting all together

One cell, one time step

Mini-batch form

Parameter count and initialization

Unrolling and training

Graph View

Table of Contents

Backlinks