
The four gates of the previous notes (forget, input, candidate update, and output) act on a single shared input pair and on a single shared memory . This note collects the complete forward pass of one LSTM cell in one place, then states the mini-batch form, the parameter count, and the initialization conventions used throughout the rest of the section.
One cell, one time step
Given , and , the LSTM cell computes
and returns .
The first four lines are all the same affine-then-pointwise pattern, with separate parameters per gate and the same input pair going into each. The last two lines are the entire dynamics of the cell: a gated additive update on the memory line, followed by a gated, bounded read-out of the hidden state.
The whole architecture in two equations
Strip the gate definitions away and the LSTM reduces to
Everything else (the four sigmoid/tanh layers, the shared input, the four times larger parameter count compared with a vanilla RNN) exists to make these two element-wise equations behave well. The first guarantees that the gradient survives across long gaps, because its Jacobian with respect to is . The second guarantees that the hidden state is bounded and selectively informative. Reading the LSTM as “two equations with learned coefficients” makes the design intent unambiguous.
Mini-batch form
For a mini-batch of sequences processed in parallel, every per-example vector becomes a matrix with one column per example. The parameters are shared across the batch (and across the unrolled time axis), so their shapes are unchanged. The cell becomes
All element-wise operations apply column-wise, so the entire batch is processed in parallel as a small number of dense matrix multiplications. This is the form that actually runs on hardware.
Dimensions used in this note (master table for the section)
symbol role single example mini-batch () input hidden state cell state gates candidate update input-to-hidden weights, idem hidden-to-hidden weights idem bias
Broadcasting: only in the bias addition
Across the entire LSTM cell, the only operation that requires broadcasting is the bias term in each affine map. In mini-batch form, is added to a matrix of shape ; the bias is broadcast across the columns.
Every element-wise product in the cell acts on two operands of identical shape:
product both operands are shape (single / batch) gate cell state / gate candidate / gate bounded cell state / This is not a coincidence: the weight matrices are sized precisely so that every gate output matches the width of the cell state it modulates. The LSTM is designed so that each “switch” is a scalar attached to one slot of memory, and the operations therefore stay shape-aligned by construction.
The fused weight matrix used in practice
Modern frameworks (PyTorch, TensorFlow, cuDNN) store the eight weight matrices as a single fused tensor and perform all four affine maps with one matrix multiplication per source:
The four blocks of the result are then sliced apart and passed through their respective activations. This is purely an implementation optimization: it reduces kernel-launch overhead and exposes a single large GEMM to BLAS. Mathematically it is identical to the four separate affine maps written above. PyTorch’s
nn.LSTMCellexposes this fused form through a single weight tensor namedweight_ih(input-to-hidden) andweight_hh(hidden-to-hidden), each of shape .
The fused matmul parallelizes the gates, not time
The fused weight matrix turns the four affine maps into a single GEMM, and the mini-batch turns the sequences into extra columns of that GEMM. Both of those axes are parallelized. The one axis that stays strictly sequential is time: and cannot be computed until and exist, so a length- sequence forces dependent steps that no amount of hardware can collapse into fewer.
This is the structural reason LSTMs are slow on long sequences, and the reason the architectures that followed were designed around it. Self-attention removes the recurrence entirely and processes all positions at once; linear state-space models keep a recurrence but choose one whose decay admits a parallel scan, trading the LSTM’s per-step nonlinearity for parallelism across the sequence. The LSTM’s inductive bias is excellent; its sequential dependency along time is the price it pays for it.
Parameter count and initialization
Each of the four gates has weight matrices , and a bias . The total parameter count of one LSTM cell is therefore
Where the count comes from
One affine map of the form contains:
- : a matrix of shape , i.e. scalars;
- : a matrix of shape , i.e. scalars;
- : a vector of length .
Summing, one affine map has parameters. The LSTM cell has four such independent maps (one per gate, plus the candidate update), with no shared parameters between them, giving the factor of .
For comparison, the vanilla RNN has a single such affine map and therefore parameters, exactly one quarter of the LSTM count at the same hidden width.
The corresponding vanilla RNN has one quarter of this count. The factor of is the price of preservation discussed in the overview: three of the four gates exist solely to make the memory line behave as a stable additive register.
The initial states are conventionally fixed:
A simple but consequential initialization detail (Jozefowicz, Zaremba and Sutskever, 2015) deserves restating: the forget-gate bias is initialized close to rather than , so that at the start of training. Without this, a freshly initialized LSTM forgets roughly half of its memory at every step, and the cell-state path is effectively dead long before gradient descent can teach the gate to keep things. The justification is given at the end of the Forget gate note.
Unrolling and training
An LSTM layer processes a sequence by unrolling the cell across time steps, with the same parameters at every step. Forward, this is just the iterated application of the equations above, starting from . Backward, this is Backpropagation Through Time on the resulting computational graph, in any of its practical variants (full BPTT, truncated BPTT with detach() at chunk boundaries, or random-chunk training when long context is not informative).
The next note, Gradient in LSTM, works out the backward pass on this graph and shows explicitly why the cell-state line carries gradient across long gaps, while the hidden-state line, despite passing through all four gates, behaves benignly thanks to the gated additive structure underneath.