The four gates of the previous notes (forget, input, candidate update, and output) act on a single shared input pair and on a single shared memory . This note collects the complete forward pass of one LSTM cell in one place, then states the mini-batch form, the parameter count, and the initialization conventions used throughout the rest of the section.

One cell, one time step

Given , and , the LSTM cell computes

and returns .

The first four lines are all the same affine-then-pointwise pattern, with separate parameters per gate and the same input pair going into each. The last two lines are the entire dynamics of the cell: a gated additive update on the memory line, followed by a gated, bounded read-out of the hidden state.

The whole architecture in two equations

Strip the gate definitions away and the LSTM reduces to

Everything else (the four sigmoid/tanh layers, the shared input, the four times larger parameter count compared with a vanilla RNN) exists to make these two element-wise equations behave well. The first guarantees that the gradient survives across long gaps, because its Jacobian with respect to is . The second guarantees that the hidden state is bounded and selectively informative. Reading the LSTM as “two equations with learned coefficients” makes the design intent unambiguous.

Mini-batch form

For a mini-batch of sequences processed in parallel, every per-example vector becomes a matrix with one column per example. The parameters are shared across the batch (and across the unrolled time axis), so their shapes are unchanged. The cell becomes

All element-wise operations apply column-wise, so the entire batch is processed in parallel as a small number of dense matrix multiplications. This is the form that actually runs on hardware.

Dimensions used in this note (master table for the section)

symbolrolesingle examplemini-batch ()
input
hidden state
cell state
gates
candidate update
input-to-hidden weights, idem
hidden-to-hidden weightsidem
bias

Broadcasting: only in the bias addition

Across the entire LSTM cell, the only operation that requires broadcasting is the bias term in each affine map. In mini-batch form, is added to a matrix of shape ; the bias is broadcast across the columns.

Every element-wise product in the cell acts on two operands of identical shape:

productboth operands areshape (single / batch)
gate cell state /
gate candidate /
gate bounded cell state /

This is not a coincidence: the weight matrices are sized precisely so that every gate output matches the width of the cell state it modulates. The LSTM is designed so that each “switch” is a scalar attached to one slot of memory, and the operations therefore stay shape-aligned by construction.

The fused weight matrix used in practice

Modern frameworks (PyTorch, TensorFlow, cuDNN) store the eight weight matrices as a single fused tensor and perform all four affine maps with one matrix multiplication per source:

The four blocks of the result are then sliced apart and passed through their respective activations. This is purely an implementation optimization: it reduces kernel-launch overhead and exposes a single large GEMM to BLAS. Mathematically it is identical to the four separate affine maps written above. PyTorch’s nn.LSTMCell exposes this fused form through a single weight tensor named weight_ih (input-to-hidden) and weight_hh (hidden-to-hidden), each of shape .

The fused matmul parallelizes the gates, not time

The fused weight matrix turns the four affine maps into a single GEMM, and the mini-batch turns the sequences into extra columns of that GEMM. Both of those axes are parallelized. The one axis that stays strictly sequential is time: and cannot be computed until and exist, so a length- sequence forces dependent steps that no amount of hardware can collapse into fewer.

This is the structural reason LSTMs are slow on long sequences, and the reason the architectures that followed were designed around it. Self-attention removes the recurrence entirely and processes all positions at once; linear state-space models keep a recurrence but choose one whose decay admits a parallel scan, trading the LSTM’s per-step nonlinearity for parallelism across the sequence. The LSTM’s inductive bias is excellent; its sequential dependency along time is the price it pays for it.

Parameter count and initialization

Each of the four gates has weight matrices , and a bias . The total parameter count of one LSTM cell is therefore

The corresponding vanilla RNN has one quarter of this count. The factor of is the price of preservation discussed in the overview: three of the four gates exist solely to make the memory line behave as a stable additive register.

The initial states are conventionally fixed:

A simple but consequential initialization detail (Jozefowicz, Zaremba and Sutskever, 2015) deserves restating: the forget-gate bias is initialized close to rather than , so that at the start of training. Without this, a freshly initialized LSTM forgets roughly half of its memory at every step, and the cell-state path is effectively dead long before gradient descent can teach the gate to keep things. The justification is given at the end of the Forget gate note.

Unrolling and training

An LSTM layer processes a sequence by unrolling the cell across time steps, with the same parameters at every step. Forward, this is just the iterated application of the equations above, starting from . Backward, this is Backpropagation Through Time on the resulting computational graph, in any of its practical variants (full BPTT, truncated BPTT with detach() at chunk boundaries, or random-chunk training when long context is not informative).

The next note, Gradient in LSTM, works out the backward pass on this graph and shows explicitly why the cell-state line carries gradient across long gaps, while the hidden-state line, despite passing through all four gates, behaves benignly thanks to the gated additive structure underneath.