LSTM Layer overview

The architectural prescription from Vanilla RNNs limitations (additive memory plus a learned forget gate) is built here in full, one piece at a time, in the order in which information flows through the LSTM cell.

LSTM Layer overview: the two states ( $c_{t}$ , $h_{t}$ ), the four inner FC layers, and the notation used throughout the subsection.
Cell state: the carry line $c_{t - 1} \to c_{t}$ as a temporal skip connection, and the canonical definition of coordinate independence that makes its Jacobian diagonal.
Forget gate: the gating of $c_{t - 1}$ by $f_{t}$ , the choice of sigmoid, and the consequential forget-bias initialization.
Input gate: the candidate update $\tilde{c}_{t}$ (with tanh) and the input mask $i_{t}$ (with sigmoid), and why the what and how much of writing are split into two separate networks.
Output gate: the read-out $h_{t} = o_{t} ⊙ tanh (c_{t})$ and the separation between what the cell stores ( $c_{t}$ ) and what it says ( $h_{t}$ ).
Putting all together: the complete forward pass, the mini-batch vectorization, the master dimensions table, the broadcasting audit, the parameter count, and the fused-weight matrix used by modern frameworks.

The gradient analysis of the assembled cell is treated separately in Gradient in LSTM, one level up.

The diagnosis of Vanilla RNNs limitations ends with a single architectural prescription: take the recurrent matrix off the carry path and replace overwriting with additive update through a learned gate. The LSTM cell is what that prescription looks like when written out in full. This note fixes the notation and the high-level reading of the diagram above; the gate-by-gate derivation begins with the cell state.

Notation used in this section

Throughout the LSTM layer notes:

lowercase bold denotes a vector for a single example ( $x_{t}, h_{t}, c_{t}, f_{t}, i_{t}, o_{t}, \tilde{c}_{t}$ );

uppercase bold denotes the corresponding mini-batch matrix, with one column per example ( $X_{t}, H_{t}, C_{t}, F_{t}, I_{t}, O_{t}, \tilde{C}_{t}$ ), or a learned weight matrix ( $W_{x ∙}, W_{h ∙}$ ).

The diagrams in this section label the cell state with an uppercase symbol because they depict a single time step in isolation; the equations always follow the convention above.

The two sizes: $n_{inputs}$ and $n_{neurons}$

Two integers fix the shape of everything in this section.

$n_{inputs}$ is the number of features in each input vector $x_{t}$ : the dimensionality of one element of the sequence, such as the embedding size of a token or the channel count of a frame. It is set by the data.

$n_{neurons}$ is the number of units in one LSTM layer, and it is a design choice. A unit carries one coordinate of the cell state and one coordinate of the hidden state; it is the recurrent analogue of a single neuron in a fully connected layer. There are $n_{neurons}$ of them, so every vector the cell computes at a step, $c_{t}, h_{t}, f_{t}, i_{t}, o_{t}, \tilde{c}_{t}$ , has one entry per unit and therefore lives in $R^{n_{neurons}}$ , while only the input $x_{t}$ lives in $R^{n_{inputs}}$ .

The name $n_{neurons}$ is therefore deliberate. The $n_{neurons}$ units sit side by side and all read the same input pair $(x_{t}, h_{t - 1})$ ; the original Hochreiter and Schmidhuber paper calls one such unit an LSTM block. To process a sequence the layer is unrolled over the $T$ time steps, reusing the same units at each one. In PyTorch this number is the constructor argument hidden_size.

Two states, two roles

At every step $t = 1, \dots, T$ an LSTM cell receives $(x_{t}, h_{t - 1}, c_{t - 1})$ and produces $(h_{t}, c_{t})$ . Unlike a vanilla RNN, which carries a single hidden vector forward, an LSTM carries two:

the cell state $c_{t} \in R^{n_{neurons}}$ , the orange label, which is the long-term memory of the cell;
the hidden state $h_{t} \in R^{n_{neurons}}$ , the blue label, which is the cell’s externally visible output.

Both are passed to the next time step; only $h_{t}$ is exposed to the rest of the network (and to any prediction head built on top of the layer).

Why two states and not one

The two paths are not redundant: they obey different constraints, and the LSTM works because those constraints are incompatible inside a single vector.

$c_{t}$ must live in an unconstrained, additive space so that gradient can flow back through it unchanged across long gaps. This is the highway that fixes the vanishing gradient. Its components are not bounded.

$h_{t}$ must be bounded and selective: it is fed back into the gate computations at the next step (through sigmoids and a tanh), and it is the signal that downstream layers consume. A bounded, task-focused output is what makes both the gating logic and the prediction head well-behaved.

Squeezing both jobs into a single hidden vector is exactly what a vanilla RNN attempts, and exactly why it cannot preserve information: the same coordinate would have to be at once a long-lived unbounded register and a short-lived bounded summary. Separating the two is the architectural move.

The four inner FC layers

The diagram contains four small fully connected layers, each one a familiar affine map followed by a pointwise nonlinearity. They are not four independent modules: each consumes the same input pair $(x_{t}, h_{t - 1})$ and produces a vector in $R^{n_{neurons}}$ .

Forget gate $f_{t} = σ (\cdot) \in (0, 1)^{n_{neurons}}$ . Decides, component by component, what fraction of $c_{t - 1}$ to keep. Detailed in Forget gate.
Input gate $i_{t} = σ (\cdot) \in (0, 1)^{n_{neurons}}$ . Decides, component by component, how much of the new candidate to write. Detailed in Input gate.
Candidate update $\tilde{c}_{t} = tanh (\cdot) \in (- 1, 1)^{n_{neurons}}$ . Proposes a signed content vector to add to memory. Also covered in Input gate.
Output gate $o_{t} = σ (\cdot) \in (0, 1)^{n_{neurons}}$ . Decides, component by component, what part of the freshly updated memory to expose as $h_{t}$ . Detailed in Output gate.

The cell state update, anticipated in the limitations note and derived properly in Cell state, is

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t},

and the hidden state is read out from it as

h_{t} = o_{t} ⊙ tanh (c_{t}) .

Dimensions used in this note

symbol role shape
$x_{t}$ current input $n_{inputs}$
$h_{t - 1}, h_{t}$ hidden state $n_{neurons}$
$c_{t - 1}, c_{t}$ cell state $n_{neurons}$
$f_{t}, i_{t}, o_{t}$ gates $n_{neurons}$
$\tilde{c}_{t}$ candidate update $n_{neurons}$
$W_{x ∙}$ input-to-hidden weights, $∙ \in {f, i, c, o}$ $n_{neurons} \times n_{inputs}$
$W_{h ∙}$ hidden-to-hidden weights $n_{neurons} \times n_{neurons}$
$b_{∙}$ bias $n_{neurons}$

symbol	role	shape
$x_{t}$	current input	$n_{inputs}$
$h_{t - 1}, h_{t}$	hidden state	$n_{neurons}$
$c_{t - 1}, c_{t}$	cell state	$n_{neurons}$
$f_{t}, i_{t}, o_{t}$	gates	$n_{neurons}$
$\tilde{c}_{t}$	candidate update	$n_{neurons}$
$W_{x ∙}$	input-to-hidden weights, $∙ \in {f, i, c, o}$	$n_{neurons} \times n_{inputs}$
$W_{h ∙}$	hidden-to-hidden weights	$n_{neurons} \times n_{neurons}$
$b_{∙}$	bias	$n_{neurons}$

Three heads on a tape

Read the four maps as the three heads of a learned tape recorder operating on the memory $c_{t}$ :

$f_{t}$ is an erase head, selecting which entries of the tape to clear before writing.

$(i_{t}, \tilde{c}_{t})$ together act as a write head: $\tilde{c}_{t}$ is the content to write, $i_{t}$ is the per-position write mask.

$o_{t}$ is a read head, selecting which entries of the tape to expose as $h_{t}$ .

This is not just a metaphor: it is exactly the decomposition that makes preservation a learned decision rather than a knife-edge of the recurrent spectrum. Erasure, writing, and reading are independent and gated at the per-coordinate level.

Shared input, separate transforms

All four maps share the same input $(x_{t}, h_{t - 1})$ . The diagram makes this explicit: the blue line carrying $h_{t - 1}$ and the line carrying $x_{t}$ both fan out into all four FC blocks. Each block, however, owns its own weight matrices and bias:

f_{t} i_{t} \tilde{c}_{t} o_{t} = σ (W_{x f} x_{t} + W_{h f} h_{t - 1} + b_{f}), = σ (W_{x i} x_{t} + W_{hi} h_{t - 1} + b_{i}), = tanh (W_{x c} x_{t} + W_{h c} h_{t - 1} + b_{c}), = σ (W_{x o} x_{t} + W_{h o} h_{t - 1} + b_{o}) .

The shared input is what gives the LSTM its coherence: at every step, all four decisions are conditioned on the same view of the world, namely the current input and the previous hidden summary. The cell does not separately gather “context for forgetting” and “context for writing”; it gathers context once, and routes it through four learned linear readouts.

The 4× parameter cost is not a bug

A vanilla RNN of width $n_{neurons}$ has one affine map of size $(n_{inputs} + n_{neurons}) \times n_{neurons}$ . An LSTM has four. The parameter count is therefore roughly $4 \times$ larger at the same width.

This factor of four is the price of preservation. Three of the four maps exist exclusively to make the cell state behave as an unbiased additive register: one to erase, one to gate writes, one to gate reads. Only the candidate $\tilde{c}_{t}$ plays the role of “what the vanilla RNN was already trying to compute”. Trading parameters for stable long-range gradient flow is the explicit bargain of the architecture, and it is what GRUs later try to renegotiate.

The figure shows a single time step. As in the vanilla RNN, the cell is unrolled in time into as many copies as the sequence requires, and every copy shares the exact same parameters ${W_{x ∙}, W_{h ∙}, b_{∙}}_{∙ \in {f, i, c, o}}$ . The initial states are conventionally fixed:

h_{0} = 0, c_{0} = 0 .

Training proceeds via Backpropagation Through Time on this unrolled graph, in any of its practical variants. The crucial property that motivated the whole construction, namely that gradient flows back through $c_{t}$ with Jacobian $diag (f_{t})$ rather than through a product of recurrent matrices, is analyzed in Gradient in LSTM.

How to read the rest of this section

The remaining notes build the cell one piece at a time, in the order in which information flows through the diagram:

Cell state: the additive memory line $c_{t - 1} \to c_{t}$ and its interpretation as a temporal skip connection.
Forget gate: the gating of $c_{t - 1}$ by $f_{t}$ .
Input gate: the candidate $\tilde{c}_{t}$ and the input mask $i_{t}$ .
Output gate: the read-out $h_{t} = o_{t} ⊙ tanh (c_{t})$ .
Putting all together: the complete forward pass and the mini-batch vectorization.

Deep Learning: Zero to Hero

Explorer

LSTM Layer overview

Two states, two roles

The four inner FC layers

Shared input, separate transforms

How to read the rest of this section

Graph View

Table of Contents

Backlinks

Deep Learning: Zero to Hero

Explorer

LSTM Layer overview

Two states, two roles

The four inner FC layers

Shared input, separate transforms

Unrolling and parameter sharing

How to read the rest of this section

Graph View

Table of Contents

Backlinks