Input Gate

The forget gate decides what to erase from the previous cell state. The input gate decides what to write into the new one. Writing is split into two questions that the LSTM keeps strictly separate:

What could be written? A candidate update $\tilde{c}_{t} \in (- 1, 1)^{n_{neurons}}$ , proposed by a tanh layer.
How much of it should actually be written, coordinate by coordinate? A gate $i_{t} \in (0, 1)^{n_{neurons}}$ , produced by a sigmoid layer.

The two are combined element-wise into the additive contribution $i_{t} ⊙ \tilde{c}_{t}$ , and the cell state is updated as

c_{t} = kept from the past f_{t} ⊙ c_{t - 1} + added now i_{t} ⊙ \tilde{c}_{t} .

This separation of what and how much is the architectural device that makes the cell’s memory both bidirectional in sign and selectively addressable. Both properties are unusual, and both are essential.

The candidate update $\tilde{c}_{t}$

The candidate is computed by a single fully connected layer of width $n_{neurons}$ with tanh activation:

\tilde{c}_{t} = tanh (W_{x c} x_{t} + W_{h c} h_{t - 1} + b_{c}) \in (- 1, 1)^{n_{neurons}} .

It depends on the same input pair $(x_{t}, h_{t - 1})$ as every other gate in the cell. Its parameters $W_{x c}, W_{h c}, b_{c}$ are independent of those of the forget and input gates.

Dimensions used in this note

symbol role single example mini-batch ( $B$ )
$x_{t}, X_{t}$ current input $n_{inputs}$ $n_{inputs} \times B$
$h_{t - 1}, H_{t - 1}$ previous hidden state $n_{neurons}$ $n_{neurons} \times B$
$c_{t - 1}, C_{t - 1}$ previous cell state $n_{neurons}$ $n_{neurons} \times B$
$c_{t}, C_{t}$ new cell state $n_{neurons}$ $n_{neurons} \times B$
$f_{t}, F_{t}$ forget gate (from Forget gate) $n_{neurons}$ $n_{neurons} \times B$
$i_{t}, I_{t}$ input gate $n_{neurons}$ $n_{neurons} \times B$
$\tilde{c}_{t}, \tilde{C}_{t}$ candidate update $n_{neurons}$ $n_{neurons} \times B$
$W_{x i}, W_{x c}$ input-to-hidden weights $n_{neurons} \times n_{inputs}$ idem
$W_{hi}, W_{h c}$ hidden-to-hidden weights $n_{neurons} \times n_{neurons}$ idem
$b_{i}, b_{c}$ biases $n_{neurons}$ $n_{neurons}$ (broadcast across columns)

The element-wise products $i_{t} ⊙ \tilde{c}_{t}$ and $f_{t} ⊙ c_{t - 1}$ act on operands of identical shape: no broadcasting is involved in the $⊙$ .

symbol	role	single example	mini-batch ( $B$ )
$x_{t}, X_{t}$	current input	$n_{inputs}$	$n_{inputs} \times B$
$h_{t - 1}, H_{t - 1}$	previous hidden state	$n_{neurons}$	$n_{neurons} \times B$
$c_{t - 1}, C_{t - 1}$	previous cell state	$n_{neurons}$	$n_{neurons} \times B$
$c_{t}, C_{t}$	new cell state	$n_{neurons}$	$n_{neurons} \times B$
$f_{t}, F_{t}$	forget gate (from Forget gate)	$n_{neurons}$	$n_{neurons} \times B$
$i_{t}, I_{t}$	input gate	$n_{neurons}$	$n_{neurons} \times B$
$\tilde{c}_{t}, \tilde{C}_{t}$	candidate update	$n_{neurons}$	$n_{neurons} \times B$
$W_{x i}, W_{x c}$	input-to-hidden weights	$n_{neurons} \times n_{inputs}$	idem
$W_{hi}, W_{h c}$	hidden-to-hidden weights	$n_{neurons} \times n_{neurons}$	idem
$b_{i}, b_{c}$	biases	$n_{neurons}$	$n_{neurons}$ (broadcast across columns)

Why tanh and not sigmoid for the candidate

The forget and input gates use sigmoid because they play the role of multiplicative masks: values in $[0, 1]$ act as continuous on/off switches, and there is no situation in which “negative keeping” or “negative writing” would make sense.

The candidate plays a different role. It is added to the cell state, so its sign matters: each coordinate of $\tilde{c}_{t}$ can either push up or push down the corresponding coordinate of $c_{t}$ . A bounded, zero-centred range $(- 1, 1)$ is the natural choice. Sigmoid would constrain the candidate to non-negative values, breaking the symmetry and forcing the cell state to drift in one direction; tanh keeps the additive updates centred around zero, which keeps the cell state itself bounded in expectation.

So the role determines the activation: masks use sigmoid, contents use tanh. The same logic governs the choice of tanh in the vanilla RNN recurrence (see the “Why tanh?” callout there).

The input gate $i_{t}$

The input gate is again a sigmoid layer with the standard structure:

i_{t} = σ (W_{x i} x_{t} + W_{hi} h_{t - 1} + b_{i}) \in (0, 1)^{n_{neurons}} .

Coordinate by coordinate, $i_{t}^{(k)}$ controls how much of the candidate $\tilde{c}_{t}^{(k)}$ is admitted into the corresponding slot of the cell state. A value near $1$ accepts the candidate in full; a value near $0$ rejects it; intermediate values blend it in partially.

Two questions, two networks, one input

A single network producing values in $(- 1, 1)$ could in principle express both “what to write” and “how much” in one shot: a small magnitude would mean “write little”, a large one would mean “write a lot”. The LSTM deliberately refuses this conflation.

The reason is that gradient flow distinguishes the two. The candidate $\tilde{c}_{t}$ is added to $c_{t}$ , so its gradient receives the full upstream signal multiplied by $i_{t}$ ; the gate $i_{t}$ enters multiplicatively, so its gradient is multiplied by $\tilde{c}_{t}$ instead. Splitting the two networks gives the optimizer two independent levers for the same write operation, and it is consistently easier to train than a single fused layer. Modern attention-based architectures revisit exactly this trick, splitting content from routing through separate projections.

How the two combine

The new contribution to the cell state is the element-wise product

i_{t} ⊙ \tilde{c}_{t} \in (- 1, 1)^{n_{neurons}} .

It is bounded coordinate-wise: even if every gate fires fully and every candidate saturates, no single update can add more than $1$ in magnitude to any slot of the cell state. The full cell state update,

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t},

then has a clean interpretation: each slot of memory is independently decayed by the forget gate and incremented by the gated candidate. The two operations commute with the coordinate structure, so the LSTM’s memory is, internally, $n_{neurons}$ scalar registers operating in parallel, each one with its own learned decay and its own learned write enable.

Each coordinate is an independent learned register

The cell state is not a single distributed representation in the way a vanilla hidden state is. Because the recurrence factorizes across coordinates,
$c_{t}^{(k)} = f_{t}^{(k)} c_{t - 1}^{(k)} + i_{t}^{(k)} \tilde{c}_{t}^{(k)},$
the network can in principle dedicate different slots to different memories with different lifetimes: a slot with $f_{t}^{(k)} \approx 1$ everywhere becomes a near-permanent register; a slot with $f_{t}^{(k)} \approx 0$ at every boundary becomes a working-memory scratchpad. The gates that affect slot $k$ at time $t$ still depend on all coordinates of $x_{t}$ and $h_{t - 1}$ (the $k$ -th row of the weight matrices is dense), but the update rule applied to each slot is independent of the others.

This is why probing studies on trained LSTMs occasionally find single coordinates of $c_{t}$ that track interpretable variables: line-position counters, quote-depth, sentence sentiment. The architecture explicitly affords such coordinate-level specialization.

Why the recurrence factorizes coordinate-wise

By the coordinate independence of $⊙$ , $(a ⊙ b)^{(k)} = a^{(k)} b^{(k)}$ , so the $k$ -th coordinate of $a ⊙ b$ depends only on the $k$ -th coordinates of $a$ and $b$ . Applying this to both terms of the cell state update,
$c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t},$
gives, component by component,
$c_{t}^{(k)} = f_{t}^{(k)} c_{t - 1}^{(k)} + i_{t}^{(k)} \tilde{c}_{t}^{(k)} .$
Slot $k$ of the new cell state depends on slot $k$ of the previous cell state and on slot $k$ of the candidate; coordinates do not mix. The coupling across coordinates lives entirely upstream, inside the four affine maps that produce $f_{t}, i_{t}, \tilde{c}_{t}, o_{t}$ from $(x_{t}, h_{t - 1})$ . Replace the recurrence’s $⊙$ with a general matrix multiplication and the property is lost: this is exactly what a vanilla RNN does, and it is why the vanilla RNN cannot dedicate isolated slots to isolated memories.

A language-model example, continued

Returning to the example from the forget gate note: while reading “Alice was walking”, the input gate writes a new value into the “subject gender” slot of the cell state. Concretely, the candidate coordinate $\tilde{c}_{t}^{(k)}$ might take a value near $+ 1$ to encode “female subject”, and the input gate coordinate $i_{t}^{(k)}$ fires near $1$ to admit it. The forget gate has, at the same step, already cleared whatever was previously stored in slot $k$ , so the new gender simply replaces the old one. From that point until the gate decides to clear or overwrite slot $k$ again, the value rides the cell-state highway untouched.

The crucial property here is selectivity. At the same time step, the input gate may write into slot $k$ while leaving slot $k^{'}$ (say, “current verb tense”) completely alone, because $i_{t}^{(k^{'})} \approx 0$ . A vanilla RNN cannot do this: every step overwrites every coordinate of its hidden state in lockstep.

Mini-batch form

For a mini-batch of $B$ examples, vectors become matrices with one column per example. The shared weights are unchanged; the gate equations read

\tilde{C}_{t} I_{t} = tanh (W_{x c} X_{t} + W_{h c} H_{t - 1} + b_{c}), = σ (W_{x i} X_{t} + W_{hi} H_{t - 1} + b_{i}),

with $\tilde{C}_{t}, I_{t} \in R^{n_{neurons} \times B}$ , and $b_{c}, b_{i} \in R^{n_{neurons}}$ broadcasting across columns. As always in the LSTM, the parameters are shared across both the batch dimension and the unrolled time dimension.

The last gate, the output gate, decides what fraction of the newly updated cell state $c_{t}$ becomes the externally visible hidden state $h_{t}$ .

Deep Learning: Zero to Hero

Explorer

The candidate update $\tilde{c}_{t}$

The input gate $i_{t}$

How the two combine

A language-model example, continued

Mini-batch form

Graph View

Table of Contents

Backlinks

Deep Learning: Zero to Hero

Explorer

Input Gate

The candidate update c~t​

The input gate it​

How the two combine

A language-model example, continued

Mini-batch form

Graph View

Table of Contents

Backlinks

The candidate update $\tilde{c}_{t}$

The input gate $i_{t}$