Output Gate

After the forget gate has scaled the previous memory and the input gate has added new content, the cell state $c_{t}$ is complete: it holds everything the network has chosen to remember at time $t$ . The output gate answers the last remaining question: which parts of that memory should be exposed as the hidden state $h_{t}$ ?

This is the only gate whose target is not the cell state itself. Its job is to decide what the cell says to the rest of the world, while the cell state stays free to hold information that may not be needed right now but will be later.

Definition

The output gate is again a sigmoid layer of width $n_{neurons}$ :

o_{t} = σ (W_{x o} x_{t} + W_{h o} h_{t - 1} + b_{o}) \in (0, 1)^{n_{neurons}} .

The hidden state is then read out from the cell state through a tanh and a pointwise product:

h_{t} = o_{t} ⊙ tanh (c_{t}) \in (- 1, 1)^{n_{neurons}} .

Dimensions used in this note

symbol role single example mini-batch ( $B$ )
$x_{t}, X_{t}$ current input $n_{inputs}$ $n_{inputs} \times B$
$h_{t - 1}, H_{t - 1}$ previous hidden state $n_{neurons}$ $n_{neurons} \times B$
$c_{t}, C_{t}$ current cell state $n_{neurons}$ $n_{neurons} \times B$
$h_{t}, H_{t}$ new hidden state $n_{neurons}$ $n_{neurons} \times B$
$o_{t}, O_{t}$ output gate $n_{neurons}$ $n_{neurons} \times B$
$W_{x o}$ input-to-hidden weights $n_{neurons} \times n_{inputs}$ idem
$W_{h o}$ hidden-to-hidden weights $n_{neurons} \times n_{neurons}$ idem
$b_{o}$ bias $n_{neurons}$ $n_{neurons}$ (broadcast across columns)

The product $o_{t} ⊙ tanh (c_{t})$ acts on operands of identical shape: no broadcasting is involved in the $⊙$ .

symbol	role	single example	mini-batch ( $B$ )
$x_{t}, X_{t}$	current input	$n_{inputs}$	$n_{inputs} \times B$
$h_{t - 1}, H_{t - 1}$	previous hidden state	$n_{neurons}$	$n_{neurons} \times B$
$c_{t}, C_{t}$	current cell state	$n_{neurons}$	$n_{neurons} \times B$
$h_{t}, H_{t}$	new hidden state	$n_{neurons}$	$n_{neurons} \times B$
$o_{t}, O_{t}$	output gate	$n_{neurons}$	$n_{neurons} \times B$
$W_{x o}$	input-to-hidden weights	$n_{neurons} \times n_{inputs}$	idem
$W_{h o}$	hidden-to-hidden weights	$n_{neurons} \times n_{neurons}$	idem
$b_{o}$	bias	$n_{neurons}$	$n_{neurons}$ (broadcast across columns)

Two things deserve attention.

A tanh is applied to $c_{t}$ before gating. The cell state lives in an unbounded space (its components are sums of bounded increments, but they can accumulate over many steps); the hidden state must be bounded, because it is fed back into the gates at the next step and into any downstream prediction head. The tanh is the bounding step. Importantly, it acts only on the read-out, not on the cell state itself, so the long-term memory remains in its native unbounded range.
The output gate $o_{t}$ then selects, coordinate by coordinate, how much of each bounded slot to expose. A value of $1$ exposes the slot in full; a value of $0$ hides it from view.

The Jacobian of the read-out

Differentiating $h_{t} = o_{t} ⊙ tanh (c_{t})$ with respect to $c_{t}$ (with $o_{t}$ held fixed, since it depends on $x_{t}$ and $h_{t - 1}$ only) gives, component by component,
$\frac{\partial h _{t}^{(k)}}{\partial c _{t}^{(j)}} = o_{t}^{(k)} \cdot (1 - tanh^{2} c_{t}^{(k)}) δ_{jk},$
using $\frac{d}{d z} tanh (z) = 1 - tanh^{2} (z)$ and the coordinate independence of $⊙$ . Stacked into a matrix,
$\frac{\partial h _{t}}{\partial c _{t}} = diag (o_{t} ⊙ (1 - tanh^{2} c_{t})) .$
This Jacobian is again diagonal: the read-out couples each slot of $h_{t}$ to the same slot of $c_{t}$ , never to a different one. The coordinate independence that the cell state enjoys (see Cell state and Input gate) is therefore preserved by the read-out as well. Cross-coordinate mixing reappears only in the next step, when $h_{t}$ is consumed by the four affine maps to produce the next round of gates.

Cell state stores, hidden state speaks

The pair $(c_{t}, h_{t})$ implements a clean separation of concerns that no single-state recurrence can express:

$c_{t}$ is the archive: large, unbounded, persistent, and not directly observed by downstream layers. The network can use it to hold information that will only become relevant many steps later.

$h_{t}$ is the dispatch: a bounded, selected projection of the archive, customized at every step to the current task. It is what the next-step gates see and what any prediction head consumes.

The output gate is the interface between the two. It is what lets the LSTM remember privately and report selectively.

Why this separation matters

A vanilla RNN, and any single-state recurrence, must pack everything into one vector: features it currently needs to output, features it will need later but not now, and features it must condition future gating decisions on. These three roles place incompatible demands on the same representation:

Outputs benefit from being bounded (to keep prediction heads well-behaved).
Long-term memory benefits from being unbounded (to let gradient flow without saturation, as derived in Cell state).
Future gating decisions benefit from being selectively visible, not from seeing everything at once.

The output gate resolves all three. The cell state stays unbounded, so its gradient flows freely; the hidden state is bounded by tanh and selected by $o_{t}$ , so prediction heads and downstream gates see a clean, focused signal; and the network can choose, at every step, to keep some information dark by setting the corresponding component of $o_{t}$ near zero.

$h_{t}$ is a compressed view of $c_{t}$

In information-theoretic terms, $h_{t} = o_{t} ⊙ tanh (c_{t})$ is a lossy projection of the cell state. Setting some components of $o_{t}$ to zero discards that information from the externally visible signal, even though it remains intact inside $c_{t}$ . The next step still has access to it through the cell-state line, but no other component of the network does.

This anticipates the design philosophy of modern attention-based architectures: maintain a rich internal state, and emit at each step only the selected summary that downstream computation needs.

A language-model example, continued

Continuing the example from the input gate note: by the time the model has finished reading “Alice was walking and”, the cell state contains, among other things, the gender of the subject in slot $k$ . While processing the next, uninformative tokens (“walking”, “and”), that gender information is not yet needed for any prediction; it must be kept alive in $c_{t}$ but does not need to be exposed in $h_{t}$ .

The output gate accomplishes this by setting $o_{t}^{(k)} \approx 0$ at those steps: the gender slot stays in the archive, invisible to the next-step gates and to the prediction head. When the model reaches the position immediately before “she”, where the pronoun must be predicted, the output gate fires $o_{t}^{(k)} \approx 1$ on the same slot: the gender becomes visible to the head, and the correct pronoun is generated. Throughout, the value in slot $k$ of $c_{t}$ has not changed; only its visibility has.

This is the kind of behaviour that no single-state recurrence can implement: the value and the visibility are entirely separate quantities, controlled by different parameters.

Why a sigmoid, again

The output gate uses the same sigmoid as the forget and input gates, for the same three reasons given in Forget gate: bounded range $[0, 1]$ , smooth differentiability, and monotonicity. The output gate is a mask, not a content, so a $(0, 1)$ output is the right range. Critically, $o_{t}$ multiplies $tanh (c_{t})$ pointwise, exactly as $f_{t}$ multiplied $c_{t - 1}$ : same mathematical structure, different operand, different role.

Mini-batch form

The vectorized form for a mini-batch of $B$ examples is, as always, obtained by replacing the per-example vectors by matrices whose columns are the per-example values:

O_{t} = σ (W_{x o} X_{t} + W_{h o} H_{t - 1} + b_{o}), H_{t} = O_{t} ⊙ tanh (C_{t}),

with $O_{t}, H_{t} \in R^{n_{neurons} \times B}$ . The bias $b_{o}$ broadcasts across the $B$ columns of the affine result; the element-wise product $O_{t} ⊙ tanh (C_{t})$ acts on matrices of identical shape, with no broadcasting.

All four gate operations are now defined. Putting all together assembles them into the complete forward pass of one LSTM cell.

Deep Learning: Zero to Hero

Explorer

Definition

Why this separation matters

A language-model example, continued

Why a sigmoid, again

Mini-batch form

Graph View

Table of Contents

Backlinks