Gated Recurrent Unit

The LSTM solves the vanishing-gradient barrier of vanilla RNNs at a precise cost: four affine maps per cell instead of one, and two separate states ( $c_{t}$ for storage, $h_{t}$ for read-out) instead of one. The Gated Recurrent Unit (Cho et al., 2014) asks a sharp question: how much of that machinery is actually load-bearing, and how much can be removed without losing the gradient-flow property that made the LSTM trainable in the first place?

The GRU answers by keeping the additive gated update that fixes the gradient, while compressing the rest:

one state ( $h_{t}$ ) instead of two;
three affine maps instead of four;
two gates instead of three.

The result is an architecture with $3/4$ the parameters of the LSTM at the same hidden width, comparable empirical performance on most sequence tasks, and a forward pass that reads as a single line of vector arithmetic.

Definition

Given $x_{t} \in R^{n_{inputs}}$ and $h_{t - 1} \in R^{n_{neurons}}$ , the GRU cell computes

r_{t} z_{t} \tilde{h}_{t} h_{t} = σ (W_{x r} x_{t} + W_{h r} h_{t - 1} + b_{r}) = σ (W_{x z} x_{t} + W_{h z} h_{t - 1} + b_{z}) = tanh (W_{x h} x_{t} + W_{hh} (r_{t} ⊙ h_{t - 1}) + b_{h}) = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h}_{t} reset gate update gate candidate state hidden-state update

and returns $h_{t}$ .

There is no separate cell state and no output gate. The single hidden vector $h_{t}$ plays both the role that $c_{t}$ played in the LSTM (long-term memory) and the role that $h_{t}$ played there (externally visible output).

Dimensions used in this note

symbol role single example mini-batch ( $B$ )
$x_{t}, X_{t}$ current input $n_{inputs}$ $n_{inputs} \times B$
$h_{t - 1}, h_{t}, H_{t - 1}, H_{t}$ hidden state $n_{neurons}$ $n_{neurons} \times B$
$r_{t}, R_{t}$ reset gate $n_{neurons}$ $n_{neurons} \times B$
$z_{t}, Z_{t}$ update gate $n_{neurons}$ $n_{neurons} \times B$
$\tilde{h}_{t}, \tilde{H}_{t}$ candidate state $n_{neurons}$ $n_{neurons} \times B$
$W_{x ∙}$ input-to-hidden weights, $∙ \in {r, z, h}$ $n_{neurons} \times n_{inputs}$ idem
$W_{h ∙}$ hidden-to-hidden weights $n_{neurons} \times n_{neurons}$ idem
$b_{∙}$ bias $n_{neurons}$ $n_{neurons}$ (broadcast across columns)

All element-wise products $⊙$ in the cell act on operands of identical shape: no broadcasting is involved in the $⊙$ .

symbol	role	single example	mini-batch ( $B$ )
$x_{t}, X_{t}$	current input	$n_{inputs}$	$n_{inputs} \times B$
$h_{t - 1}, h_{t}, H_{t - 1}, H_{t}$	hidden state	$n_{neurons}$	$n_{neurons} \times B$
$r_{t}, R_{t}$	reset gate	$n_{neurons}$	$n_{neurons} \times B$
$z_{t}, Z_{t}$	update gate	$n_{neurons}$	$n_{neurons} \times B$
$\tilde{h}_{t}, \tilde{H}_{t}$	candidate state	$n_{neurons}$	$n_{neurons} \times B$
$W_{x ∙}$	input-to-hidden weights, $∙ \in {r, z, h}$	$n_{neurons} \times n_{inputs}$	idem
$W_{h ∙}$	hidden-to-hidden weights	$n_{neurons} \times n_{neurons}$	idem
$b_{∙}$	bias	$n_{neurons}$	$n_{neurons}$ (broadcast across columns)

The architectural compression of the LSTM, in three moves

The GRU can be read most cleanly as three deliberate simplifications of the LSTM, each one motivated by a specific observation.

Move 1: drop the cell-state / hidden-state distinction

The LSTM maintains $c_{t}$ (unbounded, archive) and $h_{t}$ (bounded, dispatch) precisely because the two play incompatible roles inside a single vector. The GRU collapses them by bounding the hidden state directly: $h_{t}$ is now a convex combination of two bounded operands, $h_{t - 1}$ (already bounded by induction) and $\tilde{h}_{t} \in (- 1, 1)^{n_{neurons}}$ , so it stays bounded without an output gate or a final $tanh$ on the read-out.

This is a strict expressivity loss: the GRU cannot hold an unbounded private memory whose effect on downstream computation is independently masked. Empirically it matters less than the LSTM literature once suggested, but on tasks that benefit from “remember privately, report selectively” the LSTM still has the edge.

Move 2: replace forget + input by their convex combination

In the LSTM, the forget gate $f_{t}$ and the input gate $i_{t}$ are independent: the network can keep nothing and write nothing, keep everything and write a lot, or any combination of the two. The GRU forces $f_{t} + i_{t} = 1$ by reparameterizing both with a single update gate:

f_{t} \leftrightarrow 1 - z_{t}, i_{t} \leftrightarrow z_{t} .

Reading the update rule with these substitutions makes the parallel explicit:

h_{t} = forget (1 - z_{t}) ⊙ h_{t - 1} + input z_{t} ⊙ \tilde{h}_{t} .

Convex combination is what bounds the GRU state

Because $z_{t} \in (0, 1)^{n_{neurons}}$ coordinate-wise, the update is a convex combination of $h_{t - 1}$ and $\tilde{h}_{t}$ at every coordinate. A convex combination of two values in $(- 1, 1)$ stays in $(- 1, 1)$ , so $∥ h_{t} ∥_{\infty} \leq 1$ for all $t$ by induction (starting from $h_{0} = 0$ ).

This is the structural reason the GRU does not need an output gate: boundedness is built into the recurrence. The price is that the network cannot keep one component of the past and add new content to that same component in the same step; the budget is fixed at $1$ per coordinate, and the update gate decides how to split it.

One knob instead of two: a fixed memory budget

With the update gate held fixed, $z_{t}^{(j)} = z^{(j)}$ , unrolling one coordinate of the update gives
$h_{t}^{(j)} = k = 0 \sum t - 1 (1 - z^{(j)})^{k} z^{(j)} \tilde{h}_{t - k}^{(j)} + (1 - z^{(j)})^{t} h_{0}^{(j)} .$
Each slot is an exponentially weighted moving average of its candidate history, the same object the LSTM cell state computes, with the decay $1 - z^{(j)}$ playing the role of the forget gate. The candidate from $k$ steps back enters with weight $(1 - z^{(j)})^{k} z^{(j)}$ , and these weights sum to $1$ (the convex combination of Move 2), so the slot is a genuine weighted average that stays bounded, with a memory horizon of about $1/ z^{(j)}$ steps. Under the identification $f \leftrightarrow 1 - z$ , this is exactly the LSTM horizon $1/ (1 - f)$ .

The difference is the constraint. The LSTM keeps with the gate $f_{t}$ and writes with an independent gate $i_{t}$ , so it can keep a slot in full and still write to it strongly in the same step. The GRU ties the two, $keep + write = 1$ per coordinate, giving each slot a fixed budget of one: remembering more of the past forces writing less of the present, and the reverse. That single constraint is the GRU’s central economy and its central limitation at once.

Move 3: the reset gate inside the candidate

The LSTM’s candidate $\tilde{c}_{t}$ depends on the full $h_{t - 1}$ . The GRU’s candidate $\tilde{h}_{t}$ depends on a gated version, $r_{t} ⊙ h_{t - 1}$ :

\tilde{h}_{t} = tanh (W_{x h} x_{t} + W_{hh} (r_{t} ⊙ h_{t - 1}) + b_{h}) .

The reset gate $r_{t}$ lets the network selectively discard the past when computing fresh content. Two limiting cases clarify the design:

$r_{t} = 1$ : the candidate sees the full previous state. The update behaves like a soft interpolation between “keep the old memory” and “absorb a context-aware update”. This is the regime in which the GRU resembles a smoothed LSTM.
$r_{t} = 0$ : the candidate ignores the past entirely and becomes a pure function of the current input $x_{t}$ . The next state is then a convex blend of “the old state, untouched” and “a fresh hypothesis based only on what just arrived”.

The reset gate is therefore the GRU’s mechanism for breaking continuity when the past is no longer informative. The LSTM achieves the same effect indirectly through the forget gate closing on $c_{t - 1}$ ; the GRU exposes the decision as its own learned variable.

The gradient picture

The whole reason the LSTM works on long sequences is the diagonal Jacobian $\partial c_{t} / \partial c_{t - 1} = diag (f_{t})$ derived in Cell state. The GRU inherits the same property, in a slightly more involved form, through the convex-combination update.

The direct contribution to $\partial h_{t} / \partial h_{t - 1}$ , holding the gates and the candidate fixed, is

\frac{\partial h _{t}}{\partial h _{t - 1}}_{direct} = diag (1 - z_{t}) .

This is the GRU’s analogue of the constant error carousel: when $z_{t} \approx 0$ , the recurrent Jacobian on the direct path is close to the identity, so gradient flows back across the step essentially undisturbed. When $z_{t} \approx 1$ , the past is overwritten and the gradient stops.

The full Jacobian, with all indirect paths

The hidden state $h_{t}$ depends on $h_{t - 1}$ along three routes, two of which the boxed expression deliberately ignores:

The direct path through $(1 - z_{t}) ⊙ h_{t - 1}$ . Differentiating with respect to $h_{t - 1}$ at fixed $z_{t}$ and using the coordinate independence of $⊙$ gives $diag (1 - z_{t})$ .

The gate path through $z_{t} (x_{t}, h_{t - 1})$ : a change in $h_{t - 1}$ shifts $z_{t}$ , which shifts the convex weighting. This contributes a term proportional to $(\tilde{h}_{t} - h_{t - 1}) ⊙ σ^{'} (\cdot) \cdot W_{h z}$ .

The candidate path through $\tilde{h}_{t} (x_{t}, h_{t - 1})$ , mediated by the reset gate $r_{t}$ : this contributes a term proportional to $z_{t} ⊙ tanh^{'} (\cdot) \cdot W_{hh} (diag (r_{t}) + (reset-gate path))$ .

Paths 2 and 3 cross the recurrent weight matrices $W_{h z}, W_{hh}, W_{h r}$ , exactly the dense-matrix mechanism that produces vanishing/exploding gradients in the vanilla RNN. As in the LSTM, these are the short-range corrections; the long-range backbone is the direct path with diagonal Jacobian $diag (1 - z_{t})$ .

The take-away is the same two-track gradient flow described in Gradient in LSTM: a long-range, well-behaved highway through the gated convex combination, and a short-range, locally-corrective channel through the gates themselves.

The GRU as a learned gated residual

A small rearrangement of the update equation exposes the deepest structural reading of the GRU:

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h}_{t} = h_{t - 1} + z_{t} ⊙ (\tilde{h}_{t} - h_{t - 1}) .

In this form the GRU is a residual stream $h_{t - 1}$ to which a learned, gated increment $z_{t} ⊙ (\tilde{h}_{t} - h_{t - 1})$ is added at every step. The update gate $z_{t}$ is exactly the learned gate on the residual, and $\tilde{h}_{t} - h_{t - 1}$ is the candidate direction of movement in state space.

Recurrence as ResNet, made explicit

The form $h_{t} = h_{t - 1} + z_{t} ⊙ Δ_{t}$ with $Δ_{t} = \tilde{h}_{t} - h_{t - 1}$ is the temporal analogue of a ResNet block: an identity carry plus a learned, gated additive perturbation. The Jacobian on the identity carry is $I$ ; the Jacobian on the additive perturbation is small whenever $z_{t}$ is small.

The LSTM cell state is the same idea expressed differently: the carry there is $f_{t} ⊙ c_{t - 1}$ rather than $c_{t - 1}$ , but the structural intent is identical. Three recurrent architectures (LSTM, GRU, and the later Highway Networks of Srivastava et al. 2015) all settled on gated additive updates around an identity backbone, six years before residual connections were introduced in feedforward networks. The fact that all three converged on the same shape is not a coincidence: it is the only shape that fixes the vanishing-gradient barrier without imposing a knife-edge on the recurrent spectrum, the vanilla RNN’s requirement that the recurrent weights’ eigenvalues sit right at $1$ for gradients to neither vanish nor explode.

Parameter count

Each of the three affine maps (reset, update, candidate) has the same shape as a single vanilla-RNN affine map:

P_{GRU} = 3 (n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons}) .

Direct comparison with LSTM and vanilla RNN

architecture affine maps parameters ratio
vanilla RNN 1 $n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons}$ $1 \times$
GRU 3 $3 (n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons})$ $3 \times$
LSTM 4 $4 (n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons})$ $4 \times$

At the same hidden width $n_{neurons}$ , GRU has $3/4$ the parameters of an LSTM, and runs proportionally faster per step. Comparing at the same parameter budget (i.e., GRU with hidden width $n_{neurons}$ vs LSTM with hidden width $n_{neurons} 3/4 \approx 0.87 n_{neurons}$ ) is the fairer benchmark, and on most sequence tasks the two architectures perform comparably, with differences that fall within experimental noise.

architecture	affine maps	parameters	ratio
vanilla RNN	1	$n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons}$	$1 \times$
GRU	3	$3 (n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons})$	$3 \times$
LSTM	4	$4 (n_{neurons} n_{inputs} + n_{neurons}^{2} + n_{neurons})$	$4 \times$

Mini-batch form

For a mini-batch of $B$ examples, per-example vectors become matrices with one column per example. Weights are shared across the batch (and across the unrolled time axis), so their shapes are unchanged. The cell becomes

R_{t} Z_{t} \tilde{H}_{t} H_{t} = σ (W_{x r} X_{t} + W_{h r} H_{t - 1} + b_{r}), = σ (W_{x z} X_{t} + W_{h z} H_{t - 1} + b_{z}), = tanh (W_{x h} X_{t} + W_{hh} (R_{t} ⊙ H_{t - 1}) + b_{h}), = (1 - Z_{t}) ⊙ H_{t - 1} + Z_{t} ⊙ \tilde{H}_{t} .

The biases broadcast across the $B$ columns; the element-wise products act on matrices of identical shape with no broadcasting; the three affine maps can be fused into a single matrix multiplication (one GEMM) exactly as in the LSTM implementation note. The initial state is conventionally $H_{0} = 0$ .

When to prefer GRU over LSTM

The decision is mostly empirical, but the architectural differences point to a useful default.

Prefer GRU when the model needs to be small or fast (mobile, real-time, edge deployment), when the dataset is moderate and the regularization budget is tight ( $25%$ fewer parameters means $25%$ less to overfit with), and when the task has no obvious benefit from separating storage from read-out.
Prefer LSTM when the task needs the network to keep information in memory without exposing it in its output. This is what “holding privately” means: the cell state $c_{t}$ can store a value for many steps while the output gate keeps it out of the visible hidden state $h_{t}$ , so the network can remember something now, emit unrelated outputs in the meantime, and surface it only when it becomes relevant. The GRU cannot separate the two, because its single state is at once the memory and the output. Prefer LSTM also when the dataset is large enough to train the extra parameters, and when fine-grained, independent control of forgetting versus writing matters.

In practice, on most tasks of moderate complexity, careful tuning matters far more than the choice between LSTM and GRU. Both have been overshadowed for large-scale sequence modelling by attention-based architectures, which dispose of the recurrent bottleneck entirely; the next section develops that idea.

Brief history

The GRU was introduced by Cho et al. in 2014 in the context of statistical machine translation, inside the very paper that introduced the RNN encoder-decoder architecture (the name sequence to sequence learning arrived months later, with Sutskever et al.’s LSTM-based scaling of the same design). The motivation was practical: encoder and decoder both needed a recurrent cell that trained reliably on long sequences but cost less than an LSTM, since translation models of the time were already large. The convex-combination update was inspired by earlier work on gated recurrent networks, but the specific form (one update gate doubling as forget+input, plus one reset gate inside the candidate) is the GRU’s own.

The architecture has aged well. It is still the default recurrent cell in many sequence-modelling libraries, still competitive with LSTM on most benchmarks where pure recurrence is appropriate, and still the cleanest single-line illustration of how a recurrent network can be both stateful and well-behaved under backpropagation through time.

Deep Learning: Zero to Hero

Explorer