Encoder-decoder RNNs

Abstract

The taxonomy of sequential problems ends with its hardest entry: the unaligned many-to-many task, where a whole input sequence must be mapped to a whole output sequence of different length, with no correspondence between positions. Machine translation is the canonical case. The encoder-decoder architecture, better known as seq2seq (Sutskever, Vinyals, and Le, 2014), solves it with a division of labour:

one RNN, the encoder, reads the entire input and compresses it into a single context vector;

a second RNN, the decoder, writes the output one word at a time, conditioned on that vector.

The design powered Google Translate’s Neural Machine Translation system in 2016 and established a schema, encode then decode, that survives unchanged in today’s Transformers. It also carries a built-in flaw, a fixed-width bottleneck between the two halves, and that flaw is precisely what the attention mechanism of the next section was invented to fix.

The completed recurrent toolkit

This note is where the section’s pieces come together. Each of its four developments is a structural modification, not a tuning trick, adding one architectural lever at one precise cost:

the GRU trades the LSTM’s separation of storage from read-out for a single bounded state and fewer parameters;

stacking trades per-step single-cell expressivity for compositional, deep features, at the price of gradients that must now survive depth as well as time;

bidirectionality trades online and autoregressive applicability for a representation that depends on the entire sequence;

the encoder-decoder of this note trades emitting-while-reading for the ability to handle unaligned input-output pairs, at the price of squeezing the whole input through a fixed-width context vector.

The first three are mutually orthogonal: any cell (vanilla, LSTM, GRU) can be stacked, made bidirectional, or both. The encoder-decoder composes whichever combination the task affords; a stacked bidirectional LSTM was the standard encoder, and remains a strong baseline for sequence tagging.

The problem: sequences in, sequences out, no alignment

Every recurrent architecture met so far emits its predictions while reading. In an aligned many-to-many task this is exactly right: a single RNN reads $x_{t}$ , updates its state, and emits $\hat{y}_{t}$ , one output per input position. Two assumptions hide in that loop: the output has the same length as the input, and the $t$ -th output depends only on what has been read up to $t$ .

Translation violates both, and it is worth seeing exactly how, because the architecture of this note falls out of the failure analysis.

Lengths differ. “Are you free tomorrow?” has four words; its Italian translation “Sei libero domani?” has three. No one-output-per-input loop can produce this; the model needs to choose its own output length, and that length is only known once the whole input is understood.
No position-wise correspondence. Run the same example in reverse, from Italian to English. The very first English word, “Are” rather than “You”, encodes the fact that the sentence is a question, and in the Italian source that fact is carried by the final question mark: “Sei libero domani?” and “Sei libero domani.” differ only at the last position. A model that must emit the first output element after reading the first input element would have to commit to “Are” before the evidence for it has arrived.

The second point is the decisive one, and it proves more than it seems to: it is not that emitting-while-reading is inefficient for translation, it is that no such model can be correct, because the information needed for early outputs can sit arbitrarily late in the input. The conclusion is forced:

The architecture is one sentence made into wiring

The network must finish reading before it starts writing. Everything in this note is that sentence turned into architecture: one recurrent network whose only job is to read, one whose only job is to write, and a single vector handed from the first to the second.

The idea: two networks in series

The encoder-decoder (or sequence-to-sequence, seq2seq) architecture splits the task between two recurrent networks with separate parameters and asymmetric roles:

an encoder consumes the entire input sequence $x_{1 : T}$ and produces no outputs at all; its terminal state becomes the context vector $c$ , a fixed-shape summary of everything read;
a decoder receives $c$ as its initial state and generates the output sequence $\hat{y}_{1 : S}$ one element at a time, feeding each emitted element back in as its next input, until it decides to stop.

In the figure, an English sentence enters the green encoder cells (here LSTM cells, though the schema works with vanilla or GRU cells equally well), the orange context vector crosses the boundary, and the blue decoder cells emit the Italian translation word by word.

One architecture, discovered twice in one year

The design appeared in two independent 2014 papers, then reached production two years later:

Cho et al. (2014) coined the name RNN encoder-decoder and, inside that very paper, introduced the GRU cell as the recurrent unit to power it.

Sutskever, Vinyals, and Le (2014) scaled the idea up with deep LSTMs, called it sequence to sequence learning, and showed that one network trained end to end could match the mature statistical translation technology of the day, the so-called phrase-based systems assembled from word alignments and separately engineered components.

Google’s Neural Machine Translation (2016) brought the lineage to production: an encoder-decoder built from stacked and bidirectional LSTM layers, the two refinements of this section, plus the attentional interface of the next one.

Notation used in this note

symbol role shape
$x_{1 : T}$ input sequence, of length $T$ $T$ elements of dimension $n_{in}$
$h_{t}$ encoder hidden state at step $t$ $n$
$c$ context vector $n$
$s_{j}$ decoder hidden state at step $j$ $n$
$\hat{y}_{1 : S}$ output sequence, of length $S$ decided by the decoder $S$ elements
$\hat{y}_{j}$ word emitted at step $j$ , one entry of the vocabulary $V$ —
$p_{j}$ predicted distribution over $V$ at step $j$ $∣ V ∣$
$W$ decoder output projection $∣ V ∣ \times n$

Encoder and decoder are written with the same state width $n$ , which is what lets $s_{0} = c$ be a plain assignment; nothing forbids different widths, but then a learned projection must bridge the two.

symbol	role	shape
$x_{1 : T}$	input sequence, of length $T$	$T$ elements of dimension $n_{in}$
$h_{t}$	encoder hidden state at step $t$	$n$
$c$	context vector	$n$
$s_{j}$	decoder hidden state at step $j$	$n$
$\hat{y}_{1 : S}$	output sequence, of length $S$ decided by the decoder	$S$ elements
$\hat{y}_{j}$	word emitted at step $j$ , one entry of the vocabulary $V$	—
$p_{j}$	predicted distribution over $V$ at step $j$	$∣ V ∣$
$W$	decoder output projection	$∣ V ∣ \times n$

The encoder: a feature extractor for sequences

The encoder is an ordinary recurrent network running the familiar recurrence,

h_{t} = f (x_{t}, h_{t - 1}), t = 1, \dots, T, h_{0} = 0,

with one peculiarity: its outputs are ignored. No prediction head, no loss attached to any of its steps. Its entire purpose is the side effect of recurrence, the accumulation of the input’s content into the hidden state, and the only thing it hands onward is its final state:

c = h_{T} .

This $c$ is the context vector: a single, fixed-length vector that stands for the whole variable-length input.

Don't confuse $c$ with the figure's $C_{t}$

Three notational near-collisions surround the figure, worth defusing one by one:

the note’s $c$ , lowercase, is the context vector, the summary handed to the decoder;

the figure’s $C_{t}$ , uppercase, is the LSTM cell state, the cell’s internal memory lane, unrelated to the context vector beyond the unlucky choice of letter;

the figure writes $t$ for its final step, where this note writes $T$ .

The reconciliation happens at the hand-off. With vanilla or GRU cells the encoder’s full state is one vector, so $c = h_{T}$ plainly. With LSTM cells the full state is the pair of hidden and cell state, the decoder needs both to be initialized, and the box labelled “context vector” in the figure accordingly receives the pair $(h_{T}, C_{T})$ , its two incoming arrows.

The context vector is a sequence feature vector

The right mental model comes from vision. A CNN backbone is a learnable feature extractor: it maps a raw image to a feature vector that a small head then classifies. The encoder is the same object for sequences: it maps a raw input sequence to a sequence feature vector, learned end to end so that it contains whatever the decoder will need.

Nothing about $c$ is hand-designed; “summarize the sentence” is not programmed in. It is the behaviour the encoder is forced to learn, because $c$ is the only channel through which the loss, computed entirely on the decoder’s side, can reach it.

Notice what has been silently discarded: every intermediate state $h_{1}, \dots, h_{T - 1}$ . For now this looks like tidiness, keeping only the state that has read everything. Hold the thought; it returns at the end of the note as the architecture’s central weakness.

The decoder: a conditional language model

The decoder is a second recurrent network, with its own parameters, whose initial state is the context vector:

s_{0} = c, s_{j} = g (\hat{y}_{j - 1}, s_{j - 1}), p_{j} = softmax (W s_{j}),

for $j = 1, \dots, S$ . One pass of the loop does, in order:

consume the previously emitted word $\hat{y}_{j - 1}$ , represented numerically just as the input elements are, through a learned lookup;
update the hidden state to $s_{j}$ ;
produce, through the linear layer $W$ and a softmax, the probability distribution $p_{j}$ over the entire output vocabulary $V$ , the finite list of words and symbols the model is allowed to emit;
choose one entry from that distribution: it becomes the emitted word $\hat{y}_{j}$ , fed back as the next input.

Two special symbols, added to the vocabulary on purpose, make the loop well defined, and the second of them solves a problem quietly:

<START> is the conventional first input, needed because at $j = 1$ nothing has been emitted yet;
<END> is a vocabulary entry like any other, and when the decoder emits it, generation stops. The output length $S$ is therefore nowhere chosen by the designer: the model predicts when it is finished. The variable-length-output problem from the top of the note is solved not by machinery but by making length itself a prediction.

"Chosen" how? Greedy decoding and beam search

Step 4 hides a real decision, because the feedback loop makes every choice condition everything generated after it. The options:

Greedy decoding emits, at every step, the single most probable word. It is subtly suboptimal: the sequence of locally best words need not be the globally most probable sequence, and an early, individually safe choice can force the decoder into continuations it assigns low probability.

Exact search for the most probable output is unreachable in practice: the candidate sequences grow as $∣ V ∣^{S}$ .

Beam search is the production compromise: keep the $k$ most probable partial outputs alive at every step, extend each by one word, prune back to $k$ . Greedy decoding is the special case $k = 1$ .

Sampling from the distribution instead of maximizing it trades fidelity for variety.

What the decoder really is

Read the recurrence again with probabilistic eyes. At step $j$ the state $s_{j}$ is a function of the context $c$ and of every word generated so far, so the distribution the decoder emits is $p (y_{j} ∣ y_{1 : j - 1}, c)$ , and the product over steps reconstructs, by the chain rule of probability (the identity that factorizes any joint distribution into a product of conditionals, one per element, recapped at the start of the entropy chain-rule note), a model of the whole output sequence given the input:
$p (y_{1 : S} ∣ x_{1 : T}) = j = 1 \prod S p (y_{j} ∣ y_{1 : j - 1}, c) .$
A network that models $\prod_{j} p (y_{j} ∣ y_{1 : j - 1})$ is a language model; the decoder is exactly that, with one addition, the conditioning on $c$ .

Seq2seq is therefore a language model that has been given something to talk about: the output is shaped jointly by the input’s features (through $c$ ) and by the output’s own past (through the fed-back words), which is precisely the double conditioning the figure annotates. This reading is why the same decoder schema, conditioned on ever richer contexts, scales from translation to summarization to speech recognition without structural change.

Training versus inference: teacher forcing

At inference time the loop is autoregressive: each prediction is fed back as the next input, so the model walks through output space on its own footprints. Nothing else is available, since the true translation is exactly what is being computed.

At training time the true output sequence is available, and feeding back the model’s own predictions would be both slow to learn from and unstable: early in training the predictions are noise, so every step after the first would be conditioned on garbage, and the gradient signal for “what should follow $y_{1 : j - 1}$ ” would be diluted by the model never actually seeing the correct prefix.

The standard remedy is teacher forcing: at every step, the decoder’s input is the ground-truth previous word, regardless of what the model just predicted. The training signal is then assembled as usual:

the loss is the sum, over output positions, of the cross-entropy between the predicted distribution and the correct word at that position;
the gradient flows by backpropagation through time across the decoder, through $c$ , and on through the entire encoder: one computational graph, trained end to end.

Exposure bias: the price of teacher forcing

Teacher forcing trains the decoder exclusively on gold prefixes: at step $j$ , the $j - 1$ words the decoder has consumed so far are the reference translation’s, not its own (gold, as in gold standard, names exactly this, a prefix copied from the ground truth). At deployment it runs on its own prefixes instead; the model is never exposed, during training, to the distribution it will actually face.

The consequence is that errors compound: one early mistake at test time hands the next step a prefix unlike anything seen in training, which invites a second mistake. This train-test mismatch is called exposure bias, and the two regimes named in the figure are the standard responses to it:

strict teacher forcing always feeds the ground truth, and simply accepts the bias;

epsilon-greedy teacher forcing (in the literature, scheduled sampling, Bengio et al., 2015) feeds the ground truth with a probability that decays over training, weaning the decoder onto its own predictions so that the training-time distribution drifts toward the inference-time one.

The bottleneck that attention will break

Now collect the thought left hanging in the encoder section. Every fact the decoder will ever know about the input must pass through $c$ : a fixed-width vector, sized at design time, identical in capacity whether the input is a four-word greeting or a four-hundred-word legal clause. The architecture asks one vector to be a lossless summary of an unboundedly long sequence, and for long inputs that is not an engineering shortfall but an information-theoretic impossibility.

The failure even has a direction. As the limitations of recurrence established, a hidden state is a leaky accumulator: $h_{T}$ holds the recent past vividly and the distant past faintly, so $c$ systematically under-represents the beginning of the input.

The 2014 seq2seq paper contains a telling confession on this point: Sutskever et al. found translation improved markedly if the source sentence was fed in reversed, so that its first words, which tend to matter most for starting the output, landed last and freshest in the encoder’s memory. A trick that crude, working that well, is a symptom pointing straight at the disease, and Cho et al. measured the disease directly: translation quality falls off as source sentences grow longer.

The fix is already visible from here

Look at what the architecture throws away: the intermediate states $h_{1}, \dots, h_{T - 1}$ , each a fresh, position-specific reading of one part of the input, all deleted in favour of the single terminal state. The repair, in hindsight, almost states itself:

keep all the encoder states;

at each decoding step, let the decoder compute a content-dependent weighted summary of them, a different, custom-built context for every output position instead of one frozen $c$ for all of them.

Half of the wiring already exists: Cho’s variant of the architecture re-fed the same $c$ to the decoder at every step, and attention keeps that schema, merely upgrading the constant $c$ to a step-specific $c_{j}$ , rebuilt from all the encoder states for each output position.

Building that mechanism, the scoring, the softmax over positions, the weighted sum, is the business of the next section, where soft attention is constructed from first principles and then mounted onto exactly this architecture as the attentional interface that powered Google’s NMT. The encoder-decoder of this note is not made obsolete by attention; it is the chassis attention bolts onto.

From theory to code

Everything in this note, plus its repair, exists as runnable code: the hands-on translation note builds an Italian-English neural machine translation system line by line in PyTorch, with a bidirectional LSTM encoder, an autoregressive LSTM decoder with configurable teacher forcing, additive attention over the encoder states, and BLEU evaluation, the standard automatic score for translation quality. Read after this note, every architectural choice in that code should feel inevitable rather than conventional.

Recap

The problem. The unaligned many-to-many task forbids emitting outputs while reading, because early outputs can depend on late inputs; the encoder-decoder architecture is that constraint built into wiring.

The encoder reads everything and compresses it into the context vector, a learned sequence feature vector playing the role a CNN’s feature vector plays for images.

The decoder, initialized with that vector, runs as a conditional language model: one word per step, fed back, with the <END> symbol letting the model decide its own stopping point.

Training uses teacher forcing (gold prefixes, summed per-step cross-entropy, BPTT through both networks), at the cost of exposure bias, which scheduled sampling mitigates.

The flaw is load-bearing. The fixed-width bottleneck, with its recency bias, is the precise problem whose solution, attention, opens the next section; the encode-then-decode schema itself outlived its recurrent parts to become the skeleton of the Transformer.

Deep Learning: Zero to Hero

Explorer

Encoder-decoder RNNs

The problem: sequences in, sequences out, no alignment

The idea: two networks in series

The encoder: a feature extractor for sequences

The decoder: a conditional language model

Training versus inference: teacher forcing

The bottleneck that attention will break

Graph View

Table of Contents

Backlinks