Attentional Interface

What this note builds

The additive attention note built the score-softmax-combine pipeline for one decoding step. A translator, though, emits a whole sentence, so the real object is that pipeline run once per output word, wired permanently between the two networks of an encoder-decoder. This note follows the worked translation “Are you free tomorrow?” → “Sei libero domani?” across three steps and makes visible what a single-step formula hides: the encoder is read once, the decoder re-queries it at every step, and the attention shifts, so that a different slice of the source lights up for each word produced. Seeing the loop turns attention from a formula into an interface: the channel through which a decoder reads, on demand, from an encoder’s memory.

From one formula to a standing connection

Recall the result of the previous note. At a fixed decoding step $t$ , with the decoder holding state $h_{t}^{D}$ and the encoder having left behind annotations ${h_{i}^{E}}_{i = 1}^{K}$ , additive attention produces

α_{t, i} = softmax_{i} s (h_{i}^{E}, h_{t}^{D}), c_{t} = i = 1 \sum K α_{t, i} h_{i}^{E},

a context vector rebuilt for that one step. (Throughout, the formulas follow the figures’ colours: encoder states in $green$ , the decoder query in $blue$ , and attention quantities in $purple$ .) The attentional interface is this block, installed as a permanent layer between encoder and decoder and evaluated afresh every time the decoder takes a step.

This whole purple block is the "attentional interface"

The figure assembles in one frame everything the rest of this note animates, so it is worth reading slowly now. Left to right:

the $green$ encoder cells emit the annotations $h_{i}^{E}$ ;

the purple box runs the additive attention of the previous note on them, the score $s (h_{i}^{E}, h_{t}^{D})$ , the softmax into the weights $α_{t, i}$ , and the weighted sum into the context $c_{t}$ ;

on the right, the $blue$ decoder cell produces the query $h_{t}^{D}$ .

The two strands then meet at the readout MLP, which receives both the context $c_{t}$ and the decoder state $h_{t}^{D}$ and emits the word (“Sei”).

That purple layer between the green encoder and the blue decoder is the attentional interface the whole note is about: not the encoder, not the decoder, but the channel between them. It is what lets the decoder, holding only its own query, reach into the encoder’s fixed memory and pull out exactly the slice it needs for the word it is about to emit. Everything that follows simply runs this block once per output word and watches what moves.

Two facts make the whole loop work and are easy to miss when staring at the formulas above:

the annotations ${h_{i}^{E}}$ are computed once, by a single forward pass of the encoder, and then held fixed for the entire output;
the decoder is autoregressive: each word it emits is fed back as its next input, advancing $h_{t}^{D} \to h_{t + 1}^{D}$ , which changes the query and therefore the attention.

The interface is where these meet: a fixed memory on one side, a moving query on the other.

Anatomy of a single step

The figure is the first decoding step, $t = 0$ , of the translation. Read it bottom-to-top, right side first:

The decoder, seeded with <START>, produces its state $h_{t}^{D}$ . This is the query: the network’s current information need, “what do I emit first?”
The interface scores that query against every encoder annotation (the additive score, a $tanh$ of the summed projections, drawn as the small networks feeding each + and tanh), normalises the $K$ scores with a softmax into weights $α_{t, i}$ , and forms the context $c_{t}$ by the weighted linear combination $\sum_{i} α_{t, i} h_{i}^{E}$ .
The context $c_{t}$ and the decoder state $h_{t}^{D}$ are handed to a small readout MLP, which emits the first word: “Sei”.

Reading the figure: which arrows are lit

The weights $α_{t, 0}$ and $α_{t, 1}$ , on the source words “are” and “you”, are drawn in solid purple; the rest are greyed. The interface has decided that, to open the Italian sentence with “Sei” (the verb “are/you-are”), the relevant source evidence is exactly “are” and “you”. Everything needed to produce one target word has been funnelled through a single context vector $c_{t}$ , and the coefficients that built it say where it drew that information from.

The loop: the spotlight sweeps the source

Now advance the decoder. The word just emitted is fed back as the next input, the decoder takes another recurrent step, its state moves, and the interface runs again, on the same encoder annotations but a new query.

At $t = 1$ the decoder has consumed its own previous output “Sei” and now produces a new state. The interface re-scores the fixed annotations against this new query, and the mass moves: “free” now dominates ( $α_{t, i}$ solid), and the readout emits “libero”. The MLP that produced “Sei” is greyed out, a finished step in the past.

At $t = 2$ , fed “libero”, the decoder queries once more; this time “tomorrow?” lights up ( $α_{t, K}$ ), and the output is “domani?“. One more step would emit the <END> symbol and stop.

The one moving part, and the whole point

Across the three figures the green encoder column never changes: the annotations ${h_{i}^{E}}$ are computed once and reused at every step. The only thing that moves is the query $h_{t}^{D}$ , and because the score is query-conditional, the same set of source vectors is re-weighted differently each time. The bright band of attention therefore sweeps across the source, are/you → free → tomorrow?, tracking the word being produced.

This sweep is the soft alignment between source and target: not supplied, but discovered by the model, and the precise object visualised, and carefully qualified, in the explainability note. The frozen single context vector of plain seq2seq has become a stream of custom contexts, one per word, each drawn from wherever in the source it needs to look.

Read once, query many times

It is worth stating the cost structure plainly, because it is both the practical justification for the design and the seed of later efficiency work.

What recomputes, and what does not

Producing a target of length $S$ from a source of length $K$ costs:

one encoder pass to build the $K$ annotations, done before decoding begins and never repeated;

per decoding step, the interface re-scores all $K$ annotations against the new query, softmaxes, and combines: $O (K)$ work, plus one autoregressive decoder step and one readout.

So the encoder is not re-run for every output word; only the cheap attention-and-readout is. The attention layer’s $O (K)$ per-step cost is exactly the term that local attention later shrinks by restricting each query to a window, and the autoregressive feedback is exactly what teacher forcing replaces with gold words at training time.

The readout: turning a context vector into a word

The figures label the final block simply “MLP”, and it deserves unpacking, because it is where the context is actually used. The context $c_{t}$ does not become the output directly; it is merged with the decoder state and then projected to the vocabulary. A standard form is the attentional hidden state

\tilde{h}_{t} = tanh (W_{c} [c_{t}; h_{t}^{D}]), p_{t} = softmax (W_{o} \tilde{h}_{t}),

a one-layer MLP on the concatenation of context and decoder state, followed by a linear projection and a softmax over the vocabulary $V$ . The word is then chosen from $p_{t}$ (greedily, or by beam search, as in the seq2seq note) and fed back to drive the next step.

Where the context enters is a design choice

The figures inject $c_{t}$ into the readout (Luong’s placement). Bahdanau’s original interface instead feeds the context into the recurrence, so that $c_{t}$ helps compute the next decoder state rather than only the current output. Both are genuine attentional interfaces and both learn the same kind of alignment; they differ only in whether the context informs the decoder’s state or its prediction.

The choice, like the current-versus-previous query convention noted in the additive attention note, is an implementation detail layered on top of the one idea this note is about.

Attention as an interface

The word “interface” is the conceptual payoff, not decoration. In plain seq2seq the decoder had to carry the entire source inside its own state, because that was its only access to it. The attentional interface severs that requirement: the source stays parked in the encoder’s annotations, and the decoder reaches back to read it whenever a step demands, through a differentiable, content-addressed query. Storage (the encoder) is decoupled from use (the decoder).

The same interface, everywhere downstream

Read this way, attention is a general connector between a controller and a memory, and the same interface recurs far beyond translation.

It is how a controller reads from external memory in Neural Turing Machines and Memory Networks (surveyed in the aggregation note), and it is exactly the cross-attention block of an encoder-decoder Transformer, where a decoder layer queries the encoder’s outputs by the same score-softmax-combine logic, only with learned key/value projections and scaled dot-product scoring in place of the additive MLP. The mechanism animated in the three figures above is, structurally, the cross-attention of every modern translation model.

Recap

The object. The attentional interface is additive attention installed permanently between encoder and decoder and evaluated once per output word.

The loop. The encoder runs once to produce fixed annotations; the autoregressive decoder feeds each emitted word back, moving the query, so the interface re-weights the same annotations differently at every step.

The sweep. Across “Sei” / “libero” / “domani?” the attention band moves are-you → free → tomorrow?: the soft alignment, discovered, not supplied.

Cost. One encoder pass, then $O (K)$ attention-plus-readout per step; the encoder is never re-run.

The readout. The context is merged with the decoder state by a small MLP and projected to a vocabulary distribution; where the context is injected (state vs output) is a design choice.

The framing. Attention decouples where information is stored from when it is used; the same interface becomes the cross-attention of the Transformer.

Sources

The attentional interface for translation is Bahdanau, Cho and Bengio (2014/2015), with the readout placement of Luong, Pham and Manning (2015); both are built on the additive attention and soft-attention of the preceding notes. The “attention as an interface between modules” framing follows Olah and Carter, Attention and Augmented Recurrent Neural Networks (Distill, 2016). Cross-attention in Transformers is from Vaswani et al. (2017).

Deep Learning: Zero to Hero

Explorer

Attentional Interface

From one formula to a standing connection

Anatomy of a single step

The loop: the spotlight sweeps the source

Read once, query many times

The readout: turning a context vector into a word

Attention as an interface

Graph View

Table of Contents

Backlinks