Attention as explanation

What this note builds

Soft attention produces, as a free byproduct, a number for every input position saying how much the model drew on it. Laid out over the input, those numbers look like an explanation of the model’s decision, and the picture is seductive. This note asks how far that reading can be trusted. It takes the cleanest favourable case, the source–target alignment learned by a translation decoder, shows what it genuinely reveals (including a word reordering it discovers with no supervision), and then draws the line the most careful practitioners insist on: the difference between an alignment, a plausible explanation, and a faithful one. The conclusion is operational, not dismissive.

A map that comes for free

Almost everything inside a neural network is opaque: a hidden unit’s activation rarely means anything a human can name. Soft attention is the rare exception. Its three-step pipeline ends in a softmax over the input positions, producing weights $α_{t, s} \in (0, 1)$ that sum to one, and those weights are interpretable on their face: $α_{t, s}$ is the fraction of position $t$ ‘s “focus” spent on input element $s$ .

Mounted as the attentional interface of an encoder-decoder translator, this becomes a soft alignment. For each generated target word the decoder holds a distribution over all source words, recording which parts of the source it consulted to produce that word. Collect those distributions for a whole sentence and the result is a readable picture of the model at work.

Reading the figure

The sentence is being translated from English (bottom) into French (top).

The bottom row of A boxes is the bidirectional encoder reading the source “the agreement on the European Economic Area was signed in August 1992 .”; the $\leftrightarrow$ arrows are its forward and backward passes, so each A is an annotation that summarises the whole sentence as seen from one source position.

The top row of B boxes is the decoder emitting the French target “l’ accord sur la zone économique européenne a été signé en août 1992 .” one word at a time; the $\to$ arrows are its recurrence.

Each purple curve from a B box down to the A boxes is one attention weight $α_{t, s}$ ; the darker and thicker the curve, the more mass that target word placed on that source word.

Most of the mass lies close to the diagonal, the signature of a roughly monotonic alignment: signé draws from signed, août from August, 1992 from 1992, each target word looking mainly at its source counterpart.

The near-diagonal backbone is unsurprising. The revealing part of the picture is the one place it breaks.

The crossing is the whole point

English orders the phrase adjective–adjective–noun: European Economic Area. French orders it noun–adjective–adjective: zone économique européenne. The correspondence therefore has to reverse across the phrase: European (first in English) maps to européenne (last in French), and Area (last in English) maps to zone (first in French). In the figure the purple curves over this span visibly cross.

Nothing in the training data labelled this alignment. The model was trained only to produce good translations, and the reordering emerged on its own as the alignment that made the objective easiest to satisfy. This is the demonstration, from the paper that introduced attention for translation (Bahdanau, Cho and Bengio, 2015, the source of this exact example), that attention recovers linguistically meaningful, non-monotonic alignment as a side effect of learning to translate.

It is also the concrete reason the monotonic prior of [[local-attention|local-m attention]] is only an approximation: the diagonal is the rule, and crossings like this are exactly where a hard diagonal assumption would fail.

A caveat about the figure itself

Published alignment maps, this one included, are usually among the cleanest a model produces, and they are chosen to make a point. Real attention is often more diffuse and noisier than the crisp picture here suggests: mass spread across several plausible source words, and the occasional weight that resists any tidy linguistic reading. The idealised figure is the right teaching example, not a promise that every map will be this legible.

As an alignment the map is not only interpretable but useful: it underpins copying a rare or out-of-vocabulary source word straight to its aligned target slot, and it lets a practitioner inspect which source span a mistranslation came from. For these purposes attention earns its reputation. The temptation, and the danger, is to ask the same picture to do more.

From alignment to explanation: the line not to cross

It is tempting to read the heatmap as the model’s reason: “it output européenne because it looked at European.” Whether that reading is licensed depends on a distinction that is easy to blur and essential to keep sharp. Three claims of increasing strength hide behind one picture:

Alignment. The weights mark a correspondence between input and output elements. For translation this is a genuine latent of the task, and attention approximates it well, as above.
Plausible explanation. The highlighted inputs look, to a human, like the right ones. Attention maps are highly plausible almost by construction.
Faithful explanation. The highlighted inputs are the ones that actually drove the output: change them and the output changes accordingly; the map reflects the model’s true computation.

Plausibility is not faithfulness

This is the trap at the centre of the topic. A map that agrees with human intuition is plausible; a map that reflects the computation that actually produced the output is faithful. These are different properties, and the second is far harder to earn.

Attention weights are optimised to make the model’s predictions good, not to be honest reports of their own importance, so plausibility is cheap and faithfulness is never guaranteed by the architecture. A picture that looks like an explanation has, by that fact alone, told you nothing about whether it is one. Keeping the two terms apart is by now standard practice in interpretability (Jacovi and Goldberg, 2020).

The deeper reason a map can mislead: the states are already contextual

There is a precise mechanism behind the gap, and it is the part most often missed. The vector attended to at position $s$ is not “the source token $s$ “. It is the encoder annotation $\overset{ˉ}{h}_{s}$ , and the bidirectional encoder has already mixed the whole sentence into every position, so $\overset{ˉ}{h}_{s}$ is a contextual summary centred on $s$ , not an isolated token. Two consequences follow, and both loosen the tie between the map and the raw inputs:

putting weight on position $s$ pulls in information about $s$ ‘s neighbourhood, not token $s$ alone, so “attention on $s$ ” overstates how localised the model’s reliance actually is;

a token can still reach the output through the recurrent path even where its attention weight is near zero, so low attention does not certify low influence.

“Attention on position $s$ ” and “reliance on input token $s$ ” are therefore different quantities. This is a concrete reason attention weights and leave-one-out importance disagree in the measurements reported below: the recurrence has already smeared each token’s contribution across positions before attention ever sees it.

The literature that pressed on this point is worth knowing precisely, because it is often quoted as a slogan and lost as an argument.

What the studies actually showed

Jain and Wallace (2019), “Attention is not Explanation”. On classification tasks with a single attention layer over an LSTM, they reported two findings. First, attention weights correlate only weakly with gradient-based and leave-one-out importance scores, so different notions of “which input mattered” disagree. Second, for a fixed input one can often construct a substantially different attention distribution that leaves the model’s prediction essentially unchanged. If many distributions yield the same output, no single one can be singled out as the explanation.

Serrano and Smith (2019), “Is Attention Interpretable?” Erasing the highest-attention tokens frequently fails to flip the decision as readily as the weights would suggest, further loosening the tie between attention magnitude and causal importance.

Wiegreffe and Pinter (2019), “Attention is not not Explanation”. The rebuttal argues the adversarial distributions are found per-instance in a way that does not constitute a genuine alternative model, and that under stricter protocols, for example forcing a model to be trained with a fixed alternative attention, the weights often cannot be freely swapped without hurting performance. Their conclusion is graded: attention is not a guaranteed faithful explanation, but it is not useless either, and whether it can be trusted depends on the model, the task, and how the claim is tested.

Why the figure above is the favourable case

The sharpest negative results target classification attention, where a single label must be explained by a distribution over many tokens, leaving great freedom to reshuffle weights without disturbing the output. Sequence-to-sequence alignment is more constrained: each target word genuinely corresponds to some source span, the alignment is a real latent of the task, and it can be checked against an external ground truth (human-annotated word alignments).

The translation map above is therefore attention near its most trustworthy; a sentiment classifier’s heatmap is where the caution bites hardest. Treating the two as equally reliable, in either direction, is itself a common mistake.

The same story in vision, and its own trap

The image-captioning heatmaps of Show, Attend and Tell are the visual twin of this figure: there the attention map is painted over an image and tracks the object being named. Everything in this note transfers to them, with one extra pitfall specific to the visual case, already flagged in that note: the smooth, pixel-precise look of those overlays is an upsampling artefact of a coarse $14 \times 14$ attention grid, not evidence of fine localisation. Plausibility, once again, outruns what the underlying numbers actually support.

In modern Transformers the picture fragments

The single readable map of this RNN-era figure is a luxury of having one attention layer with one head. A Transformer spreads attention across many heads and many layers; there is no one map to read, different heads encode different and often non-linguistic relations, and schemes that collapse them into a single picture (attention rollout and the like) add assumptions of their own. The interpretability of attention does not get cleaner as models scale, it gets harder, which is one more reason to treat any single heatmap as a hypothesis rather than a verdict.

The honest reading

What an attention map is, and is not

It is a faithful record of where the context vector drew its mass: a true statement about the model’s computation, not a heuristic.

It is often a good alignment (especially for translation) and a plausible account of what the model used.

It is not guaranteed to be a faithful, unique explanation of why the model decided as it did.

The safe practice follows directly: use attention maps as an alignment tool and a hypothesis-generating diagnostic, and when a causal claim is at stake, corroborate them with independent evidence, gradient attributions, ablations, leave-one-out, rather than reading the heatmap as a certificate of the model’s reasoning.

Recap

Attention hands the analyst a free, human-readable map of what the model attended to; in encoder-decoder translation this map is a learned soft alignment.

The figure’s near-diagonal mass plus the crossing at European Economic Area / zone économique européenne shows attention discovering non-monotonic alignment without supervision: a genuine, useful result.

The reading breaks down when alignment or plausibility is mistaken for faithfulness: a map that looks right is not thereby a true account of the model’s reasoning (Jain and Wallace, 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019).

Alignment in translation is the favourable case; classification heatmaps and multi-head Transformer attention are far harder to trust.

Operationally: an attention map is where the context vector drew its mass, a reliable diagnostic and alignment, not a certified explanation.

Sources

The English–French alignment example is from Bahdanau, Cho and Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (2015), whose additive attention produces the weights drawn here. The explainability debate: Jain and Wallace, Attention is not Explanation (2019); Serrano and Smith, Is Attention Interpretable? (2019); Wiegreffe and Pinter, Attention is not not Explanation (2019). The faithfulness/plausibility distinction is formalised in Jacovi and Goldberg, Towards Faithfully Interpretable NLP Systems (2020). The vision counterpart, Xu et al. (2015), is developed in the aggregation note.

Deep Learning: Zero to Hero

Explorer

Attention as explanation

A map that comes for free

From alignment to explanation: the line not to cross

The same story in vision, and its own trap

The honest reading

Graph View

Table of Contents

Backlinks