Additive Attention

What this note builds

Encoder-decoder RNNs hand the decoder a single, frozen context vector and hope it summarises the whole source. It cannot. Additive attention (Bahdanau, Cho and Bengio, 2014) was the first mechanism to repair this, and the first attention mechanism of the modern era: at every decoding step it builds a fresh context vector by scoring the current decoder state against every encoder state, then taking a learned weighted average. This note constructs it in the three steps of the figures, projection into a common space, an additive compatibility score, softmax-and-combine, and dwells on the two things most often skipped: why the score is called “additive” (and what it is contrasted with), and why the score is, quite literally, a tiny neural network. It closes by placing additive attention in the family that runs from Bahdanau to the Transformer.

The bottleneck additive attention was built to break

The seq2seq note ended on a diagnosis. Every fact the decoder will ever know about the source must pass through one fixed-width context vector $h_{K}^{E}$ , the encoder’s final state. That single vector cannot losslessly summarise a long sentence, and because a recurrent hidden state is a leaky accumulator, it is biased toward the most recent source tokens and forgets the beginning.

The symptoms were concrete: translation quality falls as the source grows, and all RNNs lose early information once sequences pass roughly 40 steps, the familiar vanishing-gradient horizon.

The repair the seq2seq note foresaw is exactly Bahdanau’s: stop throwing away the intermediate encoder states, keep all of them, and let the decoder build a different context vector for every output word, a content-dependent weighted summary instead of one frozen vector for the whole sentence. This is the soft-attention pipeline, but with a decisive upgrade.

In the sentiment example the score was query-free, $s (h_{t})$ , because the task asked one fixed question. A translator asks a different question at every step (“which source word do I need to produce this target word?”), so the score must depend on the decoder’s current state too. Additive attention is the first query-conditional attention: the score reads both an encoder state and the decoder state.

A naming caution before any formula

What this note calls $h_{i}^{E}$ (encoder states) and $h_{t}^{D}$ (the decoder state) are, in the wider attention vocabulary, the keys/values and the query respectively. Bahdanau’s own paper calls the encoder states annotations and the scores energies. The objects are the same throughout; only the names change with the era.

The plan: one custom summary per output word

Fix a decoding step $t$ . The decoder holds a state $h_{t}^{D}$ , and the encoder has left behind a set of annotations ${h_{i}^{E}}_{i = 1}^{K}$ , one per source position. Additive attention turns these into a context vector $c_{t}$ through the same three moves as every attention mechanism, score, normalise, combine, specialised as follows:

project the decoder state and each encoder state into a shared space where they can be compared;
score every encoder state against the decoder state with an additive compatibility function;
softmax the scores into weights and take the weighted average of the encoder states.

The three figures below are these three steps. The only genuinely new ingredient compared to the query-free pipeline is that the score now takes two arguments instead of one.

Dimensions used in this note

Colours follow the figures throughout: encoder states in $green$ , the decoder query in $blue$ , and attention quantities (score, weights, context) in $purple$ .

symbol role shape
$h_{i}^{E}$ encoder annotation at source position $i$ $n_{E}$
$h_{t}^{D}$ decoder hidden state (the query) at step $t$ $n_{D}$
$W^{E}$ encoder projection into the common space $d_{h} \times n_{E}$
$W^{D}$ decoder projection into the common space $d_{h} \times n_{D}$
$v$ scoring vector (read-out of the common space) $d_{h}$
$s (h_{i}^{E}, h_{t}^{D})$ additive score (energy) of position $i$ at step $t$ scalar
$α_{t, i}$ attention weight on position $i$ at step $t$ scalar, $\in (0, 1)$
$c_{t}$ context vector at step $t$ $n_{E}$
$K$ source length scalar

symbol	role	shape
$h_{i}^{E}$	encoder annotation at source position $i$	$n_{E}$
$h_{t}^{D}$	decoder hidden state (the query) at step $t$	$n_{D}$
$W^{E}$	encoder projection into the common space	$d_{h} \times n_{E}$
$W^{D}$	decoder projection into the common space	$d_{h} \times n_{D}$
$v$	scoring vector (read-out of the common space)	$d_{h}$
$s (h_{i}^{E}, h_{t}^{D})$	additive score (energy) of position $i$ at step $t$	scalar
$α_{t, i}$	attention weight on position $i$ at step $t$	scalar, $\in (0, 1)$
$c_{t}$	context vector at step $t$	$n_{E}$
$K$	source length	scalar

Step 1: project both sides into a common space

Two linear layers, $W^{E} \in R^{d_{h} \times n_{E}}$ and $W^{D} \in R^{d_{h} \times n_{D}}$ , map the encoder state and the decoder state into a common space of dimension $d_{h}$ :

W^{E} h_{i}^{E} \in R^{d_{h}}, W^{D} h_{t}^{D} \in R^{d_{h}} .

The plain reason is shape: if $n_{E} \neq = n_{D}$ the two vectors cannot even be added, and the projections fix that by sending both to length $d_{h}$ . But the deeper reason survives even when the shapes already match.

If the dimensions already agree, why project at all?

It is tempting to think that when $n_{E} = n_{D}$ the projections are redundant. They are not. Equal dimension is not shared meaning. The encoder’s hidden space is organised to summarise source text; the decoder’s is organised to drive target generation. Two vectors of the same length can live in spaces whose coordinates mean entirely different things, so comparing them directly is comparing apples to oranges that happen to come in boxes of the same size.

The learned $W^{E}, W^{D}$ rotate and rescale both into a neutral space built for one purpose only: measuring compatibility. This is why the construction is robust to the asymmetries of real tasks, translation between languages of different structure, a question attending to a passage, where encoder and decoder genuinely encode different kinds of thing.

Step 2: score each source word by an additive compatibility

The two projected vectors are summed, squashed by a $tanh$ , and read out to a single number by the vector $v$ :

s (h_{i}^{E}, h_{t}^{D}) = [1 \times d_{h}] v^{⊤} tanh ([d_{h} \times 1] W^{E} h_{i}^{E} + [d_{h} \times 1] W^{D} h_{t}^{D}) \in R .

The scalar $s (h_{i}^{E}, h_{t}^{D})$ is the importance of encoder output $i$ with respect to the current decoder output $h_{t}^{D}$ : large when source position $i$ is relevant to the word about to be generated, small when it is not.

The score is a tiny neural network, not a magic similarity

Stack the two projections side by side. Because $W^{E} h_{i}^{E} + W^{D} h_{t}^{D} = [W^{E} W^{D}] [h_{i}^{E}; h_{t}^{D}]$ , the score is exactly
$s = v^{⊤} tanh (W [h_{i}^{E}; h_{t}^{D}]),$
a one-hidden-layer MLP with $d_{h}$ hidden units and a $tanh$ non-linearity, fed the concatenation of the two states and emitting a single scalar. There is nothing mysterious in the “compatibility function”: it is a small feedforward network learned to output a high number when an encoder state and the decoder state belong together. The hidden width $d_{h}$ is its capacity; $v$ is its output neuron. Seeing this collapses additive attention from a special-purpose formula to an instance of something already familiar.

Why "additive", and what it is set against

The name comes from the $+$ inside the $tanh$ : the query and key are added (after projection) before being scored. This is one of two families. Multiplicative (or dot-product) attention instead multiplies query and key, scoring them by an inner product such as $h_{t}^{⊤} \overset{ˉ}{h}_{s}$ or $h_{t}^{⊤} W \overset{ˉ}{h}_{s}$ , the dot and general forms in Luong’s taxonomy; Luong’s third, concat, is additive attention by another name.

The Transformer inherits the multiplicative branch as scaled dot-product attention. “Additive” versus “multiplicative” is therefore not jargon about the output, both produce a scalar score, but a statement about how query and key are combined to produce it.

Step 3: softmax, then a weighted summary

The $K$ scores at step $t$ are turned into a probability distribution by a softmax, and the encoder states are averaged with those weights:

α_{t, i} = \frac{e ^{s (h_{i}^{E}, h_{t}^{D})}}{\sum _{j = 1}^{K} e ^{s (h_{j}^{E}, h_{t}^{D})}}, c_{t} = i = 1 \sum K α_{t, i} h_{i}^{E} .

The weights $α_{t, i}$ sum to one, so $c_{t}$ is a convex combination of the encoder annotations: it lives in their convex hull and cannot blow up, exactly the boundedness the soft-attention note establishes. Crucially $c_{t}$ carries the step index $t$ : it is rebuilt from scratch at every output word, each time re-weighting the same encoder states toward whatever the decoder currently needs. The single frozen $h_{K}^{E}$ of plain seq2seq is replaced by a stream of custom summaries, and the recency-bias bottleneck is gone.

What the weights are, and a caution about reading them

The distribution $α_{t}$ is a soft alignment: it says how much each source word contributed to this target word. On translation these weights recover linguistically meaningful, even non-monotonic, alignments without ever being told them, which is the headline result of Bahdanau’s paper. That makes them a tempting explanation of the model’s choices, but a heatmap of attention is not a certified account of the model’s reasoning; the distinction between alignment, plausibility, and faithfulness is the subject of the explainability note.

A subtlety the figures smooth over: which decoder state is the query?

The diagrams use the current decoder state $h_{t}^{D}$ as the query. Bahdanau’s original formulation uses the previous decoder state to compute the attention for step $t$ , so that the context can be fed into the recurrence that produces $h_{t}^{D}$ . Implementations differ on this ordering, and on whether the context is concatenated to the decoder input or to its output. None of it changes the mechanism of this note; it is worth knowing only so that two correct diagrams that disagree on the index do not seem to contradict each other.

Additive versus multiplicative attention

Additive attention is one of two great branches; knowing the trade-off explains why the field eventually moved to the other.

	Additive (Bahdanau, 2014)	Multiplicative / dot-product (Luong, 2015; Transformer, 2017)
How query and key combine	sum inside a $tanh$	inner product
Score	$v^{⊤} tanh (W^{E} h_{i}^{E} + W^{D} h_{t}^{D})$	$h_{i}^{E ⊤} h_{t}^{D}$ (or with a weight matrix)
Extra parameters	$W^{E}, W^{D}, v$	none (dot) or one matrix (general)
Unequal dimensions	handled natively by the two projections	need a projection to match shapes first
Speed	a small MLP per pair	a single matrix multiply, highly parallel

Why the world moved to dot-product, and the trap that forced one fix

The two score families reach similar accuracy, but the dot product is a bare matrix multiplication: far faster and more memory-efficient on modern hardware than evaluating a small MLP for every query-key pair. That efficiency is why the Transformer is built on dot-product attention.

There is a catch, though, and it is the reason additive attention does not simply vanish from the story: for large dimension the raw dot product grows large in magnitude, pushing the softmax into saturated, near-zero-gradient regions, where unscaled dot-product attention actually underperforms additive attention. The Transformer’s repair is the $1/ d_{k}$ scaling factor in scaled dot-product attention, which restores the dot product to the well-behaved regime that additive scoring enjoyed for free. Additive attention is thus both the historical starting point and the benchmark that motivated the scaling trick.

How this connects

It is the query-conditional instance of the soft-attention pipeline, delivering the upgrade that note flagged as still missing.
It mounts onto the seq2seq chassis as the attentional interface; the encoder is typically bidirectional, so each annotation $h_{i}^{E}$ summarises the source as seen from position $i$ in both directions.
The soft alignments it produces are read, and carefully qualified, in the explainability note.
Its successor local attention restricts the same weighted average to a window, and the multiplicative branch leads on to the Transformer.
It exists as runnable code in the hands-on translation note, an Italian-English NMT system with additive attention over a bidirectional LSTM encoder.

Recap

The problem. A single frozen context vector cannot summarise a long source and is biased toward its recent tokens; additive attention replaces it with a fresh, per-step context.

Three steps. Project encoder and decoder states into a common space ( $W^{E}, W^{D}$ ); score every source position against the decoder by the additive energy $v^{⊤} tanh (W^{E} h_{i}^{E} + W^{D} h_{t}^{D})$ ; softmax the scores and take the weighted average $c_{t} = \sum_{i} α_{t, i} h_{i}^{E}$ .

Two demystifications. The score is a one-hidden-layer MLP on the concatenated states; “additive” names the sum of query and key, as opposed to the dot product of multiplicative attention.

Why projections matter. Equal dimension is not shared meaning; $W^{E}, W^{D}$ build a neutral space for measuring compatibility, which is what lets additive attention handle encoder/decoder asymmetries.

Its place in history. The first query-conditional attention; superseded in practice by the faster dot product, whose own failure at large dimension (saturated softmax) is fixed by the Transformer’s $1/ d_{k}$ scaling.

Sources

Additive attention is from Bahdanau, Cho and Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (arXiv 2014, ICLR 2015). The multiplicative variants are from Luong, Pham and Manning (2015), treated in the local-attention note; scaled dot-product attention is from Vaswani et al. (2017).

Deep Learning: Zero to Hero

Explorer

Additive Attention

The bottleneck additive attention was built to break

The plan: one custom summary per output word

Step 1: project both sides into a common space

Step 2: score each source word by an additive compatibility

Step 3: softmax, then a weighted summary

Additive versus multiplicative attention

How this connects

Graph View

Table of Contents

Backlinks