Local attention

What this note builds

The soft-attention pipeline and the additive attention of Bahdanau both attend to every source position at every decoding step. This note develops local attention (Luong, Pham and Manning, 2015): a variant that, at each step, looks only at a small window of source positions centred on a learned aligned position $p_{t}$ . The construction is developed in full, including the two ways of choosing $p_{t}$ (monotonic and predictive), the Gaussian re-weighting that keeps the predicted position differentiable, and a careful correction of a common label: local attention is not hard attention. The note closes by tracing the idea forward to the sliding-window attention of modern long-context Transformers, where the same locality prior reappears at scale.

Where global attention starts to hurt

In an attention-based encoder-decoder, the encoder turns a source sequence of length $S$ into hidden states $\overset{ˉ}{h}_{1}, \dots, \overset{ˉ}{h}_{S}$ , and at each decoding step $t$ the decoder holds a state $h_{t}$ . Global attention, the classic construction, scores $h_{t}$ against all $S$ source states, normalises with a softmax, and forms the context vector as a weighted average over the whole source:

a_{t} (s) = \frac{exp ( score ( h _{t} , h ˉ _{s} ) )}{\sum _{s^{'} = 1}^{S} exp ( score ( h _{t} , h ˉ _{s^{'}} ) )}, c_{t} = s = 1 \sum S a_{t} (s) \overset{ˉ}{h}_{s} .

This is exactly the three-step pipeline (score, normalise, combine), here made query-conditional: the score depends on the decoder query $h_{t}$ as well as the source key $\overset{ˉ}{h}_{s}$ . The context is then merged with the decoder state into an attentional hidden state $\tilde{h}_{t} = tanh (W_{c} [c_{t}; h_{t}])$ , from which the output $y_{t}$ is produced.

The score function is a free choice

Luong’s framework allows several scoring functions, all interchangeable inside the pipeline: the dot product $h_{t}^{⊤} \overset{ˉ}{h}_{s}$ , the general bilinear form $h_{t}^{⊤} W_{a} \overset{ˉ}{h}_{s}$ , and the concat form $v_{a}^{⊤} tanh (W_{a} [h_{t}; \overset{ˉ}{h}_{s}])$ , the last being the additive score of Bahdanau. Local attention is orthogonal to this choice: it changes which positions are scored, not how they are scored.

The weakness is in the two sums above, both of which range over the entire source.

Two costs of looking everywhere

Computation. Producing $c_{t}$ costs one score evaluation per source position, so $O (S)$ per decoding step and $O (S \cdot T)$ over a target of length $T$ . For long source sequences (long sentences, paragraphs, documents) this dominates the decoder’s cost.

Focus. A softmax over hundreds or thousands of positions spreads a fixed unit of attention thinly. When the genuinely relevant source span is short and localised, scoring the entire source mostly adds distractors that the model must learn to suppress, and the resulting distribution is easy to blur.

Both costs share a cause: the model is forced to consider the whole source even when it only needs to use a small part of it. Local attention removes that obligation.

The construction: aim a spotlight, then average inside it

The idea is one sentence. Instead of a floodlight over the whole source, local attention aims a spotlight of fixed radius $D$ at one source position $p_{t}$ , and runs ordinary soft attention only on what the spotlight illuminates.

Concretely, at step $t$ the model first generates an aligned position $p_{t}$ in the source, then restricts the context vector to the window $[p_{t} - D, p_{t} + D]$ :

c_{t} = s = p_{t} - D \sum p_{t} + D a_{t} (s) \overset{ˉ}{h}_{s},

where $D$ is a hyperparameter chosen in advance, not learned. Everything downstream, the attentional state $\tilde{h}_{t}$ and the output $y_{t}$ , is unchanged: only the support of the weighted average has shrunk from $S$ positions to the $2 D + 1$ positions inside the window.

Reading the figure

Both panels show the same attention layer feeding the same attentional state $\tilde{h}_{t}$ and output $y_{t}$ :

on the left, global attention draws its alignment weights $a_{t}$ from every source state $\overset{ˉ}{h}_{s}$ (all arrows lit).

on the right, local attention first emits the aligned position $p_{t}$ (the arrow leaving the layer to the right), then draws weights only from the source states inside the window around $p_{t}$ ; the states outside it contribute nothing to $c_{t}$ .

The context vector $c_{t}$ , the alignment $a_{t}$ , and the merge into $\tilde{h}_{t}$ are identical in form; the single difference is the set of positions the sum runs over.

The alignment vector becomes fixed-dimensional

In global attention the weight vector $a_{t}$ has length $S$ , which varies from sentence to sentence. In local attention it always has length $2 D + 1$ , regardless of source length. This is more than an implementation convenience: it gives every step a context of constant size, decouples the attention layer’s cost from $S$ , and turns the alignment into a small, fixed-width object the rest of the network can rely on.

The cost of the attention layer drops accordingly, from $O (S)$ per step to $O (2 D + 1) = O (D)$ , constant in the source length. Whether this helps, and how the window is aimed, depends entirely on how $p_{t}$ is chosen, the subject of the next section.

Two ways to aim the spotlight

Luong proposes two recipes for $p_{t}$ , differing in whether the alignment is assumed or predicted.

Monotonic alignment (`local-m`)

The simplest choice assumes the source and target advance roughly in step, so that the relevant source position at decoding step $t$ is just $t$ itself:

p_{t} = t .

Inside the window the weights are the ordinary normalised scores, $a_{t} (s) = align (h_{t}, \overset{ˉ}{h}_{s})$ . There is nothing to learn in the aiming: the spotlight tracks the diagonal of the source–target alignment. This is a strong, cheap prior, and it is exactly right when the two languages share word order, and visibly wrong when they reorder heavily (the verb-final structure of one language against the verb-medial structure of another).

Predictive alignment (`local-p`)

The richer choice lets the model predict where to look from the current decoder state:

p_{t} = S \cdot sigmoid (v_{p}^{⊤} tanh (W_{p} h_{t})) \in [0, S],

with learned parameters $W_{p}, v_{p}$ ; the sigmoid keeps $p_{t}$ inside the source. Now $p_{t}$ is a real number, not an integer index, and the model must be free to learn it by gradient descent. The mechanism that makes this possible is a Gaussian re-weighting centred on $p_{t}$ :

a_{t} (s) = align (h_{t}, \overset{ˉ}{h}_{s}) exp (- \frac{( s - p _{t} ) ^{2}}{2 σ ^{2}}), σ = \frac{D}{2},

where $s$ ranges over the integer positions in the window. The Gaussian multiplies each in-window score by a bump that is largest at $s = p_{t}$ and falls off smoothly toward the window edges.

Its width $σ = D /2$ is tied to the window so that the bump has decayed substantially by the window boundary, which is also the practical justification for truncating the average at $\pm D$ : positions further out would receive negligible weight anyway.

Why the Gaussian is the clever part

The predicted $p_{t}$ enters the loss only through the smooth term $exp (- (s - p_{t})^{2} /2 σ^{2})$ , which is differentiable in $p_{t}$ . The gradient of the loss can therefore flow back into $W_{p}, v_{p}$ and teach the model where to point the spotlight, all by ordinary backpropagation.

Had the window simply been “the $2 D + 1$ integers nearest $p_{t}$ ” with no Gaussian, $p_{t}$ would enter only through a rounding-to-integer step, which has zero gradient almost everywhere, and the position predictor could never be trained. The Gaussian is precisely the device that turns a discrete “which positions” decision into a continuous, learnable one.

Window truncation at the boundaries

Near the start or end of the source the window $[p_{t} - D, p_{t} + D]$ runs off the edge; the out-of-range positions are simply dropped and the average is taken over the valid ones. The Gaussian’s rapid decay means the lost positions carry little weight, so the truncation is benign.

The pitfall: local attention is not hard attention

Local attention is sometimes glossed as “hard attention”, and the soft/hard trade-off that motivated it (below) makes the slip easy to commit. The conflation is worth dismantling carefully, because the two mechanisms differ in the property that matters most for training.

What hard attention actually is

True hard attention (Xu et al., 2015, in the image-captioning model discussed in the aggregation note) samples one position from the attention distribution and copies that single source state as the context. Because sampling a discrete index is not a differentiable operation, the scoring network cannot be trained by backpropagation; hard attention requires REINFORCE-style gradient estimators or continuous relaxations such as the Gumbel-Softmax, exactly as described in the soft-versus-hard discussion.

What local attention actually is

Local attention never samples and never collapses to a single position. Inside its window it performs the ordinary soft, weighted average of $2 D + 1$ states, and the whole computation, including the predicted $p_{t}$ via the Gaussian, is differentiable end-to-end. There is no REINFORCE, no relaxation, no high-variance gradient estimator: plain backpropagation suffices.

The accurate description, and Luong’s own, is that local attention is a blend of soft and hard. It borrows from hard attention the idea of focusing on a small region rather than the whole source, and it borrows from soft attention the differentiable weighted average that makes training easy. The motivation came from the soft/hard trade-off in Xu et al.; the implementation deliberately stays on the soft, trainable side of it.

A second, quieter pitfall: the encoder still reads everything

Local attention shrinks the cost of the attention layer from $O (S)$ to $O (D)$ per step. It does not make the model sub-linear in $S$ overall: the (often bidirectional) encoder still processes all $S$ tokens to produce the states $\overset{ˉ}{h}_{s}$ in the first place, at $O (S)$ cost. The saving is real but localised to the alignment and context computation, not to the encoder. Claiming local attention “makes the whole model cheaper on long inputs” overstates it.

What local attention really is: a locality prior

Set beside the general aggregation view, local attention has a clean reading. Soft attention aggregates a set of feature vectors with content-dependent weights; local attention aggregates the same set restricted to a neighbourhood of $p_{t}$ . Restricting the set is an architectural commitment, an inductive bias: the model is told, before it sees any data, that the useful information for step $t$ lives near position $p_{t}$ and nowhere else.

That bias is helpful exactly when it is true. In translation between word-order-similar languages the alignment is nearly monotonic, so the diagonal prior of local-m is almost free accuracy; when reordering is heavy, local-p lets the model learn the offset instead of asserting it. As always with priors, the benefit is real when the assumption matches the data and a liability when it does not, which is why local-p (a softer, learned prior) is the more robust default of the two.

The trade against global attention, in one line

Global attention assumes nothing about where the relevant source lies and pays for that generality with $O (S)$ cost and a diluted softmax. Local attention assumes the relevant source is a contiguous window around a (possibly learned) centre, and is rewarded with constant cost and sharper focus, provided the assumption holds. The choice between them is a choice about how much structure to build in.

Where this idea went: sliding-window attention at scale

Local attention reads, in retrospect, as an early instance of a principle that became central once attention replaced recurrence entirely. The Transformer’s self-attention has cost $O (n^{2})$ in sequence length $n$ , the modern incarnation of global attention’s “look everywhere” cost. The fix that scales is precisely Luong’s: restrict each position to a window.

Longformer (Beltagy et al., 2020) gives each token a fixed-width sliding window of neighbors plus a handful of global tokens, cutting attention to $O (n \cdot w)$ for window width $w$ .
Big Bird (Zaheer et al., 2020) combines local windows with a few random and global connections, recovering long-range reach at near-linear cost.

These live in the Transformer literature under efficient attention. The window has grown up, from a $2 D + 1$ span around a predicted alignment in a 2015 NMT decoder into the structural backbone of long-context Transformers, but the move is identical: when attending to everything is too expensive or too unfocused, attend to a neighborhood.

Recap

Local attention in one frame

Construction. At step $t$ , generate an aligned position $p_{t}$ and take the soft-attention weighted average only over the window $[p_{t} - D, p_{t} + D]$ ; $D$ is a fixed hyperparameter and the alignment vector has constant length $2 D + 1$ .

Aiming. local-m sets $p_{t} = t$ (a monotonic-alignment prior, nothing learned); local-p predicts $p_{t} = S \cdot sigmoid (\cdot)$ and uses a Gaussian re-weighting of width $σ = D /2$ to keep the predicted position differentiable.

Cost. The attention layer drops from $O (S)$ to $O (D)$ per step; the encoder still reads all $S$ tokens.

The label trap. Local attention is a differentiable blend of soft and hard, trained by plain backpropagation; it is not the sampling-based, REINFORCE-trained hard attention of Xu et al.

The lineage. It is the conceptual ancestor of the sliding-window sparse attention used in long-context Transformers.

Sources

Local attention, and the global/local distinction, are from Luong, Pham and Manning, Effective Approaches to Attention-based Neural Machine Translation (2015). The soft/hard trade-off that motivated it is from Xu et al., Show, Attend and Tell (2015); the additive global attention it streamlines is from Bahdanau, Cho and Bengio (2015), treated in the additive attention note.

Deep Learning: Zero to Hero

Explorer

Where global attention starts to hurt

The construction: aim a spotlight, then average inside it

Two ways to aim the spotlight

Monotonic alignment (`local-m`)

Predictive alignment (`local-p`)

The pitfall: local attention is not hard attention

What local attention really is: a locality prior

Where this idea went: sliding-window attention at scale

Recap

Graph View

Table of Contents

Backlinks

Deep Learning: Zero to Hero

Explorer

Local attention

Where global attention starts to hurt

The construction: aim a spotlight, then average inside it

Two ways to aim the spotlight

Monotonic alignment (local-m)

Predictive alignment (local-p)

The pitfall: local attention is not hard attention

What local attention really is: a locality prior

Where this idea went: sliding-window attention at scale

Recap

Graph View

Table of Contents

Backlinks

Monotonic alignment (`local-m`)

Predictive alignment (`local-p`)