The closing list of constraints in the previous note essentially writes the answer: a mechanism that produces a fixed-shape summary as a learned, content-dependent weighted combination of the hidden states, with a parameter count independent of .
The construction that satisfies these constraints is called soft attention. It is the seed of every modern attention mechanism, and in its minimal form it consists of three operations: a small scoring network, a softmax normalization, and a weighted sum.
The three-step pipeline

Given the sequence of hidden states produced by an LSTM (or any other recurrent layer) on the input tweet, soft attention proceeds as follows.
Step 1: score every position
A small fully connected network reads each hidden state and emits a scalar attention score:
where is a learned weight vector. The score is unnormalized; it represents a raw, position-by-position estimate of how relevant is for the downstream task.
The crucial property of this scoring network is that the same is used at every position. The pictorial unrolling in the diagram shows a separate small FC at each time step, but in implementation there is a single shared network, exactly as the recurrent cell is a single shared unrolled across time.
Step 2: normalize via softmax
The raw scores are passed through a softmax to produce a probability distribution over positions:
By construction and . The vector is the attention distribution: each is the fraction of the model’s “focus” assigned to position .
Step 3: produce the context vector
The hidden states are combined by weighted average, with the weights given by :
The context vector is the fixed-shape summary of the entire sequence, content-dependent and learned. It plays exactly the role the uniform average played in the rejected averaging fix, but with non-uniform weights that the network has learned to assign.
The sentiment head consumes in place of , and the rest of the architecture is unchanged.
Dimensions used in this note
symbol role shape recurrent hidden state at position scoring-network weight vector (shared across ) scoring-network bias (shared across , often omitted) scalar raw attention score at position scalar normalized attention weight at position scalar, attention distribution over all positions context vector (fixed-shape summary) sequence length scalar
Figure labels
The figures use the course-slide notation from the previous note: for and for . The three formulas they display, , the softmax, and , are exactly those derived above, and the colours match (hidden states blue, attention quantities purple).
What each step is structurally doing
The score network: learning a salience map
Step 1 is a tiny fully connected layer: input units, one output unit. Its only job is to map a hidden state to a scalar that says “how interesting is this position for the task”. The is a convention, not a requirement; it bounds the score to and keeps the subsequent softmax numerically well-behaved, but the architecture works with most monotonic activations.
What makes this scoring network nontrivial is that the hidden state already encodes context, because it was produced by an LSTM that aggregated everything up to position . So the score assigned to position is not just a function of the -th word in isolation: it depends on the word in context. The same word “struggling”, occurring as a literal complaint or as a hyperbolic flourish in an affectionate sentence, will appear in two very different hidden states, and the score network can in principle distinguish them.
Softmax: turning scores into a distribution
Step 2 normalizes the scores into a probability distribution. Three properties of softmax matter here.
- Sum-to-one. The resulting weights are a budget: the model has exactly one unit of attention to spend, and it must decide how to allocate it across positions. Spending more on one position necessarily means spending less on others.
- Smooth maximization. Softmax is the canonical smooth approximation of the argmax operator. When one score dominates the others, concentrates almost all mass on the corresponding position; when scores are roughly equal, is roughly uniform.
- Differentiability. Every component of is a smooth function of every score, so the gradient of the loss flows back through the softmax to the scoring network and from there to the LSTM. Soft attention is trained end-to-end, with no auxiliary losses or reinforcement-learning tricks.
Soft vs hard attention
Replacing softmax with argmax would give hard attention: the model picks one position and copies its hidden state as the context. Hard attention is more interpretable (the context comes from exactly one position) and more efficient (no need to weight every ), but argmax is not differentiable, so the scoring network cannot be trained by backpropagation.
Training hard-attention models requires REINFORCE-style gradient estimators (Mnih et al., 2014) or continuous relaxations like the Gumbel-Softmax (Jang et al., 2017). Soft attention buys end-to-end trainability by making the choice fractional rather than discrete; the word “soft” in the name refers exactly to this relaxation.
The context vector: a learned weighted summary
Step 3 collapses the hidden states into a single vector of the same dimension as one hidden state. Three properties make exactly the summary that was missing in the previous note.
- Fixed shape. regardless of , so downstream layers have fixed-shape inputs.
- Content-dependent weighting. The mixing coefficients depend on the actual values of the hidden states (through the scoring network), so salient positions dominate the summary.
- Bounded by induction. is a convex combination of the (since and ), so lies in the convex hull of . In particular is bounded by , so the context vector cannot blow up.
Attention shortens the gradient's path, not only the forward one
The motivation in the previous note was a forward problem: forgets early tokens. Attention repairs a backward problem at the same time. In the plain recurrent classifier, the only route from the loss to an early hidden state runs backward through all recurrent steps, the very chain of Jacobians that makes the gradient vanish in BPTT. With attention, the context vector already contains the term , so the loss reaches through a path of length : differentiating gives
an unattenuated identity scaled by the attention weight, plus a smaller term from the score’s dependence on . Attention is, in this precise sense, a content-based skip connection: it gives every position a direct, gated route to the loss, exactly as a residual connection gives every layer a direct route across depth. The two great cures for depth in these chapters, the additive identity path and content-based routing, turn out to be the same idea seen from two angles.
Parameter cost
The whole mechanism costs extra parameters
The scoring network is a single fully connected layer mapping , with a shared weight vector and (optionally) a scalar bias . Because the same is used at every position , the total number of new learnable parameters introduced by adding soft attention to a recurrent architecture is
This is independent of . Adding attention does not enlarge the model in proportion to sequence length; it adds a fixed, tiny overhead and lets the existing recurrent layer’s representations carry the actual content.
Derivation of the parameter count
The scoring network is one neuron with inputs and one output. The forward computation at position is
Its parameters are the weights and the scalar bias , for a total of scalars per network.
The same is applied at every position : there are not scoring networks but one, reused times. (The pictorial unrolling in the diagram shows copies for visual clarity, exactly as recurrent cells are drawn unrolled across time. In implementation there is a single and a single .) The total parameter count is therefore the per-network count, , with no multiplication by .
In the running example with , soft attention adds 101 parameters to a recurrent architecture that already has tens of thousands. The architectural lift is enormous compared to the cost.
Interpretability: attention as a heat map
One side effect of soft attention is unusual for neural networks: the distribution is directly interpretable. It is a heat map over positions, saying where the model is looking when producing its prediction. Visualizing on the input sequence often reveals that the model has learned exactly the salience pattern a human reader would expect.
What the running example would yield
Suppose the LSTM-plus-attention model is trained for sentiment classification and converges. For the professor’s tweet, a plausible attention distribution over the 44 tokens might concentrate as follows:
token ”love" "Deep" "Learning" "struggling" "professor” (every other token) The mass concentrates on the affective word “love” and on the topic of affection (“Deep Learning”); the locally negative-sounding tokens receive small but non-zero mass, enough for the model to register them as context without letting them dominate. The context vector is therefore mostly the hidden state at “love”, lightly mixed with the surrounding context, and the sentiment head correctly predicts positive. The misunderstanding pattern from the previous note is gone.
The interpretability is a real consequence of the architecture, not a heuristic post-hoc tool. Whatever the network does with the rest of its computation, the bottleneck through which all sequence information must flow into the sentiment head is the single linear combination , and the coefficients of that combination are visible.
Attention weights are not guaranteed to be explanations
The heat map is suggestive, but reading as the reason for a prediction is a known trap. Two findings temper it.
- First, for a fixed input one can often construct a different attention distribution that leaves the prediction essentially unchanged (Jain and Wallace, 2019): when many distributions yield the same output, no single one is the explanation.
- Second, attention weights do not always agree with other importance measures such as gradient attributions or leave-one-out scores. The rejoinder (Wiegreffe and Pinter, 2019) is more nuanced: attention is not a faithful explanation in general, but it is not useless either, and whether it can be trusted depends on the architecture and on how the claim is tested.
The safe reading is operational: shows where the context vector draws its mass from, which is a fact about the model’s computation, not a certified account of why it decided as it did.
Pipeline recap

Putting the three steps next to each other, with the running example in mind:
- Hidden states from the recurrent layer. The LSTM processes the tweet word by word, producing . Each is the network’s contextual representation of position .
- Soft attention layer.
- Raw scores. The scoring network computes for every .
- Softmax normalization. The scores are converted to weights that sum to one.
- Context vector. The hidden states are combined as .
- Prediction head. The context vector is fed to the sentiment classifier in place of the final hidden state.
The recurrent layer, the scoring network, and the prediction head are all trained jointly by backpropagation: the gradient of the classification loss flows back through the head, through the softmax, through both the scoring network and the hidden states, and into the recurrent cell. The scoring network learns to assign high mass to whichever positions help the head make correct predictions.
What this is not, yet
The mechanism above is the simplest possible version of attention, sufficient to motivate the idea but missing several features of the modern construction. The differences are worth naming explicitly, both to set expectations and to give vocabulary for the next steps.
- Query-free scoring. Here the score depends only on the hidden state itself, not on what is being asked of the model. Modern attention is query-conditional: , where encodes the current information need. The same hidden state can be highly relevant when one question is being asked and irrelevant when another is. The running example glosses over this because the “query” (which sentiment is this?) is fixed for the whole task, so a query-free score suffices.
- No key/value separation. Here the hidden state is used both to compute its score and to contribute to the context vector. Modern attention separates these two roles, using a learned projection (the key) for scoring and another projection (the value) for the weighted sum. This gives the network independent control over “what makes a position relevant” and “what information that position contributes once selected”.
- No dot-product scoring. Here the score is computed by a learned linear projection . Modern attention uses scaled dot products between query and key, which is both more expressive (the score depends on both vectors symmetrically) and trivially parallelizable.
- Single head. Here there is one set of scores producing one context vector. Modern attention uses multiple heads, each with its own scoring network, each producing its own context vector, concatenated together. Different heads learn to attend to different kinds of relationships.
Every one of these refinements is a strict extension of the mechanism above. The conceptual core, learned weighted combination of a sequence of feature vectors with weights given by a softmax over learned scores, is preserved exactly.
Where soft attention is used beyond NLP sequences
The construction is more general than the NLP context in which it was introduced. The same three-step pipeline applies wherever a model needs to combine a set or sequence of feature vectors into a fixed-shape summary with learned, content-dependent weights: channels of a CNN feature map, spatial locations of an image, nodes of a graph, slots of an external memory. The next note surveys these generalizations.