One attention layer learns one attention pattern per query — but a sentence has many simultaneous relational structures (subject-verb agreement, coreference, prepositional attachment, …). Multi-head attention runs parallel attention computations, each with independent Q/K/V projections, lets each head specialise, then merges the results.
Why one head isn’t enough
A single self-attention layer with one set of produces one attention pattern per query. Whatever subspace the projections happen to land in determines what kind of similarity gets measured.
But natural sentences contain many simultaneous relational structures:
- “The animal didn’t cross the street because it was too tired” —
itshould attend toanimal(coreference). - “The black cat that sleepily sat on the mat” —
satshould attend tocat(subject-verb agreement) and tosleepily(adverbial modifier). - Long-range syntactic dependencies, local positional ones, semantic affinities — all happening at once.
A single head has to compromise — its weights settle on the most loss-reducing average of all these patterns. Multi-head attention sidesteps the compromise by running attention computations in parallel, each with its own independent projections. Each head ends up specialising in a different relational structure.
The mechanism
Pick the number of heads (typically 8 in the original transformer). For each head , instantiate three independent learnable matrices:
Each maps from the model dimension to a smaller per-head dimension — typically . So in the original transformer with and , each head has .
For each head, run scaled dot-product attention independently:
where , , .
Each head produces an output. Concatenate all heads’ outputs along the feature axis:
Then project through one final learnable matrix to merge them into one output sequence of the original dimension:
The output is again , ready to feed into the next layer.
Why split into smaller heads instead of running full-sized heads?
The per-head dimension is chosen so the total compute and parameter cost of multi-head attention matches that of a single full-sized head. Each head is “cheaper” but there are of them; the bookkeeping cancels.
This is a deliberate design choice. You could instead run heads each of full size — but that would multiply the parameter and compute cost by . The split keeps the total budget fixed while gaining the specialisation benefit.
TIP — Multi-head as ensemble
Conceptually, multi-head attention is something like running an ensemble of small attention layers and learning how to weight their outputs. Different heads end up looking at different parts of the input — some attend to neighbouring tokens (positional), some to syntactically related tokens (long-range), some to coreferent mentions, some to repeated patterns. The final projection learns how to combine the heads’ findings into a single representation that downstream layers can use.
What each head actually learns
Empirically, when you visualise the attention patterns of trained transformers, different heads exhibit distinct, interpretable behaviours:
- Positional heads — attend to the previous or next token regardless of content (basically a learned shift).
- Syntactic heads — attend to syntactically related tokens (subject of the verb, object of the preposition).
- Coreference heads — link pronouns to their antecedents.
- Rare-token heads — focus on less frequent words that carry more information.
- Diffuse heads — spread attention broadly, contributing average context.
The figure on the slides shows two heads on the same sentence “The Law will never be perfect, but its application should be just” — one attending sharply along a few specific edges (likely a syntactic head), the other diffusely attending across many edges (likely a content-mixing head). Same input, two different views.
This division of labour is emergent — nobody tells head 3 to be the coreference head. The training signal pushes the heads to specialise because diversifying their patterns reduces loss more than redundantly fitting the same pattern.
In code (compact form)
The full multi-head attention layer fits on a few lines if you batch the heads as an extra tensor dimension:
- Project into of shape .
- Reshape to — heads become a batch dimension.
- Run scaled dot-product attention in parallel across the head axis.
- Concatenate heads (reshape back to ).
- Project through to .
The “multi-head” label refers to logical specialisation; computationally it’s a single batched matrix operation, which is why GPUs handle it efficiently.
Counting parameters
A multi-head attention layer has four sets of learned matrices:
- copies of , each → total (since ).
- Similarly for and → each.
- One of shape → .
Total: parameters per layer, regardless of . The number of heads is a design choice that trades width-per-head against number-of-heads without changing the total budget.
Related
- self-attention — the underlying single-head operation
- cross-attention — multi-head attention with from one sequence and from another
- transformer — the architecture that stacks multi-head attention with feed-forward and normalisation
- softmax — used inside each head’s attention
- dropout — typically applied to the attention weights and after the projection
Active Recall
Multi-head attention runs parallel attention heads each with its own . What is the total parameter cost compared to a single full-sized head, and why?
Each head has projections of size where . Across all heads the projections sum to — exactly the same as one full-sized head with . Plus one final of . The total parameters is independent of . Multi-head doesn’t cost more parameters; it just splits the same budget across specialists.
Why is multi-head attention more expressive than single-head attention with the same parameter count?
A single head computes one attention distribution per query — it can only attend strongly along one relational axis at a time (syntactic or positional or coreferential, etc.). Multiple heads in parallel let each one specialise in a different pattern; the final projection learns to combine them. Same parameter budget, more relational structure captured per layer.
After running attention heads in parallel and producing output sequences each of dim , what two operations turn them back into a single sequence of dim ?
First, concatenate the outputs along the feature axis to get a sequence of dim . Second, project through a learned matrix to mix the heads and return to the original model dimension. The concatenation preserves all heads’ information; learns how to combine them.
What kinds of patterns do trained transformer heads typically specialise into, and what enforces this specialisation?
Empirically, different heads end up doing different things — one attends to the previous token (positional), one tracks coreference (e.g. pronouns to antecedents), one captures subject-verb agreement, one diffusely averages context, etc. Nothing in the architecture enforces this; it emerges from gradient descent because diversifying head behaviours reduces loss more than redundantly fitting the same pattern. The layer learns to combine the diverse heads, completing the division of labour.