One attention layer learns one attention pattern per query — but a sentence has many simultaneous relational structures (subject-verb agreement, coreference, prepositional attachment, …). Multi-head attention runs parallel attention computations, each with independent Q/K/V projections, lets each head specialise, then merges the results.

Why one head isn’t enough

A single self-attention layer with one set of produces one attention pattern per query. Whatever subspace the projections happen to land in determines what kind of similarity gets measured.

But natural sentences contain many simultaneous relational structures:

  • “The animal didn’t cross the street because it was too tired” — it should attend to animal (coreference).
  • “The black cat that sleepily sat on the mat” — sat should attend to cat (subject-verb agreement) and to sleepily (adverbial modifier).
  • Long-range syntactic dependencies, local positional ones, semantic affinities — all happening at once.

A single head has to compromise — its weights settle on the most loss-reducing average of all these patterns. Multi-head attention sidesteps the compromise by running attention computations in parallel, each with its own independent projections. Each head ends up specialising in a different relational structure.

The mechanism

Pick the number of heads (typically 8 in the original transformer). For each head , instantiate three independent learnable matrices:

Each maps from the model dimension to a smaller per-head dimension — typically . So in the original transformer with and , each head has .

For each head, run scaled dot-product attention independently:

where , , .

Each head produces an output. Concatenate all heads’ outputs along the feature axis:

Then project through one final learnable matrix to merge them into one output sequence of the original dimension:

The output is again , ready to feed into the next layer.

Why split into smaller heads instead of running full-sized heads?

The per-head dimension is chosen so the total compute and parameter cost of multi-head attention matches that of a single full-sized head. Each head is “cheaper” but there are of them; the bookkeeping cancels.

This is a deliberate design choice. You could instead run heads each of full size — but that would multiply the parameter and compute cost by . The split keeps the total budget fixed while gaining the specialisation benefit.

TIP — Multi-head as ensemble

Conceptually, multi-head attention is something like running an ensemble of small attention layers and learning how to weight their outputs. Different heads end up looking at different parts of the input — some attend to neighbouring tokens (positional), some to syntactically related tokens (long-range), some to coreferent mentions, some to repeated patterns. The final projection learns how to combine the heads’ findings into a single representation that downstream layers can use.

What each head actually learns

Empirically, when you visualise the attention patterns of trained transformers, different heads exhibit distinct, interpretable behaviours:

  • Positional heads — attend to the previous or next token regardless of content (basically a learned shift).
  • Syntactic heads — attend to syntactically related tokens (subject of the verb, object of the preposition).
  • Coreference heads — link pronouns to their antecedents.
  • Rare-token heads — focus on less frequent words that carry more information.
  • Diffuse heads — spread attention broadly, contributing average context.

The figure on the slides shows two heads on the same sentence “The Law will never be perfect, but its application should be just” — one attending sharply along a few specific edges (likely a syntactic head), the other diffusely attending across many edges (likely a content-mixing head). Same input, two different views.

This division of labour is emergent — nobody tells head 3 to be the coreference head. The training signal pushes the heads to specialise because diversifying their patterns reduces loss more than redundantly fitting the same pattern.

In code (compact form)

The full multi-head attention layer fits on a few lines if you batch the heads as an extra tensor dimension:

  1. Project into of shape .
  2. Reshape to — heads become a batch dimension.
  3. Run scaled dot-product attention in parallel across the head axis.
  4. Concatenate heads (reshape back to ).
  5. Project through to .

The “multi-head” label refers to logical specialisation; computationally it’s a single batched matrix operation, which is why GPUs handle it efficiently.

Counting parameters

A multi-head attention layer has four sets of learned matrices:

  • copies of , each → total (since ).
  • Similarly for and each.
  • One of shape .

Total: parameters per layer, regardless of . The number of heads is a design choice that trades width-per-head against number-of-heads without changing the total budget.

  • self-attention — the underlying single-head operation
  • cross-attention — multi-head attention with from one sequence and from another
  • transformer — the architecture that stacks multi-head attention with feed-forward and normalisation
  • softmax — used inside each head’s attention
  • dropout — typically applied to the attention weights and after the projection

Active Recall