Self-attention has every token query every other token in the same sequence. Cross-attention lets the decoder query a different sequence — the encoder’s latent code. Each decoder position computes “which parts of the input do I need to look at to produce my next output?” — and the answer is read straight out of the encoder.

The one-line difference from self-attention

In self-attention, queries, keys, and values all come from the same input :

In cross-attention, the query comes from one source and the keys/values come from another:

After that, the formula is identical to self-attention’s:

Same softmax, same scaling, same weighted sum of values. The mechanism is unchanged; only the sourcing of vs differs. Most introductions to attention compress these together and just say “attention”. When the distinction matters — and in the transformer’s encoder-decoder bridge, it matters a lot — we call it cross-attention to flag the two-sequence setup.

Where it lives in the transformer

The transformer decoder block has three sub-layers in sequence:

  1. Masked self-attention over the decoder’s own running outputs (causal, can’t see the future).
  2. Cross-attention with from the decoder, from the encoder’s latent code.
  3. Feed-forward MLP, applied position-wise.

The cross-attention sub-layer is the only place information flows from the encoder into the decoder. Without it, the decoder is a standalone language model with no idea what the input was. With it, the decoder can ask “given what I’ve generated so far, which input tokens are most relevant to my next output?” — and pull the relevant content directly from the encoder’s representation.

Reading the architecture

Schematically, the decoder block at one position looks like:

                 from encoder (shared across all decoder positions)
                                  │
                                  ▼
                              [K, V]  ← projected with W^K, W^V from latent code
decoder hidden state ──► [Q] ──► cross-attention ──► output
                          ▲
                          └── projected with W^Q from decoder hidden state

Crucially:

  • The encoder is run once, producing a fixed latent code (one vector per input token). and are computed once from this code and then reused at every decoder time step.
  • The decoder is run repeatedly, once per output token. At each step, the latest decoder hidden states produce fresh s, but the encoder’s stay constant.

This asymmetry is what the slides label “Encoder: executed once / Decoder: executed repeatedly”. Caching from the encoder is a major efficiency win — you don’t redo encoder work per decoder step.

What cross-attention buys

For each output position the decoder is generating, cross-attention computes:

Which input positions are relevant to me right now, and what content should I pull from them?

In machine translation (“Je suis étudiant” → “I am a student”):

  • When the decoder is generating “I”, cross-attention attends most strongly to the encoded “Je”.
  • When generating “am”, cross-attention attends most strongly to “suis”.
  • When generating “student”, cross-attention attends most strongly to “étudiant”.

The attention pattern is learned, not hardcoded — and not strictly aligned word-by-word in general. For non-English-French pairs (e.g. English-Japanese), cross-attention learns the correct, often non-monotonic alignment automatically.

Cross-attention beyond translation

The same machinery shows up wherever a network needs to condition one sequence’s processing on another:

  • Encoder-decoder transformers (Vaswani et al. 2017) — translation, summarisation, T5, Whisper. The decoder cross-attends to the encoded source.
  • Latent diffusion models (Stable Diffusion) — the U-Net’s image-side features cross-attend to a text encoder’s output, conditioning image generation on the text prompt. See latent-diffusion-model.
  • Visual question answering, image captioning — the language decoder cross-attends to an image encoder’s feature map.
  • Multimodal models (CLIP-conditioned generators, audio-visual transformers) — one modality queries another via cross-attention.

Anywhere you see “conditioned on ” in a modern generative architecture, cross-attention is usually how the conditioning is implemented.

Comparison: self-attention vs. cross-attention

Self-attentionCross-attention
Query sourceSame as keys/values (one input )Different from keys/values
Key/value sourceSame as query”Memory” — typically encoder output
MechanismScaled dot-product attentionScaled dot-product attention (identical formula)
Output sequence lengthEquals input lengthEquals query length (decoder length)
Where it appearsEncoder layers; decoder’s first sub-layer (masked)Decoder’s second sub-layer
RoleMix info within one sequencePull info from another sequence

Both are “attention” in the generic sense. The label “cross” emphasises that two distinct sequences are involved.

A note on shapes

If the encoder produced a latent code with tokens and the decoder is currently at position , cross-attention’s matrices have shapes:

  • from decoder:
  • from encoder:
  • from encoder:
  • Attention scores : — every decoder query against every encoder key.
  • Output: — one output per decoder query.

So the output sequence length matches the decoder, not the encoder. The encoder’s content is mixed in via the values, but the number of cross-attention outputs is set by how many queries the decoder has.

Multi-head cross-attention

Like self-attention, cross-attention is run with multiple heads in parallel. Each head has its own — different heads can attend to different aspects of the encoder’s latent code. The architectural mechanics are identical to multi-head-attention; the only twist is that comes from one sequence and from another.

Active Recall