Self-attention treats its input as a set of tokens, not a sequence. The dot products depend only on the contents of tokens and , never on their positions. So the transformer needs to be told positions explicitly — by adding a position-encoding vector to each token’s embedding before the first attention layer.

Why this is needed

Take a sequence of tokens “the cat sat on the mat” and feed it through a transformer encoder. Because self-attention’s output for token is

— a function only of dot products between token contents — the encoder is permutation-invariant: shuffle the tokens and the output set is identically shuffled, but the information per token is unchanged. Reorder “the cat sat” into “sat cat the” and the encoder produces the same set of output vectors, just in a different order.

This is fine for a bag of words but catastrophic for language. “Dog bites man” and “Man bites dog” are different sentences. The model must know order.

Convolutional layers and RNNs encode order implicitly — convolutions through their fixed receptive field, RNNs through sequential processing. Self-attention does neither. So order has to be injected explicitly into the input.

The fix: add a position vector to every token

Before the first encoder layer, modify each token embedding by adding a vector that depends only on :

where is the positional encoding for position . Now identity and position both contribute to the input. Two tokens with the same word but at different positions get different embeddings; subsequent attention layers can pick up on positional differences via the dot product.

This is added — not concatenated — partly because addition keeps the dimension fixed (no extra width to pay for) and partly because the network can learn to ignore the additive position component in subspaces where it shouldn’t matter.

Two common implementations

1. Sinusoidal (Vaswani et al. 2017)

The original transformer uses fixed, non-learnable sinusoids of geometrically-spaced frequencies. For position and feature dimension :

Even-indexed dimensions get sines, odd-indexed dimensions get cosines, with frequencies geometrically spread from down to . The intuition:

  • High-frequency dimensions alternate quickly across positions → encode fine-grained position differences.
  • Low-frequency dimensions vary slowly → encode coarse positional regions.

Together they form a unique fingerprint per position. Two big advantages:

  • No learned parameters. Pure deterministic encoding.
  • Generalises to longer sequences than seen in training: the formula gives a defined encoding for any position, even ones the model never trained on.

2. Learned positional embeddings

The alternative used in BERT, GPT, and most modern LLMs: treat each position as a “vocabulary” item and learn a separate embedding vector per position. So position has its own -dimensional learned vector, position has another, …, up to a maximum sequence length.

SinusoidalLearned
Parameters0
Generalises to longer sequences?Yes (extrapolation)No (positions beyond are undefined)
Empirical performanceComparableComparable to slightly better in practice

Modern variants extend this further (relative position encodings, ALiBi, RoPE) but the basic idea — make the input position-aware — is the same.

On the slide

The transformer architecture diagram shows the positional encoding being added to the input embedding immediately before the first attention layer (the symbol "" with the spiral). Once added, attention layers and feed-forward layers see only the combined embedding; they never receive position information separately.

Both encoder and decoder add their own positional encodings — the encoder to the input sequence, the decoder to its output-shifted-right sequence. Without them, the transformer would produce the same output for every permutation of its input.

Why “fixed sinusoids of varied frequency”?

A useful intuition: positional encodings should be distinguishable between positions and capable of carrying relative-distance signals. Sinusoids of varied frequencies do both:

  • Distinguishability: the combined sine/cosine pattern across all dimensions is unique per position, so two positions never collide.
  • Relative position via dot product: depends mostly on , not on the absolute positions. This gives attention a way to learn “how far apart are these two tokens?” via dot products of their position encodings, which is exactly what attention needs to track relative order.

The geometrically-spaced frequencies ensure the encoding has multi-scale resolution: high-frequency dimensions distinguish neighbouring positions; low-frequency ones distinguish distant ones.

Without positional encoding: what breaks

A simple thought experiment: feed the encoder a permuted version of its input. Without positional encoding, every output position is a function only of a set of token contents — the per-position outputs would be the same set, just permuted. The decoder cross-attending to such a set would have no way to distinguish word order.

Concretely:

  • “Dog bites man” → outputs in some order encoding the set dog, bites, man.
  • “Man bites dog” → identical outputs in a different order — but encoding the same set of facts.

Adding positional encoding breaks this symmetry. After the addition, the embedding for “dog” at position 0 differs from “dog” at position 2 — and downstream attention learns to use that difference.

  • self-attention — the layer that needs positional encoding to function on ordered sequences
  • transformer — uses positional encoding immediately after token embedding
  • word-embedding — what’s added to the positional encoding
  • recurrent-neural-network — encodes order implicitly via sequential processing, which is why RNNs don’t need positional encoding

Active Recall