BPE learns a vocabulary of subword units by iteratively merging the most frequent adjacent pair in a corpus; it is the token-learning algorithm behind GPT, RoBERTa, and many other large models.

Definition

Byte Pair Encoding (BPE) (Sennrich et al., 2016, adapted from a data-compression algorithm) is a token learner that builds a vocabulary of subword symbols by starting from individual characters and repeatedly merging the most frequent adjacent pair. The result is a vocabulary that represents frequent words as single units and rare or unseen words as sequences of subword pieces.

Algorithm

Token learner

Input: a training corpus, a target vocabulary size (a hyperparameter; typically 8K–32K for word-level, larger for modern LLMs).

Initialization: split every word into characters, appending a special end-of-word symbol _ to each word. Start with a vocabulary of all characters plus _.

Repeat times:

  1. Count all adjacent symbol pairs across the corpus.
  2. Find the most frequent pair (e.g., e + r).
  3. Merge that pair into a new symbol er.
  4. Replace all occurrences of the pair with the new symbol.
  5. Add the new symbol to the vocabulary.

Each iteration reduces the corpus length by one (two symbols become one) and adds one entry to the vocabulary.

Worked example — token learner

Corpus (4 word types, split into characters; _ = end-of-word):

l o w _          ×5    (low)
l o w e r _      ×2    (lower)
n e w e s t _    ×6    (newest)
w i d e s t _    ×3    (widest)

Initial vocabulary: {l, o, w, e, r, n, s, t, i, d, _}

Initial pair counts (top entries):

PairCountSource
e+s9newest(×6), widest(×3)
s+t9newest(×6), widest(×3)
t+_9newest(×6), widest(×3)
w+e8lower(×2), newest(×6)
l+o7low(×5), lower(×2)
o+w7low(×5), lower(×2)

Merge 1: e+ses — replace every adjacent e s in the corpus:

l o w _          ×5    (unchanged — no e s)
l o w e r _      ×2    (unchanged — no e s)
n e w es t _     ×6    ← merged
w i d es t _     ×3    ← merged

Recount. New top pairs: es+t = 9, t+_ = 9.

Merge 2: es+test

n e w est _      ×6
w i d est _      ×3

Top pair: est+_ = 9.

Merge 3: est+_est_

l o w _          ×5
l o w e r _      ×2
n e w est_       ×6
w i d est_       ×3

Top pairs: l+o = 7, o+w = 7.

Merge 4: l+oloMerge 5: lo+wlow

low _            ×5
low e r _        ×2
n e w est_       ×6
w i d est_       ×3

Merges 6–8: n+ene, ne+wnew, new+est_newest_

MergePairCountNew symbol
1e+s9es
2es+t9est
3est+_9est_
4l+o7lo
5lo+w7low
6n+e6ne
7ne+w6new
8new+est_6newest_

After enough merges, frequent words like low and newest become single tokens; rare words are represented by sequences of frequent subword pieces.

Worked example — token segmenter

New word: lowest (never seen in training). Split into characters: l o w e s t _

Apply every learned merge in the order they were learned — skip any merge that has no matching pair:

StepMerge checkedMatch?Sequence after
Startl o w e s t _
Merge 1e+sesl o w es t _
Merge 2es+testl o w est _
Merge 3est+_est_l o w est_
Merge 4l+ololo w est_
Merge 5lo+wlowlow est_
Merge 6n+enelow est_
Merge 7ne+wnewlow est_
Merge 8new+est_newest_low est_

Result: ["low", "est_"] — a word never seen in training decomposes into two meaningful pieces.

End-of-word marker: _ vs </w>

The lecture notes use _ as the end-of-word marker. In code (including the original Sennrich et al. implementation and most Python tutorials) you will see </w> instead:

vocab = {
    'l o w </w>'     : 5,
    'l o w e r </w>' : 2,
    'n e w e s t </w>': 6,
    'w i d e s t </w>': 3,
}

The choice of symbol does not affect the algorithm — </w> is just less likely to collide with actual characters in the corpus. The conceptual role is identical: mark the word boundary so that merged tokens do not span across word boundaries.

Token segmenter

At inference time, given a new word, apply the learned merges greedily in training order:

  1. Split the word into characters (+ _).
  2. Apply the first learned merge wherever it applies.
  3. Apply the second learned merge wherever it applies.
  4. Continue through all merges in order.

This is deterministic and runs in where is word length and is the number of merges.

Crucial property: a word never seen in training will be decomposed into its constituent subword pieces. In the worst case, it decomposes to individual characters. The open-vocabulary problem is eliminated by construction.

TIP — Choosing vocabulary size

trades off two things: large means more words are single tokens (faster inference, more parameters); small means more subword splitting (slower inference, handles rare words better). Common values: 32K–50K for English-only models, larger for multilingual models where the character inventory is much bigger.

Relationship to Other Subword Algorithms

BPE is the most widely used but not the only option. See subword-tokenization for a comparison of BPE, WordPiece, and Unigram LM.

  • subword-tokenization — the broader family of data-driven subword approaches
  • type-and-token — BPE directly addresses the open-vocabulary problem diagnosed by Heaps’ Law
  • tokenization — BPE replaces classical word tokenization in modern systems

Active Recall