Perplexity is the standard intrinsic metric for language models: the inverse probability of the test set, normalized per word, where lower means better.
Definition
The perplexity of a language model on a test set is:
Expanding with the chain rule:
For a bigram model:
Range: probability is in ; perplexity is in . Lower perplexity = better model. Minimizing perplexity is the same as maximizing probability.
Intuition
The Shannon game
The clearest way to understand perplexity is through a guessing game. Imagine someone is about to say the next word, and you must keep guessing until you name it. How many guesses do you need on average?
- Very predictable text (“Happy ___ to you”) → you guess “birthday” on the first try.
- Very unpredictable text → you might need dozens of attempts.
The average number of guesses required is a direct measure of how surprised the model is by the text. Perplexity is exactly that number. A perplexity of 100 means the model is as confused at each step as if it were picking uniformly from 100 equally-likely words. A perplexity of 1 means the model always predicts the correct next word with certainty — no guessing needed.
Shannon (1951) ran a real version of this experiment: he had human subjects guess the next letter of English text, counted how many attempts they needed, and used this to estimate the entropy of English. The connection is direct: where is the entropy (in bits) of the model’s distribution. Higher entropy = more uncertainty = more guesses = higher perplexity.
Branching factor
Perplexity has a second equivalent interpretation: it is the weighted average branching factor of the language as seen by the model — informally, “how many equally likely words could come next at each position.”
Example. A language with three words: red, blue, green.
Model A — uniform: .
Test set: “red red red red blue” ().
The model is always choosing among 3 equally likely options — branching factor is 3.
Model B — informed: , , .
Model B is less surprised by this red-heavy test set — its effective branching factor is only 1.89. It is a better model for this data.
Why Perplexity Instead of Raw Probability
Raw probability shrinks as the test set gets longer — a 1000-word test set will always have lower probability than a 10-word one, regardless of model quality. This makes raw probability useless for comparing models on different-length test sets.
Perplexity fixes this by normalizing per word (the th root). It is a per-word metric that allows fair comparison.
COMMON MISCONCEPTION
Comparing perplexity across different test sets is not meaningful. Model A with PP = 120 on WSJ and Model B with PP = 95 on Twitter tells you nothing about which model is better. The test set must be the same for the comparison to be valid.
Perplexity and N-gram Order
More context = better predictions = lower perplexity. WSJ corpus (38M training words, 1.5M test words):
| N-gram Order | Unigram | Bigram | Trigram |
|---|---|---|---|
| Perplexity | 962 | 170 | 109 |
Going from unigram to bigram cuts perplexity by 5.7x. Bigram to trigram gives another 1.6x improvement. Higher-order models capture more context but face worse sparsity — eventually the gains plateau or reverse (see smoothing).
Worked Example: Perplexity on a Number Corpus
Training set: 100 numbers — 91 zeros and 1 each of digits 1 through 9.
Unigram probabilities: , for .
Test set: 0 0 0 0 0 3 0 0 0 0 ().
Step 1 — Joint probability:
, so .
Step 2 — Perplexity:
, so , so .
Interpretation: the model’s effective branching factor on this test set is ~1.73. Most of the time the model is very confident (predicting 0 with probability 0.91), but the single “3” forces it to use the low-probability estimate, pulling perplexity above 1.
TIP — Perplexity in log space
In practice, compute and exponentiate at the end. This avoids underflow from multiplying small probabilities.
Related
- n-gram-language-models — the model family evaluated by perplexity
- evaluation-methodology — perplexity is the standard intrinsic metric; see that page for extrinsic/intrinsic distinction and train/dev/test splits
- smoothing — zero probabilities make perplexity undefined (division by zero); smoothing fixes this
Active Recall
Given a bigram model where P(the|⟨s⟩) = 0.5, P(cat|the) = 0.3, P(⟨/s⟩|cat) = 0.8, compute the perplexity of the sentence "⟨s⟩ the cat ⟨/s⟩".
(three predicted words: the, cat, ⟨/s⟩). . . The model’s average branching factor on this sentence is about 2 — at each step it is choosing among roughly 2 equally likely options.
Explain the branching factor interpretation of perplexity. If a model achieves perplexity 100 on English text, what does that mean informally?
Perplexity is the weighted average number of equally likely next words at each position. A perplexity of 100 means that, on average, the model is as uncertain as if it were choosing uniformly among 100 words at every position. A perfect model that always predicts the correct next word with probability 1 has perplexity 1 (no uncertainty). A model that treats all words as equally likely has perplexity equal to the vocabulary size.
Why does perplexity normalize by number of words rather than using raw probability to compare language models?
Raw probability decreases as the test set gets longer — a product of more terms, each . A model evaluated on a 10-word test set will always have higher probability than the same model on a 1000-word test set, so raw probability is length-dependent and cannot be compared across test sets. The th root in the perplexity formula normalizes to a per-word scale, making the metric independent of test set length.
Minimizing perplexity is equivalent to maximizing what? Explain why.
Maximizing the probability of the test set. Since , perplexity is a monotonically decreasing function of probability (for fixed ). Higher probability → lower perplexity. The model that assigns the highest probability to the actual test data is the model with the lowest perplexity.
Model A has perplexity 120 on WSJ text. Model B has perplexity 95 on Twitter text. Can you conclude B is a better language model? Why or why not?
No. Perplexity depends on the test set — Twitter text and WSJ text have different vocabularies, sentence structures, and entropy. A lower perplexity on an easier test set does not mean a better model. To compare A and B, you must evaluate both on the same test set. The WSJ benchmark (unigram 962, bigram 170, trigram 109) is only meaningful because all models are tested on the same 1.5M-word WSJ test set.