Perplexity is the standard intrinsic metric for language models: the inverse probability of the test set, normalized per word, where lower means better.

Definition

The perplexity of a language model on a test set is:

Expanding with the chain rule:

For a bigram model:

Range: probability is in ; perplexity is in . Lower perplexity = better model. Minimizing perplexity is the same as maximizing probability.

Intuition

The Shannon game

The clearest way to understand perplexity is through a guessing game. Imagine someone is about to say the next word, and you must keep guessing until you name it. How many guesses do you need on average?

  • Very predictable text (“Happy ___ to you”) → you guess “birthday” on the first try.
  • Very unpredictable text → you might need dozens of attempts.

The average number of guesses required is a direct measure of how surprised the model is by the text. Perplexity is exactly that number. A perplexity of 100 means the model is as confused at each step as if it were picking uniformly from 100 equally-likely words. A perplexity of 1 means the model always predicts the correct next word with certainty — no guessing needed.

Shannon (1951) ran a real version of this experiment: he had human subjects guess the next letter of English text, counted how many attempts they needed, and used this to estimate the entropy of English. The connection is direct: where is the entropy (in bits) of the model’s distribution. Higher entropy = more uncertainty = more guesses = higher perplexity.

Branching factor

Perplexity has a second equivalent interpretation: it is the weighted average branching factor of the language as seen by the model — informally, “how many equally likely words could come next at each position.”

Example. A language with three words: red, blue, green.

Model A — uniform: .

Test set: “red red red red blue” ().

The model is always choosing among 3 equally likely options — branching factor is 3.

Model B — informed: , , .

Model B is less surprised by this red-heavy test set — its effective branching factor is only 1.89. It is a better model for this data.

Why Perplexity Instead of Raw Probability

Raw probability shrinks as the test set gets longer — a 1000-word test set will always have lower probability than a 10-word one, regardless of model quality. This makes raw probability useless for comparing models on different-length test sets.

Perplexity fixes this by normalizing per word (the th root). It is a per-word metric that allows fair comparison.

COMMON MISCONCEPTION

Comparing perplexity across different test sets is not meaningful. Model A with PP = 120 on WSJ and Model B with PP = 95 on Twitter tells you nothing about which model is better. The test set must be the same for the comparison to be valid.

Perplexity and N-gram Order

More context = better predictions = lower perplexity. WSJ corpus (38M training words, 1.5M test words):

N-gram OrderUnigramBigramTrigram
Perplexity962170109

Going from unigram to bigram cuts perplexity by 5.7x. Bigram to trigram gives another 1.6x improvement. Higher-order models capture more context but face worse sparsity — eventually the gains plateau or reverse (see smoothing).

Worked Example: Perplexity on a Number Corpus

Training set: 100 numbers — 91 zeros and 1 each of digits 1 through 9.

Unigram probabilities: , for .

Test set: 0 0 0 0 0 3 0 0 0 0 ().

Step 1 — Joint probability:

, so .

Step 2 — Perplexity:

, so , so .

Interpretation: the model’s effective branching factor on this test set is ~1.73. Most of the time the model is very confident (predicting 0 with probability 0.91), but the single “3” forces it to use the low-probability estimate, pulling perplexity above 1.

TIP — Perplexity in log space

In practice, compute and exponentiate at the end. This avoids underflow from multiplying small probabilities.

  • n-gram-language-models — the model family evaluated by perplexity
  • evaluation-methodology — perplexity is the standard intrinsic metric; see that page for extrinsic/intrinsic distinction and train/dev/test splits
  • smoothing — zero probabilities make perplexity undefined (division by zero); smoothing fixes this

Active Recall