Interpolation and backoff combine higher-order and lower-order n-gram models to handle unseen contexts; interpolation mixes all orders simultaneously, while backoff falls through to lower orders only when higher orders fail.

Motivation

Add-one smoothing is too aggressive for n-gram-language-models: it makes the probability distribution nearly uniform. We need a more principled way to handle unseen n-grams.

The core idea: when a trigram has never been seen, the bigram or unigram might still carry useful information. Rather than adding pseudo-counts, use lower-order models as backup. There are two ways to do this:

  • Interpolation — always blend all orders together. Every prediction is a weighted mix of the trigram, bigram, and unigram models, regardless of whether the higher-order count is zero.
  • Backoff — use the highest-order model that has evidence. If the trigram count is zero, fall through to the bigram; if that is also zero, fall through to the unigram.

Linear Interpolation

Mix the predictions of unigram, bigram, and trigram models (i.e. take a weighted sum — a linear combination):

where are weights controlling how much each model contributes to the final prediction. A higher means more trust in that model order. They must satisfy and , so the result is still a valid probability.

All three models contribute to every prediction. Even if the trigram count is zero (contributing ), the bigram and unigram terms ensure (assuming the unigram is always non-zero, which it is for any word in the vocabulary).

Context-dependent lambdas

With fixed lambdas, the same weights apply to every prediction regardless of how reliable the trigram estimate is. But reliability depends on how often the specific context has been seen: a common context like “New York ___” has many observed trigrams and a trustworthy estimate; a rare context like “zygote morphology ___” has almost none and is noisy. Context-dependent lambdas adapt to this — when the context is common, weight the trigram more heavily; when it is rare, lean more on the bigram and unigram.

The lambdas can vary based on context:

If the bigram context is common, the trigram estimate is reliable and should be large. If the context is rare, the trigram estimate is noisy and should dominate.

How to set the lambdas?

The values cannot be learned from the training data directly — the model would just put all weight on the highest-order n-gram since that is what the training counts favour. Instead, they are set using a separate held-out set:

  1. Compute n-gram probabilities on the training set (these are fixed).
  2. Hold out a separate held-out set (drawn from the same distribution as training and test).
  3. Search for the values that maximize the probability of the held-out set.
  4. Fix those values.
  5. Evaluate perplexity on the test set once.

This is the same logic as the train/dev/test split from evaluation-methodology: the held-out set plays the role of the dev set for the interpolation weights.

Backoff

Backoff uses only one model order at a time, starting from the highest:

  1. If : use the trigram probability .
  2. Else if : back off to bigram .
  3. Else: back off to unigram .

The discounting requirement

Backoff has a subtle problem: the trigram probabilities for observed trigrams already sum to something close to 1. If we use them as-is and then also assign probability to unseen trigrams via backoff, the total exceeds 1 — an invalid probability distribution.

The fix: discount the probabilities of observed n-grams (multiply them by a factor ) to free up probability mass for the backed-off cases. This is the essence of Katz backoff, which applies a discounting factor (e.g., 0.4) before backing off.

Stupid backoff

A simpler variant that skips the discounting:

Each time you back off, multiply by 0.4. This is not a true probability distribution (the values may not sum to 1), but it is simple, fast, and works surprisingly well at scale — it was used by Google’s Web LM with 1 trillion words of training data.

Interpolation vs Backoff

Interpolation generally outperforms backoff because it always uses information from all model orders. When the trigram count is zero, backoff discards the trigram entirely and switches to the bigram. Interpolation still uses the trigram term (contributing ) but also always blends in the bigram and unigram, even when the trigram count is non-zero. This means interpolation can benefit from lower-order evidence even for well-attested trigrams.

  • smoothing — the simpler approach (modify counts); interpolation and backoff are the more principled alternative
  • n-gram-language-models — the model family that these techniques apply to
  • perplexity — the metric used to evaluate whether interpolation/backoff improves the model
  • evaluation-methodology — held-out data for lambda estimation follows the same logic as train/dev/test

Active Recall