Smoothing assigns small non-zero probabilities to unseen n-grams, preventing the entire sentence probability from collapsing to zero when the model encounters an n-gram it never saw in training.

The Zero-Probability Problem

An n-gram model estimated by MLE assigns to any n-gram not seen in training. One zero anywhere in the chain makes the entire sentence probability zero.

Suppose training contains “ate lunch”, “ate dinner”, “ate the”, “ate a” — but never “ate breakfast”. MLE sets , so:

If any test-set sentence gets probability zero, perplexity is undefined (division by zero). The model cannot be evaluated at all.

This is not a minor edge case. In the Shakespeare corpus ( types), 99.96% of possible bigrams were never observed. Any realistic test set will contain unseen bigrams.

Add-One (Laplace) Smoothing

The simplest fix: pretend every bigram was seen one extra time.

Intuition. MLE treats a count of 0 as proof of impossibility. But absence of evidence is not evidence of absence — “ate breakfast” just didn’t appear in this corpus, not in all possible English. Laplace smoothing encodes a weaker belief: before seeing any data, assume every n-gram has been seen once. These are called pseudocounts. After observing the real corpus, you have actual counts + 1 for everything. No entry is ever zero, so no sentence ever gets probability zero.

The in the denominator is a bookkeeping consequence: if you add 1 to each of the possible next words in a row, the row total grows by , and you must divide by to keep the row summing to 1.

Bigram formula (MLE):

Bigram formula (add-one smoothed):

where is the vocabulary size (number of types). Adding 1 to every numerator requires adding to the denominator to keep the probabilities summing to 1.

Trigram formula (from lab):

The same pattern generalizes to any n-gram order.

Add-k Smoothing

Add-one is a special case of add-k smoothing: add a fractional count instead of 1.

Smaller means less aggressive redistribution. The choice of is a hyperparameter, often tuned on a dev set.

Why Add-One Is Too Aggressive for N-grams

Add-one adds 1 to every cell in the count table. When is large (e.g., in the BeRP corpus), the total mass added to zero cells vastly outweighs the original counts. The probability for observed events drops dramatically while the mass gets spread thin across hundreds of thousands of previously-zero entries.

The result: the model treats everything as nearly equally likely — precisely the opposite of what a good language model should do.

COMMON MISCONCEPTION

Add-one smoothing is not useless — it works fine for text classification and other models where the number of zero entries is proportionally much smaller. It specifically fails for n-gram language models because the count table is overwhelmingly sparse (99.96% zeros for Shakespeare bigrams). The right tool depends on the sparsity regime.

Worked Example 1: Smoothed vs Unsmoothed (BeRP Corpus)

Compute i want chinese food using the Berkeley Restaurant Project bigram tables.

Unsmoothed (from Figure 3.2):

Add-one smoothed (from Figure 3.7, ):

The counterintuitive result: smoothing lowered the probability of this observed sentence — from to , a 35x decrease. Why? The probability mass that was redistributed to the hundreds of previously-zero bigrams had to come from somewhere — it came from the observed bigrams. P(want|i) dropped from 0.33 to 0.21; P(chinese|want) dropped from 0.0065 to 0.0029; P(food|chinese) dropped from 0.52 to 0.052.

This demonstrates why add-one is too aggressive: it steals too much mass from observed events to give to unobserved ones.

Worked Example 2: P(Sam|am) with Add-One

Corpus (4 sentences):

<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>

Treating <s> and </s> as regular tokens in the vocabulary.

Count: (appears in sentences 1, 2, 3). .

Vocabulary size: types (<s>, </s>, I, am, Sam, do, not, like, green, eggs, and, …).

Wait — we need to count carefully. The exact depends on how boundary tokens are handled, but the lab gives .

Without smoothing: . Smoothing cuts it from certainty to 0.214 — the difference is dramatic because the corpus is tiny and is large relative to the counts.

Absolute Discounting

Add-one steals too much. Add- steals a tunable amount. Absolute discounting steals a fixed constant — typically 0.75 — from every non-zero count and redistributes that mass to the unseen n-grams via a lower-order model.

Where does come from? Church and Gale (1991) divided 22 million words of AP Newswire into a training set and a held-out set, then for each training-set bigram count , they averaged the count of the same bigrams in the held-out set. The pattern was striking:

Bigram count in trainingAverage count in held-out
00.0000270
10.448
21.25
32.24
43.23
54.21
65.23
76.21
87.21
98.26

For counts 2 and above, the held-out count is consistently . So the MLE over-estimates each count by about 0.75 — subtracting that constant produces a much better estimate on unseen data.

Absolute discounting with interpolation:

The ensures we never produce a negative count. The interpolation weight is chosen so the total probability mass per context sums to 1 — it recovers exactly the mass we discounted. Some implementations keep separate values for counts of 1 and 2, and a single for all higher counts.

This sets up the question Kneser-Ney answers: should the lower-order model really be the plain unigram ?

Kneser-Ney Smoothing

Kneser-Ney is absolute discounting with a smarter lower-order model. It is the most widely used n-gram smoothing method when compute-cheap LMs are needed.

The problem with unigrams as the backoff

Consider the Shannon game: “I can’t see without my reading ___“. The unigram frequency of “Kong” is higher than “glasses” in many corpora — but “Kong” almost always appears after “Hong”. It is a frequent word in very few contexts. The plain unigram will happily recommend “Kong” as a fill-in whenever absolute discounting falls back to it, even though it is a terrible choice after “reading”.

The key insight: the unigram is used exactly when we have not seen this bigram before. So what we really want is not “how frequent is ” but “how likely is to appear as a novel continuation — after a context where we haven’t seen it before?”

Continuation probability

Instead of , define:

This counts the number of distinct word types that has followed at least once — the number of different bigram types completes. A word that appears in many different contexts has high continuation probability; a word locked into one context (like “Kong” after “Hong”) has low continuation probability, regardless of its raw frequency.

Normalize by the total number of observed bigram types so it is a proper distribution:

Numerator: number of word types that precede . Denominator: total distinct bigram types in the corpus. Equivalently, the denominator is — the sum of numerators over all words.

Why this fixes “Kong”. “Kong” appears in only one bigram context (“Hong Kong”), so its numerator is 1. “Glasses” appears after many words (reading, broken, dark, sun, his, her, …), so its numerator is large. Continuation probability correctly prefers “glasses” even though raw unigram frequency prefers “Kong”.

The Kneser-Ney formula

Combine absolute discounting with continuation probability as the lower-order model:

The interpolation weight is the normalizing constant that redistributes exactly the mass discounted by :

Reading this: is the per-event normalized discount; is the number of word types that can follow , which equals the number of times we applied the discount. The product is the total mass we stole and now hand back to the continuation term.

Recursive formulation

For higher-order n-grams, Kneser-Ney applies recursively:

The trick is the count function :

At the top level (e.g. trigram), use ordinary counts. At every recursive step below, use continuation counts — the number of distinct single-word contexts in which the sequence appears. This propagates the continuation intuition all the way down the recursion.

Extended Interpolated Kneser-Ney (Chen and Goodman, 1998) uses three separate values for counts 1, 2, and 3+ — this is the method most commonly deployed in practice.

N-gram Smoothing Summary

SituationMethod
Text classification; few zerosAdd-1 (Laplace)
General-purpose n-gram LMExtended Interpolated Kneser-Ney
Web-scale (trillions of words)Stupid backoff (see interpolation-and-backoff)

Add-1 is the pedagogical baseline — almost never used for n-gram LMs in practice because the sparsity regime is too punishing. Kneser-Ney dominates when model quality matters. Stupid backoff dominates when scale matters more than theoretical validity — at Google-web scale, discounting is expensive and the simple 0.4-multiplier heuristic outperforms more principled methods.

  • n-gram-language-models — smoothing modifies the probability estimates of n-gram models
  • perplexity — zero probabilities make perplexity undefined; smoothing restores computability
  • interpolation-and-backoff — the preferred alternative when add-one is too aggressive; stupid backoff for web-scale

Active Recall