Smoothing assigns small non-zero probabilities to unseen n-grams, preventing the entire sentence probability from collapsing to zero when the model encounters an n-gram it never saw in training.
The Zero-Probability Problem
An n-gram model estimated by MLE assigns to any n-gram not seen in training. One zero anywhere in the chain makes the entire sentence probability zero.
Suppose training contains “ate lunch”, “ate dinner”, “ate the”, “ate a” — but never “ate breakfast”. MLE sets , so:
If any test-set sentence gets probability zero, perplexity is undefined (division by zero). The model cannot be evaluated at all.
This is not a minor edge case. In the Shakespeare corpus ( types), 99.96% of possible bigrams were never observed. Any realistic test set will contain unseen bigrams.
Add-One (Laplace) Smoothing
The simplest fix: pretend every bigram was seen one extra time.
Intuition. MLE treats a count of 0 as proof of impossibility. But absence of evidence is not evidence of absence — “ate breakfast” just didn’t appear in this corpus, not in all possible English. Laplace smoothing encodes a weaker belief: before seeing any data, assume every n-gram has been seen once. These are called pseudocounts. After observing the real corpus, you have actual counts + 1 for everything. No entry is ever zero, so no sentence ever gets probability zero.
The in the denominator is a bookkeeping consequence: if you add 1 to each of the possible next words in a row, the row total grows by , and you must divide by to keep the row summing to 1.
Bigram formula (MLE):
Bigram formula (add-one smoothed):
where is the vocabulary size (number of types). Adding 1 to every numerator requires adding to the denominator to keep the probabilities summing to 1.
Trigram formula (from lab):
The same pattern generalizes to any n-gram order.
Add-k Smoothing
Add-one is a special case of add-k smoothing: add a fractional count instead of 1.
Smaller means less aggressive redistribution. The choice of is a hyperparameter, often tuned on a dev set.
Why Add-One Is Too Aggressive for N-grams
Add-one adds 1 to every cell in the count table. When is large (e.g., in the BeRP corpus), the total mass added to zero cells vastly outweighs the original counts. The probability for observed events drops dramatically while the mass gets spread thin across hundreds of thousands of previously-zero entries.
The result: the model treats everything as nearly equally likely — precisely the opposite of what a good language model should do.
COMMON MISCONCEPTION
Add-one smoothing is not useless — it works fine for text classification and other models where the number of zero entries is proportionally much smaller. It specifically fails for n-gram language models because the count table is overwhelmingly sparse (99.96% zeros for Shakespeare bigrams). The right tool depends on the sparsity regime.
Worked Example 1: Smoothed vs Unsmoothed (BeRP Corpus)
Compute i want chinese food using the Berkeley Restaurant Project bigram tables.
Unsmoothed (from Figure 3.2):
Add-one smoothed (from Figure 3.7, ):
The counterintuitive result: smoothing lowered the probability of this observed sentence — from to , a 35x decrease. Why? The probability mass that was redistributed to the hundreds of previously-zero bigrams had to come from somewhere — it came from the observed bigrams. P(want|i) dropped from 0.33 to 0.21; P(chinese|want) dropped from 0.0065 to 0.0029; P(food|chinese) dropped from 0.52 to 0.052.
This demonstrates why add-one is too aggressive: it steals too much mass from observed events to give to unobserved ones.
Worked Example 2: P(Sam|am) with Add-One
Corpus (4 sentences):
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
Treating <s> and </s> as regular tokens in the vocabulary.
Count: (appears in sentences 1, 2, 3). .
Vocabulary size: types (<s>, </s>, I, am, Sam, do, not, like, green, eggs, and, …).
Wait — we need to count carefully. The exact depends on how boundary tokens are handled, but the lab gives .
Without smoothing: . Smoothing cuts it from certainty to 0.214 — the difference is dramatic because the corpus is tiny and is large relative to the counts.
Absolute Discounting
Add-one steals too much. Add- steals a tunable amount. Absolute discounting steals a fixed constant — typically 0.75 — from every non-zero count and redistributes that mass to the unseen n-grams via a lower-order model.
Where does come from? Church and Gale (1991) divided 22 million words of AP Newswire into a training set and a held-out set, then for each training-set bigram count , they averaged the count of the same bigrams in the held-out set. The pattern was striking:
| Bigram count in training | Average count in held-out |
|---|---|
| 0 | 0.0000270 |
| 1 | 0.448 |
| 2 | 1.25 |
| 3 | 2.24 |
| 4 | 3.23 |
| 5 | 4.21 |
| 6 | 5.23 |
| 7 | 6.21 |
| 8 | 7.21 |
| 9 | 8.26 |
For counts 2 and above, the held-out count is consistently . So the MLE over-estimates each count by about 0.75 — subtracting that constant produces a much better estimate on unseen data.
Absolute discounting with interpolation:
The ensures we never produce a negative count. The interpolation weight is chosen so the total probability mass per context sums to 1 — it recovers exactly the mass we discounted. Some implementations keep separate values for counts of 1 and 2, and a single for all higher counts.
This sets up the question Kneser-Ney answers: should the lower-order model really be the plain unigram ?
Kneser-Ney Smoothing
Kneser-Ney is absolute discounting with a smarter lower-order model. It is the most widely used n-gram smoothing method when compute-cheap LMs are needed.
The problem with unigrams as the backoff
Consider the Shannon game: “I can’t see without my reading ___“. The unigram frequency of “Kong” is higher than “glasses” in many corpora — but “Kong” almost always appears after “Hong”. It is a frequent word in very few contexts. The plain unigram will happily recommend “Kong” as a fill-in whenever absolute discounting falls back to it, even though it is a terrible choice after “reading”.
The key insight: the unigram is used exactly when we have not seen this bigram before. So what we really want is not “how frequent is ” but “how likely is to appear as a novel continuation — after a context where we haven’t seen it before?”
Continuation probability
Instead of , define:
This counts the number of distinct word types that has followed at least once — the number of different bigram types completes. A word that appears in many different contexts has high continuation probability; a word locked into one context (like “Kong” after “Hong”) has low continuation probability, regardless of its raw frequency.
Normalize by the total number of observed bigram types so it is a proper distribution:
Numerator: number of word types that precede . Denominator: total distinct bigram types in the corpus. Equivalently, the denominator is — the sum of numerators over all words.
Why this fixes “Kong”. “Kong” appears in only one bigram context (“Hong Kong”), so its numerator is 1. “Glasses” appears after many words (reading, broken, dark, sun, his, her, …), so its numerator is large. Continuation probability correctly prefers “glasses” even though raw unigram frequency prefers “Kong”.
The Kneser-Ney formula
Combine absolute discounting with continuation probability as the lower-order model:
The interpolation weight is the normalizing constant that redistributes exactly the mass discounted by :
Reading this: is the per-event normalized discount; is the number of word types that can follow , which equals the number of times we applied the discount. The product is the total mass we stole and now hand back to the continuation term.
Recursive formulation
For higher-order n-grams, Kneser-Ney applies recursively:
The trick is the count function :
At the top level (e.g. trigram), use ordinary counts. At every recursive step below, use continuation counts — the number of distinct single-word contexts in which the sequence appears. This propagates the continuation intuition all the way down the recursion.
Extended Interpolated Kneser-Ney (Chen and Goodman, 1998) uses three separate values for counts 1, 2, and 3+ — this is the method most commonly deployed in practice.
N-gram Smoothing Summary
| Situation | Method |
|---|---|
| Text classification; few zeros | Add-1 (Laplace) |
| General-purpose n-gram LM | Extended Interpolated Kneser-Ney |
| Web-scale (trillions of words) | Stupid backoff (see interpolation-and-backoff) |
Add-1 is the pedagogical baseline — almost never used for n-gram LMs in practice because the sparsity regime is too punishing. Kneser-Ney dominates when model quality matters. Stupid backoff dominates when scale matters more than theoretical validity — at Google-web scale, discounting is expensive and the simple 0.4-multiplier heuristic outperforms more principled methods.
Related
- n-gram-language-models — smoothing modifies the probability estimates of n-gram models
- perplexity — zero probabilities make perplexity undefined; smoothing restores computability
- interpolation-and-backoff — the preferred alternative when add-one is too aggressive; stupid backoff for web-scale
Active Recall
Why does a single unseen bigram in the test set make perplexity impossible to compute, and how does smoothing fix this?
Perplexity involves the product . If any , one factor becomes , making perplexity undefined. Smoothing ensures every bigram has by adding a small pseudo-count to all entries, so no factor in the product is infinite.
Given , , , compute both MLE and add-one smoothed . By what factor does smoothing change the estimate?
MLE: . Add-one: . Smoothing reduced the estimate by a factor of — about an 8% drop. For a common bigram with decent counts, the effect is mild. For rare bigrams the effect is proportionally larger; for zero-count bigrams, the change is from 0 to .
Add-one smoothing can actually lower the probability of an observed sentence compared to unsmoothed MLE. Explain why this happens and what it reveals about add-one's weakness.
Adding 1 to every bigram count adds to each row’s denominator. When is large (e.g., 1,446 in BeRP), the denominator increases substantially, so even observed bigrams get lower probability. The total mass removed from observed events equals the mass given to the previously-zero cells (where is the number of observed bigram types in that row). If almost all bigrams are zero — which is typical — most of the mass flows to unseen events, leaving observed events with much less. This reveals add-one’s weakness: it treats all unseen events as equally worthy of probability mass, regardless of how many there are.
Why is add-one smoothing acceptable for text classification but not for n-gram language models?
In text classification, the feature space (typically word-in-document indicators) is much denser relative to — most features have non-trivial counts. Adding 1 to each has a proportionally small effect. In n-gram language models, the space of possible n-grams is , and 99%+ of entries are zero. Adding 1 to every zero entry redistributes a massive fraction of the total probability mass, making the model nearly uniform. The sparsity regime determines whether add-one is acceptable.
Write the add-one smoothed formula for trigram probability estimation. What goes in the denominator?
. The denominator is the count of the bigram prefix plus the vocabulary size . The compensates for adding 1 to each of the possible continuations of the prefix, ensuring the probabilities sum to 1.
Church and Gale compared bigram training counts to their held-out counts and found a consistent pattern. What was it, and what method does it justify?
For bigram training counts , the average held-out count was . MLE systematically over-estimates each count by about 0.75. This justifies absolute discounting: subtract a fixed constant from every non-zero count and redistribute that mass to unseen n-grams via a lower-order model.
Why does Kneser-Ney replace the unigram with a continuation probability? Use the "Kong" vs "glasses" example.
The lower-order model is used exactly when the bigram is unseen, so it should predict words that are plausible novel continuations. In raw unigram frequency, “Kong” beats “glasses” because “Hong Kong” is common. But “Kong” appears in only one bigram context (after “Hong”), while “glasses” appears after many contexts. Continuation probability counts the number of distinct contexts completes, correctly ranking “glasses” above “Kong” for the sentence “I can’t see without my reading ___”.
Write the Kneser-Ney bigram formula and identify the role of each term.
. The first term is the discounted bigram MLE — subtract (typically 0.75) from the count, clip at 0. The second term is the interpolation weight times continuation probability — redistributes exactly the mass discounted from all continuations of , and scores words by how many distinct contexts they novelly continue.
In the recursive Kneser-Ney formulation, what changes at each level of the recursion?
The count function changes. At the highest order (e.g. trigram), is the ordinary count. For every lower order reached by recursion, is the continuation count — the number of distinct single-word contexts in which the sequence appears — rather than the raw count. This propagates the “novel continuation” intuition all the way down the backoff chain, not just at the bigram-to-unigram transition.
In one sentence each, when would you choose add-1, Kneser-Ney, or stupid backoff?
Add-1: text classification or any domain where the zero rate is modest — it is simple and good enough. Kneser-Ney (extended interpolated): general n-gram LM where model quality matters — the default choice for perplexity-driven work. Stupid backoff: web-scale training (trillions of words) where proper discounting is too expensive and a valid probability distribution is not required.