Bayes’ rule inverts a conditional probability — it lets you compute from , , and . In NLP it’s the machinery behind classifiers, noisy-channel decoders, and any system that reasons about unseen causes from observed evidence.
The Rule
Start from the definition of conditional probability two ways:
Rearrange:
Every term has a name:
| Term | Name | Meaning |
|---|---|---|
| posterior | updated belief in after seeing | |
| likelihood | how probable is if is true | |
| prior | belief in before seeing | |
| evidence (or marginal) | total probability of seeing under any hypothesis |
The Mantra
New evidence shouldn’t determine your belief in a vacuum — it should update a prior.
This is the whole point. The posterior is built from two ingredients: how well the evidence fits the hypothesis (likelihood) and how plausible the hypothesis was to begin with (prior). Dropping either gives the wrong answer.
Base Rate Neglect: The Steve Problem
Kahneman & Tversky’s canonical demonstration (adapted in 3Blue1Brown’s Bayes theorem, the geometry of changing beliefs):
Steve is a meek and tidy soul, with a need for order and structure and a passion for detail. Is Steve more likely to be a librarian or a farmer?
Most people say librarian — the description matches the stereotype. They’re reasoning only from the likelihood and ignoring the prior.
But in the US there are roughly 20 farmers per librarian. Work through the numbers with a representative sample of 210 people (10 librarians, 200 farmers), and suppose 40% of librarians fit Steve’s description vs only 10% of farmers:
- Librarians matching the description:
- Farmers matching the description:
Steve is ~83% likely to be a farmer — despite the description screaming librarian. The likelihood ratio (4:1 in favour of librarian) is overwhelmed by the prior ratio (20:1 in favour of farmer). Ignoring the prior is base rate neglect, and it’s the systematic error Bayes’ rule corrects.
Geometric Picture
Draw a 1×1 square representing the full space of possibilities.
- Partition by hypothesis. A vertical strip of width covers the cases where is true; the remaining strip covers . The strip widths are the priors.
- Carve out evidence within each strip. Inside the strip, a sub-region of proportion represents cases where is also observed. Same inside the strip with proportion .
- Restrict to the evidence. Throw away everything where doesn’t hold. You’re left with the two evidence-sub-regions.
- The posterior is the fraction of the remaining area that sits in the strip.
The key takeaway from this picture: a small prior multiplies down the area, and a large likelihood alone can’t rescue a hypothesis whose strip is thin. The posterior is a proportion, not a score — a tiny slice times a generous likelihood can still be smaller than a fat slice times a mediocre likelihood.
Where Bayes’ Rule Shows Up in NLP
- Naive Bayes classification — flip into , which is estimable from counts. The formula from Naive Bayes is literally Bayes’ rule with and .
- Noisy channel models — for spelling correction, machine translation, and speech recognition: . A language model supplies the prior; a channel model supplies the likelihood.
- Any probabilistic decoder — HMM decoding, probabilistic parsing, topic models. Anywhere you infer a latent cause from observed words, you are doing Bayesian inference.
The reason Bayes’ rule is useful is almost always that is hard to estimate directly but and can be estimated from data. You invert the conditioning to get at the quantity you actually want.
Dropping the Denominator in
When you only need the most probable hypothesis — not its actual probability — the denominator is irrelevant. It’s the same for every hypothesis, so it doesn’t affect the ranking:
This is the step that turns Bayes’ rule from “intractable marginalization problem” into “pick the hypothesis with the highest numerator” — and it’s what makes Naive Bayes practical.
Related
- naive-bayes — Bayes’ rule applied to (document, class) with conditional independence
- n-gram-language-models — supplies the prior in noisy-channel decoders
- evaluation-methodology — Bayesian reasoning underlies held-out validation and posterior predictive checks
Active Recall
State Bayes' rule and name every term.
. is the posterior (updated belief after seeing evidence). is the likelihood (probability of observing if were true). is the prior (belief in before seeing evidence). is the evidence or marginal (total probability of observing under any hypothesis). The formula inverts the conditional — useful whenever is hard to estimate directly but the other terms aren’t.
Why does the denominator drop out when you use Bayes' rule in a classifier?
The classifier picks , not the actual probability. is the same for every candidate , so it scales every score identically and cannot change the ranking. Dropping it turns an intractable normalization problem (summing over every possible ) into a comparison of unnormalized scores — the step that makes Bayes’ rule practical for NLP.
Explain the Steve librarian/farmer example and the error it exposes.
Steve’s description matches a librarian stereotype — is high. Most people stop there and conclude he’s a librarian. They ignore the prior: there are ~20 farmers per librarian in the US. Even if librarians match the description 4× more often, the 20× larger pool of farmers dominates — Steve is ~83% likely to be a farmer. The error is base rate neglect: reasoning from likelihood without multiplying in the prior. Bayes’ rule forces you to include both.
In the geometric 1×1 square picture, what do the vertical strip widths and the sub-region areas represent?
Strip widths are priors — each hypothesis occupies a vertical strip of width spanning the full square. Within each strip, a sub-region of proportion represents the cases where the evidence is observed — those areas are the joint probabilities . The posterior is the fraction of the total evidence area (across all hypotheses) that sits in the strip.
Why is Bayes' rule useful in NLP — what does it let you swap?
It lets you replace a hard-to-estimate conditional with a combination of terms that are estimable from data: (often countable — e.g. “how often does this word appear in documents of this class”) and (often just a corpus frequency). This inversion is the core trick in Naive Bayes, noisy-channel spelling correction, statistical MT, speech recognition, and HMM decoding.