Bayes’ rule inverts a conditional probability — it lets you compute from , , and . In NLP it’s the machinery behind classifiers, noisy-channel decoders, and any system that reasons about unseen causes from observed evidence.

The Rule

Start from the definition of conditional probability two ways:

Rearrange:

Every term has a name:

TermNameMeaning
posteriorupdated belief in after seeing
likelihoodhow probable is if is true
priorbelief in before seeing
evidence (or marginal)total probability of seeing under any hypothesis

The Mantra

New evidence shouldn’t determine your belief in a vacuum — it should update a prior.

This is the whole point. The posterior is built from two ingredients: how well the evidence fits the hypothesis (likelihood) and how plausible the hypothesis was to begin with (prior). Dropping either gives the wrong answer.

Base Rate Neglect: The Steve Problem

Kahneman & Tversky’s canonical demonstration (adapted in 3Blue1Brown’s Bayes theorem, the geometry of changing beliefs):

Steve is a meek and tidy soul, with a need for order and structure and a passion for detail. Is Steve more likely to be a librarian or a farmer?

Most people say librarian — the description matches the stereotype. They’re reasoning only from the likelihood and ignoring the prior.

But in the US there are roughly 20 farmers per librarian. Work through the numbers with a representative sample of 210 people (10 librarians, 200 farmers), and suppose 40% of librarians fit Steve’s description vs only 10% of farmers:

  • Librarians matching the description:
  • Farmers matching the description:

Steve is ~83% likely to be a farmer — despite the description screaming librarian. The likelihood ratio (4:1 in favour of librarian) is overwhelmed by the prior ratio (20:1 in favour of farmer). Ignoring the prior is base rate neglect, and it’s the systematic error Bayes’ rule corrects.

Geometric Picture

Draw a 1×1 square representing the full space of possibilities.

  1. Partition by hypothesis. A vertical strip of width covers the cases where is true; the remaining strip covers . The strip widths are the priors.
  2. Carve out evidence within each strip. Inside the strip, a sub-region of proportion represents cases where is also observed. Same inside the strip with proportion .
  3. Restrict to the evidence. Throw away everything where doesn’t hold. You’re left with the two evidence-sub-regions.
  4. The posterior is the fraction of the remaining area that sits in the strip.

The key takeaway from this picture: a small prior multiplies down the area, and a large likelihood alone can’t rescue a hypothesis whose strip is thin. The posterior is a proportion, not a score — a tiny slice times a generous likelihood can still be smaller than a fat slice times a mediocre likelihood.

Where Bayes’ Rule Shows Up in NLP

  • Naive Bayes classification — flip into , which is estimable from counts. The formula from Naive Bayes is literally Bayes’ rule with and .
  • Noisy channel models — for spelling correction, machine translation, and speech recognition: . A language model supplies the prior; a channel model supplies the likelihood.
  • Any probabilistic decoder — HMM decoding, probabilistic parsing, topic models. Anywhere you infer a latent cause from observed words, you are doing Bayesian inference.

The reason Bayes’ rule is useful is almost always that is hard to estimate directly but and can be estimated from data. You invert the conditioning to get at the quantity you actually want.

Dropping the Denominator in

When you only need the most probable hypothesis — not its actual probability — the denominator is irrelevant. It’s the same for every hypothesis, so it doesn’t affect the ranking:

This is the step that turns Bayes’ rule from “intractable marginalization problem” into “pick the hypothesis with the highest numerator” — and it’s what makes Naive Bayes practical.

  • naive-bayes — Bayes’ rule applied to (document, class) with conditional independence
  • n-gram-language-models — supplies the prior in noisy-channel decoders
  • evaluation-methodology — Bayesian reasoning underlies held-out validation and posterior predictive checks

Active Recall