bayes-rule

Bayes’ rule inverts a conditional probability — it lets you compute $P (H ∣ E)$ from $P (E ∣ H)$ , $P (H)$ , and $P (E)$ . In NLP it’s the machinery behind classifiers, noisy-channel decoders, and any system that reasons about unseen causes from observed evidence.

The Rule

Start from the definition of conditional probability two ways:

$P (H, E) = P (H ∣ E) P (E) = P (E ∣ H) P (H)$

Rearrange:

$P (H ∣ E) = \frac{P ( E ∣ H ) \cdot P ( H )}{P ( E )}$

Every term has a name:

Term	Name	Meaning
$P (H ∣ E)$	posterior	updated belief in $H$ after seeing $E$
$P (E ∣ H)$	likelihood	how probable $E$ is if $H$ is true
$P (H)$	prior	belief in $H$ before seeing $E$
$P (E)$	evidence (or marginal)	total probability of seeing $E$ under any hypothesis

The Mantra

New evidence shouldn’t determine your belief in a vacuum — it should update a prior.

This is the whole point. The posterior is built from two ingredients: how well the evidence fits the hypothesis (likelihood) and how plausible the hypothesis was to begin with (prior). Dropping either gives the wrong answer.

Base Rate Neglect: The Steve Problem

Kahneman & Tversky’s canonical demonstration (adapted in 3Blue1Brown’s Bayes theorem, the geometry of changing beliefs):

Steve is a meek and tidy soul, with a need for order and structure and a passion for detail. Is Steve more likely to be a librarian or a farmer?

Most people say librarian — the description matches the stereotype. They’re reasoning only from the likelihood $P (description ∣ librarian)$ and ignoring the prior.

But in the US there are roughly 20 farmers per librarian. Work through the numbers with a representative sample of 210 people (10 librarians, 200 farmers), and suppose 40% of librarians fit Steve’s description vs only 10% of farmers:

Librarians matching the description: $10 \times 0.4 = 4$
Farmers matching the description: $200 \times 0.1 = 20$

$P (librarian ∣ description) = \frac{4}{4 + 20} \approx 0.17$

Steve is ~83% likely to be a farmer — despite the description screaming librarian. The likelihood ratio (4:1 in favour of librarian) is overwhelmed by the prior ratio (20:1 in favour of farmer). Ignoring the prior is base rate neglect, and it’s the systematic error Bayes’ rule corrects.

Geometric Picture

Draw a 1×1 square representing the full space of possibilities.

Partition by hypothesis. A vertical strip of width $P (H)$ covers the cases where $H$ is true; the remaining strip covers $\neg H$ . The strip widths are the priors.
Carve out evidence within each strip. Inside the $H$ strip, a sub-region of proportion $P (E ∣ H)$ represents cases where $E$ is also observed. Same inside the $\neg H$ strip with proportion $P (E ∣ \neg H)$ .
Restrict to the evidence. Throw away everything where $E$ doesn’t hold. You’re left with the two evidence-sub-regions.
The posterior is the fraction of the remaining area that sits in the $H$ strip.

$P (H ∣ E) = \frac{area of ( H \cap E )}{area of ( H \cap E ) + area of ( \neg H \cap E )}$

The key takeaway from this picture: a small prior multiplies down the area, and a large likelihood alone can’t rescue a hypothesis whose strip is thin. The posterior is a proportion, not a score — a tiny slice times a generous likelihood can still be smaller than a fat slice times a mediocre likelihood.

Where Bayes’ Rule Shows Up in NLP

Naive Bayes classification — flip $P (class ∣ document)$ into $P (document ∣ class) P (class)$ , which is estimable from counts. The formula from Naive Bayes is literally Bayes’ rule with $H = c$ and $E = d$ .
Noisy channel models — for spelling correction, machine translation, and speech recognition: $P (source ∣ observed) \propto P (observed ∣ source) \cdot P (source)$ . A language model supplies the prior; a channel model supplies the likelihood.
Any probabilistic decoder — HMM decoding, probabilistic parsing, topic models. Anywhere you infer a latent cause from observed words, you are doing Bayesian inference.

The reason Bayes’ rule is useful is almost always that $P (H ∣ E)$ is hard to estimate directly but $P (E ∣ H)$ and $P (H)$ can be estimated from data. You invert the conditioning to get at the quantity you actually want.

Dropping the Denominator in $ar g max$

When you only need the most probable hypothesis — not its actual probability — the denominator $P (E)$ is irrelevant. It’s the same for every hypothesis, so it doesn’t affect the ranking:

$H ar g max P (H ∣ E) = H ar g max P (E ∣ H) \cdot P (H)$

This is the step that turns Bayes’ rule from “intractable marginalization problem” into “pick the hypothesis with the highest numerator” — and it’s what makes Naive Bayes practical.

naive-bayes — Bayes’ rule applied to (document, class) with conditional independence
n-gram-language-models — supplies the prior $P (text)$ in noisy-channel decoders
evaluation-methodology — Bayesian reasoning underlies held-out validation and posterior predictive checks

Active Recall

State Bayes' rule and name every term.

$P (H ∣ E) = P (E ∣ H) P (H) / P (E)$ . $P (H ∣ E)$ is the posterior (updated belief after seeing evidence). $P (E ∣ H)$ is the likelihood (probability of observing $E$ if $H$ were true). $P (H)$ is the prior (belief in $H$ before seeing evidence). $P (E)$ is the evidence or marginal (total probability of observing $E$ under any hypothesis). The formula inverts the conditional — useful whenever $P (H ∣ E)$ is hard to estimate directly but the other terms aren’t.

Why does the denominator $P (E)$ drop out when you use Bayes' rule in a classifier?

The classifier picks $ar g max_{H} P (H ∣ E)$ , not the actual probability. $P (E)$ is the same for every candidate $H$ , so it scales every score identically and cannot change the ranking. Dropping it turns an intractable normalization problem (summing $P (E ∣ H^{'}) P (H^{'})$ over every possible $H^{'}$ ) into a comparison of unnormalized scores — the step that makes Bayes’ rule practical for NLP.

Explain the Steve librarian/farmer example and the error it exposes.

Steve’s description matches a librarian stereotype — $P (description ∣ librarian)$ is high. Most people stop there and conclude he’s a librarian. They ignore the prior: there are ~20 farmers per librarian in the US. Even if librarians match the description 4× more often, the 20× larger pool of farmers dominates — Steve is ~83% likely to be a farmer. The error is base rate neglect: reasoning from likelihood without multiplying in the prior. Bayes’ rule forces you to include both.

In the geometric 1×1 square picture, what do the vertical strip widths and the sub-region areas represent?

Strip widths are priors — each hypothesis occupies a vertical strip of width $P (H)$ spanning the full square. Within each strip, a sub-region of proportion $P (E ∣ H)$ represents the cases where the evidence is observed — those areas are the joint probabilities $P (H, E) = P (E ∣ H) P (H)$ . The posterior $P (H ∣ E)$ is the fraction of the total evidence area (across all hypotheses) that sits in the $H$ strip.

Why is Bayes' rule useful in NLP — what does it let you swap?

It lets you replace a hard-to-estimate conditional $P (H ∣ E)$ with a combination of terms that are estimable from data: $P (E ∣ H)$ (often countable — e.g. “how often does this word appear in documents of this class”) and $P (H)$ (often just a corpus frequency). This inversion is the core trick in Naive Bayes, noisy-channel spelling correction, statistical MT, speech recognition, and HMM decoding.

Course Notes

Explorer

bayes-rule

The Rule

The Mantra

Base Rate Neglect: The Steve Problem

Geometric Picture

Where Bayes’ Rule Shows Up in NLP

Dropping the Denominator in $ar g max$

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

bayes-rule

The Rule

The Mantra

Base Rate Neglect: The Steve Problem

Geometric Picture

Where Bayes’ Rule Shows Up in NLP

Dropping the Denominator in argmax

Related

Active Recall

Graph View

Table of Contents

Backlinks

Dropping the Denominator in $ar g max$