Bayes’ theorem is the formula for inverting a probabilistic relationship. You have a model that predicts the noisy measurement given the true value — that’s the likelihood . You have prior knowledge about how is distributed in the world — the prior . Bayes’ theorem combines them to give you what you actually want — the posterior , which says “given that I observed , what’s the distribution over the true ?” The forward direction is mechanism; the backward direction is inference. Bayes goes from the first to the second.

Intuition pump: Steve the librarian (Kahneman & Tversky)

Before the formula, an intuition pump from Kahneman & Tversky:

Steve is shy and withdrawn, invariably helpful but with very little interest in people or in the world of reality. A meek and tidy soul.

Is Steve more likely to be a librarian or a farmer?

Most people pick librarian — the description matches the stereotype. This is the wrong answer, and the reason it’s wrong is the entire point of Bayes’ theorem.

In the US there are roughly 20 farmers for every 1 librarian. Even if Steve’s description is, say, four times more typical of librarians than farmers, the sheer outnumbering still wins. Concretely:

  • Imagine 210 people: 10 librarians, 200 farmers.
  • Suppose 40% of librarians match the description: matching librarians.
  • Suppose 10% of farmers match the description: matching farmers.
  • Out of 24 matching people, only 4 are librarians — a posterior of .

The description is more characteristic of librarians, but a Steve drawn from the population at random is still much more likely to be a farmer. The base rate (20-to-1) overwhelms the descriptive evidence.

The cognitive failure mode this exposes — judging probability by stereotype-fit while ignoring how rare the stereotype-fitting category is — is called base-rate neglect. It’s the most common Bayesian mistake humans make, and the one Bayes’ theorem is built to correct.

COMMON MISCONCEPTION — Base-rate neglect

When given evidence that fits a hypothesis well, our intuition jumps to “the hypothesis is probably true.” But “the evidence fits the hypothesis well” is the likelihood — and the question we actually want answered is the posterior , which also depends on the prior — how rare the hypothesis is in the first place. Bayes’ theorem is the correction: posterior ∝ likelihood × prior. Ignore the prior and you systematically overweight rare-but-stereotype-fitting explanations.

TIP — Concrete counts beat percentages

Notice how much easier the Steve calculation became when phrased as “10 librarians, 200 farmers, 40% match, 10% match” rather than ”.” Humans reason about probability much better in concrete counts than in abstract probabilities. When working a Bayesian problem and your intuition rebels at the formula, drop to a hypothetical population of 100 or 1000, count the matching cases, and read off the answer.

The geometry: restricted possibility space

The Steve calculation has a clean geometric reading. Picture all possibilities as a square. Split it left/right by hypothesis: the left strip has width (librarians), the right strip (farmers). Within each strip, shade the fraction matching the evidence: a tall band (40%) inside the librarian strip, a short band (10%) inside the farmer strip.

Now condition on the evidence. Throw away everything unshaded. What’s left is two coloured rectangles: the librarian-and-evidence rectangle (small × tall = small area) and the farmer-and-evidence rectangle (large × short = larger area). The posterior is the librarian rectangle’s share of the total remaining shaded area.

TIP — Bayes is just the math of proportions

The whole content of Bayes’ theorem is geometric: restrict to the slice where the evidence is true, then read off the proportion where the hypothesis is also true. As Grant Sanderson puts it, “the actual math of probability is really just the math of proportions, where turning to geometry is exceedingly helpful.” Every Bayesian calculation, no matter how complicated the formula looks, is a measurement of two areas in this restricted-space picture.

What changes belief is whether the evidence shrinks the two strips unevenly. If the evidence-band height is the same in both strips (40% of librarians match, 40% of farmers match), the proportion of librarians among matches equals the prior — the evidence is uninformative. If the bands are very different (40% vs 10%), the proportion shifts toward the strip with the taller band — but only by an amount that the prior widths permit.

From 3Blue1Brown

“Rationality is not about knowing facts. It’s about recognizing which facts are relevant.” — Grant Sanderson, on what Bayes really teaches.

The setup — formalising the geometry

The geometric picture above generalises to continuous variables. You have two random variables and — the “evidence” and the “hypothesis,” to keep the Steve example in mind. The joint distribution can be factored two ways:

Equating the two factorisations and rearranging gives Bayes’ theorem:

The denominator (the marginal likelihood or evidence) is just the normalising constant that makes the posterior integrate to 1. For most practical purposes you can treat it as a constant and write the proportionality:

The four named pieces

TermNameReading
PriorWhat we believe about before seeing
Likelihood (or noise model)How is produced given
PosteriorWhat we believe about after seeing
Evidence (marginal likelihood)Normalising constant

The directionality is the key insight: likelihood goes forward (truth → observation), posterior goes backward (observation → truth). The forward direction is usually easy to model (you know how your sensor works); the backward direction is what you actually want at inference time.

Worked example: penguin flipper length

A long-running ecological study of Antarctic penguins records each bird’s flipper length. We have two measurements per bird:

  • : actual flipper length (mm), measured by hand.
  • : flipper length estimated from a computer-vision (CV) system on remote camera footage.

The CV system is noisy — it doesn’t always nail the right answer. We want to deploy it without humans in the loop, but we need to understand how trustworthy its measurements are.

Prior:

Histogram all the true flipper lengths in the dataset. You get something roughly Gaussian centred at mm. This is prior knowledge about the species — what penguin flipper lengths look like in the wild, regardless of any specific measurement.

The expected value is — your best guess about an unknown penguin’s flipper length before any measurement.

Likelihood:

Now condition on a particular true value, say . Filter the dataset to only the rows where , and histogram the corresponding values. You get a distribution centred at 192 with some spread — the CV system on average gets it right but with some noise.

This is the likelihood (also called the noise model or forward model): given the true value, what does the sensor produce? Typically zero-centred noise: if the sensor is unbiased.

Posterior:

Now the question we actually care about: the CV system measured . What’s the distribution over the true ?

Filter the dataset to only rows where (the CV measurement), histogram the corresponding values. The result is the posterior — not centred at 201, but somewhere between 201 and the prior mean of 190.

Why not centred at 201? Because two effects fight each other:

  • The likelihood says “if , the most likely value is 201 — so our best guess given is .”
  • The prior says “very few penguins have ; many more have — so most observations of probably came from a true closer to 190 with noise that pushed the measurement up.”

Bayes’ theorem combines them. The posterior pulls the likelihood-only estimate (201) back toward the prior mean (190) — a phenomenon called shrinkage or regression to the mean. The exact location depends on the relative widths of prior and likelihood.

ASIDE — Why

Naively you’d think “the sensor reported 201, the noise has zero mean, so the best estimate is 201.” That’s the maximum-likelihood estimate — uses only the likelihood. Bayes does better when you have prior information: rare values of are less likely to have produced a given than common values, even if the noise model treats them symmetrically. The MAP estimate, , takes the prior into account and pulls toward the prior mode.

Why this matters for generative modelling

The link to conditional generative models is direct: a conditional generative model learns

— the posterior distribution. The neural network is a parameterised approximation to Bayes’ theorem.

Examples that fit this template:

Task (condition) (target)
Image colourisationGreyscale imageColour image
Super-resolutionLow-resolution imageHigh-resolution image
InpaintingImage with holeFilled image
Semantic-map → photo (pix2pix)Segmentation maskPhotograph
Sketch → photoLine sketchRealistic image
Text-to-imageCaptionGenerated image

In every case, multiple values are plausible for a given — there’s no unique colourisation of a greyscale photo, no unique high-res reconstruction of a low-res image. The posterior captures this distribution of plausible answers, not a single point estimate.

The MMSE estimate (and why MSE regression is blurry)

A classical exercise: given a set of noisy measurements of an unknown , find the estimate that minimises the squared-error loss

Setting the derivative to zero gives the analytical solution: , the sample mean. The mean is the MMSE estimate (Minimum Mean Squared Error) — it’s the estimator with the smallest expected squared error.

Generalising to the conditional setting: a regression network trained with MSE loss to predict from will, at convergence, output the conditional mean — the average over all plausible explanations of the observed .

This is what the generator loss in conditional GANs is fixing. If multiple ‘s are plausible (e.g. multiple realistic colour assignments for a greyscale photo), the conditional mean is a blurry compromise between them — neither one nor the other, but their pixel-wise average. That’s why an MSE-regression colouriser produces washed-out, low-saturation images, and why a cGAN does much better: it samples from the posterior instead of averaging over it.

Connections

  • conditional-generative-model — the direct application: conditional GMs learn , a parameterised posterior.
  • generative-model — unconditional GMs learn , the marginal; conditional GMs add a condition.
  • maximum likelihood estimation — choosing to maximise uses only the likelihood, ignoring the prior; the non-Bayesian counterpart of MAP estimation.
  • loss-function — MSE regression converges to the conditional mean (an MMSE estimate, integrating over the posterior); generative models sample from the posterior, preserving multimodality. The choice of loss determines which.