bayes-law

A rule for inverting conditional probabilities: $P (θ ∣ X) = P (X ∣ θ) P (θ) / P (X)$ . The Bayesian view treats parameters $θ$ as random variables with a prior distribution; observed data updates this prior to a posterior. Stands in contrast to the frequentist view, which treats $θ$ as a fixed unknown to be point-estimated.

The Rule

For events:

$P (θ ∣ X) = \frac{P ( X ∣ θ ) P ( θ )}{P ( X )}$

For continuous parameters with prior density $p (θ)$ and likelihood $p (X ∣ θ)$ :

$p (θ ∣ X) = \frac{p ( X ∣ θ ) p ( θ )}{p ( X )}, p (X) = \int p (X ∣ θ) p (θ) d θ$

The denominator $p (X)$ — the marginal likelihood or “evidence” — is a normalising constant that ensures the posterior integrates to 1. It rarely needs to be computed explicitly because of the proportionality:

$p (θ ∣ X) \propto likelihood p (X ∣ θ) \cdot prior p (θ)$

Or in slogan form: posterior ∝ likelihood × prior.

What Each Term Means

Prior $p (θ)$ — what we believed about the parameters before seeing the data. Encodes domain knowledge, regularisation, or genuine ignorance (uniform, weak, or non-informative priors).
Likelihood $p (X ∣ θ)$ — how plausible the observed data is under each candidate $θ$ . Same object that MLE maximises.
Posterior $p (θ ∣ X)$ — what we believe after seeing the data. Combines prior knowledge with observational evidence.
Evidence $p (X)$ — the probability of the data marginalised over all $θ$ . Useful for model comparison; usually ignored when computing the posterior shape.

Frequentist vs Bayesian

The two paradigms differ on what $θ$ is:

Frequentist	Bayesian
$θ$ is a fixed unknown constant	$θ$ is a random variable with a distribution
Data is random; $θ$ is not	Data is observed; $θ$ is the unknown
Confidence in $θ$ comes from imagining repeating the experiment	Confidence in $θ$ is a probability distribution
Tool: MLE, hypothesis tests, confidence intervals	Tool: Bayes’ law, credible intervals, posterior predictive

A Bayesian works with the one dataset they actually observed and quantifies uncertainty in $θ$ as a probability distribution. A frequentist asks how the estimator $\hat{θ}$ would behave across many hypothetical datasets and uses cross-validation, bootstrapping, or sampling distributions.

The frameworks aren’t always at odds — many results overlap, and Bayesian inference with a flat prior often recovers the MLE. But the interpretations differ.

Worked Example — Disease Test

A disease affects 1% of the population. A test is 99% accurate (99% sensitivity and 99% specificity). You test positive. What’s the probability you have the disease?

Apply Bayes:

$P (D ∣ +) = \frac{P ( + ∣ D ) P ( D )}{P ( + )}$

The denominator expands by total probability:

$P (+) = P (+ ∣ D) P (D) + P (+ ∣ \neg D) P (\neg D) = 0.99 \cdot 0.01 + 0.01 \cdot 0.99 = 0.0198$

So:

$P (D ∣ +) = \frac{0.99 \cdot 0.01}{0.0198} = 0.5$

Despite the 99%-accurate test, a positive result only means a 50% chance of having the disease. The intuition: positives from the 99%-of-the-population healthy group nearly match positives from the 1%-of-the-population sick group, because the prior $P (D)$ is so small.

This is the classic illustration of why base rates matter — and why MLE-style “ignore the prior” reasoning misleads in low-prevalence settings.

Example — Bayesian Linear Regression (preview)

For linear regression, the Bayesian recipe is:

Place a prior over weights, e.g., $w \sim N (0, α^{- 1} I)$ — small weights are a priori more likely.
The likelihood, assuming Gaussian noise, is $p (y ∣ X, w) = \prod_{i} N (y_{i} ∣ w^{⊤} ϕ (x_{i}), β^{- 1})$ .
Apply Bayes: $p (w ∣ y, X) \propto p (y ∣ X, w) p (w)$ .
The posterior is itself Gaussian (by conjugacy); its mean is the MAP/Bayes estimate of $w$ .

The MAP estimate ends up being equivalent to ridge regression — the prior acts as L2 regularisation. This is one of the cleanest “regularisation = Bayesian prior” connections in ML.

(Bayesian linear regression itself is covered in later weeks; this is foreshadowing.)

When Priors Matter

Small datasets — the likelihood is weak, so the prior dominates the posterior. Choose carefully.
Domain knowledge — if you know weights should be small, sparse, or positive, encode that as a prior.
Regularisation — many regularisation schemes have Bayesian interpretations (L2 ↔ Gaussian prior, L1 ↔ Laplace prior).
Avoiding overfitting — even uninformative priors can prevent MLE pathologies (e.g., infinite weights in separable logistic regression).

When not to lean on priors:

Lots of data — the likelihood overwhelms the prior; results are nearly identical to MLE.
No defensible prior — choosing a “non-informative” prior is itself a non-trivial decision (Jeffreys prior, reference prior, etc.).

Connections

maximum likelihood estimation — recovers MAP under a flat prior (or in the large-data limit). MLE is the “no prior” special case.
gaussian-distribution — the most common conjugate prior pair: Gaussian prior + Gaussian likelihood → Gaussian posterior, all in closed form.
ordinary-least-squares — its MLE-equivalence under Gaussian noise can be extended to ridge regression by adding a Gaussian prior on $w$ .

Active Recall

State Bayes' law in the "posterior ∝ likelihood × prior" form. What's been dropped?

$p (θ ∣ X) \propto p (X ∣ θ) p (θ)$ . The denominator $p (X)$ — the marginal likelihood — has been absorbed into the proportionality. Since it doesn’t depend on $θ$ , it doesn’t change the shape of the posterior; it’s only needed when normalising or comparing across models.

A test is 99% accurate. The disease is rare (1% prevalence). You test positive. Why isn't your probability of having the disease close to 99%?

Because base rates matter. The 99% accuracy means few false positives per healthy person, but there are 99× more healthy people than sick people. So the count of false positives among healthy folks is comparable to the count of true positives among sick folks. After applying Bayes’ law: $P (D ∣ +) = 0.5$ . The prior $P (D) = 0.01$ pulls the posterior down hard.

What's the relationship between MLE and Bayesian inference?

MLE picks the single $θ$ that maximises the likelihood, ignoring the prior. Bayesian inference returns the full posterior distribution over $θ$ , weighted by the prior. With a flat prior (or as data grows), the MAP (posterior mode) converges to the MLE. So MLE is the “no-prior” or “infinite-data” limit of Bayesian inference.

Course Notes

Explorer

bayes-law

The Rule

What Each Term Means

Frequentist vs Bayesian

Worked Example — Disease Test

Example — Bayesian Linear Regression (preview)

When Priors Matter

Connections

Active Recall

Graph View

Table of Contents

Backlinks