A rule for inverting conditional probabilities: . The Bayesian view treats parameters as random variables with a prior distribution; observed data updates this prior to a posterior. Stands in contrast to the frequentist view, which treats as a fixed unknown to be point-estimated.

The Rule

For events:

For continuous parameters with prior density and likelihood :

The denominator — the marginal likelihood or “evidence” — is a normalising constant that ensures the posterior integrates to 1. It rarely needs to be computed explicitly because of the proportionality:

Or in slogan form: posterior ∝ likelihood × prior.

What Each Term Means

  • Prior — what we believed about the parameters before seeing the data. Encodes domain knowledge, regularisation, or genuine ignorance (uniform, weak, or non-informative priors).
  • Likelihood — how plausible the observed data is under each candidate . Same object that MLE maximises.
  • Posterior — what we believe after seeing the data. Combines prior knowledge with observational evidence.
  • Evidence — the probability of the data marginalised over all . Useful for model comparison; usually ignored when computing the posterior shape.

Frequentist vs Bayesian

The two paradigms differ on what is:

FrequentistBayesian
is a fixed unknown constant is a random variable with a distribution
Data is random; is notData is observed; is the unknown
Confidence in comes from imagining repeating the experimentConfidence in is a probability distribution
Tool: MLE, hypothesis tests, confidence intervalsTool: Bayes’ law, credible intervals, posterior predictive

A Bayesian works with the one dataset they actually observed and quantifies uncertainty in as a probability distribution. A frequentist asks how the estimator would behave across many hypothetical datasets and uses cross-validation, bootstrapping, or sampling distributions.

The frameworks aren’t always at odds — many results overlap, and Bayesian inference with a flat prior often recovers the MLE. But the interpretations differ.

Worked Example — Disease Test

A disease affects 1% of the population. A test is 99% accurate (99% sensitivity and 99% specificity). You test positive. What’s the probability you have the disease?

Apply Bayes:

The denominator expands by total probability:

So:

Despite the 99%-accurate test, a positive result only means a 50% chance of having the disease. The intuition: positives from the 99%-of-the-population healthy group nearly match positives from the 1%-of-the-population sick group, because the prior is so small.

This is the classic illustration of why base rates matter — and why MLE-style “ignore the prior” reasoning misleads in low-prevalence settings.

Example — Bayesian Linear Regression (preview)

For linear regression, the Bayesian recipe is:

  1. Place a prior over weights, e.g., — small weights are a priori more likely.
  2. The likelihood, assuming Gaussian noise, is .
  3. Apply Bayes: .
  4. The posterior is itself Gaussian (by conjugacy); its mean is the MAP/Bayes estimate of .

The MAP estimate ends up being equivalent to ridge regression — the prior acts as L2 regularisation. This is one of the cleanest “regularisation = Bayesian prior” connections in ML.

(Bayesian linear regression itself is covered in later weeks; this is foreshadowing.)

When Priors Matter

  • Small datasets — the likelihood is weak, so the prior dominates the posterior. Choose carefully.
  • Domain knowledge — if you know weights should be small, sparse, or positive, encode that as a prior.
  • Regularisation — many regularisation schemes have Bayesian interpretations (L2 ↔ Gaussian prior, L1 ↔ Laplace prior).
  • Avoiding overfitting — even uninformative priors can prevent MLE pathologies (e.g., infinite weights in separable logistic regression).

When not to lean on priors:

  • Lots of data — the likelihood overwhelms the prior; results are nearly identical to MLE.
  • No defensible prior — choosing a “non-informative” prior is itself a non-trivial decision (Jeffreys prior, reference prior, etc.).

Connections

  • maximum likelihood estimation — recovers MAP under a flat prior (or in the large-data limit). MLE is the “no prior” special case.
  • gaussian-distribution — the most common conjugate prior pair: Gaussian prior + Gaussian likelihood → Gaussian posterior, all in closed form.
  • ordinary-least-squares — its MLE-equivalence under Gaussian noise can be extended to ridge regression by adding a Gaussian prior on .

Active Recall