Linear regression treated probabilistically: place a Gaussian prior on the weights, combine with the Gaussian likelihood, and recover a closed-form Gaussian posterior . The posterior mean is the MAP estimate (equivalent to ridge regression); the posterior covariance quantifies how confident we are in those weights. Predictions become a distribution rather than a point.

Motivation — The Underdetermined Case

OLS works when there are more observations than unknowns. With one input/output pair and two unknowns , the equation has infinite solutions — every line through that single point fits perfectly. OLS doesn’t help; the system is underdetermined.

The Bayesian fix: declare an a-priori belief about which ‘s are plausible. A natural choice is “small weights are more likely than large ones” — formalised as a Gaussian prior centred at zero. Now the posterior over is a distribution, peaked where prior + data agree.

The Bayesian Setup

Likelihood (same as standard linear regression):

where is the noise precision (assumed known here).

Prior — Gaussian, centred at zero, with isotropic precision :

Larger → tighter prior → strong preference for small weights. Smaller → weaker prior → behaviour closer to MLE/OLS.

Posterior by Bayes’ law:

Conjugacy — Why the Math Closes

A Gaussian likelihood combined with a Gaussian prior gives a Gaussian posterior. This is the canonical example of a conjugate prior — when the posterior stays in the same distributional family as the prior, the math closes in finite form. No integrals to estimate, no MCMC required.

Taking the log of :

This is quadratic in , so completing the square gives a Gaussian posterior:

with:

where is the design matrix.

What the Posterior Tells You

  • The mean is the MAP estimate — the most probable single under the posterior. It coincides with the ridge regression solution: it’s OLS with the regularisation term added to the loss.
  • The covariance quantifies uncertainty. A tight posterior means the data has narrowed down confidently; a wide posterior means many ‘s are still plausible.
  • As grows, the posterior tightens. With , the posterior is barely narrower than the prior. By , it’s a tiny ellipse around the true weights.

Predictive Distribution

A non-Bayesian model gives a single prediction for a new . A Bayesian model gives a distribution over predictions, integrating over all plausible ‘s:

For Gaussian likelihood + Gaussian posterior, this integral is closed-form Gaussian:

The variance has two contributions: (irreducible noise) and (model uncertainty, vanishing as data accumulates).

Why this matters in practice

Bayesian regression doesn’t just give "" — it gives ”.” For decision-making (medical risk, financial pricing), the variance is as important as the mean. A confident wrong answer is worse than a hedged correct one.

MAP and Ridge Regression

Maximising the log-posterior with respect to :

The first term is the OLS objective. The second is an L2 penalty on , with regularisation coefficient . This is exactly ridge regression.

The Bayesian view explains why L2 regularisation works: it’s the negative log-prior of a zero-mean Gaussian belief. Larger (tighter prior) → larger (more shrinkage) → flatter, simpler models. Smaller → looser prior → behaviour closer to MLE/OLS.

L_p regulariserEquivalent prior
L2 ()Gaussian:
L1 ()Laplace:

When It Beats MLE

A canonical example: degree-8 polynomial regression on points.

  • MLE fits all 10 points (almost) exactly, but oscillates wildly between them — terrible test performance.
  • MAP with a Gaussian prior produces a smoother fit. The prior says “small weights are more likely,” which prevents the polynomial coefficients from blowing up to interpolate every training point.

Same data, same model class — different criterion, dramatically different generalisation.

Practical Notes

  • Hyperparameters . Treated as fixed constants in the basic setup. In practice, choose by cross-validation, by maximising the evidence (empirical Bayes), or by placing hyperpriors and integrating those out too.
  • Computation. Forming requires inverting an matrix — same cost as OLS. Posterior mean reduces to one matrix-vector product.
  • Online updates. The posterior from points becomes the prior for the th — you can update one point at a time without re-processing the dataset.

What Could Go Wrong

  • Wrong noise model. If (noise precision) is misspecified, the posterior is wrong. Either learn from data or place a prior on it (Gamma is conjugate to Gaussian-precision).
  • Wrong prior shape. A zero-centred Gaussian assumes weights “should be small.” If the truth is sparse (few non-zero weights), use a Laplace prior — recovers L1 / lasso.
  • High dimensions. is ; inverting becomes expensive past a few thousand parameters.

Connections

Active Recall