Linear regression treated probabilistically: place a Gaussian prior on the weights, combine with the Gaussian likelihood, and recover a closed-form Gaussian posterior . The posterior mean is the MAP estimate (equivalent to ridge regression); the posterior covariance quantifies how confident we are in those weights. Predictions become a distribution rather than a point.
Motivation — The Underdetermined Case
OLS works when there are more observations than unknowns. With one input/output pair and two unknowns , the equation has infinite solutions — every line through that single point fits perfectly. OLS doesn’t help; the system is underdetermined.
The Bayesian fix: declare an a-priori belief about which ‘s are plausible. A natural choice is “small weights are more likely than large ones” — formalised as a Gaussian prior centred at zero. Now the posterior over is a distribution, peaked where prior + data agree.
The Bayesian Setup
Likelihood (same as standard linear regression):
where is the noise precision (assumed known here).
Prior — Gaussian, centred at zero, with isotropic precision :
Larger → tighter prior → strong preference for small weights. Smaller → weaker prior → behaviour closer to MLE/OLS.
Posterior by Bayes’ law:
Conjugacy — Why the Math Closes
A Gaussian likelihood combined with a Gaussian prior gives a Gaussian posterior. This is the canonical example of a conjugate prior — when the posterior stays in the same distributional family as the prior, the math closes in finite form. No integrals to estimate, no MCMC required.
Taking the log of :
This is quadratic in , so completing the square gives a Gaussian posterior:
with:
where is the design matrix.
What the Posterior Tells You
- The mean is the MAP estimate — the most probable single under the posterior. It coincides with the ridge regression solution: it’s OLS with the regularisation term added to the loss.
- The covariance quantifies uncertainty. A tight posterior means the data has narrowed down confidently; a wide posterior means many ‘s are still plausible.
- As grows, the posterior tightens. With , the posterior is barely narrower than the prior. By , it’s a tiny ellipse around the true weights.
Predictive Distribution
A non-Bayesian model gives a single prediction for a new . A Bayesian model gives a distribution over predictions, integrating over all plausible ‘s:
For Gaussian likelihood + Gaussian posterior, this integral is closed-form Gaussian:
The variance has two contributions: (irreducible noise) and (model uncertainty, vanishing as data accumulates).
Why this matters in practice
Bayesian regression doesn’t just give "" — it gives ”.” For decision-making (medical risk, financial pricing), the variance is as important as the mean. A confident wrong answer is worse than a hedged correct one.
MAP and Ridge Regression
Maximising the log-posterior with respect to :
The first term is the OLS objective. The second is an L2 penalty on , with regularisation coefficient . This is exactly ridge regression.
The Bayesian view explains why L2 regularisation works: it’s the negative log-prior of a zero-mean Gaussian belief. Larger (tighter prior) → larger (more shrinkage) → flatter, simpler models. Smaller → looser prior → behaviour closer to MLE/OLS.
| L_p regulariser | Equivalent prior |
|---|---|
| L2 () | Gaussian: |
| L1 () | Laplace: |
When It Beats MLE
A canonical example: degree-8 polynomial regression on points.
- MLE fits all 10 points (almost) exactly, but oscillates wildly between them — terrible test performance.
- MAP with a Gaussian prior produces a smoother fit. The prior says “small weights are more likely,” which prevents the polynomial coefficients from blowing up to interpolate every training point.
Same data, same model class — different criterion, dramatically different generalisation.
Practical Notes
- Hyperparameters . Treated as fixed constants in the basic setup. In practice, choose by cross-validation, by maximising the evidence (empirical Bayes), or by placing hyperpriors and integrating those out too.
- Computation. Forming requires inverting an matrix — same cost as OLS. Posterior mean reduces to one matrix-vector product.
- Online updates. The posterior from points becomes the prior for the th — you can update one point at a time without re-processing the dataset.
What Could Go Wrong
- Wrong noise model. If (noise precision) is misspecified, the posterior is wrong. Either learn from data or place a prior on it (Gamma is conjugate to Gaussian-precision).
- Wrong prior shape. A zero-centred Gaussian assumes weights “should be small.” If the truth is sparse (few non-zero weights), use a Laplace prior — recovers L1 / lasso.
- High dimensions. is ; inverting becomes expensive past a few thousand parameters.
Connections
- linear-regression — the underlying model whose weights we’re inferring.
- ordinary-least-squares — the MLE-equivalent point estimate; recovered as (flat prior).
- ridge-regression — the MAP point estimate; recovered by maximising the posterior.
- bayes-law — the rule combining prior and likelihood.
- gaussian-distribution — the conjugate self-pairing that makes everything closed-form.
- maximum likelihood estimation — the no-prior limit; equivalent to OLS under Gaussian noise.
Active Recall
Why is Gaussian-likelihood + Gaussian-prior a "conjugate" pair?
Because the resulting posterior is also Gaussian — same family as the prior. Conjugacy means the math stays in closed form: no integrals to estimate numerically, no MCMC required. For Bayesian linear regression with Gaussian noise, this is what makes the entire inference tractable analytically.
What's the relationship between the Bayesian posterior mean and ridge regression?
They’re identical. Maximising the log-posterior is the same as minimising the OLS loss + . So the MAP estimate of Bayesian linear regression with a zero-mean Gaussian prior is exactly the ridge regression solution. The Bayesian view supplies the “why” for L2 regularisation: it’s the negative log of a Gaussian prior belief.
What does the posterior covariance tell you that a point estimate (OLS / ridge) doesn't?
Model uncertainty — how confident we are in those weights. A wide posterior means many ‘s are still plausible (typical with little data); a narrow posterior means data has pinned down the weights tightly. For predictions, this propagates to a predictive variance that grows in regions of input space far from training data, and shrinks where you have lots of evidence. Point-estimate methods give a single with no uncertainty.