bayesian-linear-regression

Linear regression treated probabilistically: place a Gaussian prior $w \sim N (0, α^{- 1} I)$ on the weights, combine with the Gaussian likelihood, and recover a closed-form Gaussian posterior $p (w ∣ y, X)$ . The posterior mean is the MAP estimate (equivalent to ridge regression); the posterior covariance quantifies how confident we are in those weights. Predictions become a distribution rather than a point.

Motivation — The Underdetermined Case

OLS works when there are more observations than unknowns. With one input/output pair $(x_{1}, y_{1})$ and two unknowns $(w_{0}, w_{1})$ , the equation $y_{1} = w_{0} + w_{1} x_{1}$ has infinite solutions — every line through that single point fits perfectly. OLS doesn’t help; the system is underdetermined.

The Bayesian fix: declare an a-priori belief about which $w$ ‘s are plausible. A natural choice is “small weights are more likely than large ones” — formalised as a Gaussian prior centred at zero. Now the posterior over $w$ is a distribution, peaked where prior + data agree.

The Bayesian Setup

Likelihood (same as standard linear regression):

$p (y ∣ X, w, β) = \prod_{i = 1}^{N} N (y_{i} ∣ w^{⊤} ϕ (x_{i}), β^{- 1})$

where $β = 1/ σ^{2}$ is the noise precision (assumed known here).

Prior — Gaussian, centred at zero, with isotropic precision $α$ :

$p (w) = N (w ∣ 0, α^{- 1} I)$

Larger $α$ → tighter prior → strong preference for small weights. Smaller $α$ → weaker prior → behaviour closer to MLE/OLS.

Posterior by Bayes’ law:

$p (w ∣ y, X) \propto p (y ∣ X, w) \cdot p (w)$

Conjugacy — Why the Math Closes

A Gaussian likelihood combined with a Gaussian prior gives a Gaussian posterior. This is the canonical example of a conjugate prior — when the posterior stays in the same distributional family as the prior, the math closes in finite form. No integrals to estimate, no MCMC required.

Taking the log of $p (w ∣ y, X)$ :

$ln p (w ∣ y, X) = - \frac{β}{2} \sum_{i = 1}^{N} (y_{i} - w^{⊤} ϕ (x_{i}))^{2} - \frac{α}{2} w^{⊤} w + const$

This is quadratic in $w$ , so completing the square gives a Gaussian posterior:

$p (w ∣ y, X) = N (w ∣ m_{post}, S_{post})$

with:

$m_{post} = β S_{post} Φ^{⊤} y, S_{post}^{- 1} = α I + β Φ^{⊤} Φ$

where $Φ$ is the design matrix.

What the Posterior Tells You

The mean $m_{post}$ is the MAP estimate — the most probable single $w$ under the posterior. It coincides with the ridge regression solution: it’s OLS with the regularisation term $\frac{α}{β} ∥ w ∥^{2}$ added to the loss.
The covariance $S_{post}$ quantifies uncertainty. A tight posterior means the data has narrowed down $w$ confidently; a wide posterior means many $w$ ‘s are still plausible.
As $N$ grows, the posterior tightens. With $N = 1$ , the posterior is barely narrower than the prior. By $N = 1000$ , it’s a tiny ellipse around the true weights.

Predictive Distribution

A non-Bayesian model gives a single prediction $\overset{y}{^}$ for a new $x$ . A Bayesian model gives a distribution over predictions, integrating over all plausible $w$ ‘s:

$p (y ∣ x, y, X) = \int p (y ∣ x, w) p (w ∣ y, X) d w$

For Gaussian likelihood + Gaussian posterior, this integral is closed-form Gaussian:

$p (y ∣ x, y, X) = N (y ∣ m_{post}^{⊤} ϕ (x), β^{- 1} + ϕ (x)^{⊤} S_{post} ϕ (x))$

The variance has two contributions: $β^{- 1}$ (irreducible noise) and $ϕ^{⊤} S_{post} ϕ$ (model uncertainty, vanishing as data accumulates).

Why this matters in practice

Bayesian regression doesn’t just give " $\overset{y}{^} = 89.3$ " — it gives ” $y \sim N (89.3, 0.04)$ .” For decision-making (medical risk, financial pricing), the variance is as important as the mean. A confident wrong answer is worse than a hedged correct one.

MAP and Ridge Regression

Maximising the log-posterior with respect to $w$ :

$ar g max_{w} ln p (w ∣ y, X) = ar g min_{w} {\frac{1}{2} \sum_{i} (y_{i} - w^{⊤} ϕ (x_{i}))^{2} + \frac{α}{2 β} w^{⊤} w}$

The first term is the OLS objective. The second is an L2 penalty on $w$ , with regularisation coefficient $λ = α / β$ . This is exactly ridge regression.

The Bayesian view explains why L2 regularisation works: it’s the negative log-prior of a zero-mean Gaussian belief. Larger $α$ (tighter prior) → larger $λ$ (more shrinkage) → flatter, simpler models. Smaller $α$ → looser prior → behaviour closer to MLE/OLS.

L_p regulariser	Equivalent prior
L2 ( $∥ w ∥^{2}$ )	Gaussian: $w \sim N (0, α^{- 1} I)$
L1 ( $∥ w ∥_{1}$ )	Laplace: $w_{j} \sim Laplace (0, b)$

When It Beats MLE

A canonical example: degree-8 polynomial regression on $N = 10$ points.

MLE fits all 10 points (almost) exactly, but oscillates wildly between them — terrible test performance.
MAP with a Gaussian prior produces a smoother fit. The prior says “small weights are more likely,” which prevents the polynomial coefficients from blowing up to interpolate every training point.

Same data, same model class — different criterion, dramatically different generalisation.

Practical Notes

Hyperparameters $α, β$ . Treated as fixed constants in the basic setup. In practice, choose by cross-validation, by maximising the evidence $p (y ∣ X)$ (empirical Bayes), or by placing hyperpriors and integrating those out too.
Computation. Forming $S_{post}$ requires inverting an $(M + 1) \times (M + 1)$ matrix — same cost as OLS. Posterior mean reduces to one matrix-vector product.
Online updates. The posterior from $N$ points becomes the prior for the $(N + 1)$ th — you can update one point at a time without re-processing the dataset.

What Could Go Wrong

Wrong noise model. If $β$ (noise precision) is misspecified, the posterior is wrong. Either learn $β$ from data or place a prior on it (Gamma is conjugate to Gaussian-precision).
Wrong prior shape. A zero-centred Gaussian assumes weights “should be small.” If the truth is sparse (few non-zero weights), use a Laplace prior — recovers L1 / lasso.
High dimensions. $S_{post}^{- 1}$ is $(M + 1) \times (M + 1)$ ; inverting becomes expensive past a few thousand parameters.

Connections

linear-regression — the underlying model whose weights we’re inferring.
ordinary-least-squares — the MLE-equivalent point estimate; recovered as $α \to 0$ (flat prior).
ridge-regression — the MAP point estimate; recovered by maximising the posterior.
bayes-law — the rule combining prior and likelihood.
gaussian-distribution — the conjugate self-pairing that makes everything closed-form.
maximum likelihood estimation — the no-prior limit; equivalent to OLS under Gaussian noise.

Active Recall

Why is Gaussian-likelihood + Gaussian-prior a "conjugate" pair?

Because the resulting posterior is also Gaussian — same family as the prior. Conjugacy means the math stays in closed form: no integrals to estimate numerically, no MCMC required. For Bayesian linear regression with Gaussian noise, this is what makes the entire inference tractable analytically.

What's the relationship between the Bayesian posterior mean and ridge regression?

They’re identical. Maximising the log-posterior is the same as minimising the OLS loss + $(α / β) ∥ w ∥^{2}$ . So the MAP estimate of Bayesian linear regression with a zero-mean Gaussian prior is exactly the ridge regression solution. The Bayesian view supplies the “why” for L2 regularisation: it’s the negative log of a Gaussian prior belief.

What does the posterior covariance $S_{post}$ tell you that a point estimate (OLS / ridge) doesn't?

Model uncertainty — how confident we are in those weights. A wide posterior means many $w$ ‘s are still plausible (typical with little data); a narrow posterior means data has pinned down the weights tightly. For predictions, this propagates to a predictive variance that grows in regions of input space far from training data, and shrinks where you have lots of evidence. Point-estimate methods give a single $\overset{y}{^}$ with no uncertainty.

Course Notes

Explorer

bayesian-linear-regression

Motivation — The Underdetermined Case

The Bayesian Setup

Conjugacy — Why the Math Closes

What the Posterior Tells You

Predictive Distribution

MAP and Ridge Regression

When It Beats MLE

Practical Notes

What Could Go Wrong

Connections

Active Recall

Graph View

Table of Contents

Backlinks