ridge-regression

Linear regression with an additional L2 penalty on the weight magnitudes. The objective $\sum (y_{i} - \overset{y}{^}_{i})^{2} + λ ∥ w ∥^{2}$ has a closed-form solution that’s just like OLS but with $λ I$ added to $Φ^{⊤} Φ$ . From the Bayesian view, ridge regression is MAP Bayesian linear regression under a zero-mean Gaussian prior on $w$ . The L2 term is the negative log-prior — regularisation has a probabilistic interpretation.

The Objective

$w_{ridge} = ar g w min i = 1 \sum N (y_{i} - w^{⊤} ϕ (x_{i}))^{2} + λ ∥ w ∥^{2}$

The first term is the OLS residual sum of squares. The second is the L2 regulariser with coefficient $λ \geq 0$ .

$λ = 0$ : pure OLS.
Small $λ$ : light regularisation — weights are pulled mildly towards zero.
Large $λ$ : heavy regularisation — weights forced small, fit allowed to drift.
$λ \to \infty$ : $w \to 0$ , ignoring the data.

Closed-Form Solution

Setting $\nabla = 0$ on the objective:

$2 Φ^{⊤} (Φ w - y) + 2 λ w = 0$

Rearranging:

$w_{ridge} = (Φ^{⊤} Φ + λ I)^{- 1} Φ^{⊤} y$

Compare to the OLS normal equation: $w_{OLS} = (Φ^{⊤} Φ)^{- 1} Φ^{⊤} y$ . The only change is adding $λ I$ before inverting.

Why It Fixes OLS Pathologies

OLS fails when $Φ^{⊤} Φ$ is singular (when $M > N$ , or when columns are collinear). Adding $λ I$ shifts every eigenvalue up by $λ$ , making the matrix invertible:

Rank-deficient $Φ$ ( $M > N$ ). OLS has no unique solution. Ridge has a unique one for any $λ > 0$ .
Collinear features. $Φ^{⊤} Φ$ is nearly singular; OLS weights swing wildly with small data changes. Ridge stabilises by trading bias for variance.
Numerical conditioning. Even when $Φ^{⊤} Φ$ is technically invertible, ill-conditioning amplifies floating-point errors. Ridge improves the condition number.

The Bayesian View — Where Does $λ$ Come From?

Bayesian linear regression places a Gaussian prior $w \sim N (0, α^{- 1} I)$ and a Gaussian likelihood with noise precision $β$ . The log-posterior is:

$ln p (w ∣ y, X) = - \frac{β}{2} \sum_{i} (y_{i} - w^{⊤} ϕ (x_{i}))^{2} - \frac{α}{2} ∥ w ∥^{2} + const$

Maximising (the MAP estimate) is the same as minimising:

$\frac{1}{2} \sum_{i} (y_{i} - w^{⊤} ϕ (x_{i}))^{2} + \frac{α}{2 β} ∥ w ∥^{2}$

This is exactly ridge regression with $λ = α / β$ .

Regularisation is a prior in disguise

L2 regularisation is the negative log of a zero-mean Gaussian prior on $w$ . The “regularisation strength” $λ$ is the ratio of prior precision to noise precision: high prior precision ( $α$ large) or low noise precision ( $β$ small) → strong regularisation. The Bayesian view turns “we should penalise large weights” from a heuristic into a statement about prior belief.

Other regularisers correspond to other priors:

Regulariser	Prior	Effect
$λ ∥ w ∥^{2}$ (L2)	Gaussian	Shrinks all weights smoothly toward 0
$λ ∥ w ∥_{1}$ (L1, lasso)	Laplace	Drives some weights to exactly 0 (sparsity)
$λ_{1} ∥ w ∥_{1} + λ_{2} ∥ w ∥^{2}$ (elastic net)	Mixture	Sparsity with stability

Effect on Generalisation

Ridge trades bias for variance:

Bias up. Shrinking weights toward zero biases the fit — even the optimal $w$ under regularisation isn’t quite the true weight vector.
Variance down. The fit is less sensitive to noise in the training data — small data changes produce small weight changes.

For high-capacity models (high-degree polynomials, many basis functions), this is usually a net win on test performance: the variance reduction outweighs the bias increase.

A canonical illustration: degree-8 polynomial on 10 noisy points.

$λ = 0$ (OLS): the fit interpolates every training point but oscillates wildly between them. Tiny training error, large test error.
$λ > 0$ (ridge): the fit is smoother. Slightly larger training error, much smaller test error.

Choosing $λ$

Cross-validation. For a grid of candidate $λ$ ‘s:

Split training data into $K$ folds.
For each fold, train on the others and validate on the held-out fold.
Pick the $λ$ that minimises mean validation error.

The Bayesian view suggests an alternative: empirical Bayes, where $α$ (and thus $λ$ ) is chosen by maximising the evidence $p (y ∣ X)$ . This avoids cross-validation but requires more setup.

Properties

Convex. The objective is strictly convex (assuming $λ > 0$ ) → unique global optimum.
Closed-form. No iteration required (for moderate $M$ ).
Stabilises against multicollinearity. Even severely correlated features give well-defined weights.
Doesn’t produce sparse solutions. Weights shrink toward zero but rarely hit exactly zero. Use lasso (L1) if sparsity is desired.

What Could Go Wrong

Standardisation matters. L2 penalises raw weight magnitudes, so feature scaling affects the result. Standardise inputs before fitting.
The intercept $w_{0}$ is usually not penalised. Penalising it would shrink predictions toward zero rather than toward $\overset{y}{ˉ}$ . Most implementations split it out.
Wrong $λ$ . Too small → no regularisation, OLS pathologies return. Too large → excessive shrinkage, model underfits.

Connections

ordinary-least-squares — recovered as $λ \to 0$ .
bayesian-linear-regression — ridge is the MAP estimate; the prior is the L2 penalty.
bayes-law — the rule that derives ridge from a Gaussian prior.
linear-regression — the underlying model.
generalization-bound — regularisation is one of the levers that controls model complexity (and thus the bound).

Active Recall

Why does adding $λ I$ to $Φ^{⊤} Φ$ "fix" ill-conditioned matrices?

Because every eigenvalue of $Φ^{⊤} Φ$ shifts up by $λ$ . A near-zero eigenvalue (the source of ill-conditioning) becomes $λ$ , which is well away from zero. The condition number — ratio of largest to smallest eigenvalue — improves accordingly. With $λ > 0$ , the inverse always exists; with $λ$ large, the inverse is well-conditioned.

What's the Bayesian interpretation of the L2 penalty?

It’s the negative log-density of a zero-mean isotropic Gaussian prior on $w$ , scaled by the prior precision $α$ . Maximising the posterior (MAP) is the same as minimising “negative log-likelihood + negative log-prior” — and “negative log-prior” of a Gaussian is exactly $\frac{α}{2} ∥ w ∥^{2}$ .

A high-degree polynomial fits the training data perfectly but generalises poorly. How does ridge regression help?

It trades bias for variance. The L2 penalty discourages the polynomial coefficients from taking the extreme values needed to interpolate every training point. The fit becomes smoother — slightly worse on training data, much better on test data. Mathematically, ridge shrinks the eigenvector components of $w$ along directions of small data variance (where OLS would overfit) more than along directions of large data variance.

Course Notes

Explorer

ridge-regression

The Objective

Closed-Form Solution

Why It Fixes OLS Pathologies

The Bayesian View — Where Does $λ$ Come From?

Effect on Generalisation

Choosing $λ$

Properties

What Could Go Wrong

Connections

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

ridge-regression

The Objective

Closed-Form Solution

Why It Fixes OLS Pathologies

The Bayesian View — Where Does λ Come From?

Effect on Generalisation

Choosing λ

Properties

What Could Go Wrong

Connections

Active Recall

Graph View

Table of Contents

Backlinks

The Bayesian View — Where Does $λ$ Come From?

Choosing $λ$