Linear regression with an additional L2 penalty on the weight magnitudes. The objective has a closed-form solution that’s just like OLS but with added to . From the Bayesian view, ridge regression is MAP Bayesian linear regression under a zero-mean Gaussian prior on . The L2 term is the negative log-prior — regularisation has a probabilistic interpretation.

The Objective

The first term is the OLS residual sum of squares. The second is the L2 regulariser with coefficient .

  • : pure OLS.
  • Small : light regularisation — weights are pulled mildly towards zero.
  • Large : heavy regularisation — weights forced small, fit allowed to drift.
  • : , ignoring the data.

Closed-Form Solution

Setting on the objective:

Rearranging:

Compare to the OLS normal equation: . The only change is adding before inverting.

Why It Fixes OLS Pathologies

OLS fails when is singular (when , or when columns are collinear). Adding shifts every eigenvalue up by , making the matrix invertible:

  • Rank-deficient (). OLS has no unique solution. Ridge has a unique one for any .
  • Collinear features. is nearly singular; OLS weights swing wildly with small data changes. Ridge stabilises by trading bias for variance.
  • Numerical conditioning. Even when is technically invertible, ill-conditioning amplifies floating-point errors. Ridge improves the condition number.

The Bayesian View — Where Does Come From?

Bayesian linear regression places a Gaussian prior and a Gaussian likelihood with noise precision . The log-posterior is:

Maximising (the MAP estimate) is the same as minimising:

This is exactly ridge regression with .

Regularisation is a prior in disguise

L2 regularisation is the negative log of a zero-mean Gaussian prior on . The “regularisation strength” is the ratio of prior precision to noise precision: high prior precision ( large) or low noise precision ( small) → strong regularisation. The Bayesian view turns “we should penalise large weights” from a heuristic into a statement about prior belief.

Other regularisers correspond to other priors:

RegulariserPriorEffect
(L2)GaussianShrinks all weights smoothly toward 0
(L1, lasso)LaplaceDrives some weights to exactly 0 (sparsity)
(elastic net)MixtureSparsity with stability

Effect on Generalisation

Ridge trades bias for variance:

  • Bias up. Shrinking weights toward zero biases the fit — even the optimal under regularisation isn’t quite the true weight vector.
  • Variance down. The fit is less sensitive to noise in the training data — small data changes produce small weight changes.

For high-capacity models (high-degree polynomials, many basis functions), this is usually a net win on test performance: the variance reduction outweighs the bias increase.

A canonical illustration: degree-8 polynomial on 10 noisy points.

  • (OLS): the fit interpolates every training point but oscillates wildly between them. Tiny training error, large test error.
  • (ridge): the fit is smoother. Slightly larger training error, much smaller test error.

Choosing

Cross-validation. For a grid of candidate ‘s:

  1. Split training data into folds.
  2. For each fold, train on the others and validate on the held-out fold.
  3. Pick the that minimises mean validation error.

The Bayesian view suggests an alternative: empirical Bayes, where (and thus ) is chosen by maximising the evidence . This avoids cross-validation but requires more setup.

Properties

  • Convex. The objective is strictly convex (assuming ) → unique global optimum.
  • Closed-form. No iteration required (for moderate ).
  • Stabilises against multicollinearity. Even severely correlated features give well-defined weights.
  • Doesn’t produce sparse solutions. Weights shrink toward zero but rarely hit exactly zero. Use lasso (L1) if sparsity is desired.

What Could Go Wrong

  • Standardisation matters. L2 penalises raw weight magnitudes, so feature scaling affects the result. Standardise inputs before fitting.
  • The intercept is usually not penalised. Penalising it would shrink predictions toward zero rather than toward . Most implementations split it out.
  • Wrong . Too small → no regularisation, OLS pathologies return. Too large → excessive shrinkage, model underfits.

Connections

Active Recall