Linear regression with an additional L2 penalty on the weight magnitudes. The objective has a closed-form solution that’s just like OLS but with added to . From the Bayesian view, ridge regression is MAP Bayesian linear regression under a zero-mean Gaussian prior on . The L2 term is the negative log-prior — regularisation has a probabilistic interpretation.
The Objective
The first term is the OLS residual sum of squares. The second is the L2 regulariser with coefficient .
- : pure OLS.
- Small : light regularisation — weights are pulled mildly towards zero.
- Large : heavy regularisation — weights forced small, fit allowed to drift.
- : , ignoring the data.
Closed-Form Solution
Setting on the objective:
Rearranging:
Compare to the OLS normal equation: . The only change is adding before inverting.
Why It Fixes OLS Pathologies
OLS fails when is singular (when , or when columns are collinear). Adding shifts every eigenvalue up by , making the matrix invertible:
- Rank-deficient (). OLS has no unique solution. Ridge has a unique one for any .
- Collinear features. is nearly singular; OLS weights swing wildly with small data changes. Ridge stabilises by trading bias for variance.
- Numerical conditioning. Even when is technically invertible, ill-conditioning amplifies floating-point errors. Ridge improves the condition number.
The Bayesian View — Where Does Come From?
Bayesian linear regression places a Gaussian prior and a Gaussian likelihood with noise precision . The log-posterior is:
Maximising (the MAP estimate) is the same as minimising:
This is exactly ridge regression with .
Regularisation is a prior in disguise
L2 regularisation is the negative log of a zero-mean Gaussian prior on . The “regularisation strength” is the ratio of prior precision to noise precision: high prior precision ( large) or low noise precision ( small) → strong regularisation. The Bayesian view turns “we should penalise large weights” from a heuristic into a statement about prior belief.
Other regularisers correspond to other priors:
| Regulariser | Prior | Effect |
|---|---|---|
| (L2) | Gaussian | Shrinks all weights smoothly toward 0 |
| (L1, lasso) | Laplace | Drives some weights to exactly 0 (sparsity) |
| (elastic net) | Mixture | Sparsity with stability |
Effect on Generalisation
Ridge trades bias for variance:
- Bias up. Shrinking weights toward zero biases the fit — even the optimal under regularisation isn’t quite the true weight vector.
- Variance down. The fit is less sensitive to noise in the training data — small data changes produce small weight changes.
For high-capacity models (high-degree polynomials, many basis functions), this is usually a net win on test performance: the variance reduction outweighs the bias increase.
A canonical illustration: degree-8 polynomial on 10 noisy points.
- (OLS): the fit interpolates every training point but oscillates wildly between them. Tiny training error, large test error.
- (ridge): the fit is smoother. Slightly larger training error, much smaller test error.
Choosing
Cross-validation. For a grid of candidate ‘s:
- Split training data into folds.
- For each fold, train on the others and validate on the held-out fold.
- Pick the that minimises mean validation error.
The Bayesian view suggests an alternative: empirical Bayes, where (and thus ) is chosen by maximising the evidence . This avoids cross-validation but requires more setup.
Properties
- Convex. The objective is strictly convex (assuming ) → unique global optimum.
- Closed-form. No iteration required (for moderate ).
- Stabilises against multicollinearity. Even severely correlated features give well-defined weights.
- Doesn’t produce sparse solutions. Weights shrink toward zero but rarely hit exactly zero. Use lasso (L1) if sparsity is desired.
What Could Go Wrong
- Standardisation matters. L2 penalises raw weight magnitudes, so feature scaling affects the result. Standardise inputs before fitting.
- The intercept is usually not penalised. Penalising it would shrink predictions toward zero rather than toward . Most implementations split it out.
- Wrong . Too small → no regularisation, OLS pathologies return. Too large → excessive shrinkage, model underfits.
Connections
- ordinary-least-squares — recovered as .
- bayesian-linear-regression — ridge is the MAP estimate; the prior is the L2 penalty.
- bayes-law — the rule that derives ridge from a Gaussian prior.
- linear-regression — the underlying model.
- generalization-bound — regularisation is one of the levers that controls model complexity (and thus the bound).
Active Recall
Why does adding to "fix" ill-conditioned matrices?
Because every eigenvalue of shifts up by . A near-zero eigenvalue (the source of ill-conditioning) becomes , which is well away from zero. The condition number — ratio of largest to smallest eigenvalue — improves accordingly. With , the inverse always exists; with large, the inverse is well-conditioned.
What's the Bayesian interpretation of the L2 penalty?
It’s the negative log-density of a zero-mean isotropic Gaussian prior on , scaled by the prior precision . Maximising the posterior (MAP) is the same as minimising “negative log-likelihood + negative log-prior” — and “negative log-prior” of a Gaussian is exactly .
A high-degree polynomial fits the training data perfectly but generalises poorly. How does ridge regression help?
It trades bias for variance. The L2 penalty discourages the polynomial coefficients from taking the extreme values needed to interpolate every training point. The fit becomes smoother — slightly worse on training data, much better on test data. Mathematically, ridge shrinks the eigenvector components of along directions of small data variance (where OLS would overfit) more than along directions of large data variance.