The fitting criterion for linear regression: choose the weights that minimise the sum of squared residuals between predictions and observed targets. Yields a closed-form solution via the normal equation , computable in one matrix inversion — no iterations.
The Objective
Given training data and a linear model , define the residual for example as:
OLS picks the that minimises the sum of squared residuals:
The objective is convex and quadratic in , so its unique minimum is found by setting .
Deriving the Normal Equation
Stack the basis-function evaluations into the design matrix (an matrix whose row is ). Then:
Differentiate w.r.t. and set to zero:
Rearranging gives the normal equation:
Solving (assuming is invertible):
The matrix is the Moore–Penrose pseudoinverse of . So OLS is just .
Worked Derivation (Degree-1 Polynomial)
Take . The objective is:
Setting and gives a linear system:
In matrix form:
Inverting the matrix gives the closed-form solution. The general matrix-form derivation uses exactly the same logic, just with in place of the matrix.
MLE Equivalence — The Probabilistic Justification
OLS picks “minimise squared residuals” without saying why squared, as opposed to absolute or any other power. The justification comes from a probabilistic model: assume
Then . The likelihood of the dataset is:
Taking the log:
The only -dependent term is the sum of squared residuals — and it appears with a negative coefficient. So:
MLE = OLS under additive Gaussian noise
Under the assumption of i.i.d. Gaussian noise, the maximum-likelihood weights are exactly the OLS weights. This is the missing justification for “why squared error” — squared error isn’t arbitrary, it’s the negative log-likelihood of a Gaussian noise model (up to constants).
The MLE for falls out of the same derivation: , where .
OLS makes no distributional assumption — but its justification does
The OLS objective itself is just “minimise squared residuals” — no statistics required. The probabilistic justification (Gaussian noise → squared loss is optimal) is layered on top. So you can use OLS without believing in Gaussian noise; you just lose the optimality argument.
What Could Go Wrong
- Ill-conditioned . If columns of are near-collinear (e.g., redundant features, polynomial basis evaluated on a narrow range), the matrix is nearly singular. Inverting it is numerically unstable; small data changes swing wildly. Use ridge regression or pseudoinverse via SVD.
- . When the number of basis functions exceeds the number of examples, is rank-deficient and not invertible. Either reduce , regularise, or use a method that doesn’t require inversion.
- Outliers. Squared loss heavily penalises large residuals. A single mislabelled point can shift the entire fit. Robust alternatives: MAE, Huber loss, RANSAC.
- Wrong noise model. If the true noise is heteroscedastic (variance depends on ) or heavy-tailed, the MLE justification breaks down. OLS still minimises squared residuals, but is no longer the optimal estimator.
Computational Cost
- Forming : .
- Inverting (or solving via Cholesky): .
- Total: .
For modest (≤ a few thousand), OLS is extremely fast — one shot, no learning rate. For large or , gradient descent or stochastic methods scale better, at the cost of iteration.
When OLS Beats Iterative Methods
- is small enough that fits in memory and inverts cheaply.
- The data is well-conditioned (no severe collinearity).
- You want exact, deterministic weights without learning-rate tuning.
When not to use OLS:
- is huge (e.g., kernel methods with ) — the matrix doesn’t fit.
- The design matrix is ill-conditioned — use ridge regression or SVD-based pseudoinverse.
- You need online updates — gradient descent on incoming data is more natural.
Connections
- linear-regression — the model OLS fits.
- design-matrix — the matrix that appears in the normal equation.
- maximum likelihood estimation — the principle that OLS implements under a Gaussian noise assumption.
- gaussian-distribution — the assumed noise distribution that makes MLE = OLS.
- gradient descent — iterative alternative when the closed form is impractical.
- hessian-matrix — for quadratic objectives, , so Newton’s method converges in one step (and that step is exactly the normal equation).
Active Recall
Write the OLS solution in a single equation. What does stand for?
, where is the Moore–Penrose pseudoinverse of .
Why is "minimise squared error" a reasonable choice of objective?
Under the assumption that observations are corrupted by additive i.i.d. Gaussian noise, the MLE for the regression weights is exactly the squared-error minimiser. So squared loss is the negative log-likelihood (up to constants) of a Gaussian noise model — minimising it is principled, not arbitrary.
True or false: OLS requires an assumption about the noise distribution.
False. OLS is purely an optimisation criterion (“minimise sum of squared residuals”) — no distributional assumption. The MLE-equivalence justification requires Gaussian noise, but you can still apply OLS without believing it.
When does the normal equation fail?
When is singular or ill-conditioned. This happens with collinear features, redundant basis functions, or when there are more features than examples (). Mitigations: drop redundant features, regularise (ridge), or use the SVD-based pseudoinverse.