The fitting criterion for linear regression: choose the weights that minimise the sum of squared residuals between predictions and observed targets. Yields a closed-form solution via the normal equation , computable in one matrix inversion — no iterations.

The Objective

Given training data and a linear model , define the residual for example as:

OLS picks the that minimises the sum of squared residuals:

The objective is convex and quadratic in , so its unique minimum is found by setting .

Deriving the Normal Equation

Stack the basis-function evaluations into the design matrix (an matrix whose row is ). Then:

Differentiate w.r.t. and set to zero:

Rearranging gives the normal equation:

Solving (assuming is invertible):

The matrix is the Moore–Penrose pseudoinverse of . So OLS is just .

Worked Derivation (Degree-1 Polynomial)

Take . The objective is:

Setting and gives a linear system:

In matrix form:

Inverting the matrix gives the closed-form solution. The general matrix-form derivation uses exactly the same logic, just with in place of the matrix.

MLE Equivalence — The Probabilistic Justification

OLS picks “minimise squared residuals” without saying why squared, as opposed to absolute or any other power. The justification comes from a probabilistic model: assume

Then . The likelihood of the dataset is:

Taking the log:

The only -dependent term is the sum of squared residuals — and it appears with a negative coefficient. So:

MLE = OLS under additive Gaussian noise

Under the assumption of i.i.d. Gaussian noise, the maximum-likelihood weights are exactly the OLS weights. This is the missing justification for “why squared error” — squared error isn’t arbitrary, it’s the negative log-likelihood of a Gaussian noise model (up to constants).

The MLE for falls out of the same derivation: , where .

OLS makes no distributional assumption — but its justification does

The OLS objective itself is just “minimise squared residuals” — no statistics required. The probabilistic justification (Gaussian noise → squared loss is optimal) is layered on top. So you can use OLS without believing in Gaussian noise; you just lose the optimality argument.

What Could Go Wrong

  • Ill-conditioned . If columns of are near-collinear (e.g., redundant features, polynomial basis evaluated on a narrow range), the matrix is nearly singular. Inverting it is numerically unstable; small data changes swing wildly. Use ridge regression or pseudoinverse via SVD.
  • . When the number of basis functions exceeds the number of examples, is rank-deficient and not invertible. Either reduce , regularise, or use a method that doesn’t require inversion.
  • Outliers. Squared loss heavily penalises large residuals. A single mislabelled point can shift the entire fit. Robust alternatives: MAE, Huber loss, RANSAC.
  • Wrong noise model. If the true noise is heteroscedastic (variance depends on ) or heavy-tailed, the MLE justification breaks down. OLS still minimises squared residuals, but is no longer the optimal estimator.

Computational Cost

  • Forming : .
  • Inverting (or solving via Cholesky): .
  • Total: .

For modest (≤ a few thousand), OLS is extremely fast — one shot, no learning rate. For large or , gradient descent or stochastic methods scale better, at the cost of iteration.

When OLS Beats Iterative Methods

  • is small enough that fits in memory and inverts cheaply.
  • The data is well-conditioned (no severe collinearity).
  • You want exact, deterministic weights without learning-rate tuning.

When not to use OLS:

  • is huge (e.g., kernel methods with ) — the matrix doesn’t fit.
  • The design matrix is ill-conditioned — use ridge regression or SVD-based pseudoinverse.
  • You need online updates — gradient descent on incoming data is more natural.

Connections

Active Recall