ordinary-least-squares

The fitting criterion for linear regression: choose the weights $w$ that minimise the sum of squared residuals between predictions and observed targets. Yields a closed-form solution via the normal equation $w^{*} = (Φ^{⊤} Φ)^{- 1} Φ^{⊤} y$ , computable in one matrix inversion — no iterations.

The Objective

Given training data ${(x_{i}, y_{i})}_{i = 1}^{N}$ and a linear model $\overset{y}{^}_{i} = w^{⊤} ϕ (x_{i})$ , define the residual for example $i$ as:

$r_{i} = y_{i} - \overset{y}{^}_{i} = y_{i} - w^{⊤} ϕ (x_{i})$

OLS picks the $w$ that minimises the sum of squared residuals:

$w_{OLS} = ar g w min R (w), R (w) = i = 1 \sum N (y_{i} - w^{⊤} ϕ (x_{i}))^{2}$

The objective is convex and quadratic in $w$ , so its unique minimum is found by setting $\nabla R = 0$ .

Deriving the Normal Equation

Stack the basis-function evaluations into the design matrix $Φ$ (an $N \times (M + 1)$ matrix whose row $i$ is $ϕ (x_{i})^{⊤}$ ). Then:

$R (w) = ∥ y - Φ w ∥^{2}$

Differentiate w.r.t. $w$ and set to zero:

$\nabla_{w} R = - 2 Φ^{⊤} (y - Φ w) = 0$

Rearranging gives the normal equation:

$Φ^{⊤} Φ w = Φ^{⊤} y$

Solving (assuming $Φ^{⊤} Φ$ is invertible):

$w_{OLS} = (Φ^{⊤} Φ)^{- 1} Φ^{⊤} y$

The matrix $Φ^{†} = (Φ^{⊤} Φ)^{- 1} Φ^{⊤}$ is the Moore–Penrose pseudoinverse of $Φ$ . So OLS is just $w^{*} = Φ^{†} y$ .

Worked Derivation (Degree-1 Polynomial)

Take $\overset{y}{^} = w_{0} + w_{1} x$ . The objective is:

$R (w_{0}, w_{1}) = \sum_{i} (y_{i} - w_{0} - w_{1} x_{i})^{2}$

Setting $\partial R / \partial w_{0} = 0$ and $\partial R / \partial w_{1} = 0$ gives a $2 \times 2$ linear system:

$\sum_{i} y_{i} = w_{0} N + w_{1} \sum_{i} x_{i}$ $\sum_{i} x_{i} y_{i} = w_{0} \sum_{i} x_{i} + w_{1} \sum_{i} x_{i}^{2}$

In matrix form:

$(\sum_{i} y_{i} \sum_{i} x_{i} y_{i}) = (N \sum_{i} x_{i} \sum_{i} x_{i} \sum_{i} x_{i}^{2}) (w_{0} w_{1})$

Inverting the $2 \times 2$ matrix gives the closed-form solution. The general matrix-form derivation uses exactly the same logic, just with $Φ$ in place of the $[1, x]$ matrix.

MLE Equivalence — The Probabilistic Justification

OLS picks “minimise squared residuals” without saying why squared, as opposed to absolute or any other power. The justification comes from a probabilistic model: assume

$y_{i} = w^{⊤} ϕ (x_{i}) + ε_{i}, ε_{i} \sim N (0, σ^{2})$

Then $y_{i} ∣ x_{i} \sim N (w^{⊤} ϕ (x_{i}), σ^{2})$ . The likelihood of the dataset is:

$L (w, σ^{2}) = \prod_{i = 1}^{N} \frac{1}{2 π σ ^{2}} exp (- \frac{( y _{i} - w ^{⊤} ϕ ( x _{i} ) ) ^{2}}{2 σ ^{2}})$

Taking the log:

$ln L = - \frac{N}{2} ln (2 π σ^{2}) - \frac{1}{2 σ ^{2}} \sum_{i = 1}^{N} (y_{i} - w^{⊤} ϕ (x_{i}))^{2}$

The only $w$ -dependent term is the sum of squared residuals — and it appears with a negative coefficient. So:

$ar g max_{w} L (w) = ar g min_{w} \sum_{i} (y_{i} - w^{⊤} ϕ (x_{i}))^{2} = w_{OLS}$

MLE = OLS under additive Gaussian noise

Under the assumption of i.i.d. Gaussian noise, the maximum-likelihood weights are exactly the OLS weights. This is the missing justification for “why squared error” — squared error isn’t arbitrary, it’s the negative log-likelihood of a Gaussian noise model (up to constants).

The MLE for $σ^{2}$ falls out of the same derivation: $\overset{σ}{^}_{MLE}^{2} = RSS / N$ , where $RSS = ∥ y - Φ w_{OLS} ∥^{2}$ .

OLS makes no distributional assumption — but its justification does

The OLS objective itself is just “minimise squared residuals” — no statistics required. The probabilistic justification (Gaussian noise → squared loss is optimal) is layered on top. So you can use OLS without believing in Gaussian noise; you just lose the optimality argument.

What Could Go Wrong

Ill-conditioned $Φ^{⊤} Φ$ . If columns of $Φ$ are near-collinear (e.g., redundant features, polynomial basis evaluated on a narrow range), the matrix is nearly singular. Inverting it is numerically unstable; small data changes swing $w$ wildly. Use ridge regression or pseudoinverse via SVD.
$M > N$ . When the number of basis functions exceeds the number of examples, $Φ^{⊤} Φ$ is rank-deficient and not invertible. Either reduce $M$ , regularise, or use a method that doesn’t require inversion.
Outliers. Squared loss heavily penalises large residuals. A single mislabelled point can shift the entire fit. Robust alternatives: MAE, Huber loss, RANSAC.
Wrong noise model. If the true noise is heteroscedastic (variance depends on $x$ ) or heavy-tailed, the MLE justification breaks down. OLS still minimises squared residuals, but is no longer the optimal estimator.

Computational Cost

Forming $Φ^{⊤} Φ$ : $O (N M^{2})$ .
Inverting (or solving via Cholesky): $O (M^{3})$ .
Total: $O (N M^{2} + M^{3})$ .

For modest $M$ (≤ a few thousand), OLS is extremely fast — one shot, no learning rate. For large $M$ or $N$ , gradient descent or stochastic methods scale better, at the cost of iteration.

When OLS Beats Iterative Methods

$M$ is small enough that $Φ^{⊤} Φ$ fits in memory and inverts cheaply.
The data is well-conditioned (no severe collinearity).
You want exact, deterministic weights without learning-rate tuning.

When not to use OLS:

$M$ is huge (e.g., kernel methods with $N = M$ ) — the matrix doesn’t fit.
The design matrix is ill-conditioned — use ridge regression or SVD-based pseudoinverse.
You need online updates — gradient descent on incoming data is more natural.

Connections

linear-regression — the model OLS fits.
design-matrix — the matrix $Φ$ that appears in the normal equation.
maximum likelihood estimation — the principle that OLS implements under a Gaussian noise assumption.
gaussian-distribution — the assumed noise distribution that makes MLE = OLS.
gradient descent — iterative alternative when the closed form is impractical.
hessian-matrix — for quadratic objectives, $H = 2 Φ^{⊤} Φ$ , so Newton’s method converges in one step (and that step is exactly the normal equation).

Active Recall

Write the OLS solution in a single equation. What does $Φ^{†}$ stand for?

$w_{OLS} = (Φ^{⊤} Φ)^{- 1} Φ^{⊤} y = Φ^{†} y$ , where $Φ^{†}$ is the Moore–Penrose pseudoinverse of $Φ$ .

Why is "minimise squared error" a reasonable choice of objective?

Under the assumption that observations are corrupted by additive i.i.d. Gaussian noise, the MLE for the regression weights is exactly the squared-error minimiser. So squared loss is the negative log-likelihood (up to constants) of a Gaussian noise model — minimising it is principled, not arbitrary.

True or false: OLS requires an assumption about the noise distribution.

False. OLS is purely an optimisation criterion (“minimise sum of squared residuals”) — no distributional assumption. The MLE-equivalence justification requires Gaussian noise, but you can still apply OLS without believing it.

When does the normal equation fail?

When $Φ^{⊤} Φ$ is singular or ill-conditioned. This happens with collinear features, redundant basis functions, or when there are more features than examples ( $M > N$ ). Mitigations: drop redundant features, regularise (ridge), or use the SVD-based pseudoinverse.

Course Notes

Explorer

ordinary-least-squares

The Objective

Deriving the Normal Equation

Worked Derivation (Degree-1 Polynomial)

MLE Equivalence — The Probabilistic Justification

What Could Go Wrong

Computational Cost

When OLS Beats Iterative Methods

Connections

Active Recall

Graph View

Table of Contents

Backlinks