Linear Regression and Why Squared Error Isn't Arbitrary

THE CRUX: We've spent six weeks on classification — logistic regression, hard- and soft-margin SVMs. This week pivots to regression: predicting continuous outputs. Linear regression sounds trivial — fit a line — but two questions sit underneath. (1) How do we actually compute the answer? (2) Why do we use squared error and not absolute error or anything else?

The first answer is Ordinary Least Squares (OLS): minimise the sum of squared residuals, which has a closed-form solution via the normal equation $w = (Φ^{⊤} Φ)^{- 1} Φ^{⊤} y$ . One matrix inversion, no iterations, no learning rate. The second answer is the probabilistic interpretation: assume $y = w^{⊤} ϕ (x) + ε$ with $ε \sim N (0, σ^{2})$ , and the MLE for $w$ is exactly the OLS solution. Squared error isn’t arbitrary — it’s the negative log-likelihood of a Gaussian noise model, up to constants. This is the regression analog of week 2’s logistic-regression-as-MLE result: the noise model determines the loss.

Part 1: Linear Regression

From Classification to Regression

For six weeks we’ve been doing classification: $Y$ is a finite set of categories. This week, $Y = R$ — the output is continuous. Examples:

BMI → cardiac ejection fraction
Customer features → recommended credit line (in dollars, not yes/no)
Ad spend → revenue

The hypothesis form looks identical to logistic regression’s pre-sigmoid output:

$h (x) = w^{⊤} x$

But there’s no sigmoid, no threshold. The output is the prediction. What changes is the loss function and how we fit.

The Model

Linear regression assumes:

$\overset{y}{^} (x, w) = \sum_{j = 0}^{M} w_{j} ϕ_{j} (x) = w^{⊤} ϕ (x)$

where $ϕ$ is a vector of basis functions (with $ϕ_{0} \equiv 1$ for the intercept). The “linear” in linear regression refers to linearity in the parameters $w$ — not in $x$ . The following are all linear regression models:

$\overset{y}{^} = w_{0} + w_{1} x + w_{2} x^{2} + w_{3} x^{3} (polynomial)$ $\overset{y}{^} = w_{0} + w_{1} e^{x_{1}} + w_{2} e^{x_{2}} (exponential basis)$ $\overset{y}{^} = w_{0} + \sum_{j} w_{j} exp (- \frac{( x - μ _{j} ) ^{2}}{2 s ^{2}}) (Gaussian RBF basis)$

What unites them: $\partial \overset{y}{^} / \partial w_{j}$ depends on $x$ alone, not on $w$ . This is what OLS exploits to give a closed-form solution — no matter how curvy the basis functions, the fitting problem is linear-algebraic.

Ordinary Least Squares

Define the residual $r_{i} = y_{i} - \overset{y}{^}_{i}$ and the objective:

$R (w) = \sum_{i = 1}^{N} r_{i}^{2} = \sum_{i = 1}^{N} (y_{i} - w^{⊤} ϕ (x_{i}))^{2}$

OLS picks the $w$ that minimises this. Setting $\nabla R = 0$ gives a linear system; collecting it in matrix form using the design matrix $Φ$ :

$Φ^{⊤} Φ w = Φ^{⊤} y (normal equation)$

Solving:

$w_{OLS} = (Φ^{⊤} Φ)^{- 1} Φ^{⊤} y = Φ^{†} y$

where $Φ^{†}$ is the Moore–Penrose pseudoinverse. One matrix inversion gives the exact answer.

Worked Example — Degree-1 Polynomial

For $\overset{y}{^} = w_{0} + w_{1} x$ , expanding $R$ and setting $\partial R / \partial w_{0} = \partial R / \partial w_{1} = 0$ gives:

$\sum_{i} y_{i} = w_{0} N + w_{1} \sum_{i} x_{i}$ $\sum_{i} x_{i} y_{i} = w_{0} \sum_{i} x_{i} + w_{1} \sum_{i} x_{i}^{2}$

In matrix form:

$(\sum_{i} y_{i} \sum_{i} x_{i} y_{i}) = (N \sum_{i} x_{i} \sum_{i} x_{i} \sum_{i} x_{i}^{2}) (w_{0} w_{1})$

Inverting the $2 \times 2$ matrix gives $w_{0}$ and $w_{1}$ .

Basis Expansion — The Bridge to Non-Linear

This is the same trick we used for SVMs in week 4: replace $x$ with $ϕ (x)$ to gain non-linear capacity while staying within linear-fitting machinery. The model is “linear in $ϕ$ -space” but bends in input space.

Common basis families:

Basis	Form	When
Polynomial	$ϕ_{j} (x) = x^{j}$	Low-degree smooth
Gaussian RBF	$ϕ_{j} (x) = e^{- (x - μ_{j})^{2} / (2 s^{2})}$	Local bumps
Sigmoidal	$ϕ_{j} (x) = (1 + e^{- (x - μ_{j}) / s})^{- 1}$	Saturation
tanh	$ϕ_{j} (x) = tanh ((x - μ_{j}) / s)$	Symmetric saturation

Higher polynomial degree → tighter fit but more risk of overfitting. The classic plot: a degree-1 fit on 10 points gives a line that misses local structure; a degree-6 fit hits every point but oscillates wildly between them. Validation (next weeks) is how we pick the right degree.

Iterative Alternative — Gradient Descent

OLS fails when $Φ^{⊤} Φ$ is too large to invert (millions of features) or near-singular (collinear features). Then gradient descent handles the same objective iteratively:

$w \leftarrow w - η \nabla R (w)$

where $\nabla R = - 2 Φ^{⊤} (y - Φ w)$ . Convex objective → guaranteed convergence to the global optimum, but requires learning-rate tuning.

The gradient-descent example from the lecture: for $f (x, y) = x^{3} + 2 y^{2} - y$ at $(1, 0)$ with $η = 0.1$ , $\nabla f = (3, - 1)$ , so:

$x_{t + 1} = (1, 0) - 0.1 \cdot (3, - 1) = (0.7, 0.1)$

Function value drops from $f (1, 0) = 1$ to $f (0.7, 0.1) = 0.263$ — the step improved the objective.

Part 2: The Probabilistic View — Why Squared Error?

The Question OLS Doesn’t Answer

OLS says “minimise squared residuals.” But why squared? Why not absolute residuals, or fourth-power residuals? The OLS objective itself doesn’t say. We need a separate argument.

The answer comes from probabilistic modelling. Assume the targets are generated by:

$y_{i} = w^{⊤} ϕ (x_{i}) + ε_{i}, ε_{i} \sim N (0, σ^{2})$

That is: the underlying relationship is linear (in $ϕ$ -space), but observations are corrupted by additive Gaussian noise.

MLE for the Regression Weights

Under this model, $y_{i} ∣ x_{i} \sim N (w^{⊤} ϕ (x_{i}), σ^{2})$ . The likelihood of the dataset is:

$L (w) = \prod_{i} \frac{1}{2 π σ ^{2}} exp (- \frac{( y _{i} - w ^{⊤} ϕ ( x _{i} ) ) ^{2}}{2 σ ^{2}})$

Taking the log:

$ln L (w) = - \frac{N}{2} ln (2 π σ^{2}) - \frac{1}{2 σ ^{2}} \sum_{i} (y_{i} - w^{⊤} ϕ (x_{i}))^{2}$

The only $w$ -dependent term is the sum of squared residuals, with a negative coefficient. So:

$ar g max_{w} ln L (w) = ar g min_{w} \sum_{i} (y_{i} - w^{⊤} ϕ (x_{i}))^{2} = w_{OLS}$

MLE = OLS under additive Gaussian noise

The maximum-likelihood weights are identical to the OLS weights, when noise is i.i.d. Gaussian. This is the missing justification: squared error isn’t an arbitrary choice of loss — it’s the negative log-likelihood of a Gaussian noise model. Picking any other loss is implicitly assuming a different noise distribution.

The MLE for $σ^{2}$ falls out of the same derivation: $\overset{σ}{^}_{MLE}^{2} = RSS / N$ where $RSS = ∥ y - Φ w_{OLS} ∥^{2}$ .

The Pattern Connecting Week 2 and Week 7

This is the same recipe we saw in week 2 for logistic-regression:

Model	Noise / output model	MLE objective	Reduces to
Linear regression	$y = w^{⊤} ϕ (x) + ε$ , $ε \sim N$	Maximise Gaussian likelihood	Minimise squared error (= OLS)
Logistic regression	$y \sim Bernoulli (σ (w^{⊤} x))$	Maximise Bernoulli likelihood	Minimise cross-entropy

Same recipe (negative log-likelihood as loss), different distribution → different loss function. The “right” loss is determined by what noise/output model you’re willing to assume.

Frequentist vs Bayesian — Setting Up Future Weeks

The MLE view treats $w$ as a fixed unknown to be point-estimated. The Bayesian view (preview for later weeks) treats $w$ as a random variable with a prior distribution; observed data updates this to a posterior via Bayes’ law:

$p (w ∣ X, y) \propto p (y ∣ X, w) \cdot p (w)$

Posterior ∝ likelihood × prior.

Under a Gaussian prior $w \sim N (0, α^{- 1} I)$ , the MAP estimate turns out to be ridge regression — adding $α ∥ w ∥^{2}$ to the OLS objective. So L2 regularisation has a Bayesian interpretation: it’s a Gaussian prior on the weights. The “regularisation” lever and the “prior belief” lever are the same lever, viewed differently.

This connection (regularisation ↔ priors) is one of the cleanest unifying ideas in the module.

Part 3: Evaluating Regression Models

Common metrics for evaluating $\overset{y}{^}$ vs $y$ on test data:

Metric	Formula	Notes
Mean Squared Error (MSE)	$\frac{1}{N} \sum (y_{i} - \overset{y}{^}_{i})^{2}$	What OLS minimises (up to scale)
Root Mean Squared Error (RMSE)	$MSE$	Same units as $y$ ; interpretable
Mean Absolute Error (MAE)	$\tfrac{1}{N} \sum	y_i - \hat{y}_i
Coefficient of Determination ( $R^{2}$ )	$1 - \frac{\sum ( y _{i} - y ^ _{i} ) ^{2}}{\sum ( y _{i} - y ˉ ) ^{2}}$	Fraction of variance explained; 1 = perfect

$R^{2} = 1$ means the model explains all variance in $y$ ; $R^{2} = 0$ means it does no better than the constant predictor $\overset{y}{ˉ}$ ; $R^{2} < 0$ means it’s worse than the mean baseline.

What Could Go Wrong

Ill-conditioned $Φ^{⊤} Φ$ . Collinear features or polynomial bases on a narrow range produce nearly singular matrices. Inversion blows up; small data changes flip the weights wildly. Mitigations: drop redundant features, standardise inputs, add ridge regularisation.
More parameters than examples ( $M > N$ ). $Φ^{⊤} Φ$ is rank-deficient; OLS has no unique solution. Use regularisation or reduce $M$ .
Outliers. Squared loss penalises large residuals quadratically — a single mislabelled point can dominate the fit. Robust alternatives: MAE, Huber loss, RANSAC.
Wrong polynomial degree. Too low → underfitting (linear fit on a quadratic relationship); too high → overfitting (degree-6 polynomial on 10 points oscillates wildly). Validation picks the sweet spot.
Wrong noise model. If true noise is heteroscedastic (variance depends on $x$ ) or heavy-tailed, the MLE-equivalence argument no longer applies. OLS is still computable; it’s just no longer the optimal estimator.

Concepts Introduced This Week

linear-regression — the model: $\overset{y}{^} = w^{⊤} ϕ (x)$ , linear in parameters but not necessarily in inputs.
ordinary-least-squares — the fitting criterion: minimise $\sum (y_{i} - \overset{y}{^}_{i})^{2}$ , closed-form via the normal equation.
design-matrix — the $N \times (M + 1)$ matrix $Φ$ that encodes “all training inputs through all basis functions.”
gaussian-distribution — univariate and multivariate normal distributions; the noise model that justifies squared error.
bayes-law — posterior ∝ likelihood × prior; the foundation for upcoming Bayesian regression.

Connections

Pivots from week-01–week-06: the same supervised-learning framework, but with a continuous output instead of class labels.
Mirrors logistic-regression in week 2: same MLE recipe, different noise model. Bernoulli noise → cross-entropy loss; Gaussian noise → squared loss.
Reuses the basis-expansion trick from week 3: the model is linear in $ϕ$ -space, non-linear in $x$ -space.
Sets up later weeks: regularisation (ridge, lasso) as Bayesian priors; Bayesian linear regression with closed-form posteriors; generalisation theory for choosing model complexity; cross-validation for picking hyperparameters like polynomial degree.

Open Questions

How do we pick the polynomial degree (or RBF width / kernel hyperparameters) without cheating on the test set? Validation — covered next.
What if $Φ^{⊤} Φ$ is singular or ill-conditioned? Regularisation (ridge regression) — and its Bayesian interpretation as a Gaussian prior.
Can we sample from a posterior over $w$ and quantify uncertainty in our predictions, not just point-estimate? Bayesian linear regression — posterior over weights is closed-form Gaussian under conjugate priors.
What if the noise isn’t Gaussian? Robust regression (MAE, Huber) for heavy tails; weighted regression for heteroscedasticity.

Course Notes

Explorer

Linear Regression and Why Squared Error Isn't Arbitrary

Part 1: Linear Regression

From Classification to Regression

The Model

Ordinary Least Squares

Worked Example — Degree-1 Polynomial

Basis Expansion — The Bridge to Non-Linear

Iterative Alternative — Gradient Descent

Part 2: The Probabilistic View — Why Squared Error?

The Question OLS Doesn’t Answer

MLE for the Regression Weights

The Pattern Connecting Week 2 and Week 7

Frequentist vs Bayesian — Setting Up Future Weeks

Part 3: Evaluating Regression Models

What Could Go Wrong

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks