THE CRUX: We've spent six weeks on classification — logistic regression, hard- and soft-margin SVMs. This week pivots to regression: predicting continuous outputs. Linear regression sounds trivial — fit a line — but two questions sit underneath. (1) How do we actually compute the answer? (2) Why do we use squared error and not absolute error or anything else?
The first answer is Ordinary Least Squares (OLS): minimise the sum of squared residuals, which has a closed-form solution via the normal equation . One matrix inversion, no iterations, no learning rate. The second answer is the probabilistic interpretation: assume with , and the MLE for is exactly the OLS solution. Squared error isn’t arbitrary — it’s the negative log-likelihood of a Gaussian noise model, up to constants. This is the regression analog of week 2’s logistic-regression-as-MLE result: the noise model determines the loss.
Part 1: Linear Regression
From Classification to Regression
For six weeks we’ve been doing classification: is a finite set of categories. This week, — the output is continuous. Examples:
- BMI → cardiac ejection fraction
- Customer features → recommended credit line (in dollars, not yes/no)
- Ad spend → revenue
The hypothesis form looks identical to logistic regression’s pre-sigmoid output:
But there’s no sigmoid, no threshold. The output is the prediction. What changes is the loss function and how we fit.
The Model
Linear regression assumes:
where is a vector of basis functions (with for the intercept). The “linear” in linear regression refers to linearity in the parameters — not in . The following are all linear regression models:
What unites them: depends on alone, not on . This is what OLS exploits to give a closed-form solution — no matter how curvy the basis functions, the fitting problem is linear-algebraic.
Ordinary Least Squares
Define the residual and the objective:
OLS picks the that minimises this. Setting gives a linear system; collecting it in matrix form using the design matrix :
Solving:
where is the Moore–Penrose pseudoinverse. One matrix inversion gives the exact answer.
Worked Example — Degree-1 Polynomial
For , expanding and setting gives:
In matrix form:
Inverting the matrix gives and .
Basis Expansion — The Bridge to Non-Linear
This is the same trick we used for SVMs in week 4: replace with to gain non-linear capacity while staying within linear-fitting machinery. The model is “linear in -space” but bends in input space.
Common basis families:
| Basis | Form | When |
|---|---|---|
| Polynomial | Low-degree smooth | |
| Gaussian RBF | Local bumps | |
| Sigmoidal | Saturation | |
| tanh | Symmetric saturation |
Higher polynomial degree → tighter fit but more risk of overfitting. The classic plot: a degree-1 fit on 10 points gives a line that misses local structure; a degree-6 fit hits every point but oscillates wildly between them. Validation (next weeks) is how we pick the right degree.
Iterative Alternative — Gradient Descent
OLS fails when is too large to invert (millions of features) or near-singular (collinear features). Then gradient descent handles the same objective iteratively:
where . Convex objective → guaranteed convergence to the global optimum, but requires learning-rate tuning.
The gradient-descent example from the lecture: for at with , , so:
Function value drops from to — the step improved the objective.
Part 2: The Probabilistic View — Why Squared Error?
The Question OLS Doesn’t Answer
OLS says “minimise squared residuals.” But why squared? Why not absolute residuals, or fourth-power residuals? The OLS objective itself doesn’t say. We need a separate argument.
The answer comes from probabilistic modelling. Assume the targets are generated by:
That is: the underlying relationship is linear (in -space), but observations are corrupted by additive Gaussian noise.
MLE for the Regression Weights
Under this model, . The likelihood of the dataset is:
Taking the log:
The only -dependent term is the sum of squared residuals, with a negative coefficient. So:
MLE = OLS under additive Gaussian noise
The maximum-likelihood weights are identical to the OLS weights, when noise is i.i.d. Gaussian. This is the missing justification: squared error isn’t an arbitrary choice of loss — it’s the negative log-likelihood of a Gaussian noise model. Picking any other loss is implicitly assuming a different noise distribution.
The MLE for falls out of the same derivation: where .
The Pattern Connecting Week 2 and Week 7
This is the same recipe we saw in week 2 for logistic-regression:
| Model | Noise / output model | MLE objective | Reduces to |
|---|---|---|---|
| Linear regression | , | Maximise Gaussian likelihood | Minimise squared error (= OLS) |
| Logistic regression | Maximise Bernoulli likelihood | Minimise cross-entropy |
Same recipe (negative log-likelihood as loss), different distribution → different loss function. The “right” loss is determined by what noise/output model you’re willing to assume.
Frequentist vs Bayesian — Setting Up Future Weeks
The MLE view treats as a fixed unknown to be point-estimated. The Bayesian view (preview for later weeks) treats as a random variable with a prior distribution; observed data updates this to a posterior via Bayes’ law:
Posterior ∝ likelihood × prior.
Under a Gaussian prior , the MAP estimate turns out to be ridge regression — adding to the OLS objective. So L2 regularisation has a Bayesian interpretation: it’s a Gaussian prior on the weights. The “regularisation” lever and the “prior belief” lever are the same lever, viewed differently.
This connection (regularisation ↔ priors) is one of the cleanest unifying ideas in the module.
Part 3: Evaluating Regression Models
Common metrics for evaluating vs on test data:
| Metric | Formula | Notes |
|---|---|---|
| Mean Squared Error (MSE) | What OLS minimises (up to scale) | |
| Root Mean Squared Error (RMSE) | Same units as ; interpretable | |
| Mean Absolute Error (MAE) | $\tfrac{1}{N} \sum | y_i - \hat{y}_i |
| Coefficient of Determination () | Fraction of variance explained; 1 = perfect |
means the model explains all variance in ; means it does no better than the constant predictor ; means it’s worse than the mean baseline.
What Could Go Wrong
- Ill-conditioned . Collinear features or polynomial bases on a narrow range produce nearly singular matrices. Inversion blows up; small data changes flip the weights wildly. Mitigations: drop redundant features, standardise inputs, add ridge regularisation.
- More parameters than examples (). is rank-deficient; OLS has no unique solution. Use regularisation or reduce .
- Outliers. Squared loss penalises large residuals quadratically — a single mislabelled point can dominate the fit. Robust alternatives: MAE, Huber loss, RANSAC.
- Wrong polynomial degree. Too low → underfitting (linear fit on a quadratic relationship); too high → overfitting (degree-6 polynomial on 10 points oscillates wildly). Validation picks the sweet spot.
- Wrong noise model. If true noise is heteroscedastic (variance depends on ) or heavy-tailed, the MLE-equivalence argument no longer applies. OLS is still computable; it’s just no longer the optimal estimator.
Concepts Introduced This Week
- linear-regression — the model: , linear in parameters but not necessarily in inputs.
- ordinary-least-squares — the fitting criterion: minimise , closed-form via the normal equation.
- design-matrix — the matrix that encodes “all training inputs through all basis functions.”
- gaussian-distribution — univariate and multivariate normal distributions; the noise model that justifies squared error.
- bayes-law — posterior ∝ likelihood × prior; the foundation for upcoming Bayesian regression.
Connections
- Pivots from week-01–week-06: the same supervised-learning framework, but with a continuous output instead of class labels.
- Mirrors logistic-regression in week 2: same MLE recipe, different noise model. Bernoulli noise → cross-entropy loss; Gaussian noise → squared loss.
- Reuses the basis-expansion trick from week 3: the model is linear in -space, non-linear in -space.
- Sets up later weeks: regularisation (ridge, lasso) as Bayesian priors; Bayesian linear regression with closed-form posteriors; generalisation theory for choosing model complexity; cross-validation for picking hyperparameters like polynomial degree.
Open Questions
- How do we pick the polynomial degree (or RBF width / kernel hyperparameters) without cheating on the test set? Validation — covered next.
- What if is singular or ill-conditioned? Regularisation (ridge regression) — and its Bayesian interpretation as a Gaussian prior.
- Can we sample from a posterior over and quantify uncertainty in our predictions, not just point-estimate? Bayesian linear regression — posterior over weights is closed-form Gaussian under conjugate priors.
- What if the noise isn’t Gaussian? Robust regression (MAE, Huber) for heavy tails; weighted regression for heteroscedasticity.