A regression model that assumes the target is a linear function of (possibly transformed) input features plus noise: . The standard fit is ordinary least squares — minimise the sum of squared residuals — which has a closed-form solution via the normal equation.
The Model
Linear regression posits that the target depends linearly on a fixed set of basis functions of the input:
where:
- is a vector of basis functions, with as a dummy for the intercept.
- is the weight vector to be learned.
- is the predicted value; the actual observed is assumed to differ by some noise .
The simplest case () gives plain — a hyperplane in input space.
Why It’s Called “Linear”
The name refers to linearity in the parameters , not linearity in . The following are all linear regression models, even though they bend dramatically in input space:
What makes them “linear” is that is a function of alone — not of . The prediction is a linear combination of pre-computed basis values; the unknowns enter linearly.
This matters because OLS only relies on linearity in to give a closed-form solution. The same recipe handles polynomial regression, RBF regression, and arbitrary fixed-basis models.
Common Basis Functions
| Basis | Form | When |
|---|---|---|
| Polynomial | Smooth low-degree relationships | |
| Gaussian / RBF | Local bumps; bell-shaped responses | |
| Sigmoidal | Smooth saturation effects | |
| tanh | Symmetric saturation |
For multi-input data, basis functions are typically applied per dimension or to selected combinations.
Fitting the Model
Two equivalent recipes for finding :
-
Ordinary Least Squares (OLS) — minimise the sum of squared residuals. Yields the normal equation: where is the design matrix.
-
Maximum Likelihood Estimation under Gaussian noise — assume with . The MLE for is identical to the OLS solution. See OLS § MLE equivalence.
The probabilistic view (recipe 2) supplies the missing justification for why we minimise squared error: it’s the optimal estimator under additive Gaussian noise.
Predictions and Evaluation
Once is fit, predict .
Common evaluation metrics:
| Metric | Formula | Notes |
|---|---|---|
| MSE | What OLS minimises (up to scale) | |
| RMSE | Same units as ; interpretable | |
| MAE | $\tfrac{1}{N} \sum | y_i - \hat{y}_i |
| Fraction of variance explained; 1 = perfect |
Choosing the Polynomial Degree
A higher-degree polynomial can fit any training set arbitrarily well (with , you can interpolate every point exactly). But this overfits — the curve thrashes wildly between training points and fails on test data.
This is where regularisation and validation enter the picture (later weeks). The intuition: pick the degree that fits the data well without contorting itself into the noise.
Strengths and Limitations
Strengths:
- Closed-form solution via the normal equation. No iterations, no learning rate.
- Convex objective — unique global optimum.
- Interpretable coefficients — each is the (partial) effect of feature on the output.
- Probabilistic backing — MLE under Gaussian noise gives the same answer.
- Extends to non-linear shapes via basis expansion without leaving the linear-regression machinery.
Limitations:
- Sensitive to outliers — squared loss penalises large residuals quadratically; one extreme point can dominate the fit. Use MAE or robust regression if outliers are an issue.
- Numerical issues — becomes ill-conditioned when columns of are near-collinear, or when . Then OLS fails outright; gradient descent or regularisation is needed.
- Overfitting with too many basis functions — high-degree polynomials or many RBFs without regularisation will memorise noise.
- Linear-in-parameters is still a constraint — if the true relationship is genuinely non-linear-in-parameters (e.g., ), linear regression with any basis can only approximate, never recover it.
Connections
- ordinary-least-squares — the criterion and closed-form solution.
- design-matrix — the matrix whose rows are basis-function evaluations of each training input.
- non-linear-transformation — the basis-expansion trick, shared with SVMs.
- gaussian-distribution — the noise model that justifies squared loss.
- maximum likelihood estimation — the principle that links OLS to a probabilistic model.
- gradient descent — alternative fitting method when is too large/singular.
- logistic-regression — the classification analogue: linear in parameters, but with sigmoid + Bernoulli noise instead of identity + Gaussian noise.
Active Recall
Why is the model called "linear regression" even though it traces a parabola?
Linearity refers to the parameters , not the input . We can write it as with . The derivative is a function of alone, not of — that’s the linear-in-parameters property. OLS works because of this linearity, regardless of what does in input space.
A dataset has 14 examples and 3 features (plus intercept). What are the dimensions of , , and in the normal equation?
is (one row per example, one column per parameter including the intercept). is . is .
Given training data : . Linear regression fits perfectly. What are and ?
The relationship is , so and . (Verify: , , , . ✓)