A regression model that assumes the target is a linear function of (possibly transformed) input features plus noise: . The standard fit is ordinary least squares — minimise the sum of squared residuals — which has a closed-form solution via the normal equation.

The Model

Linear regression posits that the target depends linearly on a fixed set of basis functions of the input:

where:

  • is a vector of basis functions, with as a dummy for the intercept.
  • is the weight vector to be learned.
  • is the predicted value; the actual observed is assumed to differ by some noise .

The simplest case () gives plain — a hyperplane in input space.

Why It’s Called “Linear”

The name refers to linearity in the parameters , not linearity in . The following are all linear regression models, even though they bend dramatically in input space:

What makes them “linear” is that is a function of alone — not of . The prediction is a linear combination of pre-computed basis values; the unknowns enter linearly.

This matters because OLS only relies on linearity in to give a closed-form solution. The same recipe handles polynomial regression, RBF regression, and arbitrary fixed-basis models.

Common Basis Functions

BasisFormWhen
PolynomialSmooth low-degree relationships
Gaussian / RBFLocal bumps; bell-shaped responses
SigmoidalSmooth saturation effects
tanhSymmetric saturation

For multi-input data, basis functions are typically applied per dimension or to selected combinations.

Fitting the Model

Two equivalent recipes for finding :

  1. Ordinary Least Squares (OLS) — minimise the sum of squared residuals. Yields the normal equation: where is the design matrix.

  2. Maximum Likelihood Estimation under Gaussian noise — assume with . The MLE for is identical to the OLS solution. See OLS § MLE equivalence.

The probabilistic view (recipe 2) supplies the missing justification for why we minimise squared error: it’s the optimal estimator under additive Gaussian noise.

Predictions and Evaluation

Once is fit, predict .

Common evaluation metrics:

MetricFormulaNotes
MSEWhat OLS minimises (up to scale)
RMSESame units as ; interpretable
MAE$\tfrac{1}{N} \sumy_i - \hat{y}_i
Fraction of variance explained; 1 = perfect

Choosing the Polynomial Degree

A higher-degree polynomial can fit any training set arbitrarily well (with , you can interpolate every point exactly). But this overfits — the curve thrashes wildly between training points and fails on test data.

This is where regularisation and validation enter the picture (later weeks). The intuition: pick the degree that fits the data well without contorting itself into the noise.

Strengths and Limitations

Strengths:

  • Closed-form solution via the normal equation. No iterations, no learning rate.
  • Convex objective — unique global optimum.
  • Interpretable coefficients — each is the (partial) effect of feature on the output.
  • Probabilistic backing — MLE under Gaussian noise gives the same answer.
  • Extends to non-linear shapes via basis expansion without leaving the linear-regression machinery.

Limitations:

  • Sensitive to outliers — squared loss penalises large residuals quadratically; one extreme point can dominate the fit. Use MAE or robust regression if outliers are an issue.
  • Numerical issues becomes ill-conditioned when columns of are near-collinear, or when . Then OLS fails outright; gradient descent or regularisation is needed.
  • Overfitting with too many basis functions — high-degree polynomials or many RBFs without regularisation will memorise noise.
  • Linear-in-parameters is still a constraint — if the true relationship is genuinely non-linear-in-parameters (e.g., ), linear regression with any basis can only approximate, never recover it.

Connections

Active Recall