linear-regression

A regression model that assumes the target $y$ is a linear function of (possibly transformed) input features plus noise: $y = w^{⊤} ϕ (x) + ε$ . The standard fit is ordinary least squares — minimise the sum of squared residuals — which has a closed-form solution via the normal equation.

The Model

Linear regression posits that the target $y$ depends linearly on a fixed set of basis functions of the input:

$\overset{y}{^} (x, w) = \sum_{j = 0}^{M} w_{j} ϕ_{j} (x) = w^{⊤} ϕ (x)$

where:

$ϕ (x) = (ϕ_{0} (x), ϕ_{1} (x), \dots, ϕ_{M} (x))^{⊤}$ is a vector of basis functions, with $ϕ_{0} (x) = 1$ as a dummy for the intercept.
$w = (w_{0}, w_{1}, \dots, w_{M})^{⊤}$ is the weight vector to be learned.
$\overset{y}{^}$ is the predicted value; the actual observed $y$ is assumed to differ by some noise $ε$ .

The simplest case ( $ϕ_{j} (x) = x_{j}$ ) gives plain $\overset{y}{^} = w_{0} + w_{1} x_{1} + \dots + w_{d} x_{d}$ — a hyperplane in input space.

Why It’s Called “Linear”

The name refers to linearity in the parameters $w$ , not linearity in $x$ . The following are all linear regression models, even though they bend dramatically in input space:

$y = w_{0} + w_{1} x_{1}^{2} + w_{2} x_{2}^{2} + \dots + w_{D} x_{D}^{2}$ $y = w_{0} + w_{1} e^{x_{1}} + w_{2} e^{x_{2}}$ $y = w_{0} + w_{1} exp (- \frac{( x - μ _{1} ) ^{2}}{2 s ^{2}}) + w_{2} exp (- \frac{( x - μ _{2} ) ^{2}}{2 s ^{2}})$

What makes them “linear” is that $\partial \overset{y}{^} / \partial w_{j}$ is a function of $x$ alone — not of $w$ . The prediction is a linear combination of pre-computed basis values; the unknowns $w$ enter linearly.

This matters because OLS only relies on linearity in $w$ to give a closed-form solution. The same recipe handles polynomial regression, RBF regression, and arbitrary fixed-basis models.

Common Basis Functions

Basis	Form	When
Polynomial	$ϕ_{j} (x) = x^{j}$	Smooth low-degree relationships
Gaussian / RBF	$ϕ_{j} (x) = exp (- \frac{( x - μ _{j} ) ^{2}}{2 s ^{2}})$	Local bumps; bell-shaped responses
Sigmoidal	$ϕ_{j} (x) = (1 + e^{- (x - μ_{j}) / s})^{- 1}$	Smooth saturation effects
tanh	$ϕ_{j} (x) = tanh ((x - μ_{j}) / s)$	Symmetric saturation

For multi-input data, basis functions are typically applied per dimension or to selected combinations.

Fitting the Model

Two equivalent recipes for finding $w$ :

Ordinary Least Squares (OLS) — minimise the sum of squared residuals. Yields the normal equation: $w_{OLS} = (Φ^{⊤} Φ)^{- 1} Φ^{⊤} y$ where $Φ$ is the design matrix.
Maximum Likelihood Estimation under Gaussian noise — assume $y = \overset{y}{^} (x, w) + ε$ with $ε \sim N (0, σ^{2})$ . The MLE for $w$ is identical to the OLS solution. See OLS § MLE equivalence.

The probabilistic view (recipe 2) supplies the missing justification for why we minimise squared error: it’s the optimal estimator under additive Gaussian noise.

Predictions and Evaluation

Once $w$ is fit, predict $\overset{y}{^} (x_{new}) = w^{⊤} ϕ (x_{new})$ .

Common evaluation metrics:

Metric	Formula	Notes
MSE	$\frac{1}{N} \sum (y_{i} - \overset{y}{^}_{i})^{2}$	What OLS minimises (up to scale)
RMSE	$MSE$	Same units as $y$ ; interpretable
MAE	$\tfrac{1}{N} \sum	y_i - \hat{y}_i
$R^{2}$	$1 - \frac{\sum ( y _{i} - y ^ _{i} ) ^{2}}{\sum ( y _{i} - y ˉ ) ^{2}}$	Fraction of variance explained; 1 = perfect

Choosing the Polynomial Degree

A higher-degree polynomial can fit any training set arbitrarily well (with $M = N$ , you can interpolate every point exactly). But this overfits — the curve thrashes wildly between training points and fails on test data.

This is where regularisation and validation enter the picture (later weeks). The intuition: pick the degree that fits the data well without contorting itself into the noise.

Strengths and Limitations

Strengths:

Closed-form solution via the normal equation. No iterations, no learning rate.
Convex objective — unique global optimum.
Interpretable coefficients — each $w_{j}$ is the (partial) effect of feature $j$ on the output.
Probabilistic backing — MLE under Gaussian noise gives the same answer.
Extends to non-linear shapes via basis expansion without leaving the linear-regression machinery.

Limitations:

Sensitive to outliers — squared loss penalises large residuals quadratically; one extreme point can dominate the fit. Use MAE or robust regression if outliers are an issue.
Numerical issues — $(Φ^{⊤} Φ)^{- 1}$ becomes ill-conditioned when columns of $Φ$ are near-collinear, or when $M > N$ . Then OLS fails outright; gradient descent or regularisation is needed.
Overfitting with too many basis functions — high-degree polynomials or many RBFs without regularisation will memorise noise.
Linear-in-parameters is still a constraint — if the true relationship is genuinely non-linear-in-parameters (e.g., $y = e^{w_{1} x}$ ), linear regression with any basis can only approximate, never recover it.

Connections

ordinary-least-squares — the criterion and closed-form solution.
design-matrix — the $N \times (M + 1)$ matrix whose rows are basis-function evaluations of each training input.
non-linear-transformation — the basis-expansion trick, shared with SVMs.
gaussian-distribution — the noise model that justifies squared loss.
maximum likelihood estimation — the principle that links OLS to a probabilistic model.
gradient descent — alternative fitting method when $Φ^{⊤} Φ$ is too large/singular.
logistic-regression — the classification analogue: linear in parameters, but with sigmoid + Bernoulli noise instead of identity + Gaussian noise.

Active Recall

Why is the model $y = w_{0} + w_{1} x^{2}$ called "linear regression" even though it traces a parabola?

Linearity refers to the parameters $w$ , not the input $x$ . We can write it as $y = w^{⊤} ϕ (x)$ with $ϕ (x) = (1, x^{2})^{⊤}$ . The derivative $\partial y / \partial w_{j}$ is a function of $x$ alone, not of $w$ — that’s the linear-in-parameters property. OLS works because of this linearity, regardless of what $ϕ$ does in input space.

A dataset has 14 examples and 3 features (plus intercept). What are the dimensions of $Φ$ , $y$ , and $w$ in the normal equation?

$Φ$ is $14 \times 4$ (one row per example, one column per parameter including the intercept). $y$ is $14 \times 1$ . $w$ is $4 \times 1$ .

Given training data $(x, y)$ : $(1, 0.5), (0, 0), (6, 3), (4, 2)$ . Linear regression $y = w_{0} + w_{1} x$ fits perfectly. What are $w_{0}$ and $w_{1}$ ?

The relationship is $y = 0.5 x$ , so $w_{0} = 0$ and $w_{1} = 0.5$ . (Verify: $0.5 \cdot 1 = 0.5$ , $0.5 \cdot 0 = 0$ , $0.5 \cdot 6 = 3$ , $0.5 \cdot 4 = 2$ . ✓)

Course Notes

Explorer

linear-regression

The Model

Why It’s Called “Linear”

Common Basis Functions

Fitting the Model

Predictions and Evaluation

Choosing the Polynomial Degree

Strengths and Limitations

Connections

Active Recall

Graph View

Table of Contents

Backlinks