loss-function

A function that measures how far a model’s predictions are from the truth — the thing we minimise when we train.

Definition

A loss function (also called a cost function or objective function) maps a model’s predictions and the true values to a single non-negative number. Smaller values mean better predictions. Training a model means finding parameters that minimise the loss:

$w^{*}, b^{*} = ar g min_{w, b} L (w, b)$

The notation $ar g min$ means “the argument (value) that produces the minimum” — not the minimum itself, but the parameter values that achieve it.

Why we need loss functions

A perceptron has parameters $w$ and $b$ , but how do we find the right values? We need a way to score any candidate set of parameters: “these weights give a total error of 9; those give 7; those give 3.” The loss function provides that score. Once we have it, learning becomes an optimisation problem.

Concrete example: temperature estimation

Suppose three noisy thermometers read $x_{1} = 19°$ C, $x_{2} = 17°$ C, $x_{3} = 24°$ C. We want to estimate the true temperature $\overset{x}{^}$ .

Using absolute error, an estimate of $\overset{x}{^} = 21$ gives:

$L = ∣19 - 21∣ + ∣17 - 21∣ + ∣24 - 21∣ = 2 + 4 + 3 = 9$

An estimate of $\overset{x}{^} = 22$ gives $L = 3 + 5 + 2 = 10$ — worse. We sweep through all candidates and pick the one with the lowest total loss.

Common loss functions

Mean Absolute Error (MAE)

$L_{MAE} = \frac{1}{n} \sum_{i = 1}^{n} ∣ y_{i} - \overset{y}{^}_{i} ∣$

The average of absolute differences. Treats all errors equally regardless of direction. The loss-vs-estimate curve is V-shaped (for a single data point) or piecewise linear.

Sum of Squared Errors (SSE)

$L_{SSE} = \sum_{i = 1}^{n} (y_{i} - \overset{y}{^}_{i})^{2}$

Squaring penalises large errors more heavily than small ones. The loss-vs-estimate curve is a smooth parabola (for a single parameter), which makes it easier to optimise with calculus. SSE is not an arbitrary choice — maximum likelihood estimation shows it falls out naturally from assuming Gaussian noise.

Mean Squared Error (MSE)

$L_{MSE} = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \overset{y}{^}_{i})^{2}$

SSE divided by the number of data points. Normalising by $n$ makes the loss comparable across datasets of different sizes but does not change which parameters minimise it.

A special case where we don’t need gradient descent: the analytical solution for SSE

For most loss functions and model architectures, optimisation is iterative — gradient descent climbs down the loss landscape one step at a time. But for the specific case of estimating a single value $\overset{x}{^}$ from measurements $x_{1}, \dots, x_{n}$ under sum-of-squared-errors, calculus gives a closed-form answer directly.

Setup. Find $\overset{x}{^}$ minimising $L (\overset{x}{^}) = \sum_{i = 1}^{n} (\overset{x}{^} - x_{i})^{2}$ .

Take the derivative. Expanding $(\overset{x}{^} - x_{i})^{2} = \overset{x}{^}^{2} - 2 \overset{x}{^} x_{i} + x_{i}^{2}$ and differentiating term-by-term:

$\frac{\partial L}{\partial x ^} = \sum_{i = 1}^{n} (2 \overset{x}{^} - 2 x_{i}) = 2 n \overset{x}{^} - 2 \sum_{i = 1}^{n} x_{i}$

Set it to zero (necessary condition for a minimum on a smooth convex function):

$2 n \overset{x}{^} = 2 \sum_{i = 1}^{n} x_{i} ⟹ \overset{x}{^}^{*} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$

The optimal estimate is the arithmetic mean of the measurements. For three thermometer readings ${19, 17, 24}$ , the SSE-optimal estimate is $(19 + 17 + 24) /3 = 20$ .

This is the same answer maximum likelihood estimation gives for Gaussian noise, but derived without any probability — purely as the calculus solution to the SSE minimisation. (The two derivations agreeing is what we’d expect: SSE is what MLE recommends under a Gaussian assumption.)

ASIDE — Why we still need gradient descent

Closed-form solutions exist only for simple losses with simple models. As soon as the model has non-linearities, multiple parameters with non-trivial dependencies, or a non-quadratic loss, the “set the derivative to zero and solve” approach produces equations you can’t rearrange by hand. Iterative gradient descent works in every case where the loss is differentiable, regardless of how messy the closed-form solution would be — which is why it’s the workhorse algorithm for neural networks.

Why optimisation is hard

For a single parameter, we could plot the loss and visually find the minimum. But real models have many parameters — modern neural networks have billions. Each parameter can take any real value, creating an effectively infinite search space. Brute-force search is impossible; we need efficient algorithms like gradient descent (week 2).

perceptron — the model whose parameters we optimise
maximum likelihood estimation — derives squared error loss from probability theory

Active Recall

What is the difference between $min$ and $ar g min$ , and why do we care about $ar g min$ in machine learning?

$min L$ gives the smallest value of the loss. $ar g min L$ gives the parameter values that achieve that smallest loss. In ML we care about $ar g min$ because we need the actual weight values, not just how low the loss can go.

Why does squared error penalise predictions differently from absolute error, and when might that matter?

Squaring magnifies large errors: an error of 4 contributes $16$ to SSE but only $4$ to MAE. This means SSE is more sensitive to outliers — a single wildly wrong prediction can dominate the total loss. MAE treats all errors proportionally to their magnitude.

Three sensors read 10, 14, and 16. Compute the MAE and SSE for an estimate of 13.

MAE: $(∣10 - 13∣ + ∣14 - 13∣ + ∣16 - 13∣) /3 = (3 + 1 + 3) /3 = 7/3 \approx 2.33$ . SSE: $(10 - 13)^{2} + (14 - 13)^{2} + (16 - 13)^{2} = 9 + 1 + 9 = 19$ .

Why can't we find optimal parameters by trying every possible value of $w$ and $b$ ?

Each parameter is a real number with infinitely many possible values. A model with $D$ weights has a $D + 1$ dimensional search space (weights plus bias). Modern networks have billions of parameters, making brute-force search computationally impossible. We need algorithms like gradient descent that navigate the loss landscape efficiently.

Derive the value of $\overset{x}{^}$ that minimises $L (\overset{x}{^}) = \sum_{i = 1}^{n} (\overset{x}{^} - x_{i})^{2}$ .

Differentiate: $\frac{\partial L}{\partial x ^} = \sum (2 \overset{x}{^} - 2 x_{i}) = 2 n \overset{x}{^} - 2 \sum x_{i}$ . Set to zero: $\overset{x}{^}^{*} = \frac{1}{n} \sum x_{i}$ — the sample mean. The squared-error optimum is the average of the measurements. This is one of the few cases where the optimal parameter has a closed form; in general, gradient descent is needed because the analytical solution is intractable.

Course Notes

Explorer

loss-function

Definition

Why we need loss functions

Concrete example: temperature estimation

Common loss functions

Mean Absolute Error (MAE)

Sum of Squared Errors (SSE)

Mean Squared Error (MSE)

A special case where we don’t need gradient descent: the analytical solution for SSE

Why optimisation is hard

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

loss-function

Definition

Why we need loss functions

Concrete example: temperature estimation

Common loss functions

Mean Absolute Error (MAE)

Sum of Squared Errors (SSE)

Mean Squared Error (MSE)

A special case where we don’t need gradient descent: the analytical solution for SSE

Why optimisation is hard

Related

Active Recall

Graph View

Table of Contents

Backlinks