A function that measures how far a model’s predictions are from the truth — the thing we minimise when we train.

Definition

A loss function (also called a cost function or objective function) maps a model’s predictions and the true values to a single non-negative number. Smaller values mean better predictions. Training a model means finding parameters that minimise the loss:

The notation means “the argument (value) that produces the minimum” — not the minimum itself, but the parameter values that achieve it.

Why we need loss functions

A perceptron has parameters and , but how do we find the right values? We need a way to score any candidate set of parameters: “these weights give a total error of 9; those give 7; those give 3.” The loss function provides that score. Once we have it, learning becomes an optimisation problem.

Concrete example: temperature estimation

Suppose three noisy thermometers read C, C, C. We want to estimate the true temperature .

Using absolute error, an estimate of gives:

An estimate of gives — worse. We sweep through all candidates and pick the one with the lowest total loss.

Common loss functions

Mean Absolute Error (MAE)

The average of absolute differences. Treats all errors equally regardless of direction. The loss-vs-estimate curve is V-shaped (for a single data point) or piecewise linear.

Sum of Squared Errors (SSE)

Squaring penalises large errors more heavily than small ones. The loss-vs-estimate curve is a smooth parabola (for a single parameter), which makes it easier to optimise with calculus. SSE is not an arbitrary choice — maximum likelihood estimation shows it falls out naturally from assuming Gaussian noise.

Mean Squared Error (MSE)

SSE divided by the number of data points. Normalising by makes the loss comparable across datasets of different sizes but does not change which parameters minimise it.

A special case where we don’t need gradient descent: the analytical solution for SSE

For most loss functions and model architectures, optimisation is iterative — gradient descent climbs down the loss landscape one step at a time. But for the specific case of estimating a single value from measurements under sum-of-squared-errors, calculus gives a closed-form answer directly.

Setup. Find minimising .

Take the derivative. Expanding and differentiating term-by-term:

Set it to zero (necessary condition for a minimum on a smooth convex function):

The optimal estimate is the arithmetic mean of the measurements. For three thermometer readings , the SSE-optimal estimate is .

This is the same answer maximum likelihood estimation gives for Gaussian noise, but derived without any probability — purely as the calculus solution to the SSE minimisation. (The two derivations agreeing is what we’d expect: SSE is what MLE recommends under a Gaussian assumption.)

ASIDE — Why we still need gradient descent

Closed-form solutions exist only for simple losses with simple models. As soon as the model has non-linearities, multiple parameters with non-trivial dependencies, or a non-quadratic loss, the “set the derivative to zero and solve” approach produces equations you can’t rearrange by hand. Iterative gradient descent works in every case where the loss is differentiable, regardless of how messy the closed-form solution would be — which is why it’s the workhorse algorithm for neural networks.

Why optimisation is hard

For a single parameter, we could plot the loss and visually find the minimum. But real models have many parameters — modern neural networks have billions. Each parameter can take any real value, creating an effectively infinite search space. Brute-force search is impossible; we need efficient algorithms like gradient descent (week 2).

Active Recall