A principled way to choose parameters: pick the values that make the observed data most probable.

Notation

Before diving in, a few symbols that appear throughout:

SymbolMeaning
A parameter — any unknown value we want to estimate (e.g. the true temperature, or a weight in a neural network)
Our estimate (or candidate value) for
or The optimal estimate — the specific value of that maximises the likelihood (or equivalently minimises the loss)
A loss-function — maps to a number measuring how bad that estimate is. Lower is better.
/ The value of the argument that achieves the minimum / maximum (not the min/max value itself)

Definition

Maximum likelihood estimation (MLE) finds the parameter that maximises the likelihood of the observed data:

MLE asks: “out of all possible parameter values, which one would have been most likely to produce the data we actually saw?”

Likelihood vs probability

The expression can be read in two directions, and the distinction matters:

  • Probability (fixed , varying data): “Given that the true temperature is 20°C, what is the probability of observing a reading of 23°C?” Here we know the parameter and ask about possible outcomes.
  • Likelihood (fixed data, varying ): “Given that we observed 23°C, how likely is it that the true temperature is 20°C? What about 21°C? 22°C?” Here we know the data and ask which parameter value best explains it.

The formula is the same — — but the interpretation flips depending on what we treat as fixed and what we vary. In MLE, the data is fixed (we already observed it) and we sweep over parameter values, so we call it the likelihood of , not the probability of the data.

The normal distribution

MLE requires a probabilistic model of how data is generated. A common and well-justified assumption is that measurements follow a normal (Gaussian) distribution:

where is the mean (the true value we’re estimating) and is the standard deviation (how spread out the noise is).

What the Gaussian says about the data:

  • The distribution is symmetric around — overestimates and underestimates are equally likely
  • It is bell-shaped: values near are most probable, and probability drops off exponentially as you move away
  • controls the width of the bell — small means measurements are tightly clustered, large means they’re spread out
  • About 68% of observations fall within of , and about 95% within

Why assume a Gaussian?

  • It’s everywhere in nature. Heights, weights, measurement errors, sensor noise — many real-world quantities approximately follow it. If you measure the same thing repeatedly, the readings will typically scatter in a bell curve around the true value.
  • Central Limit Theorem. When an observation is the result of many small, independent random effects, their sum tends toward a Gaussian regardless of what the individual effects look like. This is why the Gaussian shows up so often — most noise is the accumulation of many tiny disturbances.
  • Mathematical convenience. The Gaussian has properties that make optimisation clean: its log is a simple quadratic (the and cancel), which means the MLE derivation lands on a smooth, differentiable loss function (SSE) rather than something harder to work with.
  • Reasonable default for noise. Even when the true noise distribution is unknown, the Gaussian is often a good first approximation. It’s symmetric (no systematic bias), unimodal (one peak), and its tails decay quickly (extreme outliers are rare). For many practical problems, this is close enough.

Deriving squared error from MLE

This is the key result: starting from a Gaussian noise assumption, MLE leads directly to minimising the sum of squared errors.

Setup. We have observations , each independently drawn from where is the unknown true value.

Step 1 — Write the likelihood. Since observations are independent, the joint probability is the product of individual probabilities:

Step 2 — Take the log. We work with the log-likelihood instead of the raw likelihood. This is valid because is monotonically increasing (), so the that maximises the likelihood also maximises the log-likelihood. But why do we want to take the log in the first place? Three reasons:

  1. Turns products into sums — makes differentiation tractable. The likelihood is a product of PDFs. To find its maximum we’d need to differentiate using the product rule across all terms — algebraically brutal for large . The log converts , and differentiating a sum is straightforward (just differentiate each term independently).

  2. Eliminates the Gaussian’s exponential. Each Gaussian PDF contains . The log and exp cancel: , leaving a simple polynomial . Without the log, we’d be optimising a product of exponentials — far harder.

  3. Avoids numerical underflow. Individual probabilities can be extremely small (e.g. ). Multiplying many such numbers causes a computer to round the result to zero, making optimisation impossible. Taking the log first converts these to manageable negative numbers (e.g. ), and adding them preserves numerical precision.

Applying the log:

Step 3 — Drop constants (preserves optimum via translation invariance). The first term inside the sum () and the denominator do not depend on . Adding or subtracting a constant shifts the entire loss curve up or down uniformly — every candidate is affected equally — so the location of the maximum does not move. Multiplying by a positive constant () stretches the curve vertically but likewise leaves the peak in the same place. After dropping these:

Step 4 — Flip sign (preserves optimum via negation). Multiplying by flips the curve upside-down: every peak becomes a valley and vice versa. So the that maximises is the same that minimises :

This is the sum of squared errors (SSE) — the same loss-function we use to train models.

Three optimum-preserving operations (summary)

OperationWhy it preserves the optimum
Apply Monotonically increasing — doesn’t reorder values, so the peak stays in the same place
Drop additive constants / multiply by positive scalarShifts or stretches the curve uniformly — all candidates move together, so the peak doesn’t shift
Negate ()Flips max min — converts to at the same location

Why this matters

The key insight: we did not arbitrarily choose squared error as a loss function — it falls out naturally from our probabilistic assumptions. Specifically, if you believe:

  1. Each observation is generated independently
  2. The noise around the true value follows a normal distribution

then the mathematically correct thing to do is minimise the sum of squared errors. MLE derives the loss function rather than asserting it.

Why this matters for neural networks. MLE is the bridge between probability and training. When training a neural network:

  1. We assume our data comes from some underlying probability distribution
  2. We choose an appropriate distribution for the problem (Gaussian for regression, Bernoulli for binary classification, categorical for multi-class, etc.)
  3. Maximum likelihood automatically gives us the right loss function — we don’t guess or hand-pick it
  4. Gaussian noise assumption Mean Squared Error (MSE)
  5. Categorical distribution assumption Cross-Entropy Loss (covered in later weeks)

The entire training pipeline — forward pass, loss computation, backpropagation — rests on this probabilistic foundation. Every time you see a loss function in this module, ask: “what distribution assumption does this correspond to?”

  • loss-function — MLE provides a probabilistic justification for squared error loss
  • perceptron — the model whose parameters we find via loss minimisation

Active Recall