maximum-likelihood-estimation-nc

A principled way to choose parameters: pick the values that make the observed data most probable.

Notation

Before diving in, a few symbols that appear throughout:

Symbol	Meaning
$θ$	A parameter — any unknown value we want to estimate (e.g. the true temperature, or a weight in a neural network)
$\hat{θ}$	Our estimate (or candidate value) for $θ$
$θ^{*}$ or $\hat{θ}_{MLE}$	The optimal estimate — the specific value of $\hat{θ}$ that maximises the likelihood (or equivalently minimises the loss)
$L (\hat{θ})$	A loss-function — maps $\hat{θ}$ to a number measuring how bad that estimate is. Lower is better.
$ar g min$ / $ar g max$	The value of the argument that achieves the minimum / maximum (not the min/max value itself)

Definition

Maximum likelihood estimation (MLE) finds the parameter $\hat{θ}$ that maximises the likelihood of the observed data:

$\hat{θ}_{MLE} = ar g max_{\hat{θ}} P (data ∣ \hat{θ})$

MLE asks: “out of all possible parameter values, which one would have been most likely to produce the data we actually saw?”

Likelihood vs probability

The expression $P (data ∣ \hat{θ})$ can be read in two directions, and the distinction matters:

Probability (fixed $θ$ , varying data): “Given that the true temperature is 20°C, what is the probability of observing a reading of 23°C?” Here we know the parameter and ask about possible outcomes.
Likelihood (fixed data, varying $θ$ ): “Given that we observed 23°C, how likely is it that the true temperature is 20°C? What about 21°C? 22°C?” Here we know the data and ask which parameter value best explains it.

The formula is the same — $P (data ∣ θ)$ — but the interpretation flips depending on what we treat as fixed and what we vary. In MLE, the data is fixed (we already observed it) and we sweep over parameter values, so we call it the likelihood of $θ$ , not the probability of the data.

The normal distribution

MLE requires a probabilistic model of how data is generated. A common and well-justified assumption is that measurements follow a normal (Gaussian) distribution:

$P (x ∣ μ, σ) = \frac{1}{2 π σ} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$

where $μ$ is the mean (the true value we’re estimating) and $σ$ is the standard deviation (how spread out the noise is).

What the Gaussian says about the data:

The distribution is symmetric around $μ$ — overestimates and underestimates are equally likely
It is bell-shaped: values near $μ$ are most probable, and probability drops off exponentially as you move away
$σ$ controls the width of the bell — small $σ$ means measurements are tightly clustered, large $σ$ means they’re spread out
About 68% of observations fall within $\pm 1 σ$ of $μ$ , and about 95% within $\pm 2 σ$

Why assume a Gaussian?

It’s everywhere in nature. Heights, weights, measurement errors, sensor noise — many real-world quantities approximately follow it. If you measure the same thing repeatedly, the readings will typically scatter in a bell curve around the true value.
Central Limit Theorem. When an observation is the result of many small, independent random effects, their sum tends toward a Gaussian regardless of what the individual effects look like. This is why the Gaussian shows up so often — most noise is the accumulation of many tiny disturbances.
Mathematical convenience. The Gaussian has properties that make optimisation clean: its log is a simple quadratic (the $exp$ and $ln$ cancel), which means the MLE derivation lands on a smooth, differentiable loss function (SSE) rather than something harder to work with.
Reasonable default for noise. Even when the true noise distribution is unknown, the Gaussian is often a good first approximation. It’s symmetric (no systematic bias), unimodal (one peak), and its tails decay quickly (extreme outliers are rare). For many practical problems, this is close enough.

Deriving squared error from MLE

This is the key result: starting from a Gaussian noise assumption, MLE leads directly to minimising the sum of squared errors.

Setup. We have $n$ observations $x_{1}, \dots, x_{n}$ , each independently drawn from $N (\overset{x}{^}, σ^{2})$ where $\overset{x}{^}$ is the unknown true value.

Step 1 — Write the likelihood. Since observations are independent, the joint probability is the product of individual probabilities:

$P (x_{1}, \dots, x_{n} ∣ \overset{x}{^}) = \prod_{i = 1}^{n} P (x_{i} ∣ \overset{x}{^}) = \prod_{i = 1}^{n} \frac{1}{2 π σ} exp (- \frac{( x _{i} - x ^ ) ^{2}}{2 σ ^{2}})$

Step 2 — Take the log. We work with the log-likelihood $ℓ (\overset{x}{^}) = ln L (\overset{x}{^})$ instead of the raw likelihood. This is valid because $ln$ is monotonically increasing ( $a > b \Rightarrow ln a > ln b$ ), so the $\overset{x}{^}$ that maximises the likelihood also maximises the log-likelihood. But why do we want to take the log in the first place? Three reasons:

Turns products into sums — makes differentiation tractable. The likelihood is a product of $n$ PDFs. To find its maximum we’d need to differentiate using the product rule across all $n$ terms — algebraically brutal for large $n$ . The log converts $\prod \to \sum$ , and differentiating a sum is straightforward (just differentiate each term independently).
Eliminates the Gaussian’s exponential. Each Gaussian PDF contains $exp (\dots)$ . The log and exp cancel: $ln (exp (z)) = z$ , leaving a simple polynomial $(x_{i} - \overset{x}{^})^{2}$ . Without the log, we’d be optimising a product of exponentials — far harder.
Avoids numerical underflow. Individual probabilities can be extremely small (e.g. $1 0^{- 50}$ ). Multiplying many such numbers causes a computer to round the result to zero, making optimisation impossible. Taking the log first converts these to manageable negative numbers (e.g. $ln (1 0^{- 50}) \approx - 115$ ), and adding them preserves numerical precision.

Applying the log:

$ℓ (\overset{x}{^}) = ln P = \sum_{i = 1}^{n} [ln (\frac{1}{2 π σ}) - \frac{( x _{i} - x ^ ) ^{2}}{2 σ ^{2}}]$

Step 3 — Drop constants (preserves optimum via translation invariance). The first term inside the sum ( $lo g \frac{1}{2 π σ}$ ) and the denominator $2 σ^{2}$ do not depend on $\overset{x}{^}$ . Adding or subtracting a constant shifts the entire loss curve up or down uniformly — every candidate $\overset{x}{^}$ is affected equally — so the location of the maximum does not move. Multiplying by a positive constant ( $1/2 σ^{2}$ ) stretches the curve vertically but likewise leaves the peak in the same place. After dropping these:

$lo g P \propto - \sum_{i = 1}^{n} (x_{i} - \overset{x}{^})^{2}$

Step 4 — Flip sign (preserves optimum via negation). Multiplying by $- 1$ flips the curve upside-down: every peak becomes a valley and vice versa. So the $\overset{x}{^}$ that maximises $- \sum (x_{i} - \overset{x}{^})^{2}$ is the same $\overset{x}{^}$ that minimises $\sum (x_{i} - \overset{x}{^})^{2}$ :

$\overset{x}{^}_{MLE} = ar g min_{\overset{x}{^}} \sum_{i = 1}^{n} (x_{i} - \overset{x}{^})^{2}$

This is the sum of squared errors (SSE) — the same loss-function we use to train models.

Three optimum-preserving operations (summary)

Operation	Why it preserves the optimum
Apply $lo g$	Monotonically increasing — doesn’t reorder values, so the peak stays in the same place
Drop additive constants / multiply by positive scalar	Shifts or stretches the curve uniformly — all candidates move together, so the peak doesn’t shift
Negate ( $\times - 1$ )	Flips max $\leftrightarrow$ min — converts $ar g max$ to $ar g min$ at the same location

Why this matters

The key insight: we did not arbitrarily choose squared error as a loss function — it falls out naturally from our probabilistic assumptions. Specifically, if you believe:

Each observation is generated independently
The noise around the true value follows a normal distribution

then the mathematically correct thing to do is minimise the sum of squared errors. MLE derives the loss function rather than asserting it.

Why this matters for neural networks. MLE is the bridge between probability and training. When training a neural network:

We assume our data comes from some underlying probability distribution
We choose an appropriate distribution for the problem (Gaussian for regression, Bernoulli for binary classification, categorical for multi-class, etc.)
Maximum likelihood automatically gives us the right loss function — we don’t guess or hand-pick it
Gaussian noise assumption $\to$ Mean Squared Error (MSE)
Categorical distribution assumption $\to$ Cross-Entropy Loss (covered in later weeks)

The entire training pipeline — forward pass, loss computation, backpropagation — rests on this probabilistic foundation. Every time you see a loss function in this module, ask: “what distribution assumption does this correspond to?”

loss-function — MLE provides a probabilistic justification for squared error loss
perceptron — the model whose parameters we find via loss minimisation

Active Recall

Walk through the MLE derivation: why does maximising a Gaussian likelihood reduce to minimising the sum of squared errors?

Start with the product of Gaussian PDFs for independent observations. Take the log to convert the product to a sum. The Gaussian PDF has the form $exp (- (x_{i} - \overset{x}{^})^{2} /2 σ^{2})$ , so the log-likelihood becomes a sum of $- (x_{i} - \overset{x}{^})^{2}$ terms (plus constants). Dropping constants and flipping the sign to convert max to min gives $ar g min \sum (x_{i} - \overset{x}{^})^{2}$ — the SSE.

Why is it valid to apply the logarithm to the likelihood before optimising? Could we use any function?

The log is monotonically increasing: if $f (a) > f (b)$ , then $lo g f (a) > lo g f (b)$ . This means the value that maximises $f$ also maximises $lo g f$ . We cannot use any function — only strictly monotonically increasing functions preserve the location of the maximum.

Under what assumptions does minimising MSE give the maximum likelihood estimate? When might those assumptions fail?

MSE gives the MLE when (1) observations are independent and (2) noise follows a normal distribution. These assumptions can fail with heavy-tailed noise (outliers), correlated observations, or non-symmetric error distributions. In such cases, other loss functions (like MAE or Huber loss) may be more appropriate.

In the derivation, we "drop constants" that don't depend on $\overset{x}{^}$ . Why specifically are $\frac{1}{2 π σ}$ and $2 σ^{2}$ irrelevant to the optimisation?

We are optimising over $\overset{x}{^}$ , not $σ$ . The normalisation constant $\frac{1}{2 π σ}$ and the denominator $2 σ^{2}$ are the same for every candidate $\overset{x}{^}$ , so they shift or scale the log-likelihood uniformly. The location of the maximum — the value of $\overset{x}{^}$ that achieves it — is unchanged.

Why do we need to flip the sign in step 4 of the derivation?

The log-likelihood is $- \sum (x_{i} - \overset{x}{^})^{2}$ (negative). MLE maximises this, but in machine learning we conventionally minimise loss functions. Maximising $- f$ is equivalent to minimising $f$ , so we negate and switch from $ar g max$ to $ar g min$ .

What is the difference between probability and likelihood? They use the same formula — so what changes?

Both use $P (data ∣ θ)$ , but the role of what is fixed and what varies flips. Probability fixes $θ$ and asks “what data might we see?” Likelihood fixes the observed data and asks “which $θ$ best explains it?” In MLE we have already observed the data, so we treat it as fixed and sweep over $θ$ — that makes it a likelihood problem, not a probability problem.

Course Notes

Explorer

maximum-likelihood-estimation-nc

Notation

Definition

Likelihood vs probability

The normal distribution

Deriving squared error from MLE

Three optimum-preserving operations (summary)

Why this matters

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

maximum-likelihood-estimation-nc

Notation

Definition

Likelihood vs probability

The normal distribution

Deriving squared error from MLE

Three optimum-preserving operations (summary)

Why this matters

Related

Active Recall

Graph View

Table of Contents

Backlinks