The classification counterpart to squared error. Where squared error is what max likelihood prescribes under Gaussian noise, cross-entropy is what max likelihood prescribes for classes drawn from a Bernoulli with .
Definition
For a binary classification dataset with labels and sigmoid-perceptron predictions :
The minus sign in front turns a log-likelihood maximisation into a loss minimisation.
What each term does
For each training sample, exactly one of the two terms is active (because ):
| Label | Active term | Penalises |
|---|---|---|
| close to 0 (big loss if the model is confidently wrong) | ||
| close to 1 (big loss if the model is confidently wrong) |
- Correct, confident predictions ( when , or when ) give loss.
- Wrong, confident predictions drive when the true label is 1 (or vice versa), and . The loss is unbounded above — wrongness is punished severely.
Derivation from max likelihood
Assume each label is drawn from a Bernoulli distribution parameterised by the sigmoid output:
Using the encoding, this collapses into one expression valid for both labels:
(Plug in : get . Plug in : get . ✓)
Now apply the standard maximum likelihood estimation recipe:
- Assume conditional independence: .
- Take the log to turn the product into a sum.
- Flip the sign to turn a maximisation into a minimisation.
The result is exactly binary cross-entropy:
This mirrors the derivation of squared error under Gaussian noise — same pattern, different probability model.
Why cross-entropy and not squared error for classification
Squared error would run — sigmoid is differentiable, so is non-zero. But it’s a worse fit for two reasons:
- Probabilistic meaning. Cross-entropy is what max likelihood recommends when your outputs are class probabilities. Squared error assumes Gaussian noise on a continuous target, which doesn’t match the Bernoulli structure of binary labels.
- Gradient magnitude. Squared error combined with sigmoid leads to gradients of the form , which vanishes whenever saturates — even when is large. Cross-entropy cancels this sigmoid derivative cleanly, so the parameter gradient becomes simply — proportional to the error, no matter where you are on the sigmoid curve. Learning is faster and more stable.
Connection to gradient descent
Training a sigmoid classifier is just gradient descent on binary cross-entropy:
- Forward: .
- Loss: .
- Gradient (after chain rule): , .
- Update: .
You’ll implement exactly this in the week 2 Python exercise.
Multiclass extension
For more than two classes, the natural output is a probability distribution over classes produced by softmax, and the labels are represented as one-hot vectors of length . The loss becomes categorical cross-entropy:
where is 1 if sample belongs to class (else 0), and is the softmax output for that class.
Because only one is non-zero per sample, the inner sum collapses to a single term — — so the per-sample loss is just of the probability the model assigned to the correct class. Binary cross-entropy is the special case for (with the two classes implicit in and ). Same recipe, same MLE derivation, different number of classes.
Related
- sigmoid function — produces the binary probabilities that cross-entropy consumes
- softmax — produces the multi-class probabilities for the categorical extension
- loss-function — the general concept; this is the classification specialisation
- maximum likelihood estimation — the derivation of cross-entropy from first principles
- gradient descent — what minimises it
Active Recall
Write the binary cross-entropy loss and explain which term is active for each label.
. When the factor zeroes the second term, leaving . When the factor zeroes the first term, leaving . Each training example contributes one of the two terms.
A sigmoid classifier outputs for a sample whose true label is . Compute the per-sample cross-entropy loss and explain why it is large.
. The loss is large because the model is confidently wrong — it said the probability of class 1 was only 1%, but the truth is class 1. Cross-entropy grows without bound as confidently-wrong predictions approach (or for a class-0 example).
Why is cross-entropy preferred over squared error when training a sigmoid classifier?
(1) It is what maximum likelihood recommends under a Bernoulli model for the labels, so it has a principled probabilistic interpretation. (2) Its gradient with respect to the pre-activation is simply , which cancels the sigmoid’s saturating derivative; squared-error gradients include an extra factor that vanishes in the saturated tails, making learning slow when the model is confidently wrong.
In the max likelihood derivation of cross-entropy, which step converts a product of probabilities into the summation we see in the loss, and why doesn't this change the optimum?
Taking the logarithm. , and because log is monotonically increasing, of the log equals of the original — so the optimiser moves to the same parameter values. Log is then typical because it tames numerical underflow (products of many small probabilities go to 0) and gives gradients a clean additive form.