The classification counterpart to squared error. Where squared error is what max likelihood prescribes under Gaussian noise, cross-entropy is what max likelihood prescribes for classes drawn from a Bernoulli with .

Definition

For a binary classification dataset with labels and sigmoid-perceptron predictions :

The minus sign in front turns a log-likelihood maximisation into a loss minimisation.

What each term does

For each training sample, exactly one of the two terms is active (because ):

Label Active termPenalises
close to 0 (big loss if the model is confidently wrong)
close to 1 (big loss if the model is confidently wrong)
  • Correct, confident predictions ( when , or when ) give loss.
  • Wrong, confident predictions drive when the true label is 1 (or vice versa), and . The loss is unbounded above — wrongness is punished severely.

Derivation from max likelihood

Assume each label is drawn from a Bernoulli distribution parameterised by the sigmoid output:

Using the encoding, this collapses into one expression valid for both labels:

(Plug in : get . Plug in : get . ✓)

Now apply the standard maximum likelihood estimation recipe:

  1. Assume conditional independence: .
  2. Take the log to turn the product into a sum.
  3. Flip the sign to turn a maximisation into a minimisation.

The result is exactly binary cross-entropy:

This mirrors the derivation of squared error under Gaussian noise — same pattern, different probability model.

Why cross-entropy and not squared error for classification

Squared error would run — sigmoid is differentiable, so is non-zero. But it’s a worse fit for two reasons:

  1. Probabilistic meaning. Cross-entropy is what max likelihood recommends when your outputs are class probabilities. Squared error assumes Gaussian noise on a continuous target, which doesn’t match the Bernoulli structure of binary labels.
  2. Gradient magnitude. Squared error combined with sigmoid leads to gradients of the form , which vanishes whenever saturates — even when is large. Cross-entropy cancels this sigmoid derivative cleanly, so the parameter gradient becomes simply — proportional to the error, no matter where you are on the sigmoid curve. Learning is faster and more stable.

Connection to gradient descent

Training a sigmoid classifier is just gradient descent on binary cross-entropy:

  1. Forward: .
  2. Loss: .
  3. Gradient (after chain rule): , .
  4. Update: .

You’ll implement exactly this in the week 2 Python exercise.

Multiclass extension

For more than two classes, the natural output is a probability distribution over classes produced by softmax, and the labels are represented as one-hot vectors of length . The loss becomes categorical cross-entropy:

where is 1 if sample belongs to class (else 0), and is the softmax output for that class.

Because only one is non-zero per sample, the inner sum collapses to a single term — — so the per-sample loss is just of the probability the model assigned to the correct class. Binary cross-entropy is the special case for (with the two classes implicit in and ). Same recipe, same MLE derivation, different number of classes.

Active Recall