binary-cross-entropy

The classification counterpart to squared error. Where squared error is what max likelihood prescribes under Gaussian noise, cross-entropy is what max likelihood prescribes for classes drawn from a Bernoulli with $p = σ (z)$ .

Definition

For a binary classification dataset with labels $y^{j} \in {0, 1}$ and sigmoid-perceptron predictions $\overset{y}{^}^{j} \in (0, 1)$ :

$L (θ) = - \sum_{j = 1}^{n} [y^{j} ln \overset{y}{^}^{j} + (1 - y^{j}) ln (1 - \overset{y}{^}^{j})]$

The minus sign in front turns a log-likelihood maximisation into a loss minimisation.

What each term does

For each training sample, exactly one of the two terms is active (because $y^{j} \in {0, 1}$ ):

Label $y^{j}$	Active term	Penalises
$1$	$- ln \overset{y}{^}^{j}$	$\overset{y}{^}^{j}$ close to 0 (big loss if the model is confidently wrong)
$0$	$- ln (1 - \overset{y}{^}^{j})$	$\overset{y}{^}^{j}$ close to 1 (big loss if the model is confidently wrong)

Correct, confident predictions ( $\overset{y}{^}^{j} \to 1$ when $y^{j} = 1$ , or $\overset{y}{^}^{j} \to 0$ when $y^{j} = 0$ ) give $ln (1) = 0$ loss.
Wrong, confident predictions drive $\overset{y}{^}^{j} \to 0$ when the true label is 1 (or vice versa), and $- ln (\overset{y}{^}^{j}) \to \infty$ . The loss is unbounded above — wrongness is punished severely.

Derivation from max likelihood

Assume each label is drawn from a Bernoulli distribution parameterised by the sigmoid output:

$p_{w, b} (y^{j} ∣ x^{j}) = {\overset{y}{^}^{j} 1 - \overset{y}{^}^{j} if y^{j} = 1 if y^{j} = 0$

Using the ${0, 1}$ encoding, this collapses into one expression valid for both labels:

$p_{w, b} (y^{j} ∣ x^{j}) = (\overset{y}{^}^{j})^{y^{j}} (1 - \overset{y}{^}^{j})^{1 - y^{j}}$

(Plug in $y^{j} = 1$ : get $\overset{y}{^}^{j}$ . Plug in $y^{j} = 0$ : get $1 - \overset{y}{^}^{j}$ . ✓)

Now apply the standard maximum likelihood estimation recipe:

Assume conditional independence: $p (y^{1}, \dots, y^{n} ∣ x^{1}, \dots, x^{n}) = \prod_{j} p (y^{j} ∣ x^{j})$ .
Take the log to turn the product into a sum.
Flip the sign to turn a maximisation into a minimisation.

The result is exactly binary cross-entropy:

$θ^{*} = ar g min_{θ} - \sum_{j = 1}^{n} [y^{j} ln \overset{y}{^}^{j} + (1 - y^{j}) ln (1 - \overset{y}{^}^{j})]$

This mirrors the derivation of squared error under Gaussian noise — same pattern, different probability model.

Why cross-entropy and not squared error for classification

Squared error would run — sigmoid is differentiable, so $\nabla L$ is non-zero. But it’s a worse fit for two reasons:

Probabilistic meaning. Cross-entropy is what max likelihood recommends when your outputs are class probabilities. Squared error assumes Gaussian noise on a continuous target, which doesn’t match the Bernoulli structure of binary labels.
Gradient magnitude. Squared error combined with sigmoid leads to gradients of the form $(y - \overset{y}{^}) σ^{'} (z)$ , which vanishes whenever $σ^{'} (z)$ saturates — even when $(y - \overset{y}{^})$ is large. Cross-entropy cancels this sigmoid derivative cleanly, so the parameter gradient becomes simply $(\overset{y}{^} - y) x$ — proportional to the error, no matter where you are on the sigmoid curve. Learning is faster and more stable.

Connection to gradient descent

Training a sigmoid classifier is just gradient descent on binary cross-entropy:

Forward: $\overset{y}{^}^{j} = σ (b + w \cdot x^{j})$ .
Loss: $L = - \sum_{j} [y^{j} ln \overset{y}{^}^{j} + (1 - y^{j}) ln (1 - \overset{y}{^}^{j})]$ .
Gradient (after chain rule): $\frac{\partial L}{\partial w _{i}} = \sum_{j} (\overset{y}{^}^{j} - y^{j}) x_{i}^{j}$ , $\frac{\partial L}{\partial b} = \sum_{j} (\overset{y}{^}^{j} - y^{j})$ .
Update: $θ^{t + 1} = θ^{t} - η \nabla L_{t}$ .

You’ll implement exactly this in the week 2 Python exercise.

Multiclass extension

For more than two classes, the natural output is a probability distribution over $m$ classes produced by softmax, and the labels are represented as one-hot vectors of length $m$ . The loss becomes categorical cross-entropy:

$L = - \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{m} y_{i, j} ln \overset{y}{^}_{i, j}$

where $y_{i, j}$ is 1 if sample $i$ belongs to class $j$ (else 0), and $\overset{y}{^}_{i, j}$ is the softmax output for that class.

Because only one $y_{i, j}$ is non-zero per sample, the inner sum collapses to a single term — $- ln (\overset{y}{^}_{i, true class})$ — so the per-sample loss is just $- lo g$ of the probability the model assigned to the correct class. Binary cross-entropy is the special case for $m = 2$ (with the two classes implicit in $\overset{y}{^}$ and $1 - \overset{y}{^}$ ). Same recipe, same MLE derivation, different number of classes.

sigmoid function — produces the binary probabilities that cross-entropy consumes
softmax — produces the multi-class probabilities for the categorical extension
loss-function — the general concept; this is the classification specialisation
maximum likelihood estimation — the derivation of cross-entropy from first principles
gradient descent — what minimises it

Active Recall

Write the binary cross-entropy loss and explain which term is active for each label.

$L = - \sum_{j} [y^{j} ln \overset{y}{^}^{j} + (1 - y^{j}) ln (1 - \overset{y}{^}^{j})]$ . When $y^{j} = 1$ the $(1 - y^{j})$ factor zeroes the second term, leaving $- ln \overset{y}{^}^{j}$ . When $y^{j} = 0$ the $y^{j}$ factor zeroes the first term, leaving $- ln (1 - \overset{y}{^}^{j})$ . Each training example contributes one of the two terms.

A sigmoid classifier outputs $\overset{y}{^} = 0.01$ for a sample whose true label is $y = 1$ . Compute the per-sample cross-entropy loss and explain why it is large.

$- [1 \cdot ln (0.01) + 0 \cdot ln (0.99)] = - ln (0.01) \approx 4.6$ . The loss is large because the model is confidently wrong — it said the probability of class 1 was only 1%, but the truth is class 1. Cross-entropy grows without bound as confidently-wrong predictions approach $\overset{y}{^} = 0$ (or $\overset{y}{^} = 1$ for a class-0 example).

Why is cross-entropy preferred over squared error when training a sigmoid classifier?

(1) It is what maximum likelihood recommends under a Bernoulli model for the labels, so it has a principled probabilistic interpretation. (2) Its gradient with respect to the pre-activation $z$ is simply $(\overset{y}{^} - y)$ , which cancels the sigmoid’s saturating derivative; squared-error gradients include an extra $σ^{'} (z)$ factor that vanishes in the saturated tails, making learning slow when the model is confidently wrong.

In the max likelihood derivation of cross-entropy, which step converts a product of probabilities into the summation we see in the loss, and why doesn't this change the optimum?

Taking the logarithm. $lo g (\prod_{j} p^{j}) = \sum_{j} lo g p^{j}$ , and because log is monotonically increasing, $ar g max$ of the log equals $ar g max$ of the original — so the optimiser moves to the same parameter values. Log is then typical because it tames numerical underflow (products of many small probabilities go to 0) and gives gradients a clean additive form.

Course Notes

Explorer

binary-cross-entropy

Definition

What each term does

Derivation from max likelihood

Why cross-entropy and not squared error for classification

Connection to gradient descent

Multiclass extension

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

binary-cross-entropy

Definition

What each term does

Derivation from max likelihood

Why cross-entropy and not squared error for classification

Connection to gradient descent

Multiclass extension

Related

Active Recall

Graph View

Table of Contents

Backlinks