cross-entropy-loss

The negative log-likelihood for a probabilistic classifier — measures dissimilarity between the predicted class distribution and the true labels. For binary logistic regression, it is strictly convex in $w$ .

Definition

For binary classification with predicted probability $p_{1} = p (y = 1 ∣ x, w)$ and ground-truth label $y \in {0, 1}$ , the cross-entropy loss over a training set of size $N$ is:

$E (w) = - \sum_{i = 1}^{N} [y^{(i)} ln p_{1} (x^{(i)}, w) + (1 - y^{(i)}) ln (1 - p_{1} (x^{(i)}, w))]$

The two terms are a switch driven by the binary label:

If $y^{(i)} = 1$ , only the first term survives: $- ln p_{1}$ .
If $y^{(i)} = 0$ , only the second: $- ln (1 - p_{1})$ .

In both cases, you’re penalised by the negative log of the probability the model assigns to the correct class.

Behaviour

The shape of $- ln p$ for $p \in (0, 1]$ does most of the work:

Predicted probability of correct class	Loss contribution
$0.99$	$\approx 0.01$
$0.5$	$\approx 0.69$
$0.1$	$\approx 2.3$
$0.01$	$\approx 4.6$
$\to 0$	$\to \infty$

The loss explodes as the model confidently predicts the wrong class — confident wrongness is punished disproportionately, which is exactly what a calibrated probabilistic classifier should optimise.

Connection to Maximum Likelihood

Cross-entropy isn’t an arbitrary choice. It is exactly the negative log-likelihood under MLE. Starting from the likelihood and applying $- ln$ :

$L (w) = \prod_{i = 1}^{N} p_{1}^{y^{(i)}} (1 - p_{1})^{1 - y^{(i)}}$

$E (w) = - ln L (w) = - \sum_{i = 1}^{N} [y^{(i)} ln p_{1} + (1 - y^{(i)}) ln (1 - p_{1})]$

Maximising likelihood ↔ minimising cross-entropy. Same optimum, different signs.

Why “Cross-Entropy”?

In information theory, the cross-entropy between two distributions $P$ (true) and $Q$ (predicted) over a discrete outcome $y$ is:

$H (P, Q) = - \sum_{y} P (y) ln Q (y)$

For each training example, the true distribution is a one-hot — all probability mass on the actual label — and $Q$ is the model’s predicted distribution. The per-example cross-entropy reduces to $- ln Q (y_{true})$ , and the loss is the empirical sum across examples. So the loss measures the average dissimilarity between the model’s predicted distributions and the (one-hot) ground-truth distributions.

Convexity

The cross-entropy loss for logistic-regression is strictly convex in $w$ . This is the property that makes optimisation tractable: there is a single global minimum, and any reasonable iterative method (gradient descent, newton-raphson-method) will find it.

This convexity does not generalise to deeper models. A neural network with cross-entropy loss is non-convex in its parameters because the network’s output is a non-convex function of its weights — the loss itself is still convex in the output, but composing with the network breaks convexity.

Gradient

The gradient of the cross-entropy loss with respect to $w$ , when $p_{1} = σ (w^{⊤} x)$ , is remarkably clean:

$\nabla E (w) = \sum_{i = 1}^{N} (p_{1} (x^{(i)}, w) - y^{(i)}) x^{(i)}$

That is, the prediction error $(p_{1} - y)$ multiplied by the input vector, summed over the training set. The cleanness comes from a cancellation between the derivative of $- ln p_{1}$ and the derivative of the sigmoid: the chain rule produces $σ^{'} (z) = σ (z) (1 - σ (z))$ , which exactly cancels the $1/ p_{1} \cdot 1/ (1 - p_{1})$ terms, leaving the residual.

maximum likelihood estimation — cross-entropy is the negative log-likelihood
logistic-regression — cross-entropy’s primary user in this module
gradient descent — the natural minimiser
convex-function — the property that makes cross-entropy easy to optimise

Active Recall

For a single training example with $y = 1$ and predicted $p_{1} = 0.5$ , what does the example contribute to the loss? What about $y = 0$ with $p_{1} = 0.5$ ?

Both contribute $- ln (0.5) \approx 0.69$ — the maximum possible loss when the model is at chance level. This is the “natural” reference point: a model that always predicts $0.5$ gets a per-example loss of $ln 2$ .

Why does the loss go to infinity when the model predicts $p_{1} = 0$ for an example with $y = 1$ ?

When $y = 1$ , the loss term is $- ln p_{1}$ . As $p_{1} \to 0$ , $- ln p_{1} \to + \infty$ . Information-theoretically: the model claimed “this outcome is impossible,” but it happened. Any finite penalty would be an under-statement of how badly that prediction misrepresented reality.

Show that the gradient of the per-example loss $- y ln p_{1} - (1 - y) ln (1 - p_{1})$ with respect to $w$ simplifies to $(p_{1} - y) x$ , given $p_{1} = σ (w^{⊤} x)$ .

Let $z = w^{⊤} x$ so $p_{1} = σ (z)$ . By the chain rule, $\partial p_{1} / \partial w = σ^{'} (z) x = p_{1} (1 - p_{1}) x$ . The derivative of $- y ln p_{1} - (1 - y) ln (1 - p_{1})$ with respect to $p_{1}$ is $- y / p_{1} + (1 - y) / (1 - p_{1}) = (p_{1} - y) / (p_{1} (1 - p_{1}))$ . Multiplying by $\partial p_{1} / \partial w$ , the $p_{1} (1 - p_{1})$ factors cancel, leaving $(p_{1} - y) x$ .

Course Notes

Explorer

cross-entropy-loss

Definition

Behaviour

Connection to Maximum Likelihood

Why “Cross-Entropy”?

Convexity

Gradient

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

cross-entropy-loss

Definition

Behaviour

Connection to Maximum Likelihood

Why “Cross-Entropy”?

Convexity

Gradient

Related

Active Recall

Graph View

Table of Contents

Backlinks