Sigmoid turns one number into a probability in (0, 1). Softmax turns a vector of numbers into a probability distribution — one that’s positive and sums to 1. It’s what you use in the output layer when the task is multi-class classification.

Definition

For a vector of raw pre-activation scores from the output layer of an MLP, the softmax of the -th component is:

Two properties fall out immediately:

  1. Positive: , so every output is strictly positive.
  2. Sums to 1: the denominator is the sum of all numerators, so .

Together, these properties mean the output can be interpreted as a probability distribution over the classes: .

Why we need it

The sigmoid function handles binary classification: one output neuron producing , with the other class’s probability implicitly being . That doesn’t scale to three or more classes — you can’t cleanly encode “class 1 vs class 2 vs class 3” with a single number between 0 and 1.

For -class classification, the output layer has neurons. Each neuron produces a raw score , but those raw scores aren’t directly usable as class probabilities (they can be negative, they don’t sum to anything meaningful). Softmax takes the whole vector and rescales it into a proper probability distribution.

Example

A network has an output layer of 10 neurons (one per digit, 0–9) trained on MNIST. On some input image, the raw outputs are:

These aren’t probabilities — they’re unbounded raw scores. Softmax turns them into something like:

which reads as: “80% probability it’s a 7, 2% each for the others”. Now the network’s output is directly a probability distribution over digits. The predicted class is — the index with the highest probability.

One-hot encoding: the matching label format

If softmax produces a probability distribution over classes, the training labels should match that shape. The standard format is one-hot encoding.

For a sample whose true class is , the label is a vector of length with:

  • A in position (“hot”).
  • A in every other position (“cool”).

Example for MNIST, label = 3:

Formally:

This one-hot vector pairs naturally with the softmax output vector of length , and together they plug into the multi-class cross-entropy loss.

ASIDE — Why one-hot and not just an integer label?

You could label sample 3 as just the integer “3”, but then the network (which has 10 output neurons) wouldn’t know how to match its prediction against a scalar target. One-hot encoding maps the label into the same space as the prediction, so the loss function can compare component-by-component. It also generalises nicely to tasks where the label is genuinely multi-valued (e.g. “the image could be a 3 or a 5 with 50/50 probability”).

The softmax–sigmoid connection

Softmax is a strict generalisation of the sigmoid. In the binary case with and scores :

So two-class softmax collapses to a sigmoid of the score difference. Conceptually: sigmoid is “softmax with only two classes, and the two scores reduce to one degree of freedom”.

Training with softmax

Softmax is used as the output-layer activation; the loss that pairs naturally with it is categorical cross-entropy (the multi-class extension of binary-cross-entropy):

For each training example , only one is 1 (and the rest are 0), so the inner sum collapses to a single term — . The loss is low when the network assigns high probability to the correct class and grows unboundedly as the network becomes confidently wrong.

(As with binary cross-entropy under sigmoid, the pairing is not arbitrary: the gradient of softmax + cross-entropy simplifies to , a clean prediction-minus-label form. This avoids the saturation problem that squared error would introduce.)

Alternatives and gotchas

  • Mean squared error can still be used with softmax for classification — it’s differentiable and will “work” — but cross-entropy is preferred because its gradient is better-behaved, particularly when the network is confidently wrong.
  • Softmax saturates in the same way sigmoid does: if one is much larger than the others, its softmax approaches 1 and the gradient approaches 0 for that neuron. This is fine at the output layer (where you want confident predictions) but is not a reason to use softmax in hidden layers (you don’t want saturation partway through the network).

Active Recall