Sigmoid turns one number into a probability in (0, 1). Softmax turns a vector of numbers into a probability distribution — one that’s positive and sums to 1. It’s what you use in the output layer when the task is multi-class classification.
Definition
For a vector of raw pre-activation scores from the output layer of an MLP, the softmax of the -th component is:
Two properties fall out immediately:
- Positive: , so every output is strictly positive.
- Sums to 1: the denominator is the sum of all numerators, so .
Together, these properties mean the output can be interpreted as a probability distribution over the classes: .
Why we need it
The sigmoid function handles binary classification: one output neuron producing , with the other class’s probability implicitly being . That doesn’t scale to three or more classes — you can’t cleanly encode “class 1 vs class 2 vs class 3” with a single number between 0 and 1.
For -class classification, the output layer has neurons. Each neuron produces a raw score , but those raw scores aren’t directly usable as class probabilities (they can be negative, they don’t sum to anything meaningful). Softmax takes the whole vector and rescales it into a proper probability distribution.
Example
A network has an output layer of 10 neurons (one per digit, 0–9) trained on MNIST. On some input image, the raw outputs are:
These aren’t probabilities — they’re unbounded raw scores. Softmax turns them into something like:
which reads as: “80% probability it’s a 7, 2% each for the others”. Now the network’s output is directly a probability distribution over digits. The predicted class is — the index with the highest probability.
One-hot encoding: the matching label format
If softmax produces a probability distribution over classes, the training labels should match that shape. The standard format is one-hot encoding.
For a sample whose true class is , the label is a vector of length with:
- A in position (“hot”).
- A in every other position (“cool”).
Example for MNIST, label = 3:
Formally:
This one-hot vector pairs naturally with the softmax output vector of length , and together they plug into the multi-class cross-entropy loss.
ASIDE — Why one-hot and not just an integer label?
You could label sample 3 as just the integer “3”, but then the network (which has 10 output neurons) wouldn’t know how to match its prediction against a scalar target. One-hot encoding maps the label into the same space as the prediction, so the loss function can compare component-by-component. It also generalises nicely to tasks where the label is genuinely multi-valued (e.g. “the image could be a 3 or a 5 with 50/50 probability”).
The softmax–sigmoid connection
Softmax is a strict generalisation of the sigmoid. In the binary case with and scores :
So two-class softmax collapses to a sigmoid of the score difference. Conceptually: sigmoid is “softmax with only two classes, and the two scores reduce to one degree of freedom”.
Training with softmax
Softmax is used as the output-layer activation; the loss that pairs naturally with it is categorical cross-entropy (the multi-class extension of binary-cross-entropy):
For each training example , only one is 1 (and the rest are 0), so the inner sum collapses to a single term — . The loss is low when the network assigns high probability to the correct class and grows unboundedly as the network becomes confidently wrong.
(As with binary cross-entropy under sigmoid, the pairing is not arbitrary: the gradient of softmax + cross-entropy simplifies to , a clean prediction-minus-label form. This avoids the saturation problem that squared error would introduce.)
Alternatives and gotchas
- Mean squared error can still be used with softmax for classification — it’s differentiable and will “work” — but cross-entropy is preferred because its gradient is better-behaved, particularly when the network is confidently wrong.
- Softmax saturates in the same way sigmoid does: if one is much larger than the others, its softmax approaches 1 and the gradient approaches 0 for that neuron. This is fine at the output layer (where you want confident predictions) but is not a reason to use softmax in hidden layers (you don’t want saturation partway through the network).
Related
- sigmoid function — the binary-class special case
- binary-cross-entropy — the two-class loss; softmax + categorical cross-entropy is the multi-class analogue
- multi-layer-perceptron — softmax is used at the output layer of an MLP for multi-class tasks
- maximum likelihood estimation — the principle from which softmax + cross-entropy falls out (categorical distribution over classes)
Active Recall
Given raw output scores from a three-class network, compute the softmax probabilities.
, , . Sum . Softmax . The model assigns roughly 67% probability to class 3, 24% to class 2, 9% to class 1. Sanity check: the three probabilities sum to 1. ✓
Why do we use one-hot encoding for class labels instead of just writing the class number as an integer?
The network’s output layer for an -class problem produces an -dimensional vector (of softmax probabilities). For the loss function to compare prediction and label component-by-component, the label must live in the same space — a vector of length . One-hot is the minimal encoding that puts the label in this space and also has a clean probabilistic interpretation: the one-hot vector is the “perfect probability distribution” that assigns 100% to the correct class.
What is the relationship between softmax and sigmoid?
Softmax is the multi-class generalisation of sigmoid. In the binary case, softmax applied to two scores reduces to — a sigmoid of the score difference. Both functions are derived from maximum likelihood under categorical distributions (Bernoulli for sigmoid, multinoulli / categorical for softmax). Both produce outputs interpretable as probabilities. Sigmoid is just the special case.
Can softmax outputs ever be exactly 0 or exactly 1?
No. The numerator is always strictly positive, and the denominator is a sum of strictly positive terms, so every softmax output is in the open interval — it can get arbitrarily close to 0 or 1 but never reach either exactly. This has a practical consequence: the cross-entropy loss is always finite for a softmax output, so the loss is well-defined. If you ever see softmax output “exactly 1.0” in practice, it’s floating-point saturation.
Why not use softmax as the activation for hidden layers as well as the output layer?
Softmax couples the output values across a whole layer (they must sum to 1), which makes its derivative more complex and forces a competition between units. In a hidden layer, you typically want each unit to learn an independent feature — softmax’s cross-unit coupling is the wrong inductive bias, and its saturation at confident values would also cause vanishing gradients within the network. Hidden layers use per-unit activations like sigmoid, tanh, or ReLU; softmax is reserved for the output layer of multi-class classifiers.