A smooth, squashed version of the sign function. Crucially, it is differentiable — which is what makes gradient-based learning for classification actually work.
Definition
Some facts:
- Range: for all real . Output never exactly reaches 0 or 1.
- Symmetry: . Centred at the origin.
- Saturation: As , . As , . Plateaus where the gradient is nearly zero.
- Derivative: . Always positive, maximal at (value ), decays toward 0 in the tails.
Why it exists: the sign function is broken for gradient descent
The vanilla perceptron classifier uses . The sign function’s derivative is zero everywhere (and undefined at zero), so when you plug the perceptron into a squared loss and apply the chain rule, you get
The gradient is identically zero. gradient descent’s update never changes the parameters. Learning is impossible.
Swapping sign for sigmoid fixes this: sigmoid has a well-defined, mostly-nonzero derivative, so is meaningful and gradient descent can make progress.
The sigmoid perceptron (aka soft perceptron, aka logistic regression)
Replace the sign activation with sigmoid:
This is the same model you may have met in a stats course as logistic regression. Don’t let the name mislead you — logistic regression is a classification algorithm, not a regression algorithm.
| Perceptron variant | Activation | Output |
|---|---|---|
| Hard perceptron | sign | (hard decision) |
| Soft perceptron | sigmoid | value in (probability) |
“Hard” and “soft” refer to the abruptness of the activation — sign flips instantaneously, sigmoid transitions smoothly.
Probabilistic interpretation
Because the output sits in , we can read it as a probability. For a binary classification problem with class labels :
and the probability of the other class is . Together:
This probabilistic view is what lets maximum likelihood estimation derive binary-cross-entropy as the natural loss function.
ASIDE — Why labels flipped from to
With the sign activation, outputs are , so labelling the two classes lines up with the model’s output. With the sigmoid activation, outputs live in , so labelling the classes and makes the math (and the probability interpretation) work out cleanly. The choice of label encoding is driven by mathematical convenience, not anything fundamental — it’s still a binary problem either way.
The decision rule
To turn a sigmoid probability into a hard classification, threshold at 0.5:
Since iff , this is equivalent to thresholding the pre-activation at zero — so the decision boundary is still the hyperplane , just as it was for the hard perceptron. The difference is that sigmoid additionally reports how confident it is.
Weight magnitude controls confidence, not the boundary
A subtle but important point: scaling and by the same factor leaves the decision boundary unchanged but changes how confident the sigmoid is near that boundary. Concretely, take two models:
- Model A: , . Decision boundary: .
- Model B: , . Decision boundary: .
Same boundary. But evaluate at three penguins (flipper length cm, body mass kg):
| Penguin | |||||
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 |
Same classifications (above-boundary → class 1, below → class 0), but Model B is much more confident: penguins near the boundary get probabilities pushed closer to 0 and 1.
The take-away:
- Direction of — orientation of the decision boundary.
- relative to — position of the boundary (specifically, along from the origin).
- Magnitude of — steepness of the sigmoid transition. Larger → sharper transition, more confident predictions on either side. The boundary doesn’t move; the model just becomes more decisive about it.
Geometrically: measures distance from the boundary in units of . Doubling doubles those distances, so the same input now lands twice as far from the boundary in the sigmoid’s eye, and gets squashed harder toward 0 or 1.
This also gives a different angle on weight decay: penalising doesn’t directly stop the model from finding the right boundary — it just stops the model from being over-confident about it. Smoother transitions mean more cautious predictions, which usually generalise better.
Limitations
- Saturation kills gradients. For , the curve is nearly flat, so . Parameters whose pre-activation lands in the saturated region stop learning. This is the vanishing gradient problem — severe in deep networks. Modern networks often use ReLU or variants to avoid it.
- Still only linear boundaries. The sigmoid perceptron is a linear classifier — a single straight line (or hyperplane) in input space. It cannot solve XOR or other non-linearly-separable problems. The fix is combining multiple neurons into a multi-layer network (week 3).
Related
- perceptron — the model in which sigmoid replaces sign
- gradient descent — the algorithm that needs a differentiable activation
- binary-cross-entropy — the loss derived from the probabilistic interpretation of sigmoid
- maximum likelihood estimation — the principle that connects sigmoid to cross-entropy
Active Recall
Write down the sigmoid function and its derivative, and state the derivative's maximum value.
with derivative . The derivative is maximised at where , giving .
Why can't gradient descent learn a perceptron with the sign activation and squared loss, and how does sigmoid fix it?
The sign function’s derivative is 0 (or undefined at 0), so by the chain rule is identically zero and the parameters never update. Sigmoid has a positive, mostly-nonzero derivative, so is a meaningful vector that actually moves the parameters downhill.
A sigmoid perceptron outputs for input . Interpret this value and give the hard-threshold classification.
The output is interpretable as — the model’s estimated probability that belongs to class 1. With threshold 0.5, the hard classification is , with relatively high confidence.
What is the vanishing gradient problem and when does it bite for sigmoid?
Sigmoid saturates in the tails: for large, . Any parameter whose pre-activation is in the saturated region receives near-zero gradient and updates very slowly — effectively stops learning. This is especially damaging in deep networks where gradients multiply layer-by-layer and shrink exponentially toward the input.
Why do we switch the class label encoding from (hard perceptron) to (sigmoid perceptron)?
It’s a mathematical convenience. Sigmoid outputs lie in , so encoding classes as 0 and 1 aligns labels with outputs and lets the output be read as a probability . With sign outputs of , the label encoding is the natural match. Nothing changes about the problem — it’s still binary classification.
Two sigmoid classifiers have parameters and . Do they classify points differently? What does change?
They classify points identically — both have the same decision boundary , because scaling and by the same factor leaves the equation unchanged. What changes is confidence: the second model has twice as large, so its sigmoid transition is twice as steep. Points near the boundary get pushed harder toward 0 or 1, producing more extreme (more confident) probabilities. The direction and position of the boundary are set by and ; the magnitude controls only the sharpness.