A smooth, squashed version of the sign function. Crucially, it is differentiable — which is what makes gradient-based learning for classification actually work.

Definition

Some facts:

  • Range: for all real . Output never exactly reaches 0 or 1.
  • Symmetry: . Centred at the origin.
  • Saturation: As , . As , . Plateaus where the gradient is nearly zero.
  • Derivative: . Always positive, maximal at (value ), decays toward 0 in the tails.

Why it exists: the sign function is broken for gradient descent

The vanilla perceptron classifier uses . The sign function’s derivative is zero everywhere (and undefined at zero), so when you plug the perceptron into a squared loss and apply the chain rule, you get

The gradient is identically zero. gradient descent’s update never changes the parameters. Learning is impossible.

Swapping sign for sigmoid fixes this: sigmoid has a well-defined, mostly-nonzero derivative, so is meaningful and gradient descent can make progress.

The sigmoid perceptron (aka soft perceptron, aka logistic regression)

Replace the sign activation with sigmoid:

This is the same model you may have met in a stats course as logistic regression. Don’t let the name mislead you — logistic regression is a classification algorithm, not a regression algorithm.

Perceptron variantActivationOutput
Hard perceptronsign (hard decision)
Soft perceptronsigmoidvalue in (probability)

“Hard” and “soft” refer to the abruptness of the activation — sign flips instantaneously, sigmoid transitions smoothly.

Probabilistic interpretation

Because the output sits in , we can read it as a probability. For a binary classification problem with class labels :

and the probability of the other class is . Together:

This probabilistic view is what lets maximum likelihood estimation derive binary-cross-entropy as the natural loss function.

ASIDE — Why labels flipped from to

With the sign activation, outputs are , so labelling the two classes lines up with the model’s output. With the sigmoid activation, outputs live in , so labelling the classes and makes the math (and the probability interpretation) work out cleanly. The choice of label encoding is driven by mathematical convenience, not anything fundamental — it’s still a binary problem either way.

The decision rule

To turn a sigmoid probability into a hard classification, threshold at 0.5:

Since iff , this is equivalent to thresholding the pre-activation at zero — so the decision boundary is still the hyperplane , just as it was for the hard perceptron. The difference is that sigmoid additionally reports how confident it is.

Weight magnitude controls confidence, not the boundary

A subtle but important point: scaling and by the same factor leaves the decision boundary unchanged but changes how confident the sigmoid is near that boundary. Concretely, take two models:

  • Model A: , . Decision boundary: .
  • Model B: , . Decision boundary: .

Same boundary. But evaluate at three penguins (flipper length cm, body mass kg):

Penguin
1
2
3

Same classifications (above-boundary → class 1, below → class 0), but Model B is much more confident: penguins near the boundary get probabilities pushed closer to 0 and 1.

The take-away:

  • Direction of — orientation of the decision boundary.
  • relative to — position of the boundary (specifically, along from the origin).
  • Magnitude of steepness of the sigmoid transition. Larger → sharper transition, more confident predictions on either side. The boundary doesn’t move; the model just becomes more decisive about it.

Geometrically: measures distance from the boundary in units of . Doubling doubles those distances, so the same input now lands twice as far from the boundary in the sigmoid’s eye, and gets squashed harder toward 0 or 1.

This also gives a different angle on weight decay: penalising doesn’t directly stop the model from finding the right boundary — it just stops the model from being over-confident about it. Smoother transitions mean more cautious predictions, which usually generalise better.

Limitations

  • Saturation kills gradients. For , the curve is nearly flat, so . Parameters whose pre-activation lands in the saturated region stop learning. This is the vanishing gradient problem — severe in deep networks. Modern networks often use ReLU or variants to avoid it.
  • Still only linear boundaries. The sigmoid perceptron is a linear classifier — a single straight line (or hyperplane) in input space. It cannot solve XOR or other non-linearly-separable problems. The fix is combining multiple neurons into a multi-layer network (week 3).

Active Recall