sigmoid-function-nc

A smooth, squashed version of the sign function. Crucially, it is differentiable — which is what makes gradient-based learning for classification actually work.

Definition

$σ (z) = \frac{1}{1 + e ^{- z}}$

Some facts:

Range: $σ (z) \in (0, 1)$ for all real $z$ . Output never exactly reaches 0 or 1.
Symmetry: $σ (0) = 0.5$ . Centred at the origin.
Saturation: As $z \to + \infty$ , $σ (z) \to 1$ . As $z \to - \infty$ , $σ (z) \to 0$ . Plateaus where the gradient is nearly zero.
Derivative: $σ^{'} (z) = σ (z) (1 - σ (z))$ . Always positive, maximal at $z = 0$ (value $0.25$ ), decays toward 0 in the tails.

Why it exists: the sign function is broken for gradient descent

The vanilla perceptron classifier uses $\overset{y}{^} = sgn (b + w \cdot x)$ . The sign function’s derivative is zero everywhere (and undefined at zero), so when you plug the perceptron into a squared loss and apply the chain rule, you get

$\frac{\partial L}{\partial w _{i}} = - \sum_{j} 2 (y^{j} - \overset{y}{^}^{j}) = 0 sgn^{'} (\cdot) x_{i}^{j} = 0.$

The gradient is identically zero. gradient descent’s update $θ^{t + 1} = θ^{t} - η \cdot 0 = θ^{t}$ never changes the parameters. Learning is impossible.

Swapping sign for sigmoid fixes this: sigmoid has a well-defined, mostly-nonzero derivative, so $\nabla L$ is meaningful and gradient descent can make progress.

The sigmoid perceptron (aka soft perceptron, aka logistic regression)

Replace the sign activation with sigmoid:

$\overset{y}{^} = σ (b + \sum_{i = 1}^{D} w_{i} x_{i})$

This is the same model you may have met in a stats course as logistic regression. Don’t let the name mislead you — logistic regression is a classification algorithm, not a regression algorithm.

Perceptron variant	Activation	Output
Hard perceptron	sign	$\pm 1$ (hard decision)
Soft perceptron	sigmoid	value in $(0, 1)$ (probability)

“Hard” and “soft” refer to the abruptness of the activation — sign flips instantaneously, sigmoid transitions smoothly.

Probabilistic interpretation

Because the output sits in $(0, 1)$ , we can read it as a probability. For a binary classification problem with class labels $y \in {0, 1}$ :

$p_{w, b} (y = 1 ∣ x) = \overset{y}{^} = σ (b + w \cdot x)$

and the probability of the other class is $p (y = 0 ∣ x) = 1 - \overset{y}{^}$ . Together:

$p_{w, b} (y ∣ x) = {\overset{y}{^} 1 - \overset{y}{^} if y = 1 if y = 0$

This probabilistic view is what lets maximum likelihood estimation derive binary-cross-entropy as the natural loss function.

ASIDE — Why labels flipped from ${+ 1, - 1}$ to ${0, 1}$

With the sign activation, outputs are $\pm 1$ , so labelling the two classes $\pm 1$ lines up with the model’s output. With the sigmoid activation, outputs live in $(0, 1)$ , so labelling the classes $0$ and $1$ makes the math (and the probability interpretation) work out cleanly. The choice of label encoding is driven by mathematical convenience, not anything fundamental — it’s still a binary problem either way.

The decision rule

To turn a sigmoid probability into a hard classification, threshold at 0.5:

$\overset{y}{^}_{class} = {10 if σ (z) \geq 0.5 otherwise$

Since $σ (z) = 0.5$ iff $z = 0$ , this is equivalent to thresholding the pre-activation $z = b + w \cdot x$ at zero — so the decision boundary is still the hyperplane $w \cdot x + b = 0$ , just as it was for the hard perceptron. The difference is that sigmoid additionally reports how confident it is.

Weight magnitude controls confidence, not the boundary

A subtle but important point: scaling $w$ and $b$ by the same factor leaves the decision boundary unchanged but changes how confident the sigmoid is near that boundary. Concretely, take two models:

Model A: $w = (1, 1)$ , $b = - 25$ . Decision boundary: $x_{1} + x_{2} = 25$ .
Model B: $w^{'} = (2, 2)$ , $b^{'} = - 50$ . Decision boundary: $2 x_{1} + 2 x_{2} = 50 \Leftrightarrow x_{1} + x_{2} = 25$ .

Same boundary. But evaluate at three penguins (flipper length $x_{1}$ cm, body mass $x_{2}$ kg):

Penguin	$(x_{1}, x_{2})$	$z_{A} = x_{1} + x_{2} - 25$	$σ (z_{A})$	$z_{B} = 2 z_{A}$	$σ (z_{B})$
1	$(22, 2)$	$- 1$	$0.27$	$- 2$	$0.12$
2	$(22, 3)$	$0$	$0.50$	$0$	$0.50$
3	$(23, 4)$	$+ 2$	$0.88$	$+ 4$	$0.98$

Same classifications (above-boundary → class 1, below → class 0), but Model B is much more confident: penguins near the boundary get probabilities pushed closer to 0 and 1.

The take-away:

Direction of $w$ — orientation of the decision boundary.
$b$ relative to $∥ w ∥$ — position of the boundary (specifically, $- b /∥ w ∥$ along $w$ from the origin).
Magnitude of $w$ — steepness of the sigmoid transition. Larger $∥ w ∥$ → sharper transition, more confident predictions on either side. The boundary doesn’t move; the model just becomes more decisive about it.

Geometrically: $z = w \cdot x + b$ measures distance from the boundary in units of $∥ w ∥$ . Doubling $∥ w ∥$ doubles those distances, so the same input now lands twice as far from the boundary in the sigmoid’s eye, and gets squashed harder toward 0 or 1.

This also gives a different angle on weight decay: penalising $∥ w ∥^{2}$ doesn’t directly stop the model from finding the right boundary — it just stops the model from being over-confident about it. Smoother transitions mean more cautious predictions, which usually generalise better.

Limitations

Saturation kills gradients. For $∣ z ∣ ≫ 0$ , the curve is nearly flat, so $σ^{'} (z) \approx 0$ . Parameters whose pre-activation lands in the saturated region stop learning. This is the vanishing gradient problem — severe in deep networks. Modern networks often use ReLU or variants to avoid it.
Still only linear boundaries. The sigmoid perceptron is a linear classifier — a single straight line (or hyperplane) in input space. It cannot solve XOR or other non-linearly-separable problems. The fix is combining multiple neurons into a multi-layer network (week 3).

perceptron — the model in which sigmoid replaces sign
gradient descent — the algorithm that needs a differentiable activation
binary-cross-entropy — the loss derived from the probabilistic interpretation of sigmoid
maximum likelihood estimation — the principle that connects sigmoid to cross-entropy

Active Recall

Write down the sigmoid function and its derivative, and state the derivative's maximum value.

$σ (z) = 1/ (1 + e^{- z})$ with derivative $σ^{'} (z) = σ (z) (1 - σ (z))$ . The derivative is maximised at $z = 0$ where $σ (0) = 0.5$ , giving $σ^{'} (0) = 0.25$ .

Why can't gradient descent learn a perceptron with the sign activation and squared loss, and how does sigmoid fix it?

The sign function’s derivative is 0 (or undefined at 0), so by the chain rule $\nabla L$ is identically zero and the parameters never update. Sigmoid has a positive, mostly-nonzero derivative, so $\nabla L$ is a meaningful vector that actually moves the parameters downhill.

A sigmoid perceptron outputs $\overset{y}{^} = 0.82$ for input $x$ . Interpret this value and give the hard-threshold classification.

The output is interpretable as $p (y = 1 ∣ x) = 0.82$ — the model’s estimated probability that $x$ belongs to class 1. With threshold 0.5, the hard classification is $y = 1$ , with relatively high confidence.

What is the vanishing gradient problem and when does it bite for sigmoid?

Sigmoid saturates in the tails: for $∣ z ∣$ large, $σ^{'} (z) \approx 0$ . Any parameter whose pre-activation is in the saturated region receives near-zero gradient and updates very slowly — effectively stops learning. This is especially damaging in deep networks where gradients multiply layer-by-layer and shrink exponentially toward the input.

Why do we switch the class label encoding from ${+ 1, - 1}$ (hard perceptron) to ${0, 1}$ (sigmoid perceptron)?

It’s a mathematical convenience. Sigmoid outputs lie in $(0, 1)$ , so encoding classes as 0 and 1 aligns labels with outputs and lets the output be read as a probability $p (y = 1 ∣ x)$ . With sign outputs of $\pm 1$ , the $\pm 1$ label encoding is the natural match. Nothing changes about the problem — it’s still binary classification.

Two sigmoid classifiers have parameters $(w, b) = ((1, 1), - 25)$ and $(w^{'}, b^{'}) = ((2, 2), - 50)$ . Do they classify points differently? What does change?

They classify points identically — both have the same decision boundary $x_{1} + x_{2} = 25$ , because scaling $w$ and $b$ by the same factor leaves the equation $w \cdot x + b = 0$ unchanged. What changes is confidence: the second model has $∥ w ∥$ twice as large, so its sigmoid transition is twice as steep. Points near the boundary get pushed harder toward 0 or 1, producing more extreme (more confident) probabilities. The direction and position of the boundary are set by $w$ and $b /∥ w ∥$ ; the magnitude $∥ w ∥$ controls only the sharpness.

Course Notes

Explorer

sigmoid-function-nc

Definition

Why it exists: the sign function is broken for gradient descent

The sigmoid perceptron (aka soft perceptron, aka logistic regression)

Probabilistic interpretation

The decision rule

Weight magnitude controls confidence, not the boundary

Limitations

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

sigmoid-function-nc

Definition

Why it exists: the sign function is broken for gradient descent

The sigmoid perceptron (aka soft perceptron, aka logistic regression)

Probabilistic interpretation

The decision rule

Weight magnitude controls confidence, not the boundary

Limitations

Related

Active Recall

Graph View

Table of Contents

Backlinks