week-02

TARGET DECK NeuralComputation::Week-02

Gradient descent fundamentals

What is the gradient descent update rule?

$θ^{t + 1} = θ^{t} - η \nabla L_{t}$ where $θ^{t}$ is the parameter vector at step $t$ , $η$ is the learning rate (a small positive scalar), and $\nabla L_{t}$ is the gradient of the loss with respect to the parameters at $θ^{t}$ .

Why do we step in the direction of $- \nabla L$ instead of $+ \nabla L$ ?

The gradient $\nabla L$ points in the direction of steepest increase of the loss. Since we want to decrease the loss, we step in the opposite direction. The minus sign in the update rule encodes “downhill”.

What does the learning rate $η$ control, and what happens at extreme values?

$η$ is the step size along the negative gradient.

Too small: training is very slow; loss decreases inch-by-inch.

Too large: the optimiser overshoots, oscillates, or even increases the loss.

Just right: smooth steady decrease. Typical values are $η \in {1 0^{- 3}, 1 0^{- 2}, 1 0^{- 1}}$ , found by trial and error.

What two stopping criteria are used to terminate gradient descent?

Convergence — the loss stops improving across iterations (changes are below a threshold).

Maximum iterations — a manually set budget (e.g. 100 epochs), used because complex models may never fully converge.

Why does gradient descent only find a local minimum, not necessarily the global one?

The algorithm only sees the local slope at the current point — it walks downhill from wherever it started. If the loss surface has multiple basins, GD lands in whichever one its initialisation sits inside. In high dimensions we cannot map the landscape, so there is no guarantee we are in the deepest valley. Re-initialising or using momentum-based methods can help escape shallow basins.

Why classification breaks vanilla gradient descent

Why does gradient descent fail when training a sign-activated perceptron with squared-error loss?

The chain rule introduces a factor $sgn^{'} (\cdot)$ , the derivative of the sign function. Sign is flat almost everywhere (derivative 0) and undefined at 0, so $\nabla L = 0$ identically. The update rule becomes $θ^{t + 1} = θ^{t} - η \cdot 0 = θ^{t}$ — the parameters never change. Learning is impossible.

Why does swapping the loss not fix the sign-activation gradient problem — why must the activation change?

The zero gradient comes from the sign function, not the loss. Whatever loss you compose with sign, the chain rule multiplies by $sgn^{'} (\cdot) = 0$ and wipes out the gradient. The activation has to be replaced with something differentiable.

Sigmoid

What is the sigmoid function, and what is its derivative?

$σ (z) = \frac{1}{1 + e ^{- z}}, σ^{'} (z) = σ (z) (1 - σ (z))$ Sigmoid maps $R \to (0, 1)$ smoothly, so its output can be read as a probability $p (y = 1 ∣ x)$ . Its derivative is non-zero everywhere except in the saturated tails, which is why gradient descent can flow through it.

What does it mean for sigmoid to saturate, and why is it a problem?

For very large $∣ z ∣$ , $σ (z) \approx 0$ or $1$ and $σ^{'} (z) \approx 0$ . When the gradient passes through a saturated sigmoid, it is multiplied by ~0 — the upstream parameters get a near-zero update and effectively stop learning. This is the vanishing-gradient problem; it bites hardest in deep networks stacked from sigmoids.

Binary cross-entropy

What is the binary cross-entropy loss?

$L (θ) = - \sum_{j = 1}^{n} [y^{j} ln \overset{y}{^}^{j} + (1 - y^{j}) ln (1 - \overset{y}{^}^{j})]$ $y^{j} \in {0, 1}$ is the true label of example $j$ and $\overset{y}{^}^{j} \in (0, 1)$ is the model’s predicted probability. For each example, only one of the two terms is non-zero (whichever matches the true class), and that term is $- ln (probability assigned to the truth)$ .

From which probabilistic assumption does binary cross-entropy fall out of MLE?

A Bernoulli model for the labels: $p (y ∣ x) = \overset{y}{^}^{y} (1 - \overset{y}{^})^{1 - y}$ with $\overset{y}{^} = σ (b + w \cdot x)$ . Taking the log of the product across $n$ independent observations and flipping the sign produces exactly the BCE expression. Same MLE recipe as MSE — just with Bernoulli instead of Gaussian.

A sigmoid-activated perceptron trained with cross-entropy is also known by another name. What is it?

Logistic regression. Despite the name, it is a classification algorithm — the “regression” refers to fitting a continuous probability $\overset{y}{^} \in (0, 1)$ , which is then thresholded for the class decision.

Loss / activation pairings

Summarise the task / activation / loss / probability model correspondences for the two cases covered in week 2.

Task Activation Loss MLE under
Regression (none, linear) Squared error Gaussian noise
Binary classification Sigmoid Binary cross-entropy Bernoulli labels

Both are trained by gradient descent. The change is the assumed data distribution — that fixes the loss-activation pair.

Task	Activation	Loss	MLE under
Regression	(none, linear)	Squared error	Gaussian noise
Binary classification	Sigmoid	Binary cross-entropy	Bernoulli labels

GD variants

What is the difference between batch, stochastic, and mini-batch gradient descent?

Batch GD: compute the gradient over all $n$ training samples per step. Accurate gradient, very slow per step.

Stochastic GD: compute the gradient on one randomly chosen sample. Very fast per step, very noisy trajectory.

Mini-batch GD: use a random subset of size $m ≪ n$ (typically 32–512). Best of both — fast per step, much less noisy than SGD. The de facto default.

Why does the noise in stochastic / mini-batch GD sometimes help training?

The gradient estimate fluctuates from step to step, so the optimiser does not follow a smooth path. This noise can knock the parameters out of shallow local minima and saddle points, helping to find better solutions than a perfectly smooth descent would.

What does momentum add to gradient descent, and what is its update rule?

Momentum maintains a velocity $v$ that decays previous gradients in: $v^{t} = η \nabla L_{t} + β v^{t - 1}, θ^{t + 1} = θ^{t} - v^{t}$ Typical $β \approx 0.9$ . The optimiser carries inertia through flat regions and dampens oscillations across narrow valleys, both speeding up convergence and helping escape shallow minima.

What does Adam combine, and why is it the modern default?

Adam combines two ideas:

Momentum — exponentially decaying running mean of gradients (first moment).

Per-parameter adaptive learning rates — exponentially decaying running mean of squared gradients (second moment), used to scale each parameter’s step.

The result is robust across many problems with little tuning, which is why it became the default optimiser for most deep-learning training loops.

Course Notes

Explorer

week-02

Gradient descent fundamentals

Why classification breaks vanilla gradient descent

Sigmoid

Binary cross-entropy

Loss / activation pairings

GD variants

Graph View

Table of Contents