TARGET DECK NeuralComputation::Week-02
Gradient descent fundamentals
What is the gradient descent update rule?
where is the parameter vector at step , is the learning rate (a small positive scalar), and is the gradient of the loss with respect to the parameters at .
Why do we step in the direction of instead of ?
The gradient points in the direction of steepest increase of the loss. Since we want to decrease the loss, we step in the opposite direction. The minus sign in the update rule encodes “downhill”.
What does the learning rate control, and what happens at extreme values?
is the step size along the negative gradient.
- Too small: training is very slow; loss decreases inch-by-inch.
- Too large: the optimiser overshoots, oscillates, or even increases the loss.
- Just right: smooth steady decrease. Typical values are , found by trial and error.
What two stopping criteria are used to terminate gradient descent?
- Convergence — the loss stops improving across iterations (changes are below a threshold).
- Maximum iterations — a manually set budget (e.g. 100 epochs), used because complex models may never fully converge.
Why does gradient descent only find a local minimum, not necessarily the global one?
The algorithm only sees the local slope at the current point — it walks downhill from wherever it started. If the loss surface has multiple basins, GD lands in whichever one its initialisation sits inside. In high dimensions we cannot map the landscape, so there is no guarantee we are in the deepest valley. Re-initialising or using momentum-based methods can help escape shallow basins.
Why classification breaks vanilla gradient descent
Why does gradient descent fail when training a sign-activated perceptron with squared-error loss?
The chain rule introduces a factor , the derivative of the sign function. Sign is flat almost everywhere (derivative 0) and undefined at 0, so identically. The update rule becomes — the parameters never change. Learning is impossible.
Why does swapping the loss not fix the sign-activation gradient problem — why must the activation change?
The zero gradient comes from the sign function, not the loss. Whatever loss you compose with sign, the chain rule multiplies by and wipes out the gradient. The activation has to be replaced with something differentiable.
Sigmoid
What is the sigmoid function, and what is its derivative?
Sigmoid maps smoothly, so its output can be read as a probability . Its derivative is non-zero everywhere except in the saturated tails, which is why gradient descent can flow through it.
What does it mean for sigmoid to saturate, and why is it a problem?
For very large , or and . When the gradient passes through a saturated sigmoid, it is multiplied by ~0 — the upstream parameters get a near-zero update and effectively stop learning. This is the vanishing-gradient problem; it bites hardest in deep networks stacked from sigmoids.
Binary cross-entropy
What is the binary cross-entropy loss?
is the true label of example and is the model’s predicted probability. For each example, only one of the two terms is non-zero (whichever matches the true class), and that term is .
From which probabilistic assumption does binary cross-entropy fall out of MLE?
A Bernoulli model for the labels: with . Taking the log of the product across independent observations and flipping the sign produces exactly the BCE expression. Same MLE recipe as MSE — just with Bernoulli instead of Gaussian.
A sigmoid-activated perceptron trained with cross-entropy is also known by another name. What is it?
Logistic regression. Despite the name, it is a classification algorithm — the “regression” refers to fitting a continuous probability , which is then thresholded for the class decision.
Loss / activation pairings
Summarise the task / activation / loss / probability model correspondences for the two cases covered in week 2.
Task Activation Loss MLE under Regression (none, linear) Squared error Gaussian noise Binary classification Sigmoid Binary cross-entropy Bernoulli labels Both are trained by gradient descent. The change is the assumed data distribution — that fixes the loss-activation pair.
GD variants
What is the difference between batch, stochastic, and mini-batch gradient descent?
- Batch GD: compute the gradient over all training samples per step. Accurate gradient, very slow per step.
- Stochastic GD: compute the gradient on one randomly chosen sample. Very fast per step, very noisy trajectory.
- Mini-batch GD: use a random subset of size (typically 32–512). Best of both — fast per step, much less noisy than SGD. The de facto default.
Why does the noise in stochastic / mini-batch GD sometimes help training?
The gradient estimate fluctuates from step to step, so the optimiser does not follow a smooth path. This noise can knock the parameters out of shallow local minima and saddle points, helping to find better solutions than a perfectly smooth descent would.
What does momentum add to gradient descent, and what is its update rule?
Momentum maintains a velocity that decays previous gradients in: Typical . The optimiser carries inertia through flat regions and dampens oscillations across narrow valleys, both speeding up convergence and helping escape shallow minima.
What does Adam combine, and why is it the modern default?
Adam combines two ideas:
- Momentum — exponentially decaying running mean of gradients (first moment).
- Per-parameter adaptive learning rates — exponentially decaying running mean of squared gradients (second moment), used to scale each parameter’s step.
The result is robust across many problems with little tuning, which is why it became the default optimiser for most deep-learning training loops.