THE CRUX: We have a hypothesis set for logistic regression, but no learning algorithm . How do we actually find the weights that best fit the data — and what do we do when the obvious approach is too slow?

We turn “best fit” into a precise objective using maximum likelihood, derive the cross-entropy loss, and minimise it with gradient descent. Then we confront gradient descent’s failure modes — differential curvature and learning-rate sensitivity — and fix them with Newton-Raphson, which becomes IRLS when applied to logistic regression.


Last week we built logistic regression as a hypothesis set: . We can score any candidate on a test point. But “given training data, find the best ” is still an empty box. This week fills it in.

There are two distinct pieces to nail down:

  1. What does “best” mean? We need an objective function — a loss that quantifies how badly a given fits the data.
  2. How do we minimise it? We need an algorithm that searches the space of weights for the one that drives to its minimum.

Both pieces are non-trivial, and the standard answer to (2) — gradient descent — has subtle failure modes that motivate a more sophisticated approach.

Step 1: What Should the Loss Be?

It might be tempting to count misclassifications. But this is a discrete, non-differentiable function — moving slightly typically doesn’t change the count, until suddenly it does. Useless for optimisation.

A better idea: since the model outputs probabilities, we should reward weights that assign high probability to the actual training labels. This is the principle of maximum likelihood estimation: pick the for which the observed training data is most probable.

For a single example , the model assigns probability to the actual label. Assuming i.i.d. examples, the joint probability of the entire training set is the product:

We want .

ASIDE — Probability vs likelihood

Though they share a formula, the words mean different things. Probability treats the parameters as fixed and asks how likely the data is. Likelihood treats the data as fixed and asks how plausible different values are. We’re in likelihood mode: the data is given, is what varies.

From Likelihood to Cross-Entropy

Products of probabilities are numerically nasty — multiplying small numbers gives an absurdly tiny result that underflows in floating-point arithmetic. The fix is to take the logarithm, which converts products into sums and is monotonic (so is preserved):

Convention prefers minimising over maximising, so we negate to get the cross-entropy-loss:

The two terms are a switch: when , only the first term contributes (); when , only the second (). Either way, the loss is small when the model assigns high probability to the correct class, and explodes toward when it confidently predicts the wrong class.

The cross-entropy loss for logistic regression is strictly convex in — it has a single, unique global minimum. This is the property that makes everything else work.

Step 2: How Do We Minimise It?

There’s no closed-form for logistic regression’s cross-entropy (unlike linear regression with squared error). We need an iterative procedure.

The natural one is gradient descent: stand at a point, compute which way is “downhill” (the direction of steepest descent — the negative gradient), take a step, repeat.

where is the learning rate. For our cross-entropy loss, the gradient has a clean form:

It’s just the prediction error weighted by the input vector, summed over the training set. Beautifully simple.

ASIDE — Why gradient descent works at all

The gradient points in the direction of steepest increase. The negative gradient points in the direction of steepest decrease. The definition of derivative guarantees that for a small enough step, moving in the direction of strictly reduces . This is a local property — gradient descent has no global view of the loss landscape, only the slope under its feet.

Because cross-entropy for logistic regression is strictly convex, gradient descent converging to a critical point converges to the global minimum. No worries about local minima.

The Trouble with

The learning rate is a hyperparameter — we pick it, the algorithm does not. And it matters enormously:

  • too large: the step overshoots the minimum, possibly landing on the opposite slope at a higher loss. Iterations bounce around or even diverge.
  • too small: each step moves a tiny distance. Convergence is technically guaranteed but takes thousands of iterations to make progress that a moderate would make in tens.

Picking is partly trial and error, but there’s a deeper problem lurking.

The Real Problem: Differential Curvature

Even with a perfectly tuned , gradient descent struggles when the loss surface curves at different rates along different axes. Imagine the contour plot of — concentric ellipses, stretched along because the curvature in is four times stronger.

What does gradient descent do here? At any point, the gradient is much larger in the direction than in . A single has to handle both. Set it small enough to avoid overshooting in , and progress along becomes glacial. The trajectory zig-zags across the narrow valley while creeping slowly toward the optimum.

This isn’t a curvy edge case — it’s the typical situation when input features have different scales. Standardising features (subtracting the mean, dividing by the standard deviation) helps but doesn’t fully cure the issue: the loss landscape’s curvature is determined by feature interactions, not just feature scales.

The Insight: Use Curvature Information

Gradient descent uses only first-order information — the slope. But the slope tells you nothing about how it itself is changing. If the slope is changing rapidly in some direction, that’s a sign you should take smaller steps; if it’s barely changing, you can afford a larger step.

The second-order derivative captures exactly this: how quickly the gradient is changing. If we can build that into our update rule, we get an algorithm that automatically adapts its step size to the local curvature.

The vehicle is the taylor-polynomial. A degree-2 Taylor approximation of around the current weight is a quadratic that matches the true loss in value, slope, and curvature at :

This quadratic has a closed-form minimum. Setting its derivative to zero and solving for gives the Newton-Raphson update:

Notice what’s happening: the first derivative still gives the direction, but the step size is now — automatic, no to tune. Where curvature is high (second derivative large), steps shrink; where curvature is low, steps grow.

Multivariate: The Hessian

For multi-dimensional , the second derivative becomes the Hessian — a matrix of all second-order partial derivatives:

It captures not just curvature along each axis but the interactions between axes. The Newton-Raphson update generalises to:

For logistic regression specifically, this update is called Iteratively Reweighted Least Squares (IRLS):

Both gradient and Hessian depend on , so we update , recompute, update again — hence “iterative.” The procedure typically converges in far fewer iterations than gradient descent.

TIP — Why is it called "Reweighted Least Squares"?

The Newton update can be rewritten as a weighted least-squares problem at each iteration, where the weights are — i.e., proportional to the model’s uncertainty about each example. Examples the model is unsure about (probability near 0.5) get the highest weight; examples it’s already very confident about contribute little. The “reweighting” changes at every iteration as changes.

Trade-offs

IRLS isn’t free. Each iteration requires inverting a Hessian — work. For high-dimensional problems (many features), this becomes prohibitive, and gradient descent (which is per step but needs more steps) wins.

Also, the quadratic approximation is only locally accurate. Far from the minimum, Newton-Raphson can overshoot or even move uphill if the Hessian is non-positive-definite. For logistic regression with cross-entropy this isn’t a problem (loss is convex, Hessian is positive semi-definite), but it’s a real concern in non-convex settings like neural networks — which is why deep learning relies on first-order methods despite their slower per-iteration progress.


Concepts Introduced This Week

Connections

  • Builds on week-01: completes the supervised learning framework’s missing piece — the algorithm .
  • Sets up later weeks: gradient descent is the workhorse for nearly every algorithm in the module. Cross-entropy reappears in neural networks. The MLE principle generalises to most probabilistic models.

Open Questions

  • What if the data isn’t linearly separable? (Answered in weeks 3–5: non-linear transforms, kernels, SVMs.)
  • How do we control overfitting when is too expressive? (Answered later: regularisation, validation, VC theory.)
  • For very large datasets, can we avoid summing over all examples per gradient step? (Answered when stochastic gradient descent is introduced.)