The Newton-Raphson method applied to the cross-entropy-loss for logistic-regression. Each iteration solves a weighted least-squares problem whose weights depend on the model’s current uncertainty, hence “iteratively reweighted.”
The Update Rule
For logistic-regression with cross-entropy-loss, the gradient and Hessian are:
The Newton-Raphson update is:
Both and depend on (through ), so they must be recomputed at each iteration — making the procedure iterative.
Why “Reweighted Least Squares”?
The Hessian above can be written compactly. Let be the diagonal matrix with , and let be the design matrix with rows . Then:
This has the same form as the normal-equations matrix in weighted least squares, where each example is weighted by . The Newton-Raphson update can be rearranged into a weighted least-squares problem at each iteration, with weights that change as changes — hence “iteratively reweighted least squares.”
The weights have a clean meaning: is the variance of the predicted Bernoulli for example — i.e., how uncertain the model is about that example.
| Predicted | Weight | Meaning |
|---|---|---|
| (max) | Model is at chance — example highly informative | |
| or | Reasonably confident | |
| or | Very confident — example contributes little | |
| or | Saturated — no influence on update |
So at every iteration, IRLS pays the most attention to examples near the decision boundary (where the model is unsure) and largely ignores examples it is already confident about. As the model fits the data better, the set of “hard” examples evolves, and the weights shift accordingly.
Why It Beats Gradient Descent
Gradient descent struggles with the differential curvature of the loss surface — eigenvalues of the Hessian differ across directions, so a single learning rate can’t be both aggressive enough to make progress in shallow directions and cautious enough to avoid overshooting in steep directions.
IRLS sidesteps this entirely: the inverse Hessian rescales the gradient direction-by-direction. Steps are automatically:
- Shrunk in directions with high curvature (steep walls of the loss surface).
- Stretched in directions with low curvature (flat valleys).
The result is far fewer iterations, especially when features are correlated or have very different scales.
Why It Works for Logistic Regression
Two properties make IRLS especially well-suited to logistic regression:
- The cross-entropy loss is strictly convex in . The Hessian is positive definite (well, semi-definite — it can be singular if a feature is collinear with another), so the Newton step always points downhill. No worry about non-convexity-induced uphill steps.
- The quadratic approximation of cross-entropy is reasonably accurate. The deviations from quadratic are not too large. If they were, a learning rate could be reintroduced for safety: .
Convergence is typically extremely fast — a handful of iterations rather than the hundreds or thousands gradient descent might need.
Cost
Each IRLS iteration requires:
- to assemble .
- to invert the Hessian.
For modest (up to perhaps a few thousand features), this is fast. For high-dimensional problems (text classification with millions of features, image features), the per iteration becomes prohibitive and gradient descent (or stochastic variants) wins despite needing more iterations.
Related
- newton-raphson-method — IRLS is the application of Newton-Raphson to logistic regression
- logistic-regression — the model IRLS fits
- cross-entropy-loss — the loss IRLS minimises
- hessian-matrix — the matrix IRLS inverts at every iteration
- gradient descent — cheaper alternative when is large
Active Recall
Why does the Hessian for logistic regression weight each training example by ? What does that imply about which examples drive the update at each iteration?
is the variance of the predicted Bernoulli — it measures the model’s uncertainty about that example. Examples near the decision boundary have and weight close to (the maximum). Examples the model is highly confident about ( near or ) have weight near zero. So IRLS effectively focuses on the “hard” examples — those near the boundary — and ignores the easy ones, until the boundary moves and a new set of examples becomes hard.
How does IRLS avoid the differential-curvature problem that hobbles gradient descent?
Gradient descent uses a single scalar learning rate that scales every dimension of the update equally. When the loss has very different curvatures along different axes (eigenvalues of differ wildly), no single works — too aggressive in steep directions, too timid in shallow ones. IRLS multiplies the gradient by , which in the eigenbasis of divides each component by the corresponding eigenvalue. Steep directions get shrunk; shallow directions get stretched; the result is a Newton step that points more or less straight at the minimum.
IRLS converges quadratically near the optimum and typically needs far fewer iterations than gradient descent. Why isn't IRLS used everywhere?
Cost per iteration. Each IRLS step requires assembling and inverting the Hessian — work and memory. For logistic regression with hundreds or thousands of features, that’s fine. For models with millions of parameters (deep networks, high-dimensional sparse features), the per-iteration cost dwarfs the savings from fewer iterations, and gradient descent (cheap per step, many steps) wins on wall-clock time.