The Newton-Raphson method applied to the cross-entropy-loss for logistic-regression. Each iteration solves a weighted least-squares problem whose weights depend on the model’s current uncertainty, hence “iteratively reweighted.”

The Update Rule

For logistic-regression with cross-entropy-loss, the gradient and Hessian are:

The Newton-Raphson update is:

Both and depend on (through ), so they must be recomputed at each iteration — making the procedure iterative.

Why “Reweighted Least Squares”?

The Hessian above can be written compactly. Let be the diagonal matrix with , and let be the design matrix with rows . Then:

This has the same form as the normal-equations matrix in weighted least squares, where each example is weighted by . The Newton-Raphson update can be rearranged into a weighted least-squares problem at each iteration, with weights that change as changes — hence “iteratively reweighted least squares.”

The weights have a clean meaning: is the variance of the predicted Bernoulli for example — i.e., how uncertain the model is about that example.

Predicted Weight Meaning
(max)Model is at chance — example highly informative
or Reasonably confident
or Very confident — example contributes little
or Saturated — no influence on update

So at every iteration, IRLS pays the most attention to examples near the decision boundary (where the model is unsure) and largely ignores examples it is already confident about. As the model fits the data better, the set of “hard” examples evolves, and the weights shift accordingly.

Why It Beats Gradient Descent

Gradient descent struggles with the differential curvature of the loss surface — eigenvalues of the Hessian differ across directions, so a single learning rate can’t be both aggressive enough to make progress in shallow directions and cautious enough to avoid overshooting in steep directions.

IRLS sidesteps this entirely: the inverse Hessian rescales the gradient direction-by-direction. Steps are automatically:

  • Shrunk in directions with high curvature (steep walls of the loss surface).
  • Stretched in directions with low curvature (flat valleys).

The result is far fewer iterations, especially when features are correlated or have very different scales.

Why It Works for Logistic Regression

Two properties make IRLS especially well-suited to logistic regression:

  1. The cross-entropy loss is strictly convex in . The Hessian is positive definite (well, semi-definite — it can be singular if a feature is collinear with another), so the Newton step always points downhill. No worry about non-convexity-induced uphill steps.
  2. The quadratic approximation of cross-entropy is reasonably accurate. The deviations from quadratic are not too large. If they were, a learning rate could be reintroduced for safety: .

Convergence is typically extremely fast — a handful of iterations rather than the hundreds or thousands gradient descent might need.

Cost

Each IRLS iteration requires:

  • to assemble .
  • to invert the Hessian.

For modest (up to perhaps a few thousand features), this is fast. For high-dimensional problems (text classification with millions of features, image features), the per iteration becomes prohibitive and gradient descent (or stochastic variants) wins despite needing more iterations.

Active Recall