A second-order iterative optimisation method: at each step, replace the loss with its degree-2 Taylor approximation and jump to that approximation’s minimum. Uses curvature to set step size automatically — no learning rate to tune.

The Idea

gradient descent uses the slope (first derivative) and an externally-supplied step size. Its weakness: it has no idea how the slope is changing, so a single learning rate has to handle differential curvature poorly.

Newton-Raphson uses both the slope and the curvature (second derivative). At each iterate, it builds a quadratic approximation of the loss using a degree-2 taylor-polynomial, then jumps directly to the minimum of that quadratic.

Univariate Update

The degree-2 Taylor approximation of around the current iterate :

This is a parabola. Setting its derivative to zero:

So the update rule is:

Reading the formula:

  • Direction: the sign of — same as gradient descent — so we move opposite the gradient (assuming we’re in a convex region with ).
  • Step size: — automatically inversely proportional to curvature. High curvature → small step; low curvature → large step. No learning rate.

Multivariate Update

For multi-dimensional , the second derivative becomes the Hessian:

The inverse Hessian rescales the gradient component-by-component (in the eigenbasis of ): it shrinks updates in steep (high-curvature) directions and stretches updates in shallow (low-curvature) directions, exactly cancelling the differential-curvature pathology that plagues gradient descent.

Why “Iterative”?

If the loss were exactly quadratic, Newton-Raphson would reach the minimum in a single step — the parabola we approximate with is the function itself.

For real losses (like cross-entropy-loss), the quadratic is only a local approximation. After one Newton step, we recompute the gradient and Hessian at the new point and rebuild the quadratic there, then step again. The procedure converges, typically in far fewer iterations than gradient descent, because each iteration uses a much better local model.

Convergence Behaviour

Near a (well-conditioned) minimum, Newton-Raphson exhibits quadratic convergence: the number of correct digits roughly doubles each iteration. Gradient descent typically achieves only linear convergence (correct digits grow linearly).

Caveats:

  • Non-convex regions: if the Hessian is not positive definite, the “Newton step” can point uphill. Modifications (damped Newton, trust regions, adding a multiple of the identity to ) handle this.
  • Far from the optimum: the quadratic approximation may be poor, and a Newton step can overshoot. A learning rate can be reintroduced as a safety knob: with .
  • Cost: per iteration to invert , and memory. For high-dimensional models this is prohibitive.

When to Use Newton-Raphson

SettingNewton-Raphson?
Logistic regression, small to moderate Yes — fast convergence, no learning-rate tuning
GLMs (Poisson, Gamma, etc.)Yes — same idea
Deep neural networksRarely — Hessian too expensive, often non-convex
Convex problems with Yes — feasible and reliable
Problems where Hessian has nice structureYes — exploit structure for efficiency

For logistic regression, applying Newton-Raphson gives the algorithm called Iteratively Reweighted Least Squares.

Active Recall