A second-order iterative optimisation method: at each step, replace the loss with its degree-2 Taylor approximation and jump to that approximation’s minimum. Uses curvature to set step size automatically — no learning rate to tune.
The Idea
gradient descent uses the slope (first derivative) and an externally-supplied step size. Its weakness: it has no idea how the slope is changing, so a single learning rate has to handle differential curvature poorly.
Newton-Raphson uses both the slope and the curvature (second derivative). At each iterate, it builds a quadratic approximation of the loss using a degree-2 taylor-polynomial, then jumps directly to the minimum of that quadratic.
Univariate Update
The degree-2 Taylor approximation of around the current iterate :
This is a parabola. Setting its derivative to zero:
So the update rule is:
Reading the formula:
- Direction: the sign of — same as gradient descent — so we move opposite the gradient (assuming we’re in a convex region with ).
- Step size: — automatically inversely proportional to curvature. High curvature → small step; low curvature → large step. No learning rate.
Multivariate Update
For multi-dimensional , the second derivative becomes the Hessian:
The inverse Hessian rescales the gradient component-by-component (in the eigenbasis of ): it shrinks updates in steep (high-curvature) directions and stretches updates in shallow (low-curvature) directions, exactly cancelling the differential-curvature pathology that plagues gradient descent.
Why “Iterative”?
If the loss were exactly quadratic, Newton-Raphson would reach the minimum in a single step — the parabola we approximate with is the function itself.
For real losses (like cross-entropy-loss), the quadratic is only a local approximation. After one Newton step, we recompute the gradient and Hessian at the new point and rebuild the quadratic there, then step again. The procedure converges, typically in far fewer iterations than gradient descent, because each iteration uses a much better local model.
Convergence Behaviour
Near a (well-conditioned) minimum, Newton-Raphson exhibits quadratic convergence: the number of correct digits roughly doubles each iteration. Gradient descent typically achieves only linear convergence (correct digits grow linearly).
Caveats:
- Non-convex regions: if the Hessian is not positive definite, the “Newton step” can point uphill. Modifications (damped Newton, trust regions, adding a multiple of the identity to ) handle this.
- Far from the optimum: the quadratic approximation may be poor, and a Newton step can overshoot. A learning rate can be reintroduced as a safety knob: with .
- Cost: per iteration to invert , and memory. For high-dimensional models this is prohibitive.
When to Use Newton-Raphson
| Setting | Newton-Raphson? |
|---|---|
| Logistic regression, small to moderate | Yes — fast convergence, no learning-rate tuning |
| GLMs (Poisson, Gamma, etc.) | Yes — same idea |
| Deep neural networks | Rarely — Hessian too expensive, often non-convex |
| Convex problems with | Yes — feasible and reliable |
| Problems where Hessian has nice structure | Yes — exploit structure for efficiency |
For logistic regression, applying Newton-Raphson gives the algorithm called Iteratively Reweighted Least Squares.
Related
- iteratively-reweighted-least-squares — Newton-Raphson applied to logistic regression
- taylor-polynomial — the degree-2 approximation Newton-Raphson is built on
- hessian-matrix — the multivariate second-derivative used in the update
- gradient descent — first-order alternative; cheaper per step but slower to converge
Active Recall
Derive the Newton-Raphson update rule from the degree-2 Taylor polynomial of around .
Write . To find its minimum, differentiate with respect to : . Set to zero: . Replacing with the current iterate gives the update rule .
Why does Newton-Raphson not need a learning rate, while gradient descent does?
Gradient descent uses only the slope, which gives a direction but not a magnitude — there’s no information available to decide how far to move. Newton-Raphson uses the curvature (second derivative or Hessian), which sets the step size automatically: a parabola with curvature has its minimum at distance from the current point. The second derivative supplies the missing scale.
Newton-Raphson converges quadratically — much faster than gradient descent. Why don't deep learning practitioners use it?
Two reasons. First, the Hessian for a deep network with millions of parameters has a -entry matrix that is infeasible to store, let alone invert at per iteration. Second, deep network losses are non-convex, so the Hessian is often not positive definite — the Newton step can point uphill. First-order methods (gradient descent, Adam) trade per-iteration progress for tractable per-iteration cost and avoid the non-convexity pitfall. Approximations like L-BFGS retrieve some Newton-Raphson benefit at lower cost for medium-scale problems.