Three optimization algorithms appear in weeks 1–5, each chosen to fit the structure of a different learning problem. Gradient descent is the default for unconstrained smooth losses. Newton-Raphson (and its statistical incarnation IRLS) accelerates convergence using local curvature, but at a per-step cost. SMO solves a constrained QP that the other two can’t handle directly. The choice is dictated by the loss surface, the constraints, and the dimensionality.

The Three Algorithms at a Glance

AlgorithmUpdate ruleUsesConstraints?Per-step costConvergence
Gradient DescentGradient onlyUnconstrained (penalty for soft) per gradient evalLinear; tuning is awkward
Newton-Raphson / IRLSGradient + HessianUnconstrained for Hessian inverseQuadratic near optimum
SMOPair-wise analytic update on Kernel evaluations + KKT checkBox + linear equality per error updateHeuristic; depends on pair selection

Why Three Different Algorithms?

The module’s three classifiers each present a different optimization problem:

ClassifierLoss / objectiveOptimizerWhy
logistic-regression — convex, smooth, unconstrainedGD or Newton/IRLSSmooth + unconstrained → GD works; logistic loss has cheap Hessian → Newton wins
support-vector-machine (hard) s.t. Dualise → QP → SMOInequality constraints make GD inapplicable directly
soft-margin-svm s.t. Dualise → box-constrained QP → SMOSame as above; box constraint makes generic QP slow at scale

The pattern: convex + smooth + unconstrained → GD/Newton; convex + constrained → dualise then SMO.

Gradient Descent — The Default

GD takes one principled step per iteration: move opposite the gradient, scaled by the learning rate .

Strengths.

  • One hyperparameter ().
  • Works on any differentiable loss.
  • Memory: . Scales to millions of parameters.

Weaknesses.

  • Linear convergence — many iterations near the minimum where gradients shrink.
  • Step size is uniform across directions. On an elliptical loss like , the same that’s safe in (large curvature) is too small in — and vice versa. The descent path zig-zags.
  • Sensitive to feature scale and to the curvature of the loss surface.

Where it appears in this module. Logistic regression with cross-entropy loss. The loss is convex, smooth, and unconstrained — exactly the setting GD handles cleanly.

Newton-Raphson / IRLS — Curvature-Aware

Newton’s update rescales the gradient by the inverse Hessian, so the step size is chosen per direction based on local curvature:

The intuition: a quadratic approximation of around the current has its minimum at exactly this updated point. If is quadratic, Newton finds the optimum in one step. For non-quadratic but convex losses, you iterate until the quadratic approximation locks in.

IRLS is the same algorithm specialised to logistic-regression-style losses: the Hessian factorises into a re-weighted least-squares form, and each Newton step is computed by solving a weighted normal equation rather than inverting a matrix from scratch.

Strengths.

  • Quadratic convergence near the optimum — in practice, a handful of iterations.
  • No learning rate to tune.
  • Curvature-aware: handles ill-conditioned losses where GD zig-zags.

Weaknesses.

  • per iteration to invert (or solve via) the Hessian. Becomes infeasible past ~ parameters.
  • Requires the Hessian to be positive definite (true for convex losses; can fail for non-convex).
  • Memory: for the Hessian itself.

Where it appears in this module. Logistic regression as IRLS — the standard fitting algorithm in glm / statsmodels. The “weighted least squares” view comes from substituting the logistic Hessian into Newton’s update; the weights track per-example variance .

Sequential Minimal Optimization — For Constrained QPs

SMO is a decomposition method, not a generic optimizer. It exists because the SVM dual is a constrained QP:

Generic QP solvers run in time and memory — both blow up past ~ examples. SMO’s trick: at each step, fix all but two multipliers , and solve the resulting two-variable subproblem analytically.

Why two and not one. The equality constraint pins any single multiplier in place — you can’t change it without violating the constraint. Two is the minimum number that can move while preserving feasibility.

Strengths.

  • No matrix factorisation. Each step is an analytic formula.
  • Memory: for the multipliers and cached errors. Doesn’t store the full Gram matrix.
  • Naturally maintains feasibility throughout.

Weaknesses.

  • Pair-selection heuristics matter enormously. Lazy heuristics (random-random) are 10×–100× slower than the max-step heuristic.
  • Convergence is heuristic, not provably fast. Stop on KKT-within-tolerance.
  • Specialised to SVM-shaped duals — not a drop-in replacement for GD.

Where it appears in this module. Hard- and soft-margin SVMs. The kernel trick lives inside SMO via in the analytic update — kernels and SMO are joined at the hip.

Choosing Between Them

Convex, smooth, unconstrained?
├── Small d (≤ 10⁴) and Hessian cheap → Newton/IRLS
└── Large d → GD (or stochastic GD)

Convex with inequality + equality constraints?
└── Dualise → SVM-shaped QP → SMO

The constraint structure usually decides for you. The dimensionality breaks ties between GD and Newton.

What’s Common Across All Three

All three algorithms share the same outer skeleton: start somewhere feasible, iterate until a stopping criterion is met. The differences are in:

  1. What information they use. GD uses the gradient. Newton uses gradient + Hessian. SMO uses kernel evaluations and the KKT residual.
  2. How they handle constraints. GD ignores them (or absorbs them via penalty). Newton ignores them. SMO is constraint-native.
  3. Stopping criterion. GD/Newton: gradient norm small. SMO: KKT conditions satisfied within tolerance.

The “find the optimum of a convex function” problem looks deceptively uniform on the surface — but the structure of the constraint set, the curvature of the loss, and the cost of computing derivatives all push you toward different algorithms.

Open Questions for Later Weeks

  • Stochastic gradient descent (SGD) — for losses defined as sums over millions of examples, mini-batches replace full-gradient evaluations. Will appear when the module covers neural networks.
  • Coordinate descent for primal SVM with linear kernels — LIBLINEAR-style solvers beat SMO when the kernel is the identity.
  • Interior-point methods as an alternative to SMO for small-to-medium QPs.
  • Convergence proofs — we’ve taken on faith that SMO converges. The formal argument (Platt 1998) uses convexity of + careful pair selection.

Connections