week-02

TARGET DECK MachineLearning::Week-02

Maximum Likelihood Estimation

What is the principle of maximum likelihood estimation (MLE)?

Pick the parameters $w$ for which the observed training data is most probable: $w^{*} = ar g max_{w} \prod_{i = 1}^{N} p (y^{(i)} ∣ x^{(i)}, w)$ Treat the data as fixed; treat $w$ as variable. Choose $w$ that makes the data look most plausible under the model.

Why do we minimise the negative log-likelihood instead of maximising the likelihood directly?

Two reasons:

Numerical: products of small probabilities underflow in floating-point arithmetic. Logarithm converts $\prod p_{i}$ into $\sum lo g p_{i}$ — a sum of moderate-magnitude numbers.

Convention: optimisers minimise. Negating gives a loss to minimise without changing the $ar g max$ ( $lo g$ is monotonic).

What is the cross-entropy loss for logistic regression?

$E (w) = - \sum_{i = 1}^{N} [y^{(i)} ln p_{1}^{(i)} + (1 - y^{(i)}) ln (1 - p_{1}^{(i)})]$ where $p_{1}^{(i)} = σ (w^{⊤} x^{(i)})$ . The two terms act as a switch: when $y = 1$ , only $- ln p_{1}$ contributes; when $y = 0$ , only $- ln (1 - p_{1})$ contributes. Strictly convex in $w$ , so a unique global minimum exists.

A logistic regression model assigns $p_{1} = 0.99$ to a true-label-1 example. What does it contribute to the loss? What if $p_{1} = 0.01$ instead?

When $y = 1$ , the contribution is $- ln p_{1}$ .

$p_{1} = 0.99 \Rightarrow - ln (0.99) \approx 0.01$ — almost zero (correct, confident).

$p_{1} = 0.01 \Rightarrow - ln (0.01) \approx 4.6$ — large penalty (wrong, confident).

Cross-entropy heavily punishes confidently-wrong predictions; the loss diverges to $+ \infty$ as $p \to 0$ for the true class.

Gradient Descent

What is the gradient descent update rule, and what is the role of $η$ ?

$w \leftarrow w - η \nabla E (w)$

$\nabla E (w)$ is the gradient — direction of steepest increase.

The negative gradient points downhill.

$η > 0$ is the learning rate — step size hyperparameter.

Each step moves $w$ a distance proportional to $∥\nabla E ∥$ in the direction that locally reduces $E$ fastest.

What is the gradient of the cross-entropy loss for logistic regression?

$\nabla E (w) = \sum_{i = 1}^{N} (p_{1}^{(i)} - y^{(i)}) x^{(i)}$ The prediction error $(p_{1} - y)$ scales each input vector. Beautifully simple: when the prediction matches the label, that example contributes nothing to the gradient.

What goes wrong when the gradient descent learning rate $η$ is too large or too small?

$η$ too large: the step overshoots the minimum, possibly landing on the opposite slope at higher loss. Iterations bounce around or diverge.

$η$ too small: each step makes negligible progress; convergence is technically guaranteed but might take thousands of iterations to do what a moderate $η$ does in tens.

No single $η$ is universally correct — it requires tuning, and the right value depends on the loss landscape’s curvature.

What is differential curvature and why does it cause gradient descent to zig-zag?

When the loss surface curves at different rates along different axes — e.g. $E (w) = w_{1}^{2} + 4 w_{2}^{2}$ — the gradient is much larger in the steep direction than the gentle one. A single $η$ has to handle both: small enough to avoid overshoot in the steep direction, but then progress is glacial in the gentle direction. The trajectory zig-zags across the narrow valley while creeping toward the optimum.

Newton-Raphson and IRLS

What is the Newton-Raphson update rule (1D), and why does it not need a learning rate?

$w \leftarrow w - \frac{E ^{'} ( w )}{E ^{''} ( w )}$ The first derivative gives the direction; the second derivative scales the step. Where curvature is high ( $E^{''}$ large), steps shrink automatically; where it’s low, steps grow. The step size is adapted to local curvature — no $η$ to tune. Comes from minimising the degree-2 Taylor approximation of $E$ exactly.

What is the multivariate Newton-Raphson update, and what does the Hessian capture?

$w \leftarrow w - H_{E}^{- 1} (w) \nabla E (w)$ The Hessian $H_{ij} = \partial^{2} E / (\partial w_{i} \partial w_{j})$ is the matrix of second-order partial derivatives. It captures curvature along each axis and interactions between axes. Inverting it produces direction-aware step sizes that handle differential curvature automatically.

What is Iteratively Reweighted Least Squares (IRLS)?

Newton-Raphson applied to logistic regression’s cross-entropy loss. The Hessian is: $H_{E} (w) = \sum_{i = 1}^{N} p_{1}^{(i)} (1 - p_{1}^{(i)}) x^{(i)} x^{(i) ⊤}$ The “reweighting” weight $p_{1} (1 - p_{1})$ is largest when $p_{1} \approx 0.5$ (model uncertain) and small when $p_{1} \approx 0$ or $1$ (model confident). Each iteration is a weighted least-squares step where the weights track current uncertainty.

Why is IRLS rarely used for deep neural networks despite needing no learning rate?

Two reasons:

Cost: each iteration inverts a $(d + 1) \times (d + 1)$ Hessian — $O (d^{3})$ . For millions of parameters, infeasible.

Non-convexity: deep network losses are non-convex, so the local quadratic Taylor approximation can mislead. The Hessian may not be positive-definite, and the “Newton step” may point uphill.

Gradient descent (and variants like Adam) trades per-iteration progress for tractable per-iteration cost.

Convexity

What does it mean for a function to be convex, and why does convexity matter for optimisation?

A function $f$ is convex if for any two points $w_{1}, w_{2}$ and $λ \in [0, 1]$ : $f (λ w_{1} + (1 - λ) w_{2}) \leq λ f (w_{1}) + (1 - λ) f (w_{2})$ Convexity matters because it guarantees every local minimum is a global minimum — gradient descent (or any descent method) converging to a stationary point converges to the optimum. No worries about getting stuck in suboptimal valleys.

A loss has long, narrow elliptical contours aligned with the $w_{1}$ axis. Where will gradient descent struggle, and why?

The narrow direction is $w_{2}$ (steep walls); the long direction is $w_{1}$ (gentle slope along the valley). Gradient descent zig-zags across $w_{2}$ while inching slowly along $w_{1}$ . A single $η$ can’t simultaneously be aggressive enough in $w_{1}$ and conservative enough in $w_{2}$ . Newton-Raphson would handle this automatically by using curvature-adapted step sizes.

Course Notes

Explorer

week-02

Maximum Likelihood Estimation

Gradient Descent

Newton-Raphson and IRLS

Convexity

Graph View

Table of Contents