TARGET DECK MachineLearning::Week-02
Maximum Likelihood Estimation
What is the principle of maximum likelihood estimation (MLE)?
Pick the parameters for which the observed training data is most probable: Treat the data as fixed; treat as variable. Choose that makes the data look most plausible under the model.
Why do we minimise the negative log-likelihood instead of maximising the likelihood directly?
Two reasons:
- Numerical: products of small probabilities underflow in floating-point arithmetic. Logarithm converts into — a sum of moderate-magnitude numbers.
- Convention: optimisers minimise. Negating gives a loss to minimise without changing the ( is monotonic).
What is the cross-entropy loss for logistic regression?
where . The two terms act as a switch: when , only contributes; when , only contributes. Strictly convex in , so a unique global minimum exists.
A logistic regression model assigns to a true-label-1 example. What does it contribute to the loss? What if instead?
When , the contribution is .
- — almost zero (correct, confident).
- — large penalty (wrong, confident).
Cross-entropy heavily punishes confidently-wrong predictions; the loss diverges to as for the true class.
Gradient Descent
What is the gradient descent update rule, and what is the role of ?
- is the gradient — direction of steepest increase.
- The negative gradient points downhill.
- is the learning rate — step size hyperparameter.
Each step moves a distance proportional to in the direction that locally reduces fastest.
What is the gradient of the cross-entropy loss for logistic regression?
The prediction error scales each input vector. Beautifully simple: when the prediction matches the label, that example contributes nothing to the gradient.
What goes wrong when the gradient descent learning rate is too large or too small?
- too large: the step overshoots the minimum, possibly landing on the opposite slope at higher loss. Iterations bounce around or diverge.
- too small: each step makes negligible progress; convergence is technically guaranteed but might take thousands of iterations to do what a moderate does in tens.
No single is universally correct — it requires tuning, and the right value depends on the loss landscape’s curvature.
What is differential curvature and why does it cause gradient descent to zig-zag?
When the loss surface curves at different rates along different axes — e.g. — the gradient is much larger in the steep direction than the gentle one. A single has to handle both: small enough to avoid overshoot in the steep direction, but then progress is glacial in the gentle direction. The trajectory zig-zags across the narrow valley while creeping toward the optimum.
Newton-Raphson and IRLS
What is the Newton-Raphson update rule (1D), and why does it not need a learning rate?
The first derivative gives the direction; the second derivative scales the step. Where curvature is high ( large), steps shrink automatically; where it’s low, steps grow. The step size is adapted to local curvature — no to tune. Comes from minimising the degree-2 Taylor approximation of exactly.
What is the multivariate Newton-Raphson update, and what does the Hessian capture?
The Hessian is the matrix of second-order partial derivatives. It captures curvature along each axis and interactions between axes. Inverting it produces direction-aware step sizes that handle differential curvature automatically.
What is Iteratively Reweighted Least Squares (IRLS)?
Newton-Raphson applied to logistic regression’s cross-entropy loss. The Hessian is: The “reweighting” weight is largest when (model uncertain) and small when or (model confident). Each iteration is a weighted least-squares step where the weights track current uncertainty.
Why is IRLS rarely used for deep neural networks despite needing no learning rate?
Two reasons:
- Cost: each iteration inverts a Hessian — . For millions of parameters, infeasible.
- Non-convexity: deep network losses are non-convex, so the local quadratic Taylor approximation can mislead. The Hessian may not be positive-definite, and the “Newton step” may point uphill.
Gradient descent (and variants like Adam) trades per-iteration progress for tractable per-iteration cost.
Convexity
What does it mean for a function to be convex, and why does convexity matter for optimisation?
A function is convex if for any two points and : Convexity matters because it guarantees every local minimum is a global minimum — gradient descent (or any descent method) converging to a stationary point converges to the optimum. No worries about getting stuck in suboptimal valleys.
A loss has long, narrow elliptical contours aligned with the axis. Where will gradient descent struggle, and why?
The narrow direction is (steep walls); the long direction is (gentle slope along the valley). Gradient descent zig-zags across while inching slowly along . A single can’t simultaneously be aggressive enough in and conservative enough in . Newton-Raphson would handle this automatically by using curvature-adapted step sizes.