TARGET DECK MachineLearning::Week-06
Cross-Cutting Synthesis
What do logistic regression and hard-margin SVM have in common, and how do they differ?
Common: same hypothesis class — a linear function of (or ): . Both produce linear decision boundaries. Different: the fitting criterion.
- Logistic regression: minimise cross-entropy (probabilistic / MLE-derived); every training point contributes to the gradient.
- Hard-margin SVM: maximise margin (geometric); only support vectors determine the boundary.
Same hypothesis, different objectives — leads to different solutions even on the same data.
Compare the three optimization algorithms covered: gradient descent, Newton-Raphson/IRLS, and SMO. When do you use each?
Algorithm Used for Per-iteration cost Iterations Gradient descent Logistic regression, neural networks Many; needs Newton-Raphson / IRLS Logistic regression (small ) Hessian inverse Few; no SMO SVM dual (soft- or hard-margin) analytic pair update Many but cheap GD scales with parameters; IRLS uses curvature but doesn’t scale; SMO exploits the dual’s structure (equality + box constraints).
For an SVM with examples and a Gaussian kernel, why is the dual preferred over the primal?
The Gaussian kernel implies an infinite-dimensional — the primal would have infinitely many variables. The dual has only variables (one per training point), and the kernel trick computes in regardless of . The dual depends only on inner products, so is never explicitly formed.
Why does the kernel trick work, and what does Mercer's condition guarantee?
The kernel trick replaces everywhere with — sidestepping any explicit construction of . Mercer’s condition (symmetric + positive-semidefinite Gram matrix) guarantees that some embedding exists such that , even if you can’t write it down. PSD also keeps the SVM dual a convex QP.
How does basis expansion change the optimisation problem for logistic regression and SVM?
Procedurally: substitute everywhere — the loss/objective, gradient, Hessian, and update rules are otherwise unchanged. Functionally: the model is still linear in , so convexity is preserved. The decision boundary becomes non-linear in the original -space (e.g. a circle for ).
What's the role of complementary slackness in producing SVM sparsity?
KKT complementary slackness: for every constraint . For SVM, . So either:
- — point strictly outside the margin, contributes nothing.
- — point exactly on the margin (a support vector); allowed.
Only support vectors enter . Sparsity is structural, not heuristic.
What does strong duality require for the SVM, and what does it buy us?
Strong duality requires:
- Convex primal (yes — quadratic objective, linear constraints)
- Slater’s condition: a strictly feasible point exists (yes for separable data, or for soft-margin)
When it holds, — solving the dual is exactly equivalent to solving the primal. We get all the benefits (kernel trick, smaller variable count) without any loss of optimality.
What's the relationship between the hard-margin and soft-margin SVM dual?
Identical except for one constraint:
- Hard-margin:
- Soft-margin: — the box constraint
Slack variables and their multipliers vanish in the dual derivation; the entire effect of slack collapses to the upper bound . The kernel trick, KKT conditions, and prediction formulas are otherwise unchanged.
Conceptual Foundations
What is i.i.d. and why do all generalisation arguments rely on it?
Independent and identically distributed: training and test examples are independent draws from the same fixed joint . Generalisation guarantees say things like “training error close to test error with high probability” — they require the test data to follow the same distribution as the training data. Distribution shift breaks every such guarantee.
When is logistic regression's loss strictly convex, and why does that matter?
The cross-entropy loss is strictly convex in (assuming the design matrix has full column rank). It matters because every local minimum is a global minimum — gradient descent converging to a stationary point converges to the optimum. No worries about getting stuck in suboptimal valleys.
The polynomial kernel corresponds to what implicit feature space?
All monomials of degree in the components of . For , the implicit feature space has dimensions. The kernel computes in regardless of how large that becomes — e.g., implies feature dimensions but only 100 multiplications per kernel evaluation.
Why does maximising the margin give better generalisation than just any separating hyperplane?
Geometrically, a wide margin means small perturbations to a training point (or any new test point) won’t flip its class — the boundary is robust. Theoretically (formalised in week 9), restricting to fat hyperplanes of margin shrinks the effective VC dimension to — independent of input dimension. This explains why high-dimensional kernel SVMs generalise even with infinite-dimensional .