TARGET DECK MachineLearning::Week-03
Non-linear Transformation
What is a non-linear transformation (basis expansion), and what does it allow a linear model to do?
A pre-processing map (typically with ) that lifts each input into a higher-dimensional feature space. A linear model in the lifted space corresponds to a non-linear decision boundary in the original space — circles, polynomials, anything determined by . The model is still linear in , so optimisation machinery and convexity are preserved.
For input , what is the polynomial basis expansion of degree 2?
All monomials of total degree . The cross term is essential — without it, the model can only express boundaries that are sums of separate single-variable functions of and (no rotated ellipses, for example).
Why does basis expansion not break the convexity of logistic regression's loss?
Convexity of the cross-entropy loss is a property of , not of . After basis expansion, the loss becomes: still linear in , still wrapping the same convex sigmoid composition. The new inputs are constants from the optimiser’s perspective. Linearity in is preserved; nothing in optimisation changes.
What two costs do you pay for using a high-degree polynomial basis expansion?
- Dimensionality: the number of monomials of degree in variables grows as . With 100 inputs and , has more than 170{,}000 components — heavy on memory, computation, and especially sample complexity.
- Overfitting: more capacity means the model can fit training noise. A degree-4 polynomial can perfectly classify any reasonable training set while generalising worse than a linear classifier.
Margin and SVM
What is the margin of a separating hyperplane?
The perpendicular distance from the decision boundary to the closest training example: A wider margin means more buffer between the boundary and the data — small perturbations are less likely to flip a label. The quantity SVMs maximise.
What is a support vector, and why are non-support-vectors irrelevant to the SVM solution?
A support vector is a training point sitting exactly on the margin envelope (). These are the points closest to the decision boundary in each class. If you deleted every non-support-vector and re-trained, you’d get the same hyperplane — only the support vectors actively constrain the optimum; everyone else has slack and contributes nothing.
Write the canonical hard-margin SVM primal problem.
The objective minimises (which maximises margin ); each constraint requires every training point to lie on the correct side of the margin envelope. A convex quadratic program.
Why does maximising margin reduce to minimising ?
Apparent paradox — bigger weights should give bigger margin, right? No. After canonical rescaling (where the closest training point has ), its perpendicular distance to the boundary is exactly . To make that distance large, must be small. Larger weights mean a steeper linear function, which can only achieve closer to the boundary — corresponding to a smaller margin.
What is quadratic programming (QP), and why is the SVM primal a QP?
A QP minimises a convex quadratic objective subject to linear inequality (and possibly equality) constraints. The SVM primal s.t. matches: quadratic objective (in ), linear inequality constraints. Off-the-shelf QP solvers handle this in polynomial time, and convexity guarantees a unique global optimum.
What is the canonical representation of a separating hyperplane?
The scaling of where the closest training point satisfies . Used because the geometric hyperplane is invariant to multiplying by any — the canonical scaling fixes that ambiguity by demanding the closest point hits the value . Under this normalisation the margin is and the constraint becomes for all .
Comparison
A degree-4 polynomial classifier perfectly classifies all training data; a linear classifier misclassifies a few. Which is likely to generalise better, and why?
Often the linear classifier. Training accuracy is not the goal — generalisation is. The degree-4 polynomial has enough capacity to memorise training noise, drawing wildly contorted boundaries to capture every point; those contortions are unlikely to reflect the true underlying structure. Capacity must be matched to data — a principle formalised later by VC dimension and bias–variance analysis.
What's the structural difference between SVM and logistic regression in how they use training data?
- Logistic regression: every training point contributes to the gradient at every iteration; gradient touches all examples.
- SVM: only support vectors (typically a tiny fraction) determine the hyperplane; non-SVs could be deleted without changing the optimum.
SVMs achieve a kind of data efficiency by structurally identifying which examples actually matter. Logistic regression has no such notion of “irrelevant” training points.
After basis expansion, what is the SVM primal in -space?
Same form, but now lives in -space. The decision boundary is non-linear in the original space (e.g., a circle from the embedding ). The QP machinery is unchanged — the practical concern is that may be huge, motivating the dual formulation and kernels in week 4.