week-01

TARGET DECK MachineLearning::Week-01

Supervised Learning Framework

What are the five components of the supervised learning framework?

A learning problem is specified by:

Unknown target distribution $P (y ∣ x)$ — what we’re trying to learn

Training data $T = {(x^{(i)}, y^{(i)})}_{i = 1}^{N}$ — drawn i.i.d. from a fixed joint $P (x, y)$

Hypothesis set $H$ — the family of candidate functions

Learning algorithm $A$ — the procedure for picking one $h \in H$

Final hypothesis $g$ — the chosen function, $g \approx f$

Every algorithm in the module differs only in $H$ and $A$ .

What is the i.i.d. assumption and why does it matter for supervised learning?

Independent and identically distributed: training and test examples are drawn independently from the same joint $P (x, y)$ . It matters because all generalisation guarantees rest on it — if the deployment distribution differs from training (distribution shift), there is no theoretical reason a learned model should perform well, regardless of how good training error looked.

What is generalisation, and why is training accuracy alone not enough?

Generalisation is the ability to perform well on unseen examples drawn from the same $P (x, y)$ . A model that memorises the training set achieves zero training error but may be useless on new data — that’s a lookup table, not learning. Generalisation is the actual goal; training accuracy is a (sometimes misleading) proxy.

Logistic Regression

Why can't we just model $P (y = 1 ∣ x) = w^{⊤} x$ directly?

Because $w^{⊤} x$ ranges over $(- \infty, + \infty)$ but a probability must lie in $[0, 1]$ . The linear combination has no built-in constraint preventing values like $- 500$ or $42$ , neither of which is a valid probability.

What is the logit function and why is it useful for logistic regression?

The logit maps a probability to the real line: $logit (p) = ln \frac{p}{1 - p}$ $p \in (0, 1)$ but $logit (p) \in (- \infty, + \infty)$ . Modelling $logit (p_{1}) = w^{⊤} x$ puts both sides on the same scale. Inverting gives the sigmoid: $p_{1} = σ (w^{⊤} x) = 1/ (1 + e^{- w^{⊤} x})$ .

What is the formula for the sigmoid function and what are its key properties?

$σ (z) = \frac{1}{1 + e ^{- z}}$

Range: $(0, 1)$ — outputs are valid probabilities

$σ (0) = 0.5$ — maximum uncertainty at the boundary

Monotonically increasing, smooth, S-shaped

Saturates: $σ (z) \to 1$ as $z \to \infty$ and $σ (z) \to 0$ as $z \to - \infty$

Used to convert linear scores $w^{⊤} x$ into class-1 probabilities.

For a logistic regression model with $w^{⊤} = (0.1, 0.2, 0.6)$ and input $x^{⊤} = (1, 1, 3)$ , what is the predicted class and confidence?

Compute the linear score: $w^{⊤} x = 0.1 + 0.2 + 1.8 = 2.1$ . Since $2.1 \geq 0$ , predict class 1. Confidence: $p_{1} = σ (2.1) = e^{2.1} / (1 + e^{2.1}) \approx 0.891$ — about 89%.

How do you interpret a logistic regression weight $w_{j}$ in terms of odds?

A unit increase in feature $x_{j}$ multiplies the odds of class 1 by $e^{w_{j}}$ : $\frac{p _{1}}{1 - p _{1}} \to e^{w_{j}} \cdot \frac{p _{1}}{1 - p _{1}}$ If $w_{j} = 0.5$ , a one-unit bump in $x_{j}$ raises the odds by $e^{0.5} \approx 1.65$ — a 65% increase. This is why logistic regression is popular in social/natural sciences: the coefficients are directly interpretable.

Decision Boundary

What is the decision boundary of a logistic regression classifier?

The hyperplane $w^{⊤} x = 0$ . Points where $w^{⊤} x > 0$ are predicted as class 1 (since $σ (z) > 0.5$ ); points where $w^{⊤} x < 0$ are class 0. The further a point is from the boundary, the more confident the prediction.

Why is logistic regression called a linear model when the sigmoid is non-linear?

“Linear” refers to linearity in the parameters $w$ , not in the inputs $x$ . The decision boundary $w^{⊤} x = 0$ is a hyperplane (linear in $x$ ), and the cross-entropy loss is convex in $w$ . The sigmoid is a fixed non-linearity wrapping a linear-in- $w$ score.

Discriminative vs Generative

What is the difference between discriminative and generative classifiers?

Discriminative: model $P (y ∣ x)$ directly. Doesn’t ask how features are distributed; just learns the decision boundary. Logistic regression, SVM, neural networks.

Generative: model $P (x ∣ y) \cdot P (y)$ , then derive $P (y ∣ x)$ via Bayes’ rule. Asks “what does a class-1 input look like?” Naïve Bayes, Gaussian discriminant analysis.

Discriminative is generally better for raw classification accuracy; generative supplies more information (sampling, missing-feature handling) at the cost of stronger modelling assumptions.

A model gets 100% training accuracy but 55% test accuracy. What has gone wrong?

The model has overfitted: it memorised the training data instead of learning generalisable structure. Likely cause: $H$ is too expressive relative to the amount of training data, letting the algorithm fit noise. Remedies (covered later): regularisation, more data, simpler model class, validation-based hyperparameter selection.

Course Notes

Explorer

week-01

Supervised Learning Framework

Logistic Regression

Decision Boundary

Discriminative vs Generative

Graph View

Table of Contents