supervised-learning

A learning paradigm where a model is trained on input–output pairs and evaluated on its ability to predict outputs for unseen inputs.

Definition

Supervised learning is the task of learning a function $g : X \to Y$ from a set of training examples $T = {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(N)}, y^{(N)})}$ , where each $x^{(i)} \in X$ is an input and $y^{(i)} \in Y$ is the corresponding output (label). The goal is for $g$ to approximate an unknown target function $f : X \to Y$ well enough to generalize to new data drawn from the same distribution.

The Learning Framework

Every supervised learning problem has five components:

Unknown target function $f : X \to Y$ — the true relationship between inputs and outputs. We never see $f$ directly; we only observe noisy samples from it.
Training data $T$ — a finite set of $(x, y)$ pairs drawn i.i.d. from an unknown joint distribution $p (x, y)$ .
Hypothesis set $H$ — the family of candidate functions the algorithm is allowed to consider (e.g., all linear functions, all polynomials of degree $\leq 3$ ). This is the assumption we bring to the table about what $f$ might look like.
Learning algorithm $A$ — the procedure that searches $H$ for the hypothesis that best fits the training data.
Final hypothesis $g \in H$ — the output of $A$ ; our best approximation of $f$ .

The choice of $H$ is critical. Too small and $f$ may lie outside it; too large and the algorithm may overfit. This tension runs through the entire module.

Input and Output Spaces

The input space $X$ is typically $d$ -dimensional. Each dimension (feature) can be:

Numeric — e.g., age, salary (already real-valued).
Ordinal — e.g., expertise $\in {low, medium, high}$ (ordered categories, often mapped to numbers like ${0, 0.5, 1}$ ).
Categorical — e.g., car brand $\in {Fiat, VW, Toyota}$ (no natural ordering). Typically encoded via one-hot encoding: each category becomes a binary dimension.

The output space $Y$ determines the task type:

Regression: $Y = R$ (predict a continuous value, e.g., house price).
Classification: $Y$ is a finite set of categories. Binary classification ( $∣ Y ∣ = 2$ ) is the most common; multi-class ( $∣ Y ∣ > 2$ ) extends naturally.

Noisy Targets

In practice the training data rarely comes from a deterministic $f$ . Instead, $y$ is drawn from a conditional distribution $p (y ∣ x)$ , meaning the same input can map to different outputs. This noise is not a bug — it reflects genuine uncertainty in the real world. The target distribution $p (y ∣ x)$ subsumes the deterministic case (where $p$ is a point mass on $f (x)$ ).

Other Learning Paradigms

Supervised learning is one of three major paradigms:

Paradigm	Data	Goal
Supervised	${(x^{(i)}, y^{(i)})}$	Learn $f : X \to Y$
Unsupervised	${x^{(i)}}$ (no labels)	Find structure (clusters, density)
Reinforcement	States, actions, rewards	Learn policy that maximizes cumulative reward

logistic-regression — first supervised classification algorithm in this module
generalization — the property that separates learning from memorization
decision boundary — geometric view of classification hypotheses

Active Recall

Name the five components of the supervised learning framework and explain what each one contributes.

(1) Unknown target function $f$ — the true mapping we want to approximate. (2) Training data $T$ — the finite sample of input–output pairs we observe. (3) Hypothesis set $H$ — the family of functions the algorithm is allowed to search. (4) Learning algorithm $A$ — the procedure that picks the best hypothesis from $H$ . (5) Final hypothesis $g$ — the learned approximation of $f$ .

Why does the choice of hypothesis set matter? What goes wrong if it is too small or too large?

If $H$ is too small, the true function $f$ may not be representable within it, so no amount of data will produce a good approximation (underfitting). If $H$ is too large, the algorithm can fit the training noise and fail on unseen data (overfitting). The art is choosing $H$ large enough to contain a good approximation of $f$ but small enough that the algorithm can reliably find it from limited data.

A dataset has a "car brand" feature with values {Fiat, VW, Toyota}. Explain why you cannot feed these directly into a model that expects numeric inputs, and describe the standard fix.

The categories have no natural numeric ordering — assigning Fiat = 1, VW = 2, Toyota = 3 would falsely imply that Toyota is “greater than” Fiat. One-hot encoding creates a separate binary dimension per category (e.g., $x_{Fiat} \in {0, 1}$ , $x_{VW} \in {0, 1}$ , $x_{Toyota} \in {0, 1}$ ), preserving the fact that categories are unordered.

What does it mean for the target to be a distribution $p (y ∣ x)$ rather than a deterministic function $f (x)$ ?

It means the same input $x$ can produce different outputs $y$ on different occasions — there is inherent noise or uncertainty in the data. The deterministic case is a special case where $p (y ∣ x)$ is a point mass on a single value. In practice, the learning algorithm must cope with this noise rather than trying to fit every training point exactly.

Course Notes

Explorer

supervised-learning

Definition

The Learning Framework

Input and Output Spaces

Noisy Targets

Other Learning Paradigms

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

supervised-learning

Definition

The Learning Framework

Input and Output Spaces

Noisy Targets

Other Learning Paradigms

Related

Active Recall

Graph View

Table of Contents

Backlinks