A learning paradigm where a model is trained on input–output pairs and evaluated on its ability to predict outputs for unseen inputs.
Definition
Supervised learning is the task of learning a function from a set of training examples , where each is an input and is the corresponding output (label). The goal is for to approximate an unknown target function well enough to generalize to new data drawn from the same distribution.
The Learning Framework
Every supervised learning problem has five components:
- Unknown target function — the true relationship between inputs and outputs. We never see directly; we only observe noisy samples from it.
- Training data — a finite set of pairs drawn i.i.d. from an unknown joint distribution .
- Hypothesis set — the family of candidate functions the algorithm is allowed to consider (e.g., all linear functions, all polynomials of degree ). This is the assumption we bring to the table about what might look like.
- Learning algorithm — the procedure that searches for the hypothesis that best fits the training data.
- Final hypothesis — the output of ; our best approximation of .
The choice of is critical. Too small and may lie outside it; too large and the algorithm may overfit. This tension runs through the entire module.
Input and Output Spaces
The input space is typically -dimensional. Each dimension (feature) can be:
- Numeric — e.g., age, salary (already real-valued).
- Ordinal — e.g., expertise (ordered categories, often mapped to numbers like ).
- Categorical — e.g., car brand (no natural ordering). Typically encoded via one-hot encoding: each category becomes a binary dimension.
The output space determines the task type:
- Regression: (predict a continuous value, e.g., house price).
- Classification: is a finite set of categories. Binary classification () is the most common; multi-class () extends naturally.
Noisy Targets
In practice the training data rarely comes from a deterministic . Instead, is drawn from a conditional distribution , meaning the same input can map to different outputs. This noise is not a bug — it reflects genuine uncertainty in the real world. The target distribution subsumes the deterministic case (where is a point mass on ).
Other Learning Paradigms
Supervised learning is one of three major paradigms:
| Paradigm | Data | Goal |
|---|---|---|
| Supervised | Learn | |
| Unsupervised | (no labels) | Find structure (clusters, density) |
| Reinforcement | States, actions, rewards | Learn policy that maximizes cumulative reward |

Related
- logistic-regression — first supervised classification algorithm in this module
- generalization — the property that separates learning from memorization
- decision boundary — geometric view of classification hypotheses
Active Recall
Name the five components of the supervised learning framework and explain what each one contributes.
(1) Unknown target function — the true mapping we want to approximate. (2) Training data — the finite sample of input–output pairs we observe. (3) Hypothesis set — the family of functions the algorithm is allowed to search. (4) Learning algorithm — the procedure that picks the best hypothesis from . (5) Final hypothesis — the learned approximation of .
Why does the choice of hypothesis set matter? What goes wrong if it is too small or too large?
If is too small, the true function may not be representable within it, so no amount of data will produce a good approximation (underfitting). If is too large, the algorithm can fit the training noise and fail on unseen data (overfitting). The art is choosing large enough to contain a good approximation of but small enough that the algorithm can reliably find it from limited data.
A dataset has a "car brand" feature with values {Fiat, VW, Toyota}. Explain why you cannot feed these directly into a model that expects numeric inputs, and describe the standard fix.
The categories have no natural numeric ordering — assigning Fiat = 1, VW = 2, Toyota = 3 would falsely imply that Toyota is “greater than” Fiat. One-hot encoding creates a separate binary dimension per category (e.g., , , ), preserving the fact that categories are unordered.
What does it mean for the target to be a distribution rather than a deterministic function ?
It means the same input can produce different outputs on different occasions — there is inherent noise or uncertainty in the data. The deterministic case is a special case where is a point mass on a single value. In practice, the learning algorithm must cope with this noise rather than trying to fit every training point exactly.