Three meta-principles that govern how learning experiments should be designed and interpreted, beyond the specific algorithms. Occam’s Razor: among hypotheses that fit, prefer the simplest. Sampling Bias: if training data is sampled from a distribution that differs from the deployment distribution, no learning theorem rescues you. Data Snooping: any data that has affected any step of the learning process — including preprocessing, feature design, hyperparameter choice, or model selection — has lost its ability to fairly assess the result.

1. Occam’s Razor

Entities must not be multiplied beyond necessity. — William of Occam (1287–1347)

An explanation of the data should be made as simple as possible, but no simpler. — Albert Einstein (probably; widely attributed)

The Occam’s Razor principle in machine learning: the simplest model that fits the data is also the most plausible. Two questions to make this precise:

What does “simple” mean?

ObjectSimple means
Hypothesis Small — few parameters, short description
Hypothesis set Small — small VC dimension, few hypotheses

These two are related: small small. A hypothesis drawn from a low-complexity set is automatically a low-complexity hypothesis. The two senses of simplicity are aligned, even though they’re conceptually distinct.

Why is simpler better?

“Better” doesn’t mean “more elegant” — it means better out-of-sample performance. A simpler model has tighter generalisation bounds (smaller in the VC bound), so when training error is comparable, simpler wins on test error.

The practical rule: start linear, then ask whether the data is being over-modelled. Don’t add flexibility unless the simpler model is structurally inadequate. This pairs with regularisation — even within a complex hypothesis set, the regulariser pushes the optimiser toward “effectively simpler” hypotheses.

COMMON MISCONCEPTION — "Simple" means "fewer features"

Not always. A 2-feature degree-10 polynomial has more capacity than a 100-feature linear model. Complexity is about the hypothesis set’s expressivity, captured by VC dimension, not the input dimension. A linear classifier in can be simpler than a degree-3 polynomial in .

2. Sampling Bias

If the data is sampled in a biased way, learning will produce a similarly biased outcome.

The technical statement: if training data is drawn from but deployment is under , the VC bound fails — its proof assumes training and test draws come from the same distribution.

The philosophical statement: studying maths hard but being tested on English gives no strong test-performance guarantee. The mismatch between what you trained on and what you’re evaluated on is fatal — no clever algorithm rescues a model trained on the wrong distribution.

Examples in Practice

  • Survey bias: a poll of land-line phone owners predicts presidential elections poorly when many voters use only mobiles.
  • Survivorship bias: a model of “successful startups” trained only on companies that survived ignores the companies that failed.
  • Class imbalance: training on 99% benign, 1% malignant means the model can achieve 99% accuracy by always predicting “benign” — but generalises terribly when the true rate matters.
  • Temporal drift: a model trained on 2020 data deployed in 2025 sees a shifted input distribution.

What Can You Do About It?

If and you can’t re-collect data:

  • Re-weight training examples to match the deployment distribution (importance sampling).
  • Distributionally robust training: optimise for worst-case performance across a family of distributions.
  • Domain adaptation: explicit transfer-learning techniques.

But these are partial fixes. The cleanest approach is prevention: collect training data from the actual deployment distribution. If the distribution will shift, plan to retrain.

3. Data Snooping

If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.

The strongest of the three principles. Any influence — preprocessing choice, feature engineering, model selection, hyperparameter tuning — counts as “the data affecting a step”. Once snooped, the data is no longer independent of the trained model, and no test-set guarantee survives.

Examples of Data Snooping

  • Normalising the entire dataset (training + test) before splitting: the test set’s mean/variance has leaked into the training preprocessing.
  • Looking at the test set’s labels to design features: those features are tuned to whatever pattern existed in the test set.
  • Trying many models, reporting the best test error: the test set was used as a selection criterion. The reported number is biased downward.
  • Reading a paper, then “discovering” a feature engineering trick that turns out to help on the same benchmark: research-community-level snooping.
  • Cross-validating to pick a model, then reporting the best CV error: the CV error is the selection score, not an unbiased estimate of for the chosen model. (You’d need a fresh test set for that.)

The Discipline

The cleanest workflow:

  1. Split into and before doing anything.
  2. Lock in the safe — never look at it, never normalise with it, never preprocess with it.
  3. Use for everything: feature design, model fitting, hyperparameter tuning (via further internal validation/cross-validation splits).
  4. One evaluation on at the end. If you peek and decide to “try one more thing,” you’ve burned it — no longer a clean estimate.

The lecture’s warning, made concrete: a financial trading example shows that “snooping” (using test-period statistics to normalise data before training) produces an apparently profitable strategy with cumulative profit , while the same strategy with proper normalisation (training-only) loses money. The model didn’t learn signal — it learned the test set’s distribution.

ASIDE — Why data snooping is the hardest principle to follow

Sampling bias is hard to detect but at least conceptually clear (“get representative data”). Occam’s Razor is a default (“start simple”). Data snooping is insidious: every paper, every prior experience, every glance at exploratory data analysis is a potential snoop. In the strictest sense, the moment you saw that the dataset has 1{,}000 features rather than 10, your hypothesis class is “informed” by the data. Practitioners aim for minimal snooping, not zero.

How the Three Interact

If you violate…Effect on
Occam’s Razor (model too complex) — overfitting; bound is loose
Sampling Bias (wrong distribution) on deployment on training distribution; bound doesn’t apply
Data Snooping (test set used)Reported is optimistic; true is worse

All three corrupt the chain of inference from training to deployment. They’re separately deadly and jointly catastrophic.

  • generalization-bound — the formal theorem whose assumptions all three principles correspond to: bounded complexity (Occam), i.i.d. sampling (no sampling bias), independent test set (no snooping).
  • validation — disciplined data hygiene relies on it.
  • overfitting — Occam’s Razor’s failure mode.
  • regularization — Occam’s Razor mechanically enforced.

Active Recall