THE CRUX: Last week ended with two open threads. (1) Linear regression's MLE/OLS is a point estimate — what if we want uncertainty over , or a principled justification for regularisation? (2) We've been assuming throughout that "training error close to test error" is reasonable — but on what grounds? When does learning provably generalise, and how much data do we need?

The two halves of week 8 each answer one. (1) Bayesian linear regression places a Gaussian prior on and updates to a Gaussian posterior via Bayes’ law. The posterior mean is the MAP estimate — exactly ridge regression — so L2 regularisation is the negative log of a Gaussian prior. Regularisation isn’t a heuristic; it’s a probabilistic statement about what weights are plausible. (2) Hoeffding’s inequality gives the foundational concentration bound: for a fixed hypothesis , training error and test error differ by more than with probability at most . Combined with a union bound over a hypothesis set of size , this becomes the generalisation bound: with confidence . Two opposing forces — large helps fit, small helps generalise — formalise the bias-variance trade-off.


Part 1: Bayesian Regression

The Underdetermined Case

Linear regression with requires at least two pairs to identify the slope and intercept. With one data point, there are infinite solutions — every line through that point fits perfectly. OLS doesn’t help; the system is underdetermined.

The Bayesian fix: declare a prior on which ‘s are plausible before seeing the data. A natural choice is “small weights are more likely than large ones,” formalised as a Gaussian prior centred at zero. Now the posterior is a distribution over all admissible lines through the point — not a single answer, but a quantified set of possibilities.

This is more than a hack for tiny datasets. It generalises: even when OLS would work, the Bayesian view supplies model uncertainty, principled regularisation, and a graceful way to update beliefs as data arrives.

The Setup

Likelihood (same as standard linear regression, with noise precision ):

Prior — zero-mean Gaussian with isotropic precision :

Posterior by Bayes’ law:

Why the Math Closes — Conjugacy

A Gaussian likelihood combined with a Gaussian prior gives a Gaussian posterior. This is the canonical example of conjugacy — the posterior stays in the same family as the prior, so no integrals to estimate, no MCMC. Everything in closed form.

Taking the log:

Quadratic in → completing the square gives:

with:

MAP = Ridge Regression

Maximising the log-posterior (the MAP estimate) is the same as minimising:

The first term is the OLS loss. The second is an L2 penalty on with regularisation coefficient . This is exactly ridge regression.

Regularisation is a prior in disguise

The L2 penalty is the negative log of a zero-mean Gaussian prior. Maximising the posterior = minimising “negative log-likelihood + negative log-prior” = minimising “OLS loss + L2 penalty.” The Bayesian view explains why L2 regularisation works: it’s a probabilistic statement that small weights are a-priori more plausible than large ones.

Other regularisers correspond to other priors: L1 (lasso) ↔ Laplace prior; mixtures correspond to elastic net; structured priors give group lasso, etc.

Posterior Behaviour as Data Grows

A sequence of plots (lecture slides 16–17) traces what happens to the posterior over as grows from 1 to 1000:

  • : posterior is broad — barely tighter than the prior.
  • : a clear elongated ellipse appears, confirming the slope-intercept correlation.
  • : posterior shrinks to a small region.
  • : a tiny spot — almost a point estimate.

Equivalently: sample 10 lines from the posterior at each and plot them. With small , the lines fan out in many directions; with large , they’re nearly indistinguishable. Model uncertainty reduces with data.

Predictive Distribution

A non-Bayesian model gives a single for each new . Bayesian regression integrates over all plausible ‘s:

The result is Gaussian: mean , variance . The variance has two parts: irreducible noise () and model uncertainty (the second term, vanishing as data accumulates).

When MAP Beats MLE

A degree-8 polynomial fit to noisy points:

  • MLE interpolates training points exactly, oscillates wildly between them, and fails on test data.
  • MAP with a Gaussian prior is smoother. Slightly larger training error, dramatically smaller test error.

Same model class, same data — different criterion, qualitatively different behaviour. The prior prevents the polynomial coefficients from blowing up to chase noise.

Part 2: Is Learning Feasible?

The Question

The whole machine-learning enterprise rests on a leap: from observed training data, infer something about unseen test data. Is this leap justified? Or are we just memorising?

Deterministic answer: NO. From a finite sample, you cannot say anything certain about points outside it. Anyone can construct a function that agrees with you on training points and disagrees everywhere else.

Probabilistic answer: YES. If training and test are i.i.d. from the same distribution, then with high probability the training-set behaviour resembles the population behaviour. Quantifying how high — and as a function of what — is the job of Hoeffding’s inequality.

Hoeffding’s Inequality

For i.i.d. random variables with :

Two messages:

  1. You can bound the unknown using what you observe. The sample mean (you know) approximates the true mean (you don’t), with quantified probability of error.
  2. Universal. The right-hand side depends on neither the underlying distribution nor anything else — works for any bounded distribution.

For ML, set — the misclassification indicator. Then (training error) and (test error). Hoeffding becomes:

The basic feasibility result

“Training error is probably close to test error, when training set is large enough.” That’s what Hoeffding gives — and it’s the foundation of every formal generalisation argument in ML.

Accuracy and Confidence — PAC

Setting :

  • : accuracy — how close training and test errors are.
  • : confidence — probability the claim holds.

A target function is Probably Approximately Correct (PAC) learnable if for any there exists an algorithm and a sample size achieving the above. PAC is the formal sense in which ML “works”: not perfect correctness, but probably approximately correct, given enough data.

The One-Hypothesis Catch

Hoeffding requires to be fixed before observing the data. But algorithms choose from a hypothesis set based on the training data — so the per-example errors aren’t i.i.d. with respect to a fixed .

Fix: apply Hoeffding to each candidate , then take a union bound:

The factor is the price for letting the algorithm choose among hypotheses.

The Generalization Bound

Solving for and rearranging:

with probability at least . This is the generalization bound.

Reading the variables:

VariableEffectWhat you control
— training-set sizeMore → smaller gapGet more data
— hypothesis-set sizeLarger → larger gapChoose simpler models
— confidence toleranceSmaller → larger gapTrade-off
— training errorSmaller → tighterTrain better

The square root means halving the gap requires quadrupling . Improving or adding hypotheses are logarithmic — cheap.

The Two Central Questions

The bound splits learning into:

  1. Can be close to ? — generalisation. Favours small .
  2. Can be small enough? — fitting. Favours large (more options).

These pull in opposite directions:

GeneralisationFitting
Small✓ small gap✗ poor fit
Large✗ large gap✓ good fit

Picking the right balances them. This is the formal source of the bias-variance / model-complexity trade-off — too simple → underfit (high bias), too complex → overfit (high variance).

Worked Example — How Much Data?

Target: , , candidates.

So 415 examples suffice for 10% accuracy with 95% confidence over 100 hypotheses.

The Problem

Linear regression has — there’s a continuum of weight vectors. Naively, and the bound is vacuous. Same for SVMs, neural networks, and almost everything practical.

Fix: replace with a finite “effective” complexity. Many hypotheses in produce the same labelling on a fixed training set — the bad events overlap massively. Counting distinct labellings (rather than distinct ‘s) gives a finite quantity even when is infinite.

Define a dichotomy as a hypothesis “as seen by” specific inputs — the labelling pattern it produces. The number of distinct dichotomies is at most for binary classification (finite for any finite ), and for “structured” hypothesis sets grows polynomially in rather than exponentially.

For lines in :

inputsMax dichotomies
122
244
38 (or 6 if collinear)8
41416

At , lines in can no longer realise every possible labelling — some labellings are impossible regardless of where you place the points. This is the structural restriction that makes infinite- learning feasible.

The replacement complexity is the growth function , formalised in upcoming weeks via VC dimension.


Concepts Introduced This Week

  • bayesian-linear-regression — full Bayesian treatment of linear regression: prior, conjugate Gaussian likelihood, closed-form Gaussian posterior, predictive distribution.
  • ridge-regression — L2-regularised linear regression; the MAP estimate of Bayesian linear regression under a Gaussian prior; the practical fix for ill-conditioned .
  • hoeffding-inequality — the master concentration bound: sample mean and true mean are close with probability . Foundation of PAC learning.
  • generalization-bound — Hoeffding + union bound: with confidence . The formal source of the bias-variance trade-off.

Connections

  • Builds on week-07: regularisation (ridge) is the Bayesian completion of linear regression — the prior supplies the regularisation. The two halves of the module’s “regression” arc are now joined: MLE/OLS for point estimates, MAP/ridge and full Bayesian for uncertainty.
  • Builds on bayes-law: the abstract formula “posterior ∝ likelihood × prior” gets its first concrete payoff. Conjugate Gaussian-Gaussian pairing keeps everything tractable.
  • Builds on gaussian-distribution: the Gaussian’s role expands — it’s no longer just the noise model, it’s also the prior, and (consequently) the posterior. Conjugacy means the family is closed under inference.
  • Sets up later weeks: VC dimension (formal way to replace with a finite quantity); cross-validation (practical way to pick model complexity); deep learning generalisation (where Hoeffding-style bounds are vacuous and tighter notions are needed).

Open Questions

  • How do we replace with something finite for infinite hypothesis sets? VC dimension — next weeks. The lecture’s “dichotomy” preview is the setup.
  • How do we choose in Bayesian regression? Cross-validation, empirical Bayes (maximise the evidence), or hierarchical priors with hyperpriors integrated out.
  • Are Hoeffding-style bounds useful in practice for deep networks? Generally no — is astronomical and the bound is vacuous. Tighter notions (Rademacher complexity, PAC-Bayes, margin-based bounds) are active research topics.
  • What if priors and noise aren’t Gaussian? Conjugacy is lost; closed-form inference fails. Then we use approximate inference: MCMC, variational Bayes, Laplace approximations.