From Bayesian Priors to Generalization Bounds

THE CRUX: Last week ended with two open threads. (1) Linear regression's MLE/OLS is a point estimate — what if we want uncertainty over $w$ , or a principled justification for regularisation? (2) We've been assuming throughout that "training error close to test error" is reasonable — but on what grounds? When does learning provably generalise, and how much data do we need?

The two halves of week 8 each answer one. (1) Bayesian linear regression places a Gaussian prior on $w$ and updates to a Gaussian posterior via Bayes’ law. The posterior mean is the MAP estimate — exactly ridge regression — so L2 regularisation is the negative log of a Gaussian prior. Regularisation isn’t a heuristic; it’s a probabilistic statement about what weights are plausible. (2) Hoeffding’s inequality gives the foundational concentration bound: for a fixed hypothesis $h$ , training error and test error differ by more than $ϵ$ with probability at most $2 e^{- 2 ϵ^{2} N}$ . Combined with a union bound over a hypothesis set of size $M$ , this becomes the generalisation bound: $E_{out} \leq E_{in} + (1/2 N) lo g (2 M / δ)$ with confidence $1 - δ$ . Two opposing forces — large $M$ helps fit, small $M$ helps generalise — formalise the bias-variance trade-off.

Part 1: Bayesian Regression

The Underdetermined Case

Linear regression with $\overset{y}{^} = w_{0} + w_{1} x$ requires at least two $(x, y)$ pairs to identify the slope and intercept. With one data point, there are infinite solutions — every line through that point fits perfectly. OLS doesn’t help; the system is underdetermined.

The Bayesian fix: declare a prior on which $w$ ‘s are plausible before seeing the data. A natural choice is “small weights are more likely than large ones,” formalised as a Gaussian prior centred at zero. Now the posterior is a distribution over all admissible lines through the point — not a single answer, but a quantified set of possibilities.

This is more than a hack for tiny datasets. It generalises: even when OLS would work, the Bayesian view supplies model uncertainty, principled regularisation, and a graceful way to update beliefs as data arrives.

The Setup

Likelihood (same as standard linear regression, with noise precision $β = 1/ σ^{2}$ ):

$p (y ∣ X, w, β) = \prod_{i = 1}^{N} N (y_{i} ∣ w^{⊤} ϕ (x_{i}), β^{- 1})$

Prior — zero-mean Gaussian with isotropic precision $α$ :

$p (w) = N (w ∣ 0, α^{- 1} I)$

Posterior by Bayes’ law:

$p (w ∣ y, X) \propto p (y ∣ X, w) \cdot p (w)$

Why the Math Closes — Conjugacy

A Gaussian likelihood combined with a Gaussian prior gives a Gaussian posterior. This is the canonical example of conjugacy — the posterior stays in the same family as the prior, so no integrals to estimate, no MCMC. Everything in closed form.

Taking the log:

$ln p (w ∣ y, X) = - \frac{β}{2} \sum_{i} (y_{i} - w^{⊤} ϕ (x_{i}))^{2} - \frac{α}{2} w^{⊤} w + const$

Quadratic in $w$ → completing the square gives:

$p (w ∣ y, X) = N (w ∣ m_{post}, S_{post})$

with:

$m_{post} = β S_{post} Φ^{⊤} y, S_{post}^{- 1} = α I + β Φ^{⊤} Φ$

MAP = Ridge Regression

Maximising the log-posterior (the MAP estimate) is the same as minimising:

$\frac{1}{2} \sum_{i} (y_{i} - w^{⊤} ϕ (x_{i}))^{2} + \frac{α}{2 β} w^{⊤} w$

The first term is the OLS loss. The second is an L2 penalty on $w$ with regularisation coefficient $λ = α / β$ . This is exactly ridge regression.

Regularisation is a prior in disguise

The L2 penalty $\frac{λ}{2} ∥ w ∥^{2}$ is the negative log of a zero-mean Gaussian prior. Maximising the posterior = minimising “negative log-likelihood + negative log-prior” = minimising “OLS loss + L2 penalty.” The Bayesian view explains why L2 regularisation works: it’s a probabilistic statement that small weights are a-priori more plausible than large ones.

Other regularisers correspond to other priors: L1 (lasso) ↔ Laplace prior; mixtures correspond to elastic net; structured priors give group lasso, etc.

Posterior Behaviour as Data Grows

A sequence of plots (lecture slides 16–17) traces what happens to the posterior over $(w_{0}, w_{1})$ as $N$ grows from 1 to 1000:

$N = 1$ : posterior is broad — barely tighter than the prior.
$N = 10$ : a clear elongated ellipse appears, confirming the slope-intercept correlation.
$N = 100$ : posterior shrinks to a small region.
$N = 1000$ : a tiny spot — almost a point estimate.

Equivalently: sample 10 lines from the posterior at each $N$ and plot them. With small $N$ , the lines fan out in many directions; with large $N$ , they’re nearly indistinguishable. Model uncertainty reduces with data.

Predictive Distribution

A non-Bayesian model gives a single $\overset{y}{^}$ for each new $x$ . Bayesian regression integrates over all plausible $w$ ‘s:

$p (y ∣ x, y, X) = \int p (y ∣ x, w) p (w ∣ y, X) d w$

The result is Gaussian: mean $m_{post}^{⊤} ϕ (x)$ , variance $β^{- 1} + ϕ (x)^{⊤} S_{post} ϕ (x)$ . The variance has two parts: irreducible noise ( $β^{- 1}$ ) and model uncertainty (the second term, vanishing as data accumulates).

When MAP Beats MLE

A degree-8 polynomial fit to $N = 10$ noisy points:

MLE interpolates training points exactly, oscillates wildly between them, and fails on test data.
MAP with a Gaussian prior is smoother. Slightly larger training error, dramatically smaller test error.

Same model class, same data — different criterion, qualitatively different behaviour. The prior prevents the polynomial coefficients from blowing up to chase noise.

Part 2: Is Learning Feasible?

The Question

The whole machine-learning enterprise rests on a leap: from observed training data, infer something about unseen test data. Is this leap justified? Or are we just memorising?

Deterministic answer: NO. From a finite sample, you cannot say anything certain about points outside it. Anyone can construct a function that agrees with you on training points and disagrees everywhere else.

Probabilistic answer: YES. If training and test are i.i.d. from the same distribution, then with high probability the training-set behaviour resembles the population behaviour. Quantifying how high — and as a function of what — is the job of Hoeffding’s inequality.

Hoeffding’s Inequality

For i.i.d. random variables $z_{1}, \dots, z_{N}$ with $0 \leq z_{i} \leq 1$ :

$P (\frac{1}{N} \sum_{i = 1}^{N} z_{i} - E [z] > ϵ) \leq 2 e^{- 2 ϵ^{2} N}$

Two messages:

You can bound the unknown using what you observe. The sample mean (you know) approximates the true mean (you don’t), with quantified probability of error.
Universal. The right-hand side depends on neither the underlying distribution $p (z)$ nor anything else — works for any bounded distribution.

For ML, set $z_{i} = 1 [h (x_{i}) \neq = f (x_{i})]$ — the misclassification indicator. Then $\frac{1}{N} \sum z_{i} = E_{in} (h)$ (training error) and $E [z] = E_{out} (h)$ (test error). Hoeffding becomes:

$P (∣ E_{in} (h) - E_{out} (h) ∣ > ϵ) \leq 2 e^{- 2 ϵ^{2} N}$

The basic feasibility result

“Training error is probably close to test error, when training set is large enough.” That’s what Hoeffding gives — and it’s the foundation of every formal generalisation argument in ML.

Accuracy and Confidence — PAC

Setting $δ = 2 e^{- 2 ϵ^{2} N}$ :

$P (∣ E_{in} (h) - E_{out} (h) ∣ \leq ϵ) \geq 1 - δ$

$ϵ$ : accuracy — how close training and test errors are.
$1 - δ$ : confidence — probability the claim holds.

A target function is Probably Approximately Correct (PAC) learnable if for any $ϵ, δ > 0$ there exists an algorithm and a sample size $N (ϵ, δ)$ achieving the above. PAC is the formal sense in which ML “works”: not perfect correctness, but probably approximately correct, given enough data.

The One-Hypothesis Catch

Hoeffding requires $h$ to be fixed before observing the data. But algorithms choose $g$ from a hypothesis set $H$ based on the training data — so the per-example errors $1 [g (x_{i}) \neq = f (x_{i})]$ aren’t i.i.d. with respect to a fixed $g$ .

Fix: apply Hoeffding to each candidate $h_{m} \in H$ , then take a union bound:

$P (∣ E_{in} (g) - E_{out} (g) ∣ > ϵ) \leq \sum_{m = 1}^{M} P (∣ E_{in} (h_{m}) - E_{out} (h_{m}) ∣ > ϵ) \leq 2 M e^{- 2 ϵ^{2} N}$

The factor $M = ∣ H ∣$ is the price for letting the algorithm choose among $M$ hypotheses.

The Generalization Bound

Solving $2 M e^{- 2 ϵ^{2} N} = δ$ for $ϵ$ and rearranging:

$E_{out} (g) \leq E_{in} (g) + \frac{1}{2 N} lo g \frac{2 M}{δ}$

with probability at least $1 - δ$ . This is the generalization bound.

Reading the variables:

Variable	Effect	What you control
$N$ — training-set size	More $N$ → smaller gap	Get more data
$M$ — hypothesis-set size	Larger $M$ → larger gap	Choose simpler models
$δ$ — confidence tolerance	Smaller $δ$ → larger gap	Trade-off
$E_{in}$ — training error	Smaller → tighter	Train better

The square root means halving the gap requires quadrupling $N$ . Improving $δ$ or adding hypotheses are logarithmic — cheap.

The Two Central Questions

The bound splits learning into:

Can $E_{out}$ be close to $E_{in}$ ? — generalisation. Favours small $M$ .
Can $E_{in}$ be small enough? — fitting. Favours large $M$ (more options).

These pull in opposite directions:

$M$	Generalisation	Fitting
Small	✓ small gap	✗ poor fit
Large	✗ large gap	✓ good fit

Picking the right $M$ balances them. This is the formal source of the bias-variance / model-complexity trade-off — too simple → underfit (high bias), too complex → overfit (high variance).

Worked Example — How Much Data?

Target: $ϵ = 0.1$ , $δ = 0.05$ , $M = 100$ candidates.

$N \geq \frac{1}{2 ϵ ^{2}} lo g \frac{2 M}{δ} = \frac{1}{0.02} lo g \frac{200}{0.05} = 50 \cdot lo g (4000) \approx 50 \cdot 8.29 \approx 415$

So 415 examples suffice for 10% accuracy with 95% confidence over 100 hypotheses.

The $M = \infty$ Problem

Linear regression has $∣ H ∣ = \infty$ — there’s a continuum of weight vectors. Naively, $lo g M = \infty$ and the bound is vacuous. Same for SVMs, neural networks, and almost everything practical.

Fix: replace $M$ with a finite “effective” complexity. Many hypotheses in $H$ produce the same labelling on a fixed training set — the bad events overlap massively. Counting distinct labellings (rather than distinct $w$ ‘s) gives a finite quantity even when $∣ H ∣$ is infinite.

Define a dichotomy as a hypothesis “as seen by” $N$ specific inputs — the labelling pattern $(h (x_{1}), \dots, h (x_{N}))$ it produces. The number of distinct dichotomies is at most $2^{N}$ for binary classification (finite for any finite $N$ ), and for “structured” hypothesis sets grows polynomially in $N$ rather than exponentially.

For lines in $R^{2}$ :

$N$ inputs	Max dichotomies	$2^{N}$
1	2	2
2	4	4
3	8 (or 6 if collinear)	8
4	14	16

At $N = 4$ , lines in $R^{2}$ can no longer realise every possible labelling — some labellings are impossible regardless of where you place the points. This is the structural restriction that makes infinite- $H$ learning feasible.

The replacement complexity is the growth function $m_{H} (N)$ , formalised in upcoming weeks via VC dimension.

Concepts Introduced This Week

bayesian-linear-regression — full Bayesian treatment of linear regression: prior, conjugate Gaussian likelihood, closed-form Gaussian posterior, predictive distribution.
ridge-regression — L2-regularised linear regression; the MAP estimate of Bayesian linear regression under a Gaussian prior; the practical fix for ill-conditioned $Φ^{⊤} Φ$ .
hoeffding-inequality — the master concentration bound: sample mean and true mean are close with probability $\geq 1 - 2 e^{- 2 ϵ^{2} N}$ . Foundation of PAC learning.
generalization-bound — Hoeffding + union bound: $E_{out} \leq E_{in} + (1/2 N) lo g (2 M / δ)$ with confidence $1 - δ$ . The formal source of the bias-variance trade-off.

Connections

Builds on week-07: regularisation (ridge) is the Bayesian completion of linear regression — the prior supplies the regularisation. The two halves of the module’s “regression” arc are now joined: MLE/OLS for point estimates, MAP/ridge and full Bayesian for uncertainty.
Builds on bayes-law: the abstract formula “posterior ∝ likelihood × prior” gets its first concrete payoff. Conjugate Gaussian-Gaussian pairing keeps everything tractable.
Builds on gaussian-distribution: the Gaussian’s role expands — it’s no longer just the noise model, it’s also the prior, and (consequently) the posterior. Conjugacy means the family is closed under inference.
Sets up later weeks: VC dimension (formal way to replace $M$ with a finite quantity); cross-validation (practical way to pick model complexity); deep learning generalisation (where Hoeffding-style bounds are vacuous and tighter notions are needed).

Open Questions

How do we replace $M$ with something finite for infinite hypothesis sets? VC dimension — next weeks. The lecture’s “dichotomy” preview is the setup.
How do we choose $α, β$ in Bayesian regression? Cross-validation, empirical Bayes (maximise the evidence), or hierarchical priors with hyperpriors integrated out.
Are Hoeffding-style bounds useful in practice for deep networks? Generally no — $M$ is astronomical and the bound is vacuous. Tighter notions (Rademacher complexity, PAC-Bayes, margin-based bounds) are active research topics.
What if priors and noise aren’t Gaussian? Conjugacy is lost; closed-form inference fails. Then we use approximate inference: MCMC, variational Bayes, Laplace approximations.

Course Notes

Explorer

From Bayesian Priors to Generalization Bounds

Part 1: Bayesian Regression

The Underdetermined Case

The Setup

Why the Math Closes — Conjugacy

MAP = Ridge Regression

Posterior Behaviour as Data Grows

Predictive Distribution

When MAP Beats MLE

Part 2: Is Learning Feasible?

The Question

Hoeffding’s Inequality

Accuracy and Confidence — PAC

The One-Hypothesis Catch

The Generalization Bound

The Two Central Questions

Worked Example — How Much Data?

The $M = \infty$ Problem

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

From Bayesian Priors to Generalization Bounds

Part 1: Bayesian Regression

The Underdetermined Case

The Setup

Why the Math Closes — Conjugacy

MAP = Ridge Regression

Posterior Behaviour as Data Grows

Predictive Distribution

When MAP Beats MLE

Part 2: Is Learning Feasible?

The Question

Hoeffding’s Inequality

Accuracy and Confidence — PAC

The One-Hypothesis Catch

The Generalization Bound

The Two Central Questions

Worked Example — How Much Data?

The M=∞ Problem

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks

The $M = \infty$ Problem