hoeffding-inequality

A probabilistic bound: the sample mean $\frac{1}{N} \sum z_{i}$ of $N$ i.i.d. random variables (each in $[0, 1]$ ) deviates from the true mean by more than $ϵ$ with probability at most $2 e^{- 2 ϵ^{2} N}$ . In ML terms: training error and test error are close with high probability, for any fixed hypothesis. The exponential decay in $N$ is what makes learning feasible at all.

The Inequality

For i.i.d. random variables $z_{1}, \dots, z_{N}$ with $0 \leq z_{i} \leq 1$ :

$P (\frac{1}{N} i = 1 \sum N z_{i} - E_{z \sim p (z)} [z] > ϵ) \leq 2 e^{- 2 ϵ^{2} N}$

The sample mean concentrates around the true mean. The probability of being “ $ϵ$ -far” decays exponentially in $N$ (the sample size), not polynomially.

Two messages:

You can bound the unknown using what you observe. The sample mean (which you compute) is an estimate of the true mean (which you don’t know). Hoeffding tells you how much they can differ.
The bound is universal. The right-hand side is independent of the underlying distribution $p (z)$ — works for any bounded distribution, no further assumptions.

Application to Machine Learning

Set $z_{i} = 1 [h (x_{i}) \neq = f (x_{i})]$ — the 0/1 misclassification indicator for example $i$ under a fixed hypothesis $h$ . Then:

$\frac{1}{N} \sum_{i} z_{i} = E_{in} (h)$ — the in-sample (training) error.
$E_{z} [z] = E_{out} (h)$ — the out-of-sample (true) error.

Plugging into Hoeffding:

$P (∣ E_{in} (h) - E_{out} (h) ∣ > ϵ) \leq 2 e^{- 2 ϵ^{2} N}$

The basic feasibility statement

“Training error is probably close to test error, when the training set is large enough.” That’s all Hoeffding says — but it’s the foundation of every formal generalization argument in ML.

Accuracy and Confidence

Rewriting with $δ = 2 e^{- 2 ϵ^{2} N}$ :

$P (∣ E_{in} (h) - E_{out} (h) ∣ \leq ϵ) \geq 1 - δ$

$ϵ$ is the accuracy: how close $E_{in}$ and $E_{out}$ are.
$1 - δ$ is the confidence: probability that the bound holds.
Solving for $ϵ$ : $ϵ = \frac{1}{2 N} lo g \frac{2}{δ}$ .

So with probability $\geq 1 - δ$ :

$E_{in} (h) - ϵ \leq E_{out} (h) \leq E_{in} (h) + ϵ$

PAC Learnability

A target function is PAC-learnable (Probably Approximately Correct) if for any tolerances $ϵ, δ > 0$ , there exists an algorithm and a sample size $N (ϵ, δ)$ such that:

$P [∣ E_{in} (h) - E_{out} (h) ∣ \leq ϵ] \geq 1 - δ$

In words: with as much accuracy as you want and as much confidence as you want, you can find $N$ large enough that your training error approximates your test error to that tolerance.

PAC is the formal sense in which ML “works”: not “we will be perfectly correct” but “we will be probably approximately correct, for any $ϵ$ and $δ$ we care about, given enough data.”

The Catch — One Hypothesis vs the Final Hypothesis

Hoeffding requires the hypothesis $h$ to be fixed before looking at the data. But in ML, we choose our final hypothesis $g$ from a hypothesis set $H = {h_{1}, \dots, h_{M}}$ based on the training data. This breaks the i.i.d. assumption that Hoeffding needs.

Fix: apply Hoeffding to each candidate hypothesis and take a union bound. If we choose $g$ from $M$ hypotheses:

$P (∣ E_{in} (g) - E_{out} (g) ∣ > ϵ) \leq 2 M e^{- 2 ϵ^{2} N}$

The $M$ factor is the price for letting the algorithm pick $g$ from a set rather than committing to one in advance. See generalization-bound for what this means in practice.

Worked Numerical Example

A bin contains marbles with true mean $μ = 0.4$ . We sample $N = 10$ marbles. What’s the bound on $P (ν \leq 0.1)$ where $ν$ is the sample mean?

$ν \leq 0.1$ implies $∣ ν - μ ∣ \geq 0.3$ , so $ϵ = 0.3$ .

$P (∣ ν - μ ∣ > 0.3) \leq 2 e^{- 2 (0.3)^{2} \cdot 10} = 2 e^{- 1.8} \approx 0.33$

So the bound is about 33%.

How Much Data Do We Need?

Given target accuracy $ϵ$ , target confidence $1 - δ$ , and hypothesis set size $M$ :

$2 M e^{- 2 ϵ^{2} N} \leq δ ⟺ N \geq \frac{1}{2 ϵ ^{2}} lo g \frac{2 M}{δ}$

Worked example: $ϵ = 0.1$ , $δ = 0.05$ , $M = 100$ :

$N \geq \frac{1}{2 ( 0.1 ) ^{2}} lo g \frac{200}{0.05} = 50 lo g (4000) \approx 50 \cdot 8.29 \approx 415$

So 415 examples suffice for 10% accuracy with 95% confidence over 100 hypotheses.

Notice the scaling:

$1/ ϵ^{2}$ — halving accuracy quadruples required $N$ .
$lo g (1/ δ)$ — improving confidence is cheap (logarithmic).
$lo g M$ — adding hypotheses is also cheap (logarithmic).

Tightness

Hoeffding is not tight — it’s a worst-case upper bound. For specific distributions, sharper bounds exist (Bernstein, Bennett). In practice the shape matters more than the constant: exponential decay in $N$ is what guarantees learning is feasible at scale.

What Could Go Wrong

The i.i.d. assumption. If training and test data come from different distributions, Hoeffding doesn’t apply. This is “covariate shift” or “distribution shift.”
Variables outside $[0, 1]$ . Hoeffding for general bounded variables exists; the constant changes. For unbounded distributions (heavy tails), use Bernstein-type bounds with variance terms.
Choosing $h$ after looking at the data. This is the whole reason we need the union bound and the $M$ factor — see the next section.

Connections

generalization-bound — Hoeffding + union bound gives the master generalization bound.
gaussian-distribution — for sums of bounded variables, Hoeffding is the non-Gaussian-CLT version: it gives a bound that doesn’t require Gaussianity in the limit.
bayes-law — different framework for uncertainty (probability distributions over parameters); Hoeffding is the frequentist counterpart.

Active Recall

Why does Hoeffding give exponential decay in $N$ rather than polynomial?

Hoeffding is a concentration inequality — it bounds how much an average of $N$ independent variables fluctuates. Averaging amplifies cancellation (positive and negative deviations balance), and the probability of large deviations decays exponentially because each independent draw multiplies the chance of “everything going wrong together.” Polynomial decay would correspond to a much weaker concentration; exponential is what makes finite-data learning provably feasible.

What's wrong with applying Hoeffding directly to the final hypothesis $g$ that an algorithm chooses?

Hoeffding requires $h$ to be fixed before observing the data. The final hypothesis $g$ is chosen based on the training data, so the per-example errors $1 [g (x_{i}) \neq = f (x_{i})]$ are not i.i.d. with respect to a fixed $g$ . The fix is the union bound: apply Hoeffding to each candidate $h \in H$ and sum, paying a factor of $M = ∣ H ∣$ .

For $ϵ = 0.05$ , $δ = 0.01$ , $M = 1000$ , roughly how many training examples do we need?

$N \geq \frac{1}{2 ϵ ^{2}} lo g \frac{2 M}{δ} = \frac{1}{0.005} lo g \frac{2000}{0.01} = 200 \cdot lo g (200, 000) \approx 200 \cdot 12.2 = 2440$ . So roughly 2500 examples.

Course Notes

Explorer

hoeffding-inequality

The Inequality

Application to Machine Learning

Accuracy and Confidence

PAC Learnability

The Catch — One Hypothesis vs the Final Hypothesis

Worked Numerical Example

How Much Data Do We Need?

Tightness

What Could Go Wrong

Connections

Active Recall

Graph View

Table of Contents

Backlinks