A probabilistic bound: the sample mean of i.i.d. random variables (each in ) deviates from the true mean by more than with probability at most . In ML terms: training error and test error are close with high probability, for any fixed hypothesis. The exponential decay in is what makes learning feasible at all.

The Inequality

For i.i.d. random variables with :

The sample mean concentrates around the true mean. The probability of being “-far” decays exponentially in (the sample size), not polynomially.

Two messages:

  1. You can bound the unknown using what you observe. The sample mean (which you compute) is an estimate of the true mean (which you don’t know). Hoeffding tells you how much they can differ.
  2. The bound is universal. The right-hand side is independent of the underlying distribution — works for any bounded distribution, no further assumptions.

Application to Machine Learning

Set — the 0/1 misclassification indicator for example under a fixed hypothesis . Then:

  • — the in-sample (training) error.
  • — the out-of-sample (true) error.

Plugging into Hoeffding:

The basic feasibility statement

“Training error is probably close to test error, when the training set is large enough.” That’s all Hoeffding says — but it’s the foundation of every formal generalization argument in ML.

Accuracy and Confidence

Rewriting with :

  • is the accuracy: how close and are.
  • is the confidence: probability that the bound holds.
  • Solving for : .

So with probability :

PAC Learnability

A target function is PAC-learnable (Probably Approximately Correct) if for any tolerances , there exists an algorithm and a sample size such that:

In words: with as much accuracy as you want and as much confidence as you want, you can find large enough that your training error approximates your test error to that tolerance.

PAC is the formal sense in which ML “works”: not “we will be perfectly correct” but “we will be probably approximately correct, for any and we care about, given enough data.”

The Catch — One Hypothesis vs the Final Hypothesis

Hoeffding requires the hypothesis to be fixed before looking at the data. But in ML, we choose our final hypothesis from a hypothesis set based on the training data. This breaks the i.i.d. assumption that Hoeffding needs.

Fix: apply Hoeffding to each candidate hypothesis and take a union bound. If we choose from hypotheses:

The factor is the price for letting the algorithm pick from a set rather than committing to one in advance. See generalization-bound for what this means in practice.

Worked Numerical Example

A bin contains marbles with true mean . We sample marbles. What’s the bound on where is the sample mean?

implies , so .

So the bound is about 33%.

How Much Data Do We Need?

Given target accuracy , target confidence , and hypothesis set size :

Worked example: , , :

So 415 examples suffice for 10% accuracy with 95% confidence over 100 hypotheses.

Notice the scaling:

  • — halving accuracy quadruples required .
  • — improving confidence is cheap (logarithmic).
  • — adding hypotheses is also cheap (logarithmic).

Tightness

Hoeffding is not tight — it’s a worst-case upper bound. For specific distributions, sharper bounds exist (Bernstein, Bennett). In practice the shape matters more than the constant: exponential decay in is what guarantees learning is feasible at scale.

What Could Go Wrong

  • The i.i.d. assumption. If training and test data come from different distributions, Hoeffding doesn’t apply. This is “covariate shift” or “distribution shift.”
  • Variables outside . Hoeffding for general bounded variables exists; the constant changes. For unbounded distributions (heavy tails), use Bernstein-type bounds with variance terms.
  • Choosing after looking at the data. This is the whole reason we need the union bound and the factor — see the next section.

Connections

  • generalization-bound — Hoeffding + union bound gives the master generalization bound.
  • gaussian-distribution — for sums of bounded variables, Hoeffding is the non-Gaussian-CLT version: it gives a bound that doesn’t require Gaussianity in the limit.
  • bayes-law — different framework for uncertainty (probability distributions over parameters); Hoeffding is the frequentist counterpart.

Active Recall