A classifier’s score isn’t a single number — it’s a confusion matrix, from which accuracy, precision, recall, and F1 are derived. Accuracy lies on imbalanced classes; the others don’t.

The Confusion Matrix

For a binary classifier, every test example falls into one of four cells based on its gold label (the human-annotated correct answer) and the system output label:

gold positivegold negative
system positivetrue positive (tp)false positive (fp)
system negativefalse negative (fn)true negative (tn)

These four counts are the basis for every evaluation metric below.

Accuracy — and Why It Fails on Imbalanced Classes

“Of all predictions, what fraction were correct?” Intuitive, but it breaks under class imbalance.

Example (from the slides): 1,000,000 social media posts; 100 are about Delicious Pie Co; 999,900 are unrelated. A stupid classifier that says “not about pie” for every post achieves:

  • 999,900 true negatives, 100 false negatives, 0 true positives, 0 false positives.
  • Accuracy = 999,900 / 1,000,000 = 99.99%.

Completely useless at finding pie-related posts — its recall on the positive class is zero — yet it scores near-perfect on accuracy. For any task where you care about a minority class, accuracy is the wrong headline number.

Precision and Recall

Both are defined with respect to a specific class (usually the positive class):

  • Precision: of all items the system labelled positive, what fraction actually are? → “how trustworthy are its positive predictions?”
  • Recall: of all items that actually are positive, what fraction did the system find? → “how much of the positive class did it catch?”

The “say no to everything” classifier above has tp = 0, so both precision and recall collapse to 0. Neither is fooled by class imbalance the way accuracy is.

Trade-off: tightening a classifier (raising a decision threshold) typically raises precision but lowers recall; loosening it does the opposite. Which matters more depends on the cost of each error type — a cancer screener prefers recall (missing a case is deadly); a spam filter prefers precision (incorrectly filtering legitimate mail is disruptive).

F1 Score

A single number combining precision and recall:

This is the harmonic mean of precision and recall. Harmonic mean punishes extremes — if either P or R is near zero, is near zero. The “say no” classifier with P = R = 0 has ; no amount of gaming the other metric saves it.

Why “harmonic mean”

The harmonic mean of positive numbers is

It’s the reciprocal of the arithmetic mean of reciprocals. For two numbers and :

That second form is exactly . The reciprocals are why HM blows up when any input is tiny: if , then , which dominates the denominator sum and pulls the HM toward 0 no matter what is. Arithmetic mean doesn’t have this property — , but .

The general F-measure

is a special case of the general F-measure, which allows weighting P and R differently. Two equivalent parameterizations:

The two forms are related by . In the form, is the relative weight on precision; in the form, is the relative emphasis on recall (think of as “recall is times as important as precision”).

corresponds to or equivalently — precision and recall weighted equally. (e.g. ) weights recall more heavily; (e.g. ) weights precision more.

Worked calculation (from the lab)

Confusion matrix:

Actual / Predictedposneg
pos8020
neg3070

For the pos class: tp = 80, fp = 30, fn = 20, tn = 70.

  • Accuracy =
  • Precision =
  • Recall =

Multi-class Precision and Recall

When , precision and recall are defined per class: for class , treat “is ” vs “is not ” as a binary problem and count tp/fp/fn/tn accordingly.

Worked example (from the slides)

Three-way email classification — urgent, normal, spam — with the confusion matrix (rows = system output, columns = gold):

gold urgentgold normalgold spam
system urgent8101
system normal56050
system spam330200

Per-class precision (true positives over all predictions of that class — the row sum):

Per-class recall (true positives over all gold instances of that class — the column sum):

The classifier is good at spam, mediocre at normal, weak at urgent. A single-number summary needs a way to combine these three per-class scores — see below.

Macro vs Micro Averaging

Combining per-class precision (or recall, or ) into one number takes two forms:

Macroaveraging

Compute precision (or recall, or ) for each class separately, then average across classes.

Treats every class equally regardless of how many examples it has. Good when all classes matter equally (e.g. sentiment analysis with balanced positive/negative/neutral). Bad when a rare class with few examples produces a noisy estimate that drags the average down.

Microaveraging

Pool all the true positives, false positives, and false negatives across classes first, then compute a single precision.

Treats every example equally. Dominated by the majority class. Good when you care about average-case performance; bad when minority classes matter.

The two can differ dramatically. Continuing the urgent/normal/spam example, first convert the 3×3 matrix into three per-class binary confusion matrices (one per class: “is this class” vs “isn’t”):

Class 1 — Urgent:

true urgenttrue not
system urgent811
system not8340

Class 2 — Normal:

true normaltrue not
system normal6055
system not40212

Class 3 — Spam:

true spamtrue not
system spam20033
system not5183

Pooled (for microaveraging): sum the tp, fp, fn, tn across all three binary matrices:

true yestrue no
system yes26899
system no99635

Now compute each average:

The micro value (0.73) is higher than the macro value (0.60) because the large, easy class (spam, 200 true positives) dominates the pooled sum, while the small hard class (urgent, precision 0.42) drags the macro average down. Macro-averaging exposes the per-class weakness; micro-averaging hides it.

Worked multi-class (from the lab)

Confusion matrix:

Actual / Predictedposneutneg
pos1002010
neut33012020
neg152595
  • Accuracy =
  • Positive: precision = 100/445 = 0.22, recall = 100/130 = 0.77, = 0.34
  • Neutral: precision = 120/165 = 0.73, recall = 120/470 = 0.26, = 0.38
  • Negative: precision = 95/125 = 0.76, recall = 95/135 = 0.70, = 0.73
  • Macro-average: precision = 0.57, recall = 0.57, = 0.57
  • Micro-average: precision = 0.75, recall = 0.61, = 0.67

The classifier is strongly biased toward predicting pos (it over-fires), which hurts its precision on that class. Macro-averaging exposes the per-class imbalance; micro-averaging hides it.

Devsets and Cross-Validation

Before evaluating on the test set, you need a separate “dress rehearsal” set for tuning — otherwise every hyperparameter choice leaks information from the test set into the model.

Three-way split

┌──────────────┬──────────────────┬──────────┐
│ Training set │ Development set  │ Test set │
│              │    ("devset")    │          │
└──────────────┴──────────────────┴──────────┘
  • Training set: fit parameters (counts, priors, smoothed likelihoods).
  • Development set (devset): tune hyperparameters — smoothing , feature choices, thresholds, whether to use binary or multinomial NB.
  • Test set: report the final number. Used once, at the very end.

Train on training, tune on devset, report on test. The devset prevents overfitting to the test set — if you keep rerunning on the test set and tweaking until the number improves, you’ve leaked test-set information into your model and your reported score overestimates real-world performance.

The paradox: you want as much data as possible for training and as much as possible for the devset. Every example you put in one is an example not in the other. Cross-validation resolves this.

-fold cross-validation

Instead of a fixed devset, rotate which slice of the data acts as devset.

Iter 1: [ Dev │        Training                   ]
Iter 2: [     │ Dev │  Training                   ]
Iter 3: [           │ Dev │      Training         ]
 …                                                  → Test set (untouched)
Iter k: [                  Training          │ Dev ]

Split the non-test data into equal folds (typically ). Run training iterations; in each, use one fold as the devset and the remaining as training. Pool the dev-set performance scores across folds to get a more stable estimate. The test set stays untouched across all folds — it’s only used at the end.

Why pool: a single 90/10 split might get an easy (or hard) devset by chance. Averaging across 10 different devsets cancels out that variance. The trade-off is cost — -fold is times more expensive to train.

Statistical Significance: Hypothesis Testing

Given classifiers A and B with F1 of 0.81 and 0.83 on the same test set, is B really better — or did it get lucky on this test set? Asking the question formally:

Setup: effect size and hypotheses

  • Let be classifier A’s performance on test set under some metric (accuracy, F1, whatever).
  • The effect size is the observed difference: .
  • Null hypothesis : A is not better than B, so . Any observed positive is coincidence.
  • Alternative : A is better than B, so .

We want to reject . The question is: assuming is true, how likely is it that we would see a as large as the one we observed?

The p-value

is a random variable ranging over test sets drawn from the underlying population. The p-value is the probability that, across those hypothetical test sets, we would see an effect size at least as large as the one actually observed — if the null hypothesis were true.

  • Very small p-value (say below 0.01 or 0.05) → the observed effect is unlikely under ; we reject and call the result statistically significant.
  • Not-small p-value → we cannot distinguish the observation from noise; we fail to reject .

The threshold (0.05, 0.01) is a social convention, not a mathematical constant.

Parametric vs non-parametric tests

Parametric tests (e.g. the paired t-test) assume the differences follow a specific distribution, usually Gaussian. In NLP, the metrics are typically non-linear functions of counts (F1, BLEU, etc.) and not Gaussian-distributed. So NLP overwhelmingly uses non-parametric, sampling-based tests:

  1. Approximate randomization — randomly swap labels between A and B on each test item many times to see how often you get an effect as large as by chance.
  2. Paired bootstrap — resample the test set with replacement to simulate many alternative test sets (see below).

Both are paired tests: they assume each test item has a pair of observations (one from each system on the same test item). Pairing is much more powerful than unpaired testing because it removes per-example variance.

The Paired Bootstrap Test

The bootstrap (Efron & Tibshirani, 1993) repeatedly draws large numbers of smaller samples with replacement from an original larger sample — each draw is a bootstrap sample. It works for any metric: accuracy, precision, recall, F1, BLEU, etc.

Intuition: build a distribution from one test set

You have one real test set of size . You can’t collect more test sets, but you can simulate new ones by resampling rows from with replacement. Each simulated test set has the same size but a different composition — some rows duplicated, others missing. Run the bootstrap times (say 10,000) and you have a distribution of values, from which you can estimate how “accidental” the original is.

A concrete example

Consider a baby test set of documents under accuracy, with the four possible outcomes per document encoded as (A correct / B correct, A correct / B wrong, …):

doc12345678910A%B%
ABAḃABȦ BAḃȦ BAḃABȦ ḂAḃ.70.50.20

(Each column shows which of A, B was correct; A struck through = A wrong.) On the real test set, A scores 70%, B scores 50%, and .

Now draw 10,000 bootstrap samples . Each is 10 columns drawn uniformly at random with replacement from the 10 columns above. Compute A%, B%, and on each.

The subtlety: shift to zero mean

Naïvely, you might count: in what fraction of bootstrap samples is ? That’s wrong, because you drew the bootstrap samples from , which is itself biased by in favour of A. Under (A no better than B), the expected is 0, not 0.20. So you have to shift the observed bootstrap ‘s down by to centre the distribution at zero, then ask how often the shifted value exceeds the original .

Equivalently: count how often :

If only 47 out of 10,000 bootstrap samples show , then the p-value is — the observed 0.20 gap is unlikely to be an artefact of the particular test set, and we reject : A is significantly better than B.

Algorithm (after Berg-Kirkpatrick et al., 2012)

function Bootstrap(test set x, num samples b) returns p-value(x)
    Calculate δ(x)                      # observed effect size
    s ← 0
    for i = 1 to b do
        Draw bootstrap sample x⁽ⁱ⁾ of size n:
            for j = 1 to n do
                Select an item of x uniformly at random (with replacement)
                and add it to x⁽ⁱ⁾
        Calculate δ(x⁽ⁱ⁾)
        if δ(x⁽ⁱ⁾) > 2·δ(x):          # "accidentally" large under H₀
            s ← s + 1
    p-value(x) ≈ s / b
    return p-value(x)

Why it’s paired

Each bootstrap sample is a re-selection of whole test items, each of which carries both A’s and B’s output. This preserves the pairing — we never compare A on one set of items to B on a different set. Per-example variance (some examples are easy, some are hard) cancels out, making the test more powerful than comparing two separately-sampled distributions.

Pros

  • No distributional assumptions — works for F1, BLEU, ROUGE, any non-linear metric.
  • Works on small test sets where parametric tests misestimate significance.
  • Conceptually simple to implement: resample, recompute, count.

Active Recall