Evaluation methodology determines how we measure model quality; the core choices are extrinsic vs intrinsic evaluation and the three-way train/dev/test data split.

Definition

Evaluation methodology is the set of decisions about how to measure whether a model is any good. Two orthogonal choices govern it:

  1. What to measure — extrinsic (task performance) vs intrinsic (model-level metric).
  2. What data to measure on — and how to split available data to get honest answers.

These choices apply to every model in the course, not just n-gram-language-models.

Extrinsic vs Intrinsic Evaluation

Extrinsic (in-vivo)

Put the model inside a real downstream task — machine translation, speech recognition, spam classification — run the task, measure a task-specific score (e.g., words translated correctly).

  • Advantage: directly measures what you care about.
  • Disadvantage: expensive, slow, task-specific. A model that excels at MT may fail at speech recognition.

Intrinsic (in-vitro)

Measure the model’s own quality directly, using a metric that does not require embedding it in an application. For language models, this metric is perplexity.

  • Advantage: fast, cheap, general — one number that applies to any language model.
  • Disadvantage: may not correlate perfectly with real-task performance. A model with low perplexity is not guaranteed to improve your downstream system.

In practice, intrinsic evaluation is used during development (fast iteration) and extrinsic evaluation is used for final validation (does it actually help?).

The Data Split

Train/test split

The most basic split:

  • Training set: the data used to learn model parameters (e.g., count n-grams).
  • Test set: a held-out dataset the model has never seen, used to measure generalisation.

The test set must be different from the training set. If they overlap, the model has already memorised the test data and will assign it artificially high probability — making itself look better than it really is. This is called “training on the test set” and it is bad science.

The three-way split (train/dev/test)

A two-way split is not enough. If you test on the test set, notice the result, adjust your model, and test again, you are implicitly tuning to the test set’s characteristics. After enough iterations you have effectively trained on it.

The fix: introduce a third dataset.

DatasetPurposeWhen usedHow often
TrainingLearn model parametersDuring trainingOnce
Development (dev)Compare models, tune hyperparametersDuring developmentMany times
TestFinal evaluation of the chosen modelAt the very endOnce

The dev set absorbs all the iterative testing. The test set is reserved for one final evaluation so the reported number is honest.

COMMON MISCONCEPTION

The dev set and the test set serve different purposes even though both are “held out.” The dev set is your scratch paper — you can look at it as often as you like. The test set is the sealed exam — you open it once, report the score, and stop. Using the test set repeatedly during development turns it into a second dev set, which defeats its purpose.

Choosing Splits

All three datasets should be drawn from the same distribution:

  • Task-specific model (e.g., legal document classifier): train, dev, and test should all contain legal documents.
  • General-purpose model: train and test should both be diverse — different domains, authors, time periods, language varieties. Otherwise the model looks good on one narrow slice and fails on everything else.
  • perplexity — the standard intrinsic metric introduced alongside this methodology
  • n-gram-language-models — the first model family where this evaluation framework is applied
  • corpora — corpus composition determines whether train/test splits are representative

Active Recall