Types are distinct word forms in a vocabulary; tokens are individual occurrences in running text. Their relationship follows a predictable sub-linear growth law.

Definition

  • Token: a single instance of a word as it appears in a sequence of text. In “the cat sat on the mat”, the word the produces two tokens (it appears twice).
  • Type: a distinct word form — each unique element of the vocabulary . In the same sentence, the is only one type regardless of how many times it appears.

The sentence “to be or not to be” contains 6 tokens but only 4 types: to, be, or, not (to and be each appear twice, but they still count as one type each).

This distinction matters throughout NLP because model size, data requirements, and statistical estimates all depend on which count you mean.

Heaps’ Law (Herdan’s Law)

As a corpus grows, new word types keep appearing — but the rate slows. This relationship is captured by Heaps’ Law:

where is the number of tokens, is the vocabulary size, and the constants are empirically:

  • (typically around 0.67–0.75 for English)

The sub-linear exponent () means vocabulary grows much more slowly than corpus size: doubling the corpus does not double the vocabulary. Church & Gale (1990) found approximately 44,000 types in the first million tokens of a Wall Street Journal corpus.

Implications

  1. Open-vocabulary problem: no matter how large your training corpus, you will encounter unseen words at test time. Heaps’ Law tells you the rate at which this problem diminishes.
  2. Vocabulary cap decisions: choosing a fixed vocabulary size for a model (e.g., 50,000 tokens in many LLMs) is a design tradeoff — larger captures more types but increases parameter count.
  3. Subword tokenization motivation: rather than treating rare word types as unknown, modern systems decompose them into subword units (see subword-tokenization), which side-steps the open-vocabulary problem.
  • corpora — corpus size is what drives Heaps’ Law
  • tokenization — tokenization determines what counts as a token
  • subword-tokenization — one practical response to the open-vocabulary problem

Active Recall