A corpus is a collection of text or speech used to train, evaluate, or study NLP systems; its composition determines what those systems can and cannot do.

Definition

A corpus (pl. corpora) is any structured collection of linguistic data — written text, transcribed speech, or both — assembled for analysis or model training. Every NLP system is shaped by the corpora it has been exposed to, which makes corpus selection a design decision with real consequences.

Dimensions of Variation

Corpora differ along several axes, each of which affects the models trained on them:

Language and variety English NLP often defaults to Standard American English, but language is not monolithic. African American English (AAE), Nigerian English, Singaporean English, and dozens of other varieties have distinct phonology, morphology, and syntax. A model trained only on formal written Standard English will generalise poorly — or produce systematically biased errors — when applied to other varieties.

Code-switching Multilingual speakers routinely mix languages within a conversation or even a sentence (“Por favor, pass the salt”). Corpora that strip code-switching out, or that are monolingual by construction, leave models unable to handle real-world multilingual text.

Genre and register News wire, social media, legal documents, medical records, and spoken transcripts have very different statistical properties. A sentiment analyser trained on movie reviews will not straightforwardly transfer to clinical notes.

Time period Language changes. A corpus from the 1990s will underrepresent contemporary slang, new named entities, and shifting word meanings.

Author demographics Who wrote the texts matters. If a corpus over-represents certain demographics (e.g., male, Western, highly educated authors), models trained on it will encode those perspectives and may perform worse on text from other groups.

Corpus Datasheets

Gebru et al. (2020) proposed datasheets for datasets: structured documentation recording how a corpus was collected, what it contains, its intended uses, and its known limitations. Bender & Friedman (2018) similarly argued for data statements, emphasising the importance of documenting speaker demographics and annotation practices.

The goal in both cases is to make the assumptions embedded in a corpus explicit, so that downstream users can reason about what biases and gaps they are inheriting.

  • type-and-token — corpus size drives vocabulary growth via Heaps’ Law
  • tokenization — how raw corpus text is segmented into usable units
  • text-normalization — normalization choices interact with corpus conventions

Active Recall