language-model

A language model does exactly one thing: it assigns probabilities to sequences of words. Everything else — generation, autocomplete, spell-check, translation rescoring, ChatGPT — is built on top of that one primitive. Given a sequence $w_{1} \dots w_{n}$ , an LM tells you $P (w_{1}, \dots, w_{n})$ , or equivalently the conditional $P (w_{n} ∣ w_{1}, \dots, w_{n - 1})$ from which the joint can be recovered by the chain rule. The remarkable thing is how much you get from such a narrow primitive.

The definition

A language model assigns probabilities to sequences. Given a vocabulary $V$ and a sequence $W = (w_{1}, w_{2}, \dots, w_{n})$ with each $w_{i} \in V$ :

$P (W) = P (w_{1}, w_{2}, \dots, w_{n})$

The closely related (and more useful) task is the next-word distribution:

$P (w_{n} ∣ w_{1}, w_{2}, \dots, w_{n - 1})$

These two are linked by the chain rule of probability:

$P (w_{1}, \dots, w_{n}) = \prod_{k = 1}^{n} P (w_{k} ∣ w_{1}, \dots, w_{k - 1})$

So an LM that can do either task can do the other. Almost every concrete LM (n-gram, RNN, GPT) directly outputs the conditional next-word distribution; the joint is computed by multiplying them along the sequence.

ASIDE — How big is the joint, naively?

Suppose vocabulary size $∣ V ∣ = 1 0^{5}$ and sequence length $n = 10$ . The joint distribution $P (W)$ is a table over $∣ V ∣^{n} = 1 0^{50}$ entries. There aren’t enough atoms in the observable universe to store one probability per entry. So we cannot just enumerate it — we must exploit structure (the chain rule) and impose simplifying assumptions (Markov, neural compression). This combinatorial blowup is the motivation for everything in the LM literature.

Two flavours: autoregressive vs. masked

Language models split by what context they’re allowed to use to predict a word.

Flavour	Predicts	Context	Generation?	Examples
Autoregressive (causal)	$P (w_{n} ∣ w_{1 : n - 1})$	left only	yes — sample next token, append, repeat	n-gram, RNN, LSTM, GPT
Masked	$P (w_{k} ∣ w_{1 : k - 1}, w_{k + 1 : n})$	left + right	no (not directly)	BERT, ELMo

Autoregressive models predict the next word from past words only. This makes them naturally generative: pick a starting context, sample the next word, append it, and repeat. ChatGPT, autocomplete, machine translation decoders all use autoregressive LMs.
Masked language models predict a word from words on both sides — left and right context. They cannot generate by left-to-right rollout because each prediction needs future context that doesn’t exist yet during generation. But they produce excellent contextualized representations of words, which is what BERT-style models are used for: feed embeddings into a classifier, retrieval system, etc.

The Markov assumption underlying the n-gram model below applies to the autoregressive case.

Where language models are used

The narrowness of the LM primitive (just probabilities over sequences) is deceptive — once you have one, an enormous range of applications opens up:

Autocomplete / next-word prediction. Phone keyboards, IDE code completion. Sample (or argmax) from $P (w_{n} ∣ prefix)$ .
Spelling and grammar correction. Score competing candidate sentences and pick the highest-probability one. “OpenAI is great. I love there products” gets corrected because $P (their products) ≫ P (there products)$ in context.
Machine translation, speech recognition. The acoustic/translation model proposes candidates; the LM rescores them by fluency in the target language. This is why training a better LM improves an unrelated task — the LM is reused as a fluency scorer.
Chatbots / question answering. Modern LLMs reframe everything as text completion. Prompt: “Q: What is the capital of UAE? A:” — sampling the continuation gives “Abu Dhabi”. The chatbot is just an autoregressive LM continuing a structured prompt.
Author identification. Train a separate LM on each candidate author’s writings; for a disputed text, the LM with lowest perplexity is the most likely author.

TIP — The "general-purpose task solver" framing

A modern LLM does not have a separate module for translation, summarisation, code generation, etc. They all reduce to: “given this prompt as text, what is the most likely continuation as text?” The training objective (next-word prediction) is unchanged from a 1990s n-gram model — what changed is the architecture (Transformer), the data scale (trillions of tokens), and post-training alignment (instruction tuning, RLHF). The probabilistic-sequence-modelling frame is the through-line.

The taxonomy of language models in this module

The module covers a chronological progression of LMs. Each fixes a problem with the previous one:

Era	Family	Example	Fixes
1990s	Statistical	n-gram	first practical LM; counts in a big table
2003	Feed-forward neural	Bengio et al. NPLM	learned word embeddings → handles unseen contexts
2010	Recurrent neural	RNN	unbounded context, sequential processing
2014	Gated recurrent	LSTM, stacked LSTM	long-range dependencies via gating
2018	Bi-directional	ELMo	contextualized embeddings
2017–now	Transformer-based	BERT, GPT family	parallel training, very long context, attention
2020+	Large language models	GPT-3, GPT-4, Claude	scale + instruction tuning + RLHF unlock general-purpose problem solving

Each row inherits the same probabilistic framing — $P (W)$ or $P (w_{n} ∣ context)$ — and changes the parametric form used to compute it.

Connections

Underpins n-gram-language-model — the simplest concrete LM, derived from the chain rule plus a Markov assumption.
Evaluated by perplexity — the standard intrinsic metric for any LM regardless of architecture.
Generation requires decoding-strategies — sampling/beam/greedy methods to turn the next-word distribution into actual text.
Modern LMs use word embeddings — dense vectors that solve the “synonymy / unseen context” problem n-grams suffer from.
Implemented by RNNs, LSTMs, and Transformer-based architectures (covered in week 11).

Course Notes

Explorer

language-model

The definition

Two flavours: autoregressive vs. masked

Where language models are used

The taxonomy of language models in this module

Connections

Graph View

Table of Contents

Backlinks