A language model does exactly one thing: it assigns probabilities to sequences of words. Everything else — generation, autocomplete, spell-check, translation rescoring, ChatGPT — is built on top of that one primitive. Given a sequence , an LM tells you , or equivalently the conditional from which the joint can be recovered by the chain rule. The remarkable thing is how much you get from such a narrow primitive.

The definition

A language model assigns probabilities to sequences. Given a vocabulary and a sequence with each :

The closely related (and more useful) task is the next-word distribution:

These two are linked by the chain rule of probability:

So an LM that can do either task can do the other. Almost every concrete LM (n-gram, RNN, GPT) directly outputs the conditional next-word distribution; the joint is computed by multiplying them along the sequence.

ASIDE — How big is the joint, naively?

Suppose vocabulary size and sequence length . The joint distribution is a table over entries. There aren’t enough atoms in the observable universe to store one probability per entry. So we cannot just enumerate it — we must exploit structure (the chain rule) and impose simplifying assumptions (Markov, neural compression). This combinatorial blowup is the motivation for everything in the LM literature.

Two flavours: autoregressive vs. masked

Language models split by what context they’re allowed to use to predict a word.

FlavourPredictsContextGeneration?Examples
Autoregressive (causal)left onlyyes — sample next token, append, repeatn-gram, RNN, LSTM, GPT
Maskedleft + rightno (not directly)BERT, ELMo
  • Autoregressive models predict the next word from past words only. This makes them naturally generative: pick a starting context, sample the next word, append it, and repeat. ChatGPT, autocomplete, machine translation decoders all use autoregressive LMs.
  • Masked language models predict a word from words on both sides — left and right context. They cannot generate by left-to-right rollout because each prediction needs future context that doesn’t exist yet during generation. But they produce excellent contextualized representations of words, which is what BERT-style models are used for: feed embeddings into a classifier, retrieval system, etc.

The Markov assumption underlying the n-gram model below applies to the autoregressive case.

Where language models are used

The narrowness of the LM primitive (just probabilities over sequences) is deceptive — once you have one, an enormous range of applications opens up:

  • Autocomplete / next-word prediction. Phone keyboards, IDE code completion. Sample (or argmax) from .
  • Spelling and grammar correction. Score competing candidate sentences and pick the highest-probability one. “OpenAI is great. I love there products” gets corrected because in context.
  • Machine translation, speech recognition. The acoustic/translation model proposes candidates; the LM rescores them by fluency in the target language. This is why training a better LM improves an unrelated task — the LM is reused as a fluency scorer.
  • Chatbots / question answering. Modern LLMs reframe everything as text completion. Prompt: “Q: What is the capital of UAE? A:” — sampling the continuation gives “Abu Dhabi”. The chatbot is just an autoregressive LM continuing a structured prompt.
  • Author identification. Train a separate LM on each candidate author’s writings; for a disputed text, the LM with lowest perplexity is the most likely author.

TIP — The "general-purpose task solver" framing

A modern LLM does not have a separate module for translation, summarisation, code generation, etc. They all reduce to: “given this prompt as text, what is the most likely continuation as text?” The training objective (next-word prediction) is unchanged from a 1990s n-gram model — what changed is the architecture (Transformer), the data scale (trillions of tokens), and post-training alignment (instruction tuning, RLHF). The probabilistic-sequence-modelling frame is the through-line.

The taxonomy of language models in this module

The module covers a chronological progression of LMs. Each fixes a problem with the previous one:

EraFamilyExampleFixes
1990sStatisticaln-gramfirst practical LM; counts in a big table
2003Feed-forward neuralBengio et al. NPLMlearned word embeddings → handles unseen contexts
2010Recurrent neuralRNNunbounded context, sequential processing
2014Gated recurrentLSTM, stacked LSTMlong-range dependencies via gating
2018Bi-directionalELMocontextualized embeddings
2017–nowTransformer-basedBERT, GPT familyparallel training, very long context, attention
2020+Large language modelsGPT-3, GPT-4, Claudescale + instruction tuning + RLHF unlock general-purpose problem solving

Each row inherits the same probabilistic framing — or — and changes the parametric form used to compute it.

Connections

  • Underpins n-gram-language-model — the simplest concrete LM, derived from the chain rule plus a Markov assumption.
  • Evaluated by perplexity — the standard intrinsic metric for any LM regardless of architecture.
  • Generation requires decoding-strategies — sampling/beam/greedy methods to turn the next-word distribution into actual text.
  • Modern LMs use word embeddings — dense vectors that solve the “synonymy / unseen context” problem n-grams suffer from.
  • Implemented by RNNs, LSTMs, and Transformer-based architectures (covered in week 11).