The first practical language model, and the right place to start because every later neural LM solves problems n-grams have. The recipe: take the chain rule of probability, replace the full history with the last words (the Markov assumption), estimate each conditional by counting in a corpus and dividing. The reason it works at all is that local word context carries most of the predictive signal. The reason it fails is that “local” means in practice, the table grows as , and any combination you didn’t see at training time gets probability zero.

The chain rule + Markov assumption

The chain rule of probability, applied to a sentence:

This is exact, but unhelpful — the conditional requires knowing the joint over arbitrary-length histories, which is the exact intractability we wanted to escape.

The Markov assumption truncates: assume each word depends only on the previous :

For example, a bigram model (memory of 1) approximates:

So the joint becomes:

NameConditional
Unigram1
Bigram2
Trigram3
4-gram4
5-gram5

TIP — Why the Markov assumption is plausible

Most predictive signal for the next word is short-range: “once upon a”“time” needs three words of context, not three hundred. The assumption is that beyond a few words back, additional context contributes diminishing returns. It’s wrong in important cases (“The soups that I made from the new cookbook I bought yesterday were delicious” — subject–verb agreement spans 12 words), but it’s a starting point. Neural LMs exist precisely to relax it.

Estimating probabilities: MLE by counting

The maximum-likelihood estimate of a bigram probability is:

Numerator: how many times was followed by in the corpus. Denominator: how many times appeared at all (equivalently, the row sum of the bigram count table). For unigrams, just .

That’s it. No gradient descent, no parameters in the neural sense — just count and divide. Training an n-gram LM is a single pass over the corpus building the count tables. The “model” is the count tables.

ASIDE — Sentence boundary tokens

A bigram LM needs to know how sentences begin and end, since the first word has no left context and the last has no right neighbour. The standard trick: prepend <SOS> (start-of-sentence) and append <EOS> (end-of-sentence) to every training sentence. Then and are well-defined, and the joint over a full sentence picks up multiplicative factors for “this is a plausible way to start” and “this is a plausible way to end.” Without <EOS> the model has no notion of when to stop generating.

Worked example: the Berkeley Restaurant corpus

A classic teaching dataset (recorded restaurant phone orders) gives raw bigram counts like:

iwanttoeatchinesefoodlunchspend
i5827090002
want2060816651
to204686206211
eat0020162420

Dividing by row sums (the unigram counts) gives :

iwanttoeatchinesefoodlunchspend
i0.0020.3300.00360000.00079
want0.002200.660.00110.00650.00650.00540.0011
to0.0008300.00170.280.0008300.00250.087
eat000.002700.0210.00270.0560

You can read off useful patterns: (high — “want to” is common); (high — “to eat” is common); (sensibly low — “to food” is ungrammatical). Sentence probability comes from chaining:

The size problem: why we stop at 5-grams

For vocabulary size , an -gram model has up to entries in its conditional table. With :

Table size
1trivial
2feasible
3very large
4extreme
5enormous (English-only, a single machine struggles)

Empirically the lecturer notes that at Microsoft Research circa 2008, the production LMs were 5-grams for English (largest language with enough data) and 4-grams for everything else. Training data and storage become limiting before modelling power does. Recent work (e.g., infini-grams, Liu et al. 2024) sidesteps this by storing the corpus in a suffix array instead of precomputing counts — letting you compute -gram probabilities for any at lookup time over 5 trillion words of web text. Still, the fundamental scaling problem is a key motivation for neural language models that compress the next-word distribution into a (much smaller) parametric model.

The zeros problem

For any , most syntactically valid n-grams will not appear in the training corpus simply because there are too many. A bigram count of zero forces the conditional probability to zero, which forces the entire sentence probability to zero whenever such a bigram appears (since the joint is a product). This wrecks both:

  • Generation — words with zero probability are never sampled, so legal continuations are unreachable.
  • Evaluation — perplexity involves , so a zero in the test set gives perplexity infinity.

The breakfast example

Training corpus contains “ate lunch”, “ate dinner”, “ate a”, “ate the” but never “ate breakfast”. Test corpus contains ”… ate breakfast”. The bigram model says , so any test sentence containing this bigram has probability zero, and perplexity blows up. Yet “ate breakfast” is plainly grammatical English — the corpus just didn’t happen to include it.

The classical fix is smoothing (add-one / Laplace, Kneser-Ney, etc.) — redistribute a small probability mass to unseen n-grams. Smoothing is a substantial subfield in itself, with whole papers per technique. The deeper fix, however, is to abandon discrete counting and represent words by dense vectors: if “lunch” and “breakfast” have similar vectors, then can be high even without the exact bigram appearing in training. This is the n-gram → neural-LM transition.

Generating from an n-gram LM (the Shannon visualisation)

Given a trained model, you can sample sequences. For a bigram LM:

  1. Start with <SOS>.
  2. Sample .
  3. Sample .
  4. Continue until you sample <EOS>.

The sample quality depends sharply on and on training data. Trained on Shakespeare:

Sample
1To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have
2Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow.
3Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, ‘tis done.
4King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv’d in;

Higher → more locally fluent but increasingly verbatim from training (memorisation, not learning). The lecturer notes that the corpus also fingerprints the style: trained on Wall Street Journal, the same setup produces financial-sounding gibberish (“They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest”) — a side effect of an LM is author identification: train one LM per candidate author, score the disputed text under each, and the lowest-perplexity model wins.

For sampling mechanics (turning a probability distribution into an actual word draw), see decoding-strategies.

Two structural problems with n-grams (the real motivation for neural LMs)

The size problem and zeros problem are practical. Two deeper, structural shortcomings drive the neural transition:

  1. No long-distance dependencies. Because , the model cannot enforce constraints that span more than 5 words. Subject–verb agreement across a long relative clause (“The soups that I made from the new cookbook I bought yesterday were amazing”) is invisible to a 5-gram. Increasing further is precluded by the size problem.
  2. No notion of similarity between words. Each word is a discrete, opaque token — cat is no more similar to dog than to Wednesday. So learning that “the dog ran” is a likely sentence teaches the model nothing about “the cat ran” or “the puppy ran”. Every n-gram must be observed for itself; there’s no generalisation across semantically similar contexts.

Both problems dissolve once you represent words as dense vectors in . Vectors automatically encode similarity (close vectors = similar words), and the matrix multiplications + non-linearities used by neural LMs to combine them produce smooth, never-zero probability distributions over the next word. This is the bridge to the neural era.

Connections

  • Foundational case of language-model — the simplest concrete LM, useful as the conceptual baseline against which all later models are compared.
  • Motivates word embeddings — needed to fix the zeros and synonymy problems.
  • Motivates RNN-based LMs — needed for unbounded context and long-range dependencies.
  • Evaluated by perplexity — the standard intrinsic metric, applied identically to n-gram and neural LMs.
  • Generation uses decoding-strategies — the sampling/greedy/beam machinery applies to any autoregressive LM.

Common pitfalls

  • Forgetting the <EOS> token — without it, the LM cannot decide when to stop generating; sentence probabilities are also miscalibrated because they don’t include the “this is a complete sentence” factor.
  • Confusing chain rule with Markov assumption — the chain rule is exact and decomposes the joint into conditionals. The Markov assumption is the additional simplification (truncating each conditional to a finite history) that makes the model tractable. They are independent ideas; you need both.
  • Believing higher is always better — past a point, larger overfits to the training corpus (memorising verbatim n-grams) rather than learning generalisable structure. The benefit also flattens: trigrams are dramatically better than bigrams; 5-grams are only marginally better than 4-grams on most tasks (compare the Shakespeare samples above).