Maximum likelihood estimation picks parameter values that make the observed training data as probable as possible under the model. For the categorical distributions used throughout NLP, this reduces to “count and normalize.”

The Principle

Given a model with parameters and training data , MLE picks the parameters that maximize the probability of :

“What parameter values would make me least surprised by the data I actually saw?” The data is fixed; we tune . If the model assumes examples are independent:

Why It Reduces to Counting

Most NLP models estimate categorical distributions — probabilities over a vocabulary, a class set, or a set of transitions. MLE for a categorical always has the same closed-form solution: relative frequencies.

Sketch. Suppose you’re estimating for each word in vocabulary , having observed counts summing to . The log-likelihood of the data is

Maximize under the simplex constraint (Lagrange multiplier) and you get

The relative-frequency estimator isn’t a rule of thumb — it is the MLE. Every “count this and divide by that” formula in the module is an instance of the same theorem.

Examples Already in the Module

ModelParameterMLE
Unigram LM
Bigram LM
N-gram LM
Naive Bayes prior
Naive Bayes word likelihood

Each is the MLE of a categorical. Conditional MLE (estimating a distribution given some context, like the previous word or the class) is just unconditional MLE done separately for each conditioning context — you partition the data by context, then count and normalize within each partition. That’s why the bigram formula has in the denominator: within the partition of bigrams starting with , the MLE of is the relative frequency of .

Log-Likelihood

Because is monotonic, . You always use the log version:

  • Numerical stability underflows on any realistic corpus; the sum of log-probabilities stays in a sane range (this is also why n-gram LMs and Naive Bayes decode in log space).
  • Gradient-friendly — for models trained by gradient ascent (logistic regression, neural nets), log turns products into sums, so gradients decompose cleanly across examples.
  • Equivalent to cross-entropy minimization — maximizing log-likelihood of the data under a model is the same objective as minimizing the cross-entropy between the empirical data distribution and the model distribution. The “cross-entropy loss” used to train most modern NLP systems is literally negative log-likelihood.

The Zero-Probability Problem

MLE has a famous weakness: if an event never appears in training, MLE assigns it probability zero. In NLP, this is catastrophic because:

  • Sentence probabilities are products of conditional probabilities. One zero anywhere zeroes the whole sentence.
  • perplexity is undefined when any test-set probability is zero.
  • Even large training corpora leave most possible n-grams unseen (Heaps’ Law guarantees the vocabulary keeps growing). Zeros are the rule, not the exception.

MLE’s maxim — “don’t pretend you’ve seen what you haven’t” — is fair in the abstract but too harsh for finite training sets. The standard fix is smoothing: reserve some probability mass for unseen events. Smoothed estimators are no longer strictly MLE — they trade off training-data likelihood for held-out generalization. That trade-off is the whole game of statistical modelling.

MLE vs MAP vs Fully Bayesian

Three ways to pick , ordered by how much prior information they use:

  • MLE — use only the data: .
  • Maximum a posteriori (MAP) — combine data with a prior: . Add-1 (Laplace) smoothing is exactly MAP estimation of a categorical under a uniform Dirichlet prior — as if you’d pre-observed one pseudo-count of every outcome. See bayes-rule for the inversion.
  • Fully Bayesian — integrate over instead of picking a point estimate. Expensive but principled; central to topic models, Bayesian deep learning, and Gaussian processes.

For most of this module, MLE with smoothing is the default. The smoothing step is how MAP’s prior sneaks back into an otherwise likelihood-only estimator.

  • n-gram-language-models — MLE estimator for bigram, trigram, and higher-order probabilities
  • naive-bayes — MLE estimator for the class prior and per-class word likelihoods
  • smoothing — the standard fix for MLE’s zero-probability problem
  • perplexity — the metric that collapses when any MLE estimate is zero
  • bayes-rule — MAP estimation uses Bayes’ rule to introduce a prior over parameters

Active Recall