maximum-likelihood-estimation-nlp

Maximum likelihood estimation picks parameter values that make the observed training data as probable as possible under the model. For the categorical distributions used throughout NLP, this reduces to “count and normalize.”

The Principle

Given a model with parameters $θ$ and training data $D = {x_{1}, \dots, x_{N}}$ , MLE picks the parameters that maximize the probability of $D$ :

$\hat{θ}_{M L E} = θ ar g max P (D ∣ θ)$

“What parameter values would make me least surprised by the data I actually saw?” The data is fixed; we tune $θ$ . If the model assumes examples are independent:

$P (D ∣ θ) = \prod_{i = 1}^{N} P (x_{i} ∣ θ)$

Why It Reduces to Counting

Most NLP models estimate categorical distributions — probabilities over a vocabulary, a class set, or a set of transitions. MLE for a categorical always has the same closed-form solution: relative frequencies.

Sketch. Suppose you’re estimating $P (w)$ for each word in vocabulary $V$ , having observed counts $C (w)$ summing to $N$ . The log-likelihood of the data is

$lo g P (D ∣ θ) = \sum_{w} C (w) lo g θ_{w} subject to \sum_{w} θ_{w} = 1$

Maximize under the simplex constraint (Lagrange multiplier) and you get

$\hat{P} (w) = \frac{C ( w )}{N}$

The relative-frequency estimator isn’t a rule of thumb — it is the MLE. Every “count this and divide by that” formula in the module is an instance of the same theorem.

Examples Already in the Module

Model	Parameter	MLE
Unigram LM	$P (w)$	$C (w) / N$
Bigram LM	$P (w_{n} ∣ w_{n - 1})$	$C (w_{n - 1}, w_{n}) / C (w_{n - 1})$
N-gram LM	$P (w_{n} ∣ w_{n - N + 1 : n - 1})$	$C (w_{n - N + 1 : n}) / C (w_{n - N + 1 : n - 1})$
Naive Bayes prior	$P (c)$	$N_{c} / N_{total}$
Naive Bayes word likelihood	$P (w ∣ c)$	$C (w, c) / \sum_{w^{'}} C (w^{'}, c)$

Each is the MLE of a categorical. Conditional MLE (estimating a distribution given some context, like the previous word or the class) is just unconditional MLE done separately for each conditioning context — you partition the data by context, then count and normalize within each partition. That’s why the bigram formula has $C (w_{n - 1})$ in the denominator: within the partition of bigrams starting with $w_{n - 1}$ , the MLE of $P (w_{n} ∣ w_{n - 1})$ is the relative frequency of $w_{n}$ .

Log-Likelihood

Because $lo g$ is monotonic, $ar g max P (D ∣ θ) = ar g max lo g P (D ∣ θ)$ . You always use the log version:

Numerical stability — $\prod_{i} P (x_{i} ∣ θ)$ underflows on any realistic corpus; the sum of log-probabilities stays in a sane range (this is also why n-gram LMs and Naive Bayes decode in log space).
Gradient-friendly — for models trained by gradient ascent (logistic regression, neural nets), log turns products into sums, so gradients decompose cleanly across examples.
Equivalent to cross-entropy minimization — maximizing log-likelihood of the data under a model is the same objective as minimizing the cross-entropy between the empirical data distribution and the model distribution. The “cross-entropy loss” used to train most modern NLP systems is literally negative log-likelihood.

The Zero-Probability Problem

MLE has a famous weakness: if an event never appears in training, MLE assigns it probability zero. In NLP, this is catastrophic because:

Sentence probabilities are products of conditional probabilities. One zero anywhere zeroes the whole sentence.
perplexity is undefined when any test-set probability is zero.
Even large training corpora leave most possible n-grams unseen (Heaps’ Law guarantees the vocabulary keeps growing). Zeros are the rule, not the exception.

MLE’s maxim — “don’t pretend you’ve seen what you haven’t” — is fair in the abstract but too harsh for finite training sets. The standard fix is smoothing: reserve some probability mass for unseen events. Smoothed estimators are no longer strictly MLE — they trade off training-data likelihood for held-out generalization. That trade-off is the whole game of statistical modelling.

MLE vs MAP vs Fully Bayesian

Three ways to pick $θ$ , ordered by how much prior information they use:

MLE — use only the data: $\hat{θ} = ar g max_{θ} P (D ∣ θ)$ .
Maximum a posteriori (MAP) — combine data with a prior: $\hat{θ} = ar g max_{θ} P (θ ∣ D) \propto P (D ∣ θ) P (θ)$ . Add-1 (Laplace) smoothing is exactly MAP estimation of a categorical under a uniform Dirichlet prior — as if you’d pre-observed one pseudo-count of every outcome. See bayes-rule for the inversion.
Fully Bayesian — integrate over $θ$ instead of picking a point estimate. Expensive but principled; central to topic models, Bayesian deep learning, and Gaussian processes.

For most of this module, MLE with smoothing is the default. The smoothing step is how MAP’s prior sneaks back into an otherwise likelihood-only estimator.

n-gram-language-models — MLE estimator for bigram, trigram, and higher-order probabilities
naive-bayes — MLE estimator for the class prior and per-class word likelihoods
smoothing — the standard fix for MLE’s zero-probability problem
perplexity — the metric that collapses when any MLE estimate is zero
bayes-rule — MAP estimation uses Bayes’ rule to introduce a prior over parameters

Active Recall

State the MLE principle in one sentence, then state the closed-form solution for a categorical distribution.

Principle: choose the parameters that make the observed training data as probable as possible under the model — $\hat{θ}_{M L E} = ar g max_{θ} P (D ∣ θ)$ . Closed form for a categorical over vocabulary $V$ with counts $C (w)$ and total $N$ : $\hat{P} (w) = C (w) / N$ . Relative frequencies aren’t a heuristic — they fall out of taking the derivative of the log-likelihood subject to the simplex constraint.

Why is every count-and-normalize formula in n-gram LMs and Naive Bayes the same theorem?

Each one estimates a categorical distribution. Unigram LM = one categorical over $V$ . Bigram LM = one categorical per previous word (the distribution over next words). Naive Bayes likelihood = one categorical per class (the distribution over words given that class). MLE of any categorical is relative frequency — count events and divide by the total in the conditioning context. The formulas differ only in what you condition on.

Why do we always maximize log-likelihood rather than likelihood itself?

Three reasons. Numerical: products of many small probabilities underflow 64-bit floats; sums of logs don’t. Mathematical: log is monotonic, so $ar g max$ is preserved. Algorithmic: log turns products into sums, and gradients decompose cleanly across training examples — essential for any gradient-based learner. Minimizing cross-entropy is the same objective under a different name.

Why is MLE's zero-probability problem especially catastrophic in NLP?

NLP models compute sentence probabilities as products of many conditional probabilities. A single zero in the product zeroes the whole sentence, regardless of how well every other factor scored. Heaps’ Law guarantees that even huge training corpora leave most bigrams and n-grams unseen, so zeros are the common case, not an edge case. The fix is smoothing — move small amounts of probability mass onto unseen events.

How is add-1 (Laplace) smoothing related to MLE?

Add-1 smoothing is MAP estimation of a categorical under a uniform Dirichlet(1) prior — it’s mathematically equivalent to observing one pseudo-count of every outcome before the real data arrives, then applying MLE. MLE uses only observed counts; MAP combines observed counts with a prior. Moving from MLE to add-1 smoothing is moving from pure likelihood maximization to maximum-a-posteriori estimation with a mild, symmetric prior.

What's the difference between MLE and conditional MLE, and why is conditional MLE what you actually use for bigram LMs?

MLE estimates a single distribution over all events. Conditional MLE estimates a separate distribution for each value of the conditioning variable. For bigrams you want $P (w_{n} ∣ w_{n - 1})$ — one distribution over next words for every possible previous word. Operationally, partition the corpus by $w_{n - 1}$ , then do standard MLE within each partition: $C (w_{n - 1}, w_{n}) / C (w_{n - 1})$ . The denominator is the size of the partition, not the overall corpus — which is why the formula conditions on the previous word’s count, not the total token count.

Course Notes

Explorer

maximum-likelihood-estimation-nlp

The Principle

Why It Reduces to Counting

Examples Already in the Module

Log-Likelihood

The Zero-Probability Problem

MLE vs MAP vs Fully Bayesian

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

maximum-likelihood-estimation-nlp

The Principle

Why It Reduces to Counting

Examples Already in the Module

Log-Likelihood

The Zero-Probability Problem

MLE vs MAP vs Fully Bayesian

Related

Active Recall

Graph View

Table of Contents

Backlinks