Pointwise mutual information measures how much more (or less) often two words co-occur than chance would predict. Positive PMI means the pair is associated; zero means independence; negative means they avoid each other.

Definition

For two words and :

The denominator is what the joint probability would be if and occurred independently. The numerator is what we actually observe. Their ratio quantifies association; the log turns ratio into an additive quantity.

  • : the pair co-occurs more than chance → associated.
  • : independent.
  • : co-occurs less than chance → anti-associated.

The base of the log is a convention: gives an interpretation in bits; gives nats. The worksheet uses natural log.

Worked Example (Worksheet Q6)

From the lab:

Compute:

A positive PMI means sunny and weather co-occur more often than chance — there’s a real association in the corpus. Useful: a basic sanity check on whether a word pair meaningfully clusters without needing labels.

Estimating From a Corpus

is estimated from the unigram count of ; from the bigram/co-occurrence count within a chosen window. So PMI can be computed directly from the term-context co-occurrence matrix:

where is the total number of word pairs observed (or the total token count, depending on conventions).

PMI vs TF-IDF

Both PMI and tf-idf solve the same underlying problem — raw frequency is a bad weighting, very common words carry no distinctive signal — but from different angles:

  • TF-IDF corrects for ubiquity across documents (words that appear everywhere contribute nothing). Document-centred.
  • PMI corrects for baseline frequency (a word co-occurring with the is less informative than one co-occurring with apricot because the co-occurs with everything). Pair-centred.

They often agree in practice: both downweight stopwords and upweight distinctive collocations. The practical choice comes from the matrix you have. If your matrix is term-by-document, tf-idf is natural. If it’s term-by-context-word, PMI is natural.

Positive PMI (PPMI)

PMI ranges from to , but negative values are problematic for three reasons:

  1. Interpretation is weak. A negative PMI means the pair co-occurs less than chance. That’s a real signal in principle, but far less intuitive than the positive direction.
  2. Unreliable without enormous corpora. If , the chance-level joint is . Telling whether the observed joint is “significantly different” from requires an astronomically large corpus. On anything short of that, the negative tail is pure noise.
  3. Humans aren’t calibrated on “unrelatedness.” Similarity judgments correlate well with positive PMI; nobody reliably rates pairs as anti-related.

The standard fix is Positive PMI (PPMI), which clips negative values at zero:

PPMI also sidesteps the problem for unobserved pairs (which become 0). PPMI-weighted co-occurrence matrices are the classical baseline that word2vec’s skip-gram with negative sampling was shown (Levy & Goldberg, 2014) to implicitly factorise.

Computing PPMI on a Term-Context Matrix

Given a co-occurrence matrix with rows (words) and columns (contexts), where is the number of times word occurs with context , the joint and marginal probabilities are:

Worked example (Jurafsky’s cherry/strawberry/digital/information)

Raw counts:

computerdataresultpiesugarcount(w)
cherry28944225486
strawberry001601980
digital1670168385543447
information332539823785137703
count(c)499756734735126111716

Computing one cell — PMI(information, data):

Resulting PPMI matrix (negatives clipped to 0):

computerdataresultpiesugar
cherry0004.383.30
strawberry0004.105.51
digital0.180.01000
information0.020.090.2800

The matrix encodes the semantic split cleanly: cherry and strawberry light up on pie and sugar (food field); digital and information light up faintly on the tech columns. All the “stopword-ish” cells collapse to 0.

Weighting PMI: The Rare-Word Bias

Even with the PPMI clip, PMI has one more bias worth knowing about: it overweights rare events. Very rare words can have very high PMI values on the (few) contexts where they co-occur — not because the association is strong, but because their marginal probability is tiny and dividing by a near-zero denominator inflates the log.

Two standard mitigations:

1. Raise context probabilities to

Smooth the context marginal by raising counts to a fractional power before normalising:

Because , this boosts rare contexts’ implied probability and dampens frequent contexts’. Concretely, if two events have and :

The rare event’s effective probability went from 0.01 to 0.03 — a 3× boost. This reduces the PMI inflation on rare contexts. (Incidentally, is the same smoothing that word2vec’s negative-sampling distribution uses — Mikolov et al. settled on it empirically.)

2. Add-one smoothing

Add 1 to every count in the co-occurrence matrix before computing probabilities. Same trick as Laplace smoothing for n-gram models — it doesn’t remove rare-word bias entirely but dampens the worst cases.

Why PMI Matters

Two reasons it keeps appearing in the course:

  1. Interpretable baseline. PPMI on a term-context matrix with a dimensionality reduction step (SVD → LSA) is a strong, fully explainable alternative to dense neural embeddings, and often within a few percentage points of word2vec on similarity benchmarks.
  2. Theoretical bridge. Word2vec’s training objective approximates PMI factorisation — understanding PMI is understanding what skip-gram is really computing underneath the loss function.
  • tf-idf — document-weighted alternative; solves the same problem from a different angle
  • vector-semantics — the matrix framework PMI computes over
  • cosine-similarity — still the right metric for comparing PMI-weighted vectors
  • word2vec — implicitly factorises a shifted-PMI matrix (Levy & Goldberg 2014)

Active Recall