Raw word frequency is a bad representation because very frequent words like the and it are not discriminative. TF-IDF fixes this by multiplying each word’s term frequency by its inverse document frequency — upweighting terms that are common inside a document but rare across the corpus.

The Problem with Raw Counts

The term-document and term-context matrices represent each cell as a raw count. Counts are useful — if sugar appears a lot near apricot, that tells you something. But stop-words (the, it, they, of) appear a lot near every word, so they drown the signal in noise. A co-occurrence with the is not evidence of anything.

The paradox: frequent words carry the most co-occurrence mass but the least discriminative information. TF-IDF and PMI are two classical ways to resolve it.

The Formula

Each cell in the weighted matrix is the product of two terms. TF captures how important the term is inside this document; IDF captures how distinctive the term is across the corpus.

Term frequency (tf)

The simplest choice is the raw count, . In practice counts are log-squashed so that a word appearing 100 times isn’t treated as 100× more important than one appearing once:

The inside the log keeps the value defined when a word doesn’t occur (giving ), and the log dampens the runaway effect of high-count words. This is the sentiment-classifier intuition restated in weighting form (see binary multinomial NB): occurrence matters more than raw frequency.

ALTERNATIVE CONVENTION

Some texts use (for , else 0) — a sublinear variant that gives a word appearing once a weight of rather than . These two conventions produce different numerical values (and tf-idf tables will disagree) but the same relative ordering. The course uses and we follow that here.

Document frequency (df)

is the number of documents the term appears in — not the collection frequency (total count across all documents). The distinction matters:

Collection FrequencyDocument Frequency
Romeo1131
action11331

Both words appear 113 times in Shakespeare. But Romeo is concentrated in one play (it’s highly distinctive) while action is spread across 31 plays (it’s generic). Document frequency catches this; collection frequency doesn’t.

Inverse document frequency (idf)

where is the total number of documents in the collection. The log compresses the dynamic range; on its own would vary over many orders of magnitude.

Illustrative values from the Shakespeare collection (37 plays):

Worddfidf
Romeo11.57
salad21.27
Falstaff40.967
forest120.489
battle210.246
wit340.037
fool360.012
good370
sweet370

Good and sweet appear in every play, so their idf is exactly 0 — the product zeroes out. TF-IDF says: “a word that’s everywhere tells you nothing about which document you’re in.” Romeo, by contrast, gets a very high idf — it’s a strong fingerprint for Romeo and Juliet.

What “Document” Means

The definition is flexible: a play, a Wikipedia article, a paragraph, a tweet, or any fixed-size chunk of text. For word-similarity work, people often treat each paragraph as a document. The key requirement is that the partition produces useful discriminative signal — if every “document” is a single sentence, idf values saturate; if every “document” is an entire book, they under-discriminate.

Worked Application: Shakespeare Plays

From raw counts to tf-idf weights, using the four-play term-document matrix:

Raw counts:

As You Like ItTwelfth NightJulius CaesarHenry V
battle10713
good114806289
fool365814
wit201523

TF-IDF weights (same data, re-weighted with ):

As You Like ItTwelfth NightJulius CaesarHenry V
battle0.07400.220.28
good0000
fool0.0190.0210.00360.0083
wit0.0490.0440.0180.022

Observations:

  • good is zero everywhere. idf = 0 because it appears in every play (df = 37, N = 37). Multiplying by log(1) = 0 wipes it out. This is the point: good was the loudest dimension in the raw matrix and contributed nothing distinctive.
  • battle dominates the history plays. Its idf is non-trivial and its tf is high in Julius Caesar and Henry V — exactly the dimension that separates histories from comedies.
  • fool still discriminates comedies from histories, but with much smaller absolute values because its idf is modest.

The net effect: tf-idf-weighted vectors separate documents by what is distinctive to them, not by what is frequent.

Why TF-IDF Works as an Embedding

Each row of the tf-idf matrix is a word embedding (weighted counts of the documents the word appears in). Each column is a document embedding. These vectors are:

  • Long or dimensional, i.e. 20,000–50,000 if the columns are context words.
  • Sparse — most entries are zero because most vocabulary words don’t co-occur in any single window/document.
  • Interpretable — every dimension is a named word or document; you can read off which contexts a word lives in.

Compared to dense predictive embeddings, tf-idf is cheaper (no training, just counts and logs), doesn’t require a GPU, is deterministic, and remains the default baseline in information retrieval — the algorithm that underlies classical search engines and is the first thing any retrieval system should beat.

Its weaknesses are the mirror of word2vec’s strengths. TF-IDF doesn’t know that car and automobile are synonymous — they’re different dimensions and their vectors don’t overlap unless they happen to co-occur with the same words. Dense embeddings solve this by construction.

  • vector-semantics — the framework tf-idf sits inside (weighted count vectors)
  • pmi — the information-theoretic alternative to tf-idf for weighting co-occurrences
  • cosine-similarity — the standard metric over tf-idf vectors (length normalisation is crucial for retrieval)
  • word2vec — the dense predictive alternative that generalises across synonyms tf-idf can’t equate

Active Recall