Define the meaning of a word by its distribution in language use. Represent that distribution as a vector in a multi-dimensional space. Similar words become nearby points; every downstream NLP model computes on those points instead of strings.

Two Ideas, One Method

Vector semantics rests on two ideas that become one technique when combined:

Idea 1 — Defining meaning by linguistic distribution. From Wittgenstein’s Philosophical Investigations §43: “The meaning of a word is its use in the language.” Operationalised by Zellig Harris (1954): “If A and B have almost identical environments we say that they are synonyms.” A word’s meaning is a function of the company it keeps — the neighbouring words or grammatical environments it occurs in.

Idea 2 — Meaning as a point in space. The connotation sketch in lexical semantics already represented a word as a 3D vector over valence, arousal, dominance. Generalise from 3 to thousands of dimensions, and the whole vocabulary becomes a cloud of points where geometry recovers semantic relationships.

The payoff of combining them: build the space automatically by seeing which words are nearby in text. No hand-annotation required.

The ongchoi intuition

Suppose you’ve never seen ongchoi before but encounter:

  • Ong choi is delicious sautéed with garlic.
  • Ong choi is superb over rice.
  • Ong choi leaves with salty sauces.

And you’ve also seen:

  • …spinach sautéed with garlic over rice.
  • Chard stems and leaves are delicious.
  • Collard greens and other salty leafy greens.

Without anyone telling you, the shared context (leaves, delicious, sautéed, salty) is enough to conclude that ongchoi is a leafy green like spinach, chard, or collard greens. The distribution did the work.

Embeddings: Vectors That Know Things

A word’s vector in this space is called an embedding — the word is embedded into a (typically short, typically learned) space whose geometry reflects meaning. The phrase “every modern NLP algorithm uses embeddings as the representation of word meaning” is not hyperbole — embeddings replace strings as the input to every classifier, parser, retriever, or generator that runs in 2026.

The practical payoff is generalisation. Consider a sentiment classifier:

  • With words (strings / one-hot): “the previous word was terrible is a feature that only fires on exactly terrible. A test document containing awful gets no credit.
  • With embeddings: the previous word has a vector, say . A test document with awful might produce . The classifier can generalise to similar-but-unseen words.

In the rest of this module, computing moves from string representations to meaning representations. Zhuangzi put it best: “words are for meaning; once you get the meaning, you can forget the words.”

Two Kinds of Embedding

The course covers two construction methods, paired with their typical matrix layout:

tf-idfword2vec
Vector length20,000–50,00050–1000
DensitySparse (mostly zero)Dense (mostly non-zero)
How madeWeighted counts of nearby words/documentsClassifier trained to predict whether words co-occur
Baseline useInformation retrieval workhorseDefault word representation for deep models

Later in the module, contextual embeddings (ELMo, BERT) produce a different vector for each occurrence of a word, not one fixed vector per type. This handles polysemy — the two senses of mouse get distinct vectors in context.

The Matrix Behind the Vectors

Both tf-idf and (conceptually) word2vec rest on co-occurrence matrices — tables counting how often word A appears with word B or in document D.

Term-document matrix

Each column is a document, each row is a vocabulary word, each cell is the count of that word in that document. Jurafsky’s recurring example:

As You Like ItTwelfth NightJulius CaesarHenry V
battle10713
good114806289
fool365814
wit201523

Two ways to read this matrix:

  • Columns = document vectors. As You Like It . Documents with similar vectors are similar documents — this is the vector space model of information retrieval (Salton 1971). The two comedies look alike; the two histories look alike.
  • Rows = word vectors. battle “the kind of word that occurs in Julius Caesar and Henry V.” fool “the kind of word that occurs in comedies, especially Twelfth Night.” Words with similar row vectors occur in similar documents.

Term-context (word-word) matrix

More common in modern practice. Columns are context words (a window around each occurrence), not documents. Cells count how often the row-word appeared with the column-word within a window.

aardvarkcomputerdataresultpiesugar
cherry028944225
strawberry00016019
digital0167016838554
information033253982378513

The row vectors of digital and information are nearly parallel (both heavy on computer, data); the row vectors of cherry and strawberry are similar on pie. Two words are similar in meaning if their context vectors are similar. This is Harris’s distributional hypothesis, expressed in matrix form.

What’s a “document”?

For tf-idf, a document can be a play, a Wikipedia article, a paragraph, or a single paragraph-sized chunk of a corpus — whatever partition gives useful discriminative signal. For a word-word matrix, the “document” is a small window around each token.

Similarity = Geometry

Once words are points, similarity becomes a geometric question. The next concepts answer it:

  • cosine-similarity measures angular similarity between vectors — the standard metric.
  • tf-idf weights raw counts to suppress uninformative frequent words and highlight distinctive ones.
  • pmi does the same job through an information-theoretic lens.
  • word2vec replaces counting with predicting — learn a compact dense vector whose dot products approximate co-occurrence probabilities.
  • lexical-semantics — the linguistic structure this representation is trying to recover
  • cosine-similarity — measuring distance between vectors
  • tf-idf — sparse weighted-count embedding (first concrete instance)
  • pmi — information-theoretic re-weighting
  • word2vec — dense, learned, predictive embedding (the modern default)
  • bag-of-words — the document-level analogue: a document is a multiset of word types
  • text-classification — downstream consumer of whatever representation we build

Active Recall