Word2vec learns dense, short word embeddings by training a binary classifier to predict whether a pair of words co-occurred. The classifier is never used — the vectors it used as parameters are. It’s a self-supervised, predict-don’t-count alternative to tf-idf and PPMI.

Sparse vs Dense, Count vs Predict

tf-idf and PPMI produce sparse, long vectors (20,000–50,000 dimensions, most entries zero). Word2vec (Mikolov et al., 2013) produces dense, short vectors (50–1000 dimensions, most entries non-zero). Why bother?

  • Fewer parameters downstream. Feeding a 300-dimensional dense vector into a classifier is much cheaper than a 50,000-dimensional sparse one.
  • Better generalisation. Car and automobile are synonyms but live on distinct sparse dimensions, so a feature like “previous word was car” won’t trigger on automobile. Dense embeddings place both near each other, so anything learned about one generalises to the other.
  • Empirically better. On most downstream tasks, dense embeddings simply work better than sparse ones. The mechanism is compression: reducing to a low-rank approximation forces the model to find the structure that matters.

Word2vec is the canonical dense embedding method. Alternatives include GloVe (Pennington et al., 2014), SVD / LSA (a classical dimensionality reduction of the co-occurrence matrix), and the modern contextual embeddings (ELMo, BERT) that replace “one vector per lemma” with “one vector per occurrence.”

Skip-Gram with Negative Sampling (SGNS)

Word2vec comes in two flavours — skip-gram and CBOW. The course focuses on skip-gram with negative sampling (SGNS):

Idea: predict rather than count. Instead of tallying how often apricot co-occurs with every context word, train a classifier on a binary prediction task: “is word c likely to appear near apricot?” The task itself is throwaway — nobody cares about the classifier’s output at test time. What we keep is the weights, which become the word embeddings.

Big idea: self-supervision. A word that actually occurs near apricot in the corpus counts as the “gold correct answer” for supervised learning. No human labels needed. Any running text generates training data.

Four-step algorithm

  1. Build positive examples : take a target word and each context word in a small window around it.
  2. Build negative examples : for each positive example, sample random words from the vocabulary as non-neighbours.
  3. Train a logistic regression classifier to distinguish positive from negative pairs, using vector dot product as the similarity input.
  4. Extract the learned weights as the embeddings.

Training data

Assume a word window around the target apricot in the sentence “lemon, a tablespoon of apricot jam, a pinch…”:

...lemon, a [tablespoon of apricot jam, a] pinch...
              c1          c2  target    c3    c4

Positive examples (one per context word in the window):

tc
apricottablespoon
apricotof
apricotjam
apricota

For each positive example, draw negative examples by sampling by word frequency (actually unigram frequency raised to the 3/4 power in the original paper — but the course treatment uses plain frequency). With :

tclabel
apricotaardvark
apricotmy
apricotwhere
apricotcoaxial
apricotseven
apricotforever
apricotdear
apricotif

Similarity via dot product

The classifier’s core: two words are similar if their embeddings have a high dot product. Cosine is just normalised dot product. Define

and turn it into a probability via the sigmoid from logistic regression:

Assuming independence across the context words in the window:

The loss function

For one positive example and negatives paired with target , minimise the cross-entropy loss:

In words: maximise the similarity of the target with the true context word, minimise the similarity of the target with each negative-sampled non-neighbour.

Learning: SGD

Stochastic gradient descent walks each parameter towards the direction of steepest descent on this loss. The update rule per training example:

where is the learning rate. The first term pulls apricot closer to jam (positive); the second pushes apricot away from matrix, Tolstoy, etc. (negative). After enough epochs, words that share many contexts end up near each other, words that don’t end up apart.

Initialisation is random -dimensional vectors. Convergence in practice takes a single pass or a few passes over a large corpus (SGNS is trained on billions of tokens).

Two Sets of Embeddings

SGNS actually learns two embeddings per word: one as a target (matrix , used when the word is the centre of its window) and one as a context (matrix , used when the word is a neighbour). Final representations usually sum the two: .

Formally the full parameter vector is

with rows total: a target-role and context-role vector for each word.

Properties of Trained Embeddings

Window size shapes what gets learned

Small windows capture syntactic / taxonomic similarity — words that could substitute grammatically. Hogwarts’s nearest neighbours become other fictional schools: Sunnydale, Evernight, Blandings.

Large windows capture topical relatedness — words in the same semantic field. Hogwarts’s nearest neighbours become the Harry Potter world: Dumbledore, half-blood, Malfoy.

This is the similarity vs relatedness distinction from lexical semantics, now tunable via a hyperparameter.

Analogical relations (the parallelogram method)

A long-standing observation (Rumelhart & Abrahamson 1973 in psychology; Turney & Littman 2005 and Mikolov et al. 2013 in NLP): analogies can be solved by vector arithmetic. To solve apple is to tree as grape is to ___, compute

With , the closest word to is vine. Similarly:

The visual is a parallelogram in embedding space: opposite sides represent the same semantic relation (gender, capital-of-country).

Caveats. The parallelogram method only works for frequent words, small distances, and certain relations — capitals-of-countries and parts-of-speech do reliably; many others don’t (Linzen 2016, Gladkova et al. 2016, Ethayarajh et al. 2019a). Understanding when and why analogies work remains an open research question.

Diachronic embeddings: meaning change over time

Hamilton, Leskovec & Jurafsky (2016) trained separate embeddings on different decades of text (~30M Google Books, 1850–1990) to track semantic shift:

  • gay: 1900s neighbours include daft, tasteful, cheerful; 1990s neighbours include lesbian, homosexual, bisexual.
  • broadcast: 1850s sits near sow, seed, scatter (agricultural sense); 1990s sits near television, radio, bbc.
  • awful: 1850s is near majestic, solemn, awe; 1990s near wonderful, weird, terrible.

Each word’s embedding moves through space as its usage changes. A simple, striking use of vector semantics as a historical-linguistics tool.

Embeddings reflect cultural bias

The same method that recovers king − man + woman ≈ queen also produces:

  • father : doctor :: mother : xnurse
  • man : computer programmer :: woman : xhomemaker (Bolukbasi et al. 2016)

The embeddings are doing exactly what they’re trained to do — reflect the statistical regularities of the training corpus. When the corpus is the web, those regularities encode gender, racial, and cultural stereotypes. Systems that consume embeddings — hiring searches, content ranking, translation — propagate those biases downstream.

Garg, Schiebinger, Jurafsky & Zou (2018) used diachronic embeddings to quantify historical bias, finding e.g. that competence adjectives (smart, wise, brilliant) were biased toward men in 1910s embeddings with the bias decreasing 1960–1990, and that dehumanising adjectives (barbaric, monstrous, bizarre) were biased toward Asians in 1930s embeddings — patterns that match independently-measured old surveys. Embeddings become a window into the cultural record.

This is the harms story again, one layer deeper: not “biased training labels” but “biased text itself.” No classifier is in scope yet, but the bias is already baked into the representation that every classifier will use.

Summary: How to Train Word2vec

  1. Start with random -dimensional vectors as initial embeddings.
  2. Take a corpus, extract positive pairs from co-occurrence windows and negative pairs by frequency-sampling the vocabulary.
  3. Train the skip-gram classifier (logistic regression over dot-product similarity) by SGD to distinguish the two classes.
  4. Throw away the classifier weights you don’t need; keep the word embeddings (optionally summing target and context matrices).

The classifier is a scaffold. The embeddings are the product.

  • vector-semantics — the framework: words as points, meaning from distribution
  • tf-idf — the sparse, count-based alternative word2vec replaces for most uses
  • pmi — Levy & Goldberg (2014) showed skip-gram-with-negative-sampling implicitly factorises a shifted-PMI matrix
  • cosine-similarity — the metric over trained embeddings
  • lexical-semantics — the structure the embeddings are trying to recover
  • harms-in-classification — embeddings propagate corpus bias into every downstream model

Active Recall