Cosine similarity measures the angle between two vectors — the dot product divided by the product of their lengths. It’s the standard way to compute word similarity because it separates direction (what the vectors mean) from magnitude (how often the words appeared).

Dot Product and Its Flaw

The dot product is the starting point:

It’s high when the two vectors have large values in the same dimensions — which intuitively means they co-occur with similar words — and low when their large-value dimensions disagree. So far so good.

The flaw: the dot product also grows with vector length. Define length as

Frequent words like of, the, you have long vectors — they co-occur with many things many times. So their dot product with anything is high, simply because their coordinates are big. A raw dot product would say the is “similar” to every word in the vocabulary — a useless metric.

The Cosine Fix

Divide by the lengths, and the magnitude dependence cancels:

This is just the cosine of the angle between the two vectors, following from . Geometrically it asks “what direction do the vectors point, ignoring how long they are?”

Range

  • : vectors point in the same direction (perfect similarity).
  • : vectors are orthogonal (no similarity).
  • : vectors point in opposite directions.

For term-term matrices with raw counts, all coordinates are non-negative, so the cosine ranges over — you’ll never see negative cosines. For PMI vectors (which can go negative when PMI is allowed to be negative) or for dense embeddings, the full range is possible.

Worked Example (from the slides)

Using a small word-word submatrix over contexts :

piedatacomputer
cherry44282
digital516831670
information539823325

Cosine between cherry and information:

Cosine between digital and information:

The numbers match intuition: digital and information both live in the tech semantic field, while cherry lives in the food semantic field. Their cosine similarity (0.996 vs 0.017) captures that separation. Crucially, information is a much longer vector than digital (it appears more often), but the cosine doesn’t care — it normalises length out.

Worked Example (Worksheet Q5)

For word vectors king and queen :

  • Dot product:

Very similar — as expected from two words in the same semantic field of monarchy.

Why Cosine Is the Default

Three practical reasons:

  1. Length independence. Frequent words aren’t automatically “similar to everything” just because their coordinates are large.
  2. Matches human similarity judgments. On datasets like SimLex-999, cosine of learned embeddings correlates well with humans’ pairwise ratings.
  3. Cheap to compute. One dot product and two norms — linear in vector dimension.

For normalised vectors (length = 1), cosine and dot product are identical — many systems pre-normalise embeddings so they can use the faster raw dot product during retrieval.

Pitfall: Cosine Is Not a Metric

is sometimes called cosine distance, but it isn’t a proper distance metric — it doesn’t satisfy the triangle inequality in general. For most NLP tasks this doesn’t matter; for nearest-neighbour search with tree indexes it can. When it does matter, use Euclidean distance on length-normalised vectors — which is monotonically related to cosine and is a metric.

  • vector-semantics — what the vectors being compared represent
  • tf-idf — cosine is the standard metric over tf-idf vectors; length normalisation is critical for retrieval
  • pmi — cosine also works over PPMI-weighted vectors
  • word2vec — dense embeddings are typically compared via cosine; the skip-gram objective is itself dot-product-based

Active Recall