The bag-of-words representation treats a document as an unordered multiset of tokens — a vector of word counts, with word position discarded entirely.

Definition

Given a document tokenized into , the bag of words is the mapping from each type in the vocabulary to its count in :

The representation loses word order and syntactic structure. The document “dog bites man” and “man bites dog” have the same bag-of-words representation.

Why It’s Useful

Despite the obvious information loss, bag of words is the default starting representation for text classification. Reasons:

  1. Simplicity — the representation is a sparse integer vector over the vocabulary; no parsing, no feature engineering beyond tokenization.
  2. Classifier independencenaive-bayes, logistic regression, SVMs, and even early neural networks all consume it directly.
  3. Strong empirical baseline — for topical tasks (spam, news categorization, MeSH labelling) the words that appear are already informative enough that order mostly doesn’t help.
  4. Matches the Naive Bayes independence assumption — if you’re going to assume features are conditionally independent given the class, there is no reason to carry order information that the model cannot use.

Variants

VariantWhat counts as a feature
Count — raw token frequency
Binary — did occur at all? Used in binary multinomial NB for sentiment
TF-IDFfrequency weighted by inverse document frequency (not covered this week)

Binary counts often beat raw counts for sentiment — whether the word fantastic occurs is more informative than how many times.

What Bag of Words Discards

  • Word order: “not good” and “good not” become identical. This is the reason negation handling has to be bolted on as a preprocessing step — otherwise “I don’t like this movie” and “I like this movie” differ only in the word don't, which the model treats as a single independent feature.
  • Syntactic structure: any information carried by the position of a word relative to others — subject vs object, modifier vs head — is lost.
  • Phrasal semantics: “New York” becomes two tokens; the phrase no longer functions as a single entity.

Connection to Language Models

A bag-of-words classifier with all word features and no position is equivalent to a unigram language model per class — Naive Bayes scores a document under class by , exactly what a unigram LM does. This is covered in Relationship to language modelling.

  • naive-bayes — the canonical consumer of bag-of-words features
  • tokenization — bag of words is bag-of-what the tokenizer produces
  • type-and-token — bag of words is a mapping from types to token counts within a document
  • sentiment-analysis — where the lossiness of bag-of-words becomes a real problem (negation, irony)

Active Recall