The bag-of-words representation treats a document as an unordered multiset of tokens — a vector of word counts, with word position discarded entirely.
Definition
Given a document tokenized into , the bag of words is the mapping from each type in the vocabulary to its count in :
The representation loses word order and syntactic structure. The document “dog bites man” and “man bites dog” have the same bag-of-words representation.
Why It’s Useful
Despite the obvious information loss, bag of words is the default starting representation for text classification. Reasons:
- Simplicity — the representation is a sparse integer vector over the vocabulary; no parsing, no feature engineering beyond tokenization.
- Classifier independence — naive-bayes, logistic regression, SVMs, and even early neural networks all consume it directly.
- Strong empirical baseline — for topical tasks (spam, news categorization, MeSH labelling) the words that appear are already informative enough that order mostly doesn’t help.
- Matches the Naive Bayes independence assumption — if you’re going to assume features are conditionally independent given the class, there is no reason to carry order information that the model cannot use.
Variants
| Variant | What counts as a feature |
|---|---|
| Count | — raw token frequency |
| Binary | — did occur at all? Used in binary multinomial NB for sentiment |
| TF-IDF | frequency weighted by inverse document frequency (not covered this week) |
Binary counts often beat raw counts for sentiment — whether the word fantastic occurs is more informative than how many times.
What Bag of Words Discards
- Word order: “not good” and “good not” become identical. This is the reason negation handling has to be bolted on as a preprocessing step — otherwise “I don’t like this movie” and “I like this movie” differ only in the word
don't, which the model treats as a single independent feature. - Syntactic structure: any information carried by the position of a word relative to others — subject vs object, modifier vs head — is lost.
- Phrasal semantics: “New York” becomes two tokens; the phrase no longer functions as a single entity.
Connection to Language Models
A bag-of-words classifier with all word features and no position is equivalent to a unigram language model per class — Naive Bayes scores a document under class by , exactly what a unigram LM does. This is covered in Relationship to language modelling.
Related
- naive-bayes — the canonical consumer of bag-of-words features
- tokenization — bag of words is bag-of-what the tokenizer produces
- type-and-token — bag of words is a mapping from types to token counts within a document
- sentiment-analysis — where the lossiness of bag-of-words becomes a real problem (negation, irony)
Active Recall
What information does the bag-of-words representation throw away, and which NLP problems does that make harder?
It throws away word order and all syntactic structure. This makes negation handling hard (“don’t like” looks like “don’t” + “like” independently), multiword expressions disappear, and any task that depends on who-did-what-to-whom (relation extraction, question answering) is impossible from BoW alone. For topical classification, the loss is usually tolerable.
Why does binary bag-of-words often outperform count bag-of-words for sentiment analysis?
For sentiment, occurrence carries most of the signal — seeing the word fantastic once tells you almost everything about its contribution; seeing it five times adds little. Raw counts let very repetitive documents dominate the likelihood calculation with redundant evidence. Binary clipping makes each informative word count exactly once per document, which matches the structure of the task better.