Text classification assigns a predefined category to a document — the general pattern that covers spam detection, sentiment analysis, authorship identification, topic labelling, and language ID.

Definition

Input: a document and a fixed set of classes . Output: a predicted class .

The set of classes is fixed in advance — text classification is a closed-label task. Choosing the labels, and the annotation process that produces the training examples, is a design decision that shapes every downstream decision about the model.

Example Tasks

TaskClasses
Spam detectionspam / not spam
Sentiment analysispositive / negative (sometimes neutral)
Authorship identificationa fixed set of candidate authors (the Federalist Papers problem: Hamilton, Madison, Jay)
Topic categorizationMeSH subject hierarchy, newswire topics
Language IDa fixed set of languages, often using character n-grams as features

Two Classification Strategies

1. Hand-coded rules

Rules over features: spam := black-list-address OR ("dollars" AND "have been selected"). When experts refine rules carefully in a narrow domain, precision can be very high. But maintaining rules is expensive, and rules are brittle — they miss paraphrases, so recall is low. This is a high-precision, low-recall regime.

2. Supervised machine learning

Given a training set of hand-labelled documents, learn a classifier . Candidates include naive-bayes, logistic regression, -nearest neighbours, neural networks, and prompted/fine-tuned LLMs. This week introduces Naive Bayes as the simplest, most interpretable baseline — fast to train, robust on small datasets, and surprisingly hard to beat on many tasks.

Where This Sits in the Module

Text classification is the first real prediction task in NLP after language modelling. Where n-gram LMs asked “what is the probability of this sentence?”, classification asks “what category does this document belong to?” — a decision rather than a distribution. The machinery overlaps: bag-of-words representations, MLE parameter estimation, Laplace smoothing, and log-space arithmetic all reappear.

Active Recall