Tokenization is the task of segmenting a stream of text into meaningful units (tokens); it is the first step in virtually every NLP pipeline and harder than it appears.

Definition

Tokenization takes a raw character sequence as input and produces a sequence of tokens — the atomic units the downstream system will reason over. What counts as a token depends on the application: in most Western-language NLP, tokens are roughly words; in other settings they may be subwords, characters, or morphemes.

Space-Based Tokenization

The naive approach: split on whitespace.

tr -sc 'A-Za-z' '\n' < file.txt | sort | uniq -c | sort -rn

This works well for clean, formal English but fails immediately on nearly everything else.

Hard Cases

Abbreviations and punctuation

A period can end a sentence, mark an abbreviation, or appear in a number:

TokenPeriod role
Dr.Abbreviation marker
$45.55Decimal separator
google.comPart of URL
She left.Sentence boundary

A simple split-on-period rule will mangle all of these. Handling abbreviations requires either a lookup list (Mr., Dr., Ph.D., m.p.h.) or a trained classifier.

Clitics

In English, we're, I'd, can't should often be split into component morphemes (we + 're, I + 'd, can + n't). In French this is even more complex: j'ai (“I have”), l'homme (“the man”), du (de + le).

Multiword expressions

New York, ad hoc, rock 'n' roll, and part-time function as single semantic units but span multiple whitespace-delimited tokens. Deciding whether to treat them as one token or several depends on the application.

Chinese and Japanese

These languages use no spaces between words. The character () is a morpheme-sized unit, but “words” are sequences of characters, and segmentation is genuinely ambiguous:

姚明进入总决赛
(Yao Ming enters the finals)

Can be parsed as 3 words, 5 words, or 7 characters depending on the segmenter. The correct segmentation is 3 words: 姚明 / 进入 / 总决赛.

NLTK Regex Tokenizer

A practical step up: define a regex pattern for the shapes of tokens you want, and collect them rather than splitting on delimiters.

import nltk
 
pattern = r'''(?x)           # verbose mode
    (?:[A-Z]\.)+             # abbreviations: U.S.A.
  | \w+(?:-\w+)*             # words with optional hyphens
  | \$?\d+(?:\.\d+)?%?       # currency and percentages
  | \.\.\.                   # ellipsis
  | [][.,;"'?():-_`]         # punctuation marks
'''
 
tokens = nltk.regexp_tokenize(text, pattern)

The pattern is a cascade of alternatives, tried in order. This handles many edge cases that whitespace splitting cannot.

COMMON MISCONCEPTION

Tokenization is not a solved problem that can be ignored. The errors introduced here propagate forward — a mis-tokenized word is an incorrect feature for every downstream component. In low-resource languages or noisy text (tweets, clinical notes, code), tokenization quality is often the binding constraint on overall system performance.

Active Recall