A formal language for specifying text search patterns, built from a small set of operators that compose into arbitrarily complex matchers.

Definition

A regular expression (regex) is a sequence of characters that defines a pattern over strings. A regex engine takes a pattern and an input string and returns all substrings (matches) that conform to the pattern. In NLP they appear at every layer: corpus preprocessing, tokenization, feature extraction, and rule-based systems like eliza.


Syntax

Basic building blocks

ConstructMeaningExampleMatchesTest string
[abc]Any one of a, b, c[wW]oodchuckwoodchuck, Woodchuck”The Woodchuck ate his dinner”
[a-z]Any character in range[A-Z]any uppercase letterDrenched Blossoms”
[^abc]Anything except a, b, c (negation — only when ^ is first in [])[^A-Z]any non-uppercase charOyfn pripetchik”
.Any single characterbeg.nbegin, began, begun”We must begin now”
^Start of line (outside [])^TheThe only at line startThe quick brown fox”
$End of lineend$end only at line end”reach the end

^ HAS THREE DISTINCT MEANINGS

  • First character inside []negation: [^abc] matches any character that is not a, b, or c
  • Outside []start-of-line anchor: ^The matches The only at the beginning of a line
  • Non-first character inside []literal caret: [e^] matches e or ^; [^e^] matches any character that is neither e nor ^

The same symbol, three unrelated jobs — which one applies is determined entirely by position.


Shorthand character classes

These shorthands expand to character classes and appear constantly in real patterns:

ShorthandEquivalentMeaning
\d[0-9]Any digit
\D[^0-9]Any non-digit
\w[a-zA-Z0-9_]Any word character (letter, digit, underscore)
\W[^a-zA-Z0-9_]Any non-word character
\s[ \t\n\r\f]Any whitespace
\S[^ \t\n\r\f]Any non-whitespace

Examples: \d+ matches one or more digits. \w+ matches a whole word token. \s+ matches any run of whitespace and is the basis of whitespace tokenization.


Word boundary \b

\b is a zero-width assertion — it matches a position (between a \w character and a \W character) rather than consuming any characters.

\bthe\b

This matches the as a standalone word, but not the inside there, other, or these.

Without \b:

the     →   matches "the", "there", "other", "hypothesis", …

With \b:

\bthe\b →   matches "the"   in "I went to the store"
             no match        in "there" or "other"

This is the standard fix for the false-positive problem when searching for whole words. You almost always want \b when matching specific words in NLP.


Quantifiers

Quantifiers attach to the immediately preceding element and control how many times it can appear.

QuantifierMeaningExampleTest string
*Zero or moreoo*h!oh!, ooh!, oooh!, ooooh!
+One or moreo+h!ooh!, oooh! (not oh!)
?Zero or one (optional)colou?rcolor, colour
{n}Exactly n[0-9]{4}2026 in “Call by 2026”
{n,m}Between n and m[a-z]{2,5}be, began, begin
{n,}n or more\d{3,}1800 in “Call 1800 today”

Examples:

  • colou?r — matches color and colour (the u is optional)
  • [0-9]{4} — matches exactly four digits (e.g. a year)
  • \d+\.\d{2} — matches a price like 45.99

Greedy vs non-greedy

By default, quantifiers are greedy — they consume as many characters as possible while still allowing the overall match to succeed.

re.findall(r'<.+>', '<a>hello</a>')   # → ['<a>hello</a>']   (greedy)
re.findall(r'<.+?>', '<a>hello</a>')  # → ['<a>', '</a>']    (non-greedy)

Add ? after a quantifier (*?, +?, ??) to make it non-greedy: it will match as few characters as possible. Non-greedy is usually what you want when parsing structured text like HTML tags.


Grouping and alternation

How is a group different from just writing a pattern?

Without groups, quantifiers (*, +, ?, {n}) attach to the single token immediately before them — never more. So cat+ does not mean “one or more repetitions of cat”; it means “ca followed by one or more ts”. The + only sees the t.

This is the same problem as operator precedence in arithmetic. In 2 + 3 × 4, the × only applies to 3 and 4. Parentheses let you override that: (2 + 3) × 4. Regex parentheses work identically — they bundle multiple tokens into a single unit so that a quantifier or alternation applies to all of them together:

cat+        →   ca, catt, cattt, …        (+ applies only to t)
(cat)+      →   cat, catcat, catcatcat, … (+ applies to the whole word)

the cat|dog →   "the cat"  or  "dog"      (| splits at low precedence)
the (cat|dog) → "the cat"  or  "the dog"  (| applies only inside the group)

Grouping is the primary job. The capturing side-effect is a bonus.

Capturing: a capturing group also stores whatever text matched inside it into a numbered slot — \1 for the first group, \2 for the second, left to right. You can then reference that slot in two places:

  • In a replacement string — reuse or rearrange the captured text:
    re.sub(r'(\w+) (\w+)', r'\2 \1', 'Smith John')"John Smith"
  • Inside the same pattern — enforce that a later part must equal what was already captured:
    ([a-z]+)\s+\1 matches "the the" (second word must repeat the first) but not "the cat"

If you only need grouping for scope — and have no use for the captured text — use (?:...), a non-capturing group, to avoid allocating an unnecessary slot.

ConstructMeaning
(abc)Capturing group: groups and saves match as \1, \2, …
(?:abc)Non-capturing group: groups without saving
a|bAlternation: match a or b

Use (?:...) when you need grouping for a quantifier but don’t want to create a capture group:

(?:cat|dog)s?   →   cat, cats, dog, dogs   (no capture created)
(cat|dog)s?     →   same matches, but \1 holds 'cat' or 'dog'

Multiple capture groups are indexed left to right:

(\w+)\s(\w+)    →   \1 = first word,  \2 = second word

Backreferences in patterns

A backreference (\1, \2, …) can appear inside the pattern itself — not just in the replacement. This matches text that is identical to what was captured earlier in the same match.

re.findall(r'([a-zA-Z]+)\s+\1', text)

This matches any word that appears twice consecutively, separated by whitespace: "the the", "Humbert Humbert", but not "the cat". The capture group matches the first word; \1 requires the second word to be identical.

Practical uses:

  • Detecting duplicate words in text
  • Finding repeated tokens in corpora
  • Validating that two parts of a pattern match the same value

COMMON MISCONCEPTION

Backreferences in patterns (\1 inside the regex) are different from backreferences in substitution strings (\1 in the replacement). In a pattern, \1 is a constraint — it says “this position must match whatever group 1 already captured.” In a replacement, \1 is an interpolation — it inserts the captured text. The syntax looks the same; the meaning is different.


Operator precedence (high → low)

  1. () — parentheses
  2. * + ? {} — quantifiers
  3. Sequences and anchors (concatenation)
  4. | — alternation

So a|bc* parses as a | (b(c*)) — either a, or b followed by any number of c. To mean “zero or more of either a or b followed by c” you would write (a|b)c*.


Escaping special characters

The characters . * + ? ^ $ { } [ ] ( ) | \ all have special meaning in regex. To match them literally, prefix with \:

You want to matchWrite
A literal period\.
A literal asterisk\*
A literal parenthesis\( or \)
A literal backslash\\

Example: \$\d+\.\d{2} matches a price like $45.99 — the \$ and \. match literal symbols, \d+ and \d{2} match digit runs.


Python raw strings

Python processes escape sequences (\n, \t, etc.) in strings before the regex engine ever sees the pattern. This causes double-escaping problems:

# BAD: Python converts \b to a backspace character; regex never sees \b
re.findall('\bcat\b', text)
 
# GOOD: raw string r'...' passes the characters literally to the regex engine
re.findall(r'\bcat\b', text)

Always use r'...' for regex patterns in Python. This is not optional — patterns containing \d, \w, \b, \s will silently misbehave without it.


Lookahead Assertions

Lookaheads match a position without consuming characters.

ConstructMeaning
(?=...)Positive lookahead: position is followed by the pattern
(?!...)Negative lookahead: position is not followed by the pattern
re.findall(r'Windows(?! NT)', text)   # "Windows" only when NOT followed by " NT"
re.findall(r'\d+(?= dollars)', text)  # digits only when followed by " dollars"

Lookaheads are useful when you want to match something based on what comes after it, without including that context in the match itself.


Substitutions and Capture Groups

The substitution s/pattern/replacement/ replaces matches. Capture groups let you reuse matched text in the replacement:

import re
 
# Wrap every number in angle brackets: "34 items" → "<34> items"
re.sub(r'([0-9]+)', r'<\1>', '34 items in 2 boxes')
# → '<34> items in <2> boxes'
 
# Swap first and last name: "Smith John" → "John Smith"
re.sub(r'(\w+) (\w+)', r'\2 \1', 'Smith John')
# → 'John Smith'

ELIZA’s core mechanism is exactly this:

re.sub(r".* I'M (depressed|sad) .*", r"I AM SORRY TO HEAR YOU ARE \1", text)

Multiple capture groups are referenced as \1, \2, … in the replacement string.


Building a Pattern Iteratively

Regex engineering is not “write once, it works”. It is a loop:

  1. Write a first-pass pattern — simple, covering the obvious cases.
  2. Test against real text. Find:
    • False positives: strings that match but shouldn’t.
    • False negatives: strings that should match but don’t.
  3. Tighten (add constraints) to reduce false positives.
  4. Broaden (add alternatives) to reduce false negatives.
  5. Repeat.

Worked example: find the word the in text.

Attempt 1:   [tT]he

False positives: matches the inside there, other, theology.

Attempt 2:   [tT]he[^a-zA-Z]

False negatives: misses the at end of line (no character follows).

Attempt 3:   \b[tT]he\b

This correctly matches the and The as standalone words, and nothing else. Word boundaries handle both the “followed by non-letter” and “end of string” edge cases simultaneously.


Precision and Recall

Any regex pattern makes errors in two directions:

  • False positives (Type I): the pattern fires when it shouldn’t → lower precision.
  • False negatives (Type II): the pattern misses what it should catch → lower recall.

Tightening a pattern raises precision but risks lowering recall. Broadening raises recall but risks lowering precision. There is no free lunch — practical regex engineering is managing this tradeoff deliberately.


Python re Module

import re
 
re.findall(r'\b[A-Z]\w+', text)        # all capitalized words (as whole words)
re.search(r'\bcat\b', text)            # first match object (or None)
re.match(r'\d+', text)                 # match only at START of string
re.sub(r'\bcolou?r\b', 'color', text)  # normalize spelling
re.split(r'\s+', text)                 # split on any whitespace run
re.compile(r'\d{4}', re.IGNORECASE)    # pre-compile for reuse

Key flags:

FlagEffect
re.IGNORECASE (re.I)Case-insensitive matching
re.MULTILINE (re.M)^ and $ match start/end of each line, not just the string
re.DOTALL (re.S). matches newline characters too
re.VERBOSE (re.X)Allows whitespace and # comments inside the pattern

re.VERBOSE is useful for complex patterns:

pattern = re.compile(r'''
    (?:[A-Z]\.)+       # abbreviations: U.S.A.
  | \w+(?:-\w+)*       # hyphenated words
  | \$?\d+(?:\.\d+)?   # prices and numbers
  | \.\.\.             # ellipsis
''', re.VERBOSE)

Role of Regex in NLP

Regular expressions play a surprisingly large role in NLP:

  • Sophisticated sequences of regular expressions are often the first model tried for any text processing task — before any machine learning is involved. They are fast to write, fully interpretable, and require no training data.
  • For harder tasks, machine learning classifiers take over — when patterns become too complex or too numerous to enumerate by hand, learned models are more practical.
  • Even then, regex doesn’t disappear: it is used for pre-processing text before it reaches a classifier, and regex-derived features (e.g. “does this token match \d+?”) are fed directly into classifiers as input signals.
  • Regex is also well-suited for capturing generalizations — a single pattern like \$\d+(?:\.\d{2})? covers every price format without needing labelled examples.

The practical takeaway: when starting any new NLP task, write a regex first. It sets a baseline, reveals edge cases, and often turns out to be good enough.


  • eliza — uses regex substitution as its sole reasoning mechanism
  • tokenization — early tokenizers are regex cascades; NLTK uses re.VERBOSE patterns
  • text-normalization — normalization rules are implemented as substitutions

Active Recall