A formal language for specifying text search patterns, built from a small set of operators that compose into arbitrarily complex matchers.
Definition
A regular expression (regex) is a sequence of characters that defines a pattern over strings. A regex engine takes a pattern and an input string and returns all substrings (matches) that conform to the pattern. In NLP they appear at every layer: corpus preprocessing, tokenization, feature extraction, and rule-based systems like eliza.
Syntax
Basic building blocks
| Construct | Meaning | Example | Matches | Test string |
|---|---|---|---|---|
[abc] | Any one of a, b, c | [wW]oodchuck | woodchuck, Woodchuck | ”The Woodchuck ate his dinner” |
[a-z] | Any character in range | [A-Z] | any uppercase letter | ”Drenched Blossoms” |
[^abc] | Anything except a, b, c (negation — only when ^ is first in []) | [^A-Z] | any non-uppercase char | ”Oyfn pripetchik” |
. | Any single character | beg.n | begin, began, begun | ”We must begin now” |
^ | Start of line (outside []) | ^The | The only at line start | ”The quick brown fox” |
$ | End of line | end$ | end only at line end | ”reach the end” |
^HAS THREE DISTINCT MEANINGS
- First character inside
[]→ negation:[^abc]matches any character that is nota,b, orc- Outside
[]→ start-of-line anchor:^ThematchesTheonly at the beginning of a line- Non-first character inside
[]→ literal caret:[e^]matcheseor^;[^e^]matches any character that is neitherenor^The same symbol, three unrelated jobs — which one applies is determined entirely by position.
Shorthand character classes
These shorthands expand to character classes and appear constantly in real patterns:
| Shorthand | Equivalent | Meaning |
|---|---|---|
\d | [0-9] | Any digit |
\D | [^0-9] | Any non-digit |
\w | [a-zA-Z0-9_] | Any word character (letter, digit, underscore) |
\W | [^a-zA-Z0-9_] | Any non-word character |
\s | [ \t\n\r\f] | Any whitespace |
\S | [^ \t\n\r\f] | Any non-whitespace |
Examples: \d+ matches one or more digits. \w+ matches a whole word token. \s+ matches any run of whitespace and is the basis of whitespace tokenization.
Word boundary \b
\b is a zero-width assertion — it matches a position (between a \w character and a \W character) rather than consuming any characters.
\bthe\b
This matches the as a standalone word, but not the inside there, other, or these.
Without \b:
the → matches "the", "there", "other", "hypothesis", …
With \b:
\bthe\b → matches "the" in "I went to the store"
no match in "there" or "other"
This is the standard fix for the false-positive problem when searching for whole words. You almost always want \b when matching specific words in NLP.
Quantifiers
Quantifiers attach to the immediately preceding element and control how many times it can appear.
| Quantifier | Meaning | Example | Test string |
|---|---|---|---|
* | Zero or more | oo*h! | oh!, ooh!, oooh!, ooooh! |
+ | One or more | o+h! | ooh!, oooh! (not oh!) |
? | Zero or one (optional) | colou?r | color, colour |
{n} | Exactly n | [0-9]{4} | 2026 in “Call by 2026” |
{n,m} | Between n and m | [a-z]{2,5} | be, began, begin |
{n,} | n or more | \d{3,} | 1800 in “Call 1800 today” |
Examples:
colou?r— matchescolorandcolour(theuis optional)[0-9]{4}— matches exactly four digits (e.g. a year)\d+\.\d{2}— matches a price like45.99
Greedy vs non-greedy
By default, quantifiers are greedy — they consume as many characters as possible while still allowing the overall match to succeed.
re.findall(r'<.+>', '<a>hello</a>') # → ['<a>hello</a>'] (greedy)
re.findall(r'<.+?>', '<a>hello</a>') # → ['<a>', '</a>'] (non-greedy)Add ? after a quantifier (*?, +?, ??) to make it non-greedy: it will match as few characters as possible. Non-greedy is usually what you want when parsing structured text like HTML tags.
Grouping and alternation
How is a group different from just writing a pattern?
Without groups, quantifiers (*, +, ?, {n}) attach to the single token immediately before them — never more. So cat+ does not mean “one or more repetitions of cat”; it means “ca followed by one or more ts”. The + only sees the t.
This is the same problem as operator precedence in arithmetic. In 2 + 3 × 4, the × only applies to 3 and 4. Parentheses let you override that: (2 + 3) × 4. Regex parentheses work identically — they bundle multiple tokens into a single unit so that a quantifier or alternation applies to all of them together:
cat+ → ca, catt, cattt, … (+ applies only to t)
(cat)+ → cat, catcat, catcatcat, … (+ applies to the whole word)
the cat|dog → "the cat" or "dog" (| splits at low precedence)
the (cat|dog) → "the cat" or "the dog" (| applies only inside the group)
Grouping is the primary job. The capturing side-effect is a bonus.
Capturing: a capturing group also stores whatever text matched inside it into a numbered slot — \1 for the first group, \2 for the second, left to right. You can then reference that slot in two places:
- In a replacement string — reuse or rearrange the captured text:
re.sub(r'(\w+) (\w+)', r'\2 \1', 'Smith John')→"John Smith" - Inside the same pattern — enforce that a later part must equal what was already captured:
([a-z]+)\s+\1matches"the the"(second word must repeat the first) but not"the cat"
If you only need grouping for scope — and have no use for the captured text — use (?:...), a non-capturing group, to avoid allocating an unnecessary slot.
| Construct | Meaning |
|---|---|
(abc) | Capturing group: groups and saves match as \1, \2, … |
(?:abc) | Non-capturing group: groups without saving |
a|b | Alternation: match a or b |
Use (?:...) when you need grouping for a quantifier but don’t want to create a capture group:
(?:cat|dog)s? → cat, cats, dog, dogs (no capture created)
(cat|dog)s? → same matches, but \1 holds 'cat' or 'dog'
Multiple capture groups are indexed left to right:
(\w+)\s(\w+) → \1 = first word, \2 = second word
Backreferences in patterns
A backreference (\1, \2, …) can appear inside the pattern itself — not just in the replacement. This matches text that is identical to what was captured earlier in the same match.
re.findall(r'([a-zA-Z]+)\s+\1', text)This matches any word that appears twice consecutively, separated by whitespace: "the the", "Humbert Humbert", but not "the cat". The capture group matches the first word; \1 requires the second word to be identical.
Practical uses:
- Detecting duplicate words in text
- Finding repeated tokens in corpora
- Validating that two parts of a pattern match the same value
COMMON MISCONCEPTION
Backreferences in patterns (
\1inside the regex) are different from backreferences in substitution strings (\1in the replacement). In a pattern,\1is a constraint — it says “this position must match whatever group 1 already captured.” In a replacement,\1is an interpolation — it inserts the captured text. The syntax looks the same; the meaning is different.
Operator precedence (high → low)
()— parentheses* + ? {}— quantifiers- Sequences and anchors (concatenation)
|— alternation
So a|bc* parses as a | (b(c*)) — either a, or b followed by any number of c. To mean “zero or more of either a or b followed by c” you would write (a|b)c*.
Escaping special characters
The characters . * + ? ^ $ { } [ ] ( ) | \ all have special meaning in regex. To match them literally, prefix with \:
| You want to match | Write |
|---|---|
| A literal period | \. |
| A literal asterisk | \* |
| A literal parenthesis | \( or \) |
| A literal backslash | \\ |
Example: \$\d+\.\d{2} matches a price like $45.99 — the \$ and \. match literal symbols, \d+ and \d{2} match digit runs.
Python raw strings
Python processes escape sequences (\n, \t, etc.) in strings before the regex engine ever sees the pattern. This causes double-escaping problems:
# BAD: Python converts \b to a backspace character; regex never sees \b
re.findall('\bcat\b', text)
# GOOD: raw string r'...' passes the characters literally to the regex engine
re.findall(r'\bcat\b', text)Always use r'...' for regex patterns in Python. This is not optional — patterns containing \d, \w, \b, \s will silently misbehave without it.
Lookahead Assertions
Lookaheads match a position without consuming characters.
| Construct | Meaning |
|---|---|
(?=...) | Positive lookahead: position is followed by the pattern |
(?!...) | Negative lookahead: position is not followed by the pattern |
re.findall(r'Windows(?! NT)', text) # "Windows" only when NOT followed by " NT"
re.findall(r'\d+(?= dollars)', text) # digits only when followed by " dollars"Lookaheads are useful when you want to match something based on what comes after it, without including that context in the match itself.
Substitutions and Capture Groups
The substitution s/pattern/replacement/ replaces matches. Capture groups let you reuse matched text in the replacement:
import re
# Wrap every number in angle brackets: "34 items" → "<34> items"
re.sub(r'([0-9]+)', r'<\1>', '34 items in 2 boxes')
# → '<34> items in <2> boxes'
# Swap first and last name: "Smith John" → "John Smith"
re.sub(r'(\w+) (\w+)', r'\2 \1', 'Smith John')
# → 'John Smith'ELIZA’s core mechanism is exactly this:
re.sub(r".* I'M (depressed|sad) .*", r"I AM SORRY TO HEAR YOU ARE \1", text)Multiple capture groups are referenced as \1, \2, … in the replacement string.
Building a Pattern Iteratively
Regex engineering is not “write once, it works”. It is a loop:
- Write a first-pass pattern — simple, covering the obvious cases.
- Test against real text. Find:
- False positives: strings that match but shouldn’t.
- False negatives: strings that should match but don’t.
- Tighten (add constraints) to reduce false positives.
- Broaden (add alternatives) to reduce false negatives.
- Repeat.
Worked example: find the word the in text.
Attempt 1: [tT]he
False positives: matches the inside there, other, theology.
Attempt 2: [tT]he[^a-zA-Z]
False negatives: misses the at end of line (no character follows).
Attempt 3: \b[tT]he\b
This correctly matches the and The as standalone words, and nothing else. Word boundaries handle both the “followed by non-letter” and “end of string” edge cases simultaneously.
Precision and Recall
Any regex pattern makes errors in two directions:
- False positives (Type I): the pattern fires when it shouldn’t → lower precision.
- False negatives (Type II): the pattern misses what it should catch → lower recall.
Tightening a pattern raises precision but risks lowering recall. Broadening raises recall but risks lowering precision. There is no free lunch — practical regex engineering is managing this tradeoff deliberately.
Python re Module
import re
re.findall(r'\b[A-Z]\w+', text) # all capitalized words (as whole words)
re.search(r'\bcat\b', text) # first match object (or None)
re.match(r'\d+', text) # match only at START of string
re.sub(r'\bcolou?r\b', 'color', text) # normalize spelling
re.split(r'\s+', text) # split on any whitespace run
re.compile(r'\d{4}', re.IGNORECASE) # pre-compile for reuseKey flags:
| Flag | Effect |
|---|---|
re.IGNORECASE (re.I) | Case-insensitive matching |
re.MULTILINE (re.M) | ^ and $ match start/end of each line, not just the string |
re.DOTALL (re.S) | . matches newline characters too |
re.VERBOSE (re.X) | Allows whitespace and # comments inside the pattern |
re.VERBOSE is useful for complex patterns:
pattern = re.compile(r'''
(?:[A-Z]\.)+ # abbreviations: U.S.A.
| \w+(?:-\w+)* # hyphenated words
| \$?\d+(?:\.\d+)? # prices and numbers
| \.\.\. # ellipsis
''', re.VERBOSE)Role of Regex in NLP
Regular expressions play a surprisingly large role in NLP:
- Sophisticated sequences of regular expressions are often the first model tried for any text processing task — before any machine learning is involved. They are fast to write, fully interpretable, and require no training data.
- For harder tasks, machine learning classifiers take over — when patterns become too complex or too numerous to enumerate by hand, learned models are more practical.
- Even then, regex doesn’t disappear: it is used for pre-processing text before it reaches a classifier, and regex-derived features (e.g. “does this token match
\d+?”) are fed directly into classifiers as input signals. - Regex is also well-suited for capturing generalizations — a single pattern like
\$\d+(?:\.\d{2})?covers every price format without needing labelled examples.
The practical takeaway: when starting any new NLP task, write a regex first. It sets a baseline, reveals edge cases, and often turns out to be good enough.
Related
- eliza — uses regex substitution as its sole reasoning mechanism
- tokenization — early tokenizers are regex cascades; NLTK uses
re.VERBOSEpatterns - text-normalization — normalization rules are implemented as substitutions
Active Recall
What is the difference between
[^abc]and^abcin a regular expression?Inside a character class
[...], the caret^is a negation operator:[^abc]matches any character that is nota,b, orc. Outside a character class,^is a line-start anchor:^abcmatchesabconly at the beginning of a line. Same symbol, two unrelated meanings — context (inside vs outside[]) determines which.
Why does
\bcat\bbehave differently fromcat, and when does\bmatter in NLP?
\bis a zero-width assertion matching the position between a word character (\w) and a non-word character (\W).catmatchescatanywhere — insideconcatenate,bobcat, orcategory.\bcat\bmatchescatonly as a standalone word. In NLP, searching for content words without word boundaries produces systematic false positives on substrings.
What goes wrong if you write
re.findall('\bcat\b', text)in Python, and how do you fix it?Python processes escape sequences before the regex engine sees the string.
'\b'in a Python string literal is the backspace character (ASCII 8), not the regex word boundary. The regex engine never receives\b. Fix: use a raw stringr'\bcat\b'— raw strings pass the literal characters\,bto the regex engine, which then interprets\bas a word boundary.
What is the difference between a greedy and a non-greedy quantifier? Give an example where the choice matters.
Greedy (
*,+): matches as many characters as possible. Non-greedy (*?,+?): matches as few as possible. On<a>hello</a>, the pattern<.+>greedily matches the whole string<a>hello</a>as one match;<.+?>non-greedily matches<a>and</a>as two separate matches. Non-greedy is necessary whenever you want to match the shortest possible span between delimiters.
What is the operator precedence in regular expressions, and why does it matter for reading
a|bc*?Precedence (high to low): parentheses, then quantifiers (
* + ? {}), then sequences/anchors, then alternation|. Soa|bc*parses asa | (b(c*))— eithera, orbfollowed by any number ofc. To mean “zero or more ofaorbafterc” you need(a|b)c*. Misreading precedence is a common source of incorrect patterns.
How does the substitution operator use capture groups, and what is a concrete NLP use case?
Capture groups
(...)save the matched substring as\1,\2, etc. The replacement string can reference them. Example:re.sub(r'([0-9]+)', r'<\1>', text)wraps every number in angle brackets. In ELIZA,s/.* I'M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/echoes back the captured feeling word to produce therapist-like output.
Which of the following regular expressions will correctly match email addresses in the format
username@domain.extension? A)^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$B)\[a-zA-Z]+@[a-zA-Z]+\.[a-zA-Z]{2,4}$C)[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}$D)^[a-zA-Z0-9._%+-]@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$A is correct. It anchors with
^...$, matches one or more username characters from the set[a-zA-Z0-9._%+-]+, then@, then one or more domain characters, then\.(literal dot), then 2+ letters for the extension. B uses\[which is a literal bracket, not a character class, and restricts extensions to 2–4 chars (excluding newer TLDs). C uses#instead of@. D lacks the+after the username class, so it matches only a single character before@.
Write a single regex that matches the set of all strings where each
ais immediately preceded and followed by ab, over the alphabet {a, b}. What is the key insight?
/(b+(b|ab)*b+)?/— The pattern allows the empty string (the outer?) and any string over {a,b} where everyais wrapped withbs. The core idea: either there are noas at all (justbs), oraappears only inside ab...bsandwich. The alternation(b|ab)*handles runs ofbs andabunits in the middle, while the outerb+anchors enforce leading and trailingbs. Verifying edge cases:ba— fails correctly (trailingais not followed byb);bab— matches correctly;bb— matches correctly;a— fails correctly.
Write a regex that matches any string containing both the whole word
grottoand the whole wordraven, in either order.
/\bgrotto\b.*\braven\b|\braven\b.*\bgrotto\b/— Two alternatives connected by|, one for each ordering. Both use\bto enforce whole-word matching (sogrottosdoes not match). The.*between them allows any content between the two words. This pattern illustrates a recurring idiom: to require two independent patterns both appear in a string, write both orderings explicitly with.*between them.
Write a regex that captures the first word of an English sentence into group 1, handling any leading punctuation.
^[^a-zA-Z]*([a-zA-Z]+)— The^anchors to line start.[^a-zA-Z]*skips any non-letter characters at the beginning (e.g., leading quotes", dashes—, or whitespace).([a-zA-Z]+)captures the first run of letters as group 1. The key insight is using a negated character class to skip punctuation rather than trying to enumerate all possible punctuation characters.