TARGET DECK NLP::Week-01
Week 1 Flashcards
Write your own flashcards here using the [!question]- callout format. Each callout becomes one Anki card (front = question, back = answer).
How to write a good flashcard
- One concept per card — don’t bundle multiple ideas
- Prefer “write a regex that…” or “what happens when…” over “define X”
- Include a concrete example in the answer when possible
- Keep answers short enough to review in under 30 seconds
Regex
What is a regex?
A regular expression (regex) is a sequence of characters that defines a search pattern over strings. A regex engine takes the pattern and an input string, and returns all substrings that match. In NLP, regex is used for tokenization, text normalization, feature extraction, and rule-based systems like ELIZA.
What are the three meanings of
^in regex, and what determines which one applies?
- First character inside
[]→ negation:[^abc]matches anything except a, b, c- Outside
[]→ start-of-line anchor:^ThematchesTheonly at line start- Non-first character inside
[]→ literal caret:[e^]matcheseor^Position alone determines the meaning — the same symbol, three unrelated jobs.
What does
\bmatch, and how is it different from constructs like\dor.?
\bis a zero-width assertion — it matches a position between a word character (\w) and a non-word character (\W), not an actual character. Unlike\dor., it consumes nothing.\bcat\bmatchescatas a standalone word but not insideconcatenate. Without\b,catmatches anywhere — including inside other words.
What goes wrong with
re.findall('\bcat\b', text)in Python, and why is there no error?Python interprets
'\b'as the backspace character (ASCII 8) before the regex engine ever sees it. The regex engine receives a backspace instead of a word boundary — so it silently matches the wrong thing with no error. Fix: use a raw stringr'\bcat\b', which passes the literal\bto the regex engine.
What does
<.+>match on the string<a>hello</a>, and why?It matches
<a>hello</a>— the entire string, not just<a>. By default,+is greedy: it consumes as many characters as possible. The.+grabs everything, then backtracks just enough to find the last>. Use<.+?>(non-greedy) to get<a>and</a>as separate matches.
\1appears in both patterns and replacements. What does it mean in each, and how are they different?
- In a pattern:
\1is a constraint — “this position must match exactly what group 1 already captured.” E.g.,([a-z]+)\s+\1matchesthe thebut notthe cat.- In a replacement:
\1is an insertion — “paste whatever group 1 captured here.” E.g.,re.sub(r'(\w+)', r'<\1>', text)wraps each word in angle brackets.Same syntax, different semantics.
What does
cat+match, and why doesn't it mean "one or morecat"?
cat+means “cafollowed by one or moret” — it matchescat,catt,cattt, etc. Quantifiers attach to the single token immediately before them, not the whole preceding word. To match one or more repetitions of the full word, you need a group:(cat)+→cat,catcat,catcatcat.
How does
a|bc*parse, and why is it not(a|b)followed byc*?It parses as
a | (b(c*))— eithera, orbfollowed by zero or morec. The|operator has the lowest precedence in regex (below quantifiers and concatenation), so it splits the entire expression. To get “eitheraorb, followed byc*”, you must write(a|b)c*.
What are the two directions a regex pattern can make mistakes in, and how do you address each?
- False positives (Type I): the pattern matches strings it shouldn’t → lowers precision. Fix: tighten the pattern by adding constraints (e.g., add
\bto stop matching substrings).- False negatives (Type II): the pattern misses strings it should catch → lowers recall. Fix: broaden the pattern by adding alternatives (e.g.,
colou?rto handle both spellings).There is no free lunch — tightening raises precision but risks lowering recall, and vice versa. Practical regex engineering is iterating between the two deliberately.
What is the role of regex in NLP?
- Regex is often the first model tried for any text processing task — fast to write, interpretable, no training data needed.
- For harder tasks, ML classifiers take over, but regex is still used for pre-processing text before it reaches a classifier, and as features fed into classifiers (e.g., “does this token match
\d+?”).- Regex is well-suited for capturing generalizations — a single pattern like
\$\d+(?:\.\d{2})?covers every price format without labelled examples.
What is a lookahead assertion, and give an example of each type?
A lookahead matches a position without consuming any characters — it checks what comes next but doesn’t include it in the match.
- Positive lookahead
(?=...): matches if the pattern is ahead.\d+(?= dollars)matches100in “100 dollars” but not in “100 cats”.- Negative lookahead
(?!...): matches if the pattern is not ahead.Windows(?! NT)matches “Windows” in “Windows 10” but not in “Windows NT”.
ELIZA
What is ELIZA?
ELIZA (Weizenbaum, 1966) is the first chatbot — a program that simulates a Rogerian psychotherapist. It operates entirely through regex pattern matching: each user utterance is scanned against an ordered list of substitution rules, and the first match fires a template response that echoes and reframes the input. It has no world model, no memory, and no understanding — only a list of rules.
How does ELIZA work? Trace through an example input.
ELIZA uses regex substitution rules in the form
s/pattern/replacement/. Rules are applied sequentially — each substitution feeds the next.For input
"He says I'm depressed much of the time":
- Pronoun inversion rules fire first:
I'm→YOU ARE- The string becomes
"He says YOU ARE depressed much of the time"- Rule
s/.*YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/matches.*absorbs the surrounding text,(depressed|sad)captures"depressed"as\1- Output:
"I AM SORRY TO HEAR YOU ARE depressed"The
.*at the start and end is essential — without it, the rule would only fire on the exact stringYOU ARE depressedwith nothing else around it.
What is the significance of ELIZA, and what are its limitations?
Significance: ELIZA produced the ELIZA effect — users attributed understanding and empathy to the program despite knowing it was rule-based. This raised early questions about the relationship between linguistic behaviour and actual cognition.
Limitations:
- No memory — each utterance is processed in isolation; it cannot refer back to earlier exchanges
- No semantic understanding — it only matches surface patterns, not meaning
- Domain-dependent — Weizenbaum chose Rogerian therapy deliberately because the therapist mostly reflects words back, hiding the system’s shallowness. Any other domain would break it
Why did Weizenbaum choose Rogerian psychotherapy as ELIZA's domain?
- Rogerian therapists mostly reflect the patient’s words back and ask open-ended questions
- They rarely assert facts about the world
- This maps directly onto regex substitution rules that echo and reframe input
- Most other domains (doctor, lawyer, tutor) require factually correct, contextually specific statements — which pattern matching cannot generate
- ELIZA doesn’t overcome the limits of regex — it picks a domain where those limits don’t show
Types and Tokens
What is the difference between a type and a token? Give an example.
- Token: an instance of a word in running text
- Type: an element of the vocabulary — each unique word form
- Example 1 (slides): “they lay back on the San Francisco grass and looked at the stars and their” → 15 tokens, 13 types —
theandandeach appear twice but count as one type each- Example 2: “to be or not to be” → 6 tokens, 4 types (
to,be,or,not) —toandbeare repeated but count once each
What is Heaps' Law and what does each variable represent?
- = number of tokens (corpus size)
- = vocabulary size (number of types)
- – (constant)
- – (sub-linear exponent)
It describes how vocabulary size grows as a corpus gets larger.
What does in Heaps' Law tell you intuitively?
- Vocabulary grows sub-linearly — much slower than corpus size
- Doubling the corpus does not double the vocabulary
- New types keep appearing but at a decreasing rate
- Even an enormous corpus will not have seen every word that appears at test time
What is the open-vocabulary problem, and why is it unavoidable?
- The open-vocabulary problem: no matter how large your training corpus, you will encounter unseen words at test time
- Heaps’ Law proves it’s unavoidable — means new types keep appearing indefinitely, just at a slower rate
- Rare words, names, neologisms, typos, and domain-specific terms will always slip through
Why does Heaps' Law motivate subword tokenization rather than just enlarging the vocabulary?
- A bigger vocabulary only captures types seen in training — truly novel forms (names, neologisms, typos) remain out-of-vocabulary regardless of size
- Subword tokenization decomposes any word into units that are almost certainly in the vocabulary (at worst, individual characters)
- This eliminates the open-vocabulary problem by construction rather than mitigating it by scale
Corpora
What is a corpus?
- A corpus (pl. corpora) is a collection of text or speech assembled for analysis or model training
- Every NLP system is shaped by the corpora it was trained on — corpus selection is a design decision with real consequences
- A corpus is not a neutral sample of “language” — it is a sample of specific people’s language use under specific conditions
Along what dimensions do corpora vary, and why does this matter for NLP?
- Language — 7097 languages in the world; most NLP defaults to English
- Variety — e.g., African American English, Singaporean English — distinct grammar, not just accent
- Code-switching — multilingual speakers mix languages within sentences
- Genre — newswire, fiction, social media, legal, medical — very different statistical properties
- Author demographics — age, gender, ethnicity, SES of writers
Each dimension affects model performance: a model trained on one slice will generalise poorly to another.
What is a corpus datasheet and why is it needed?
- A datasheet (Gebru et al., 2020) is structured documentation recording how a corpus was collected, what it contains, its intended uses, and known limitations
- Also called a data statement (Bender & Friedman, 2018)
- Covers: motivation, situation, collection process, annotation process, language variety, demographics
- The goal is to make the assumptions embedded in a corpus explicit, so downstream users can reason about what biases and gaps they are inheriting
Why does training on a biased corpus produce systematic errors, not random ones?
- The model learns statistical patterns from whatever text it sees
- If certain language varieties are absent, the model has no evidence for their patterns and falls back on the majority variety
- This produces errors correlated with speaker identity — the model performs worse on some groups by construction
- These are not random mistakes — they are systematic bias built into the training data
Lemmas and Wordforms
What is the difference between a lemma and a wordform?
- Lemma: same stem, same part of speech, same rough word sense —
catandcatsshare the lemmacat- Wordform: the full inflected surface form —
catandcatsare different wordforms- A lemma groups together inflected variants; a wordform distinguishes them
- Whether you count lemmas or wordforms affects your type count: “the cats sat on the mats” has 6 wordform types but only 5 lemmas (
the,cat,sit,on,mat)
Tokenization
What is tokenization?
- The task of segmenting a raw character stream into tokens — the atomic units the downstream system reasons over
- It is the first step in virtually every NLP pipeline
- What counts as a token depends on the application: words, subwords, characters, or morphemes
Why is tokenization harder than it appears? Give examples.
- Periods have multiple roles: abbreviation (
Dr.), decimal ($45.55), URL (google.com), sentence boundary (She left.) — a split-on-period rule mangles all of these- Clitics:
we're→we + 're,can't→can + n't— need to split within a “word”- Multiword expressions:
New York,rock 'n' roll— multiple whitespace tokens that function as one semantic unit- Languages without spaces: Chinese, Japanese — word boundaries must be inferred, not read off the text
Why is Chinese tokenization fundamentally different from English?
- Chinese text has no spaces between words — word boundaries must be inferred from meaning and context
- Segmentation is genuinely ambiguous: 姚明进入总决赛 can be parsed as 3 words, 5 words, or 7 characters
- Needs a dedicated word segmentation model trained on annotated data — whitespace splitting is not an option
What is the difference between "split on delimiters" and "collect matches" tokenization?
- Split on delimiters: define what separates tokens (e.g., whitespace, punctuation) and split there
- Collect matches: define what a token looks like (its shape) using regex patterns and collect matches
- Collect matches is more expressive — you can have patterns for abbreviations (
(?:[A-Z]\.)+), hyphenated words, currencies, all tried in priority order- Some tokens like
U.S.A.contain characters that would be treated as delimiters under a split approach
Text Normalization
What is text normalization?
Text normalization is the process of transforming raw text into a canonical, consistent form before it is fed into an NLP system.
- Raw text is messy: inconsistent casing, spelling variants (
colourvscolor), inflected word forms (running,ran,runs), punctuation attached to words, multiple sentence boundaries- Normalization reduces this surface variation so the downstream model sees equivalent inputs as the same
- It is a pre-processing stage — applied once to the corpus before any learning or inference
- The three canonical steps are: tokenization → word-format normalization → sentence segmentation
- What counts as “normal” is task-dependent: IR folds case, NER preserves it; chatbots keep contractions, some classifiers split them Tags: text-normalization preprocessing
[!question]- What are the 3 steps of text normalization?
- Tokenizing words — segmenting a character stream into tokens
- Normalizing word formats — mapping token variants to a canonical form (case folding, lemmatization, stemming)
- Segmenting sentences — identifying sentence boundaries
Every NLP task requires this pipeline. The steps are ordered — tokenization must happen first because the other two depend on having tokens.
What is the difference between stemming and lemmatization?
- Lemmatization: reduces a word to its dictionary headword (lemma) using full morphological analysis — result is always a real word (
better→good,sang→sing)- Stemming: strips affixes using heuristic rules without a lexicon — result may not be a real word (
universe→univers) and may conflate unrelated words (wand/wander→wand)- Use stemming for IR — crude conflation raises recall at acceptable precision cost
- Use lemmatization when word meaning matters — translation, classification, QA
When should and shouldn't you use case folding? Give examples.
- Use it: information retrieval —
Theandtheshould match the same documents; folding removes irrelevant variation- Avoid it: named entity recognition —
Apple(company) vsapple(fruit) are different entities- Avoid it: sentiment analysis —
US(United States) vsus(first-person pronoun) carry different signals- Case folding is lossy — the information lost sometimes matters
Subword Tokenization
What is subword tokenization and why is it needed?
- Operates between character-level and word-level — learns a vocabulary of word fragments from data
- Word-level fails in two ways: unseen words collapse to
<UNK>(OOV problem), and morphologically rich languages produce enormous vocabularies- Subword solves both: any string decomposes into known subword pieces (worst case: individual characters), and related forms share units (
running,runnerboth containrun)- Every modern large language model uses subword tokenization
What are the two components of every subword tokenizer?
- Token learner: runs offline on a training corpus → produces a vocabulary of subword types
- Token segmenter: runs at inference time → applies the learned vocabulary to segment new text
They are separated because learning is expensive (processes entire corpus) while segmentation must be fast (runs on every input). Learn once, segment many times.
How does BPE differ from WordPiece?
- BPE (Sennrich et al., 2016): merges the pair with the highest raw frequency — used by GPT-2/3/4, RoBERTa
- WordPiece (Schuster & Nakajima, 2012): merges the pair that most improves language model likelihood — used by BERT, DistilBERT
- WordPiece is more principled (optimises a meaningful objective) but slower to train
- Both produce similar vocabularies in practice
Edit Distance
What is edit distance, and what are the two cost conventions?
- The minimum number of single-character operations (insertion, deletion, substitution) needed to transform one string into another
- Two conventions exist — always check which one a question uses:
- Levenshtein (Jurafsky & Martin): ins = 1, del = 1, sub = 2
- Uniform: ins = 1, del = 1, sub = 1
- The same pair of strings produces a different distance under the two conventions
What is the recurrence relation for edit distance?
- if (characters match), (Levenshtein) or (uniform) if they differ
- Base cases: ,
- Answer: (bottom-right cell)
- Complexity: time and space
Walk through the edit distance DP table for "SIT" → "SAT" (Levenshtein: sub = 2).
Base cases: row 0 = 0 1 2 3, col 0 = 0 1 2 3. Bold = backtrace path.
"" S A T "" 0 1 2 3 S 1 0 1 2 I 2 1 2 2 T 3 2 2 2 Reading the path (top-left → bottom-right):
- (0,0)→(1,1): S = S → match, cost 0
- (1,1)→(2,2): I ≠ A → substitution, cost +2
- (2,2)→(3,3): T = T → match, cost 0
Edit distance = 2
Alignment:
S I T ↕ S A TTags: edit-distance dynamic-programming worked-example
What is backtrace in edit distance, and how does it recover the alignment?
- Each cell stores a pointer to which predecessor gave the minimum: ← (insertion), ↓ (deletion), ↖ (match/substitution)
- Start at , follow pointers back to
- Each step reads off one edit operation; reversing gives the alignment in forward order
- Complexity:
When would you use weighted edit distance instead of uniform costs?
- Uniform costs assume all errors are equally likely — in reality some are far more common
- Spelling correction: confused character pairs (
m/n,b/d) should have lower substitution cost- Keyboard proximity: adjacent keys (
s/d) are more likely to be confused by a typist- OCR correction: visually similar characters (
0/O,l/1) should cost less to substitute- Requires a confusion matrix — empirical frequencies of which characters get confused, from error logs or misspelling corpora
Byte Pair Encoding (BPE)
How does the BPE token learner algorithm work?
- Split every word in the corpus into individual characters + an end-of-word marker (
_)- Start with a vocabulary of all characters +
_- Repeat times:
- Count all adjacent symbol pairs across the corpus
- Find the most frequent pair (e.g.,
e+s→es)- Merge that pair into a new symbol everywhere in the corpus
- Add the new symbol to the vocabulary
- Each merge reduces corpus length by one and adds one vocabulary entry
- Frequent words eventually become single tokens; rare words stay as sequences of subword pieces
How does the BPE segmenter handle a word it has never seen?
- Split the new word into characters + end-of-word marker
- Apply every learned merge in the order it was learned, greedily
- Skip any merge whose pair doesn’t appear in the current sequence
- Example: unseen word
lowest→l o w e s t _→ merges produce["low", "est_"]- Key property: degrades gracefully — whole words → subword pieces → individual characters. Never produces
<UNK>
What is the precision/recall tradeoff in regex pattern design?
- Precision = of the strings the pattern matched, what fraction were correct (true positives / (true positives + false positives))
- Recall = of the strings that should have matched, what fraction did (true positives / (true positives + false negatives))
- Tightening a pattern (more constraints) → raises precision, risks lowering recall
- Broadening a pattern (more alternatives) → raises recall, risks lowering precision
- There is no free lunch — practical regex engineering iterates between the two
What does the Porter Stemmer do, and give an example rule?
- A widely-used rule-based stemmer for English — strips affixes via a cascade of heuristic rewrite rules
- Example rules:
ATIONAL → ATE(e.g.,relational → relate),ING → εif the stem contains a vowel (e.g.,motoring → motor),SSES → SS(e.g.,caresses → caress)- Output may not be a real word (
computer → comput)- Used heavily in information retrieval, where crude conflation raises recall at acceptable precision cost
What is sentence segmentation, and why is it harder than splitting on
.!??
- The task of identifying sentence boundaries in a token stream
!and?are relatively unambiguous;.is the troublemaker- A period can mark: sentence end, abbreviation (
Dr.,Inc.), decimal (3.14), URL (google.com), or ellipsis (...)- The standard approach is a binary classifier at every period: “end of sentence” vs “not”, using features like word shape, capitalization of next token, and abbreviation lists
- Often handled jointly with tokenization — both involve the same period-disambiguation decision