Text classifiers — including simple ones like Naive Bayes — can produce systematic harm when they encode biases from their training data. Good aggregate metrics can coexist with serious per-group failures.

Three Kinds of Harm

Representational harms

Harms caused by a system that demeans a social group, such as by perpetuating negative stereotypes.

Kiritchenko and Mohammad (2018) study: examined 200 sentiment analysis systems on pairs of sentences that were identical except for names — common African American names (Shaniqua) vs European American names (Stephanie). Example pair: “I talked to Shaniqua yesterday” vs “I talked to Stephanie yesterday”.

Result: systems systematically assigned lower sentiment and more negative emotion to sentences with African American names — despite the sentences being semantically identical.

Downstream harm: these sentiment tools are widely used in marketing research and mental health studies, so biased scores cause African Americans to be treated differently by the downstream systems consuming the outputs. The bias perpetuates stereotypes rather than creating new ones — but it amplifies and automates them.

Harms of censorship

Toxicity detection is the text classification task of detecting hate speech, abuse, harassment, or other toxic language — widely deployed in online content moderation.

Toxicity classifiers have been documented to incorrectly flag non-toxic sentences that simply mention minority identities — sentences containing words like “blind” or “gay” are over-flagged as toxic regardless of context:

  • Women (Park et al., 2018)
  • Disabled people (Hutchinson et al., 2020)
  • Gay people (Dixon et al., 2018; Oliva et al., 2021)

Downstream harms:

  • Speech by these groups is censored disproportionately.
  • Their speech becomes less visible online.
  • Writers learn (explicitly or implicitly) to avoid identity-referring vocabulary, making them less likely to write about themselves.

Performance disparities

Text classifiers perform worse on many languages of the world due to lack of data or labels — the long tail of low-resource languages.

They also perform worse on varieties of even high-resource languages. Example: language identification, typically a first step in an NLP pipeline (“is this post in English?”), performs worse on English written by African Americans (Blodgett & O’Connor 2017) or by writers from India (Jurgens et al., 2017) — leading to their content being dropped from downstream English-language pipelines entirely.

Causes

Three classes of cause, each with examples:

  1. Issues in the data: NLP systems amplify biases already present in training data. If a corpus underrepresents African American English, the model has no evidence for its patterns and falls back on majority-variety patterns. The biases come from who writes what, what gets published, and what gets scraped.
  2. Problems in the labels: annotators bring their own biases and disagreements; some phenomena are labelled inconsistently; rare categories get conflated.
  3. Problems in the algorithm: the choice of what the model is trained to optimize (accuracy) can itself encode a bias. Optimizing average accuracy gives a classifier no incentive to perform well on minority groups or rare classes.

Prevalence and Solutions

Prevalence: the same problems occur throughout NLP — not just in Naive Bayes classifiers, not just in sentiment. They appear in search, translation, named entity recognition, and large language models. Bigger models amplify data biases, not remove them.

Solutions: there is no general mitigation. Harm mitigation is an active research area. What exists:

  • Standard benchmarks for measuring disparate performance (e.g. across demographic slices).
  • Bias-measurement tools that surface gaps aggregate metrics hide.
  • Per-application mitigations — data augmentation for underrepresented varieties, fairness constraints at training time, post-hoc calibration.
  • Model cards — a lightweight documentation practice (see below) that doesn’t fix bias but makes it visible, so downstream users can make informed decisions.

None of these solve the problem in general. Each requires thinking carefully about the specific task, who is affected, and what kind of harm is at stake.

Model Cards (Mitchell et al., 2019)

Model cards are a documentation standard: for each algorithm released, publish a short, structured card describing how the model was trained, what it’s intended for, and how it performs across different groups. The idea is borrowed from nutrition labels — you can’t taste what’s in the food, so the label makes the ingredients explicit.

Each model card should document:

  • Training algorithms and parameters — what was actually trained, what hyperparameters, what optimizer, what budget.
  • Training data sources, motivation, and preprocessing — where the data came from, why this data was chosen, what filtering/cleaning was applied before training. This is where corpus datasheets feed in.
  • Evaluation data sources, motivation, and preprocessing — what the model was benchmarked against, and the provenance of those benchmarks.
  • Intended use and users — the scenarios the model was designed for, and the scenarios it was explicitly not designed for.
  • Model performance across different demographic or other groups and environmental situations — disaggregated metrics, not just aggregate numbers. This is the critical fairness-audit piece: slice performance by gender, race, age, language variety, dialect, etc., and report per-slice precision/recall/F1.

Model cards don’t fix any of the underlying biases — biased training data produces biased models regardless of how well you document them. What they do is make the biases legible: a downstream practitioner picking a sentiment classifier for an app can read the model card, see that it underperforms on African American English, and choose differently (or at least build in compensating checks). Documentation shifts harm from invisible-and-automatic to visible-and-decidable.

Connection to Other Weeks

  • corpora (week 1): corpus datasheets / data statements (Gebru et al. 2020; Bender & Friedman 2018) exist specifically to make the assumptions embedded in a corpus explicit — a prerequisite for reasoning about what biases a model might inherit.
  • classification-evaluation: aggregate metrics like accuracy or macro- hide per-group disparities. Evaluation for fairness requires slicing the test set by demographic group and reporting metrics separately.

Active Recall