A pretext task is an artificial challenge the algorithm sets for itself. Cut a patch from the centre of an image, cut another from a corner, and ask the network to classify where the corner patch came from relative to the centre. To answer correctly, the network has to recognise what’s in the image — a cat’s face implies the ear belongs top-right of it, not bottom-left. The position label is fake (the algorithm made it up); but solving the fake task forces real visual understanding.

The pattern

  1. Pick a transformation that scrambles the input in a known, parameterisable way.
  2. The “label” is whatever did — the rotation angle, the patch position, the missing pixels, the colour vs greyscale flag.
  3. Train the network to invert/predict . Standard supervised classification, except the label was generated by the algorithm.
  4. Discard the prediction head. Keep the encoder.

The encoder’s mid-layer features are the deliverable — they’ve absorbed whatever visual understanding was needed to solve the pretext.

The canonical example: context prediction (Doersch et al. 2015)

[Doersch, Gupta, Efros — Unsupervised Visual Representation Learning by Context Prediction, ICCV 2015.]

The task

From an image, sample a 3×3 grid of patches. Pick the centre patch (always position “5”) and one of the eight surrounding patches (positions 1–4 or 6–9). Feed both patches into a Siamese network (two parallel encoder copies sharing weights, joining at fc7). Ask: “which of the 8 positions did the second patch come from?”

The label is one of — 8-way classification. Loss is standard cross-entropy.

Why this forces understanding

To answer correctly, the network must:

  • Recognise what’s in each patch — “this is a cat’s nose,” “this is a cat’s ear.”
  • Know how the parts of objects relate spatially — “ears go above eyes, eyes go above nose, nose goes above mouth.”

A network that only does pixel statistics (colour histograms, edge counts) cannot distinguish “ear above nose” from “ear below nose.” Solving the patch-position task requires semantic content recognition. So the encoder’s features end up capturing object structure — and they transfer to downstream tasks like classification, detection, retrieval.

The result

After pre-training, the network’s fc6 features (just below the position classifier) cluster semantically. Nearest-neighbour search in fc6 space returns visually similar and semantically related results — wheels near other wheels, faces near other faces — without any class label having been seen. R-CNN initialised with these pre-trained weights beats training from scratch by 5% mAP on PASCAL VOC, getting close to ImageNet-supervised pre-training.

The Clever Hans warning — chromatic aberration

Doersch et al. discovered the network was achieving near-perfect accuracy on the patch-position task not by recognising content, but by reading chromatic aberration — a colour-fringe artefact at lens edges that makes the centre of an image look slightly different in colour from the corners. The network had found a shortcut that revealed absolute position from low-level optics, no scene understanding required.

The fix: drop colour channels (random colour-channel removal during training) to break the shortcut. With that, the network was forced to actually look at content, and the learned features became meaningful.

This is a textbook clever-hans-effect. The pretext task created a spurious correlation (lens position ↔ colour shift) that solved the proxy without solving the intended problem. Designing a pretext task always carries this risk; testing on diverse downstream tasks is the only way to know if the encoder learned something real.

Other pretext tasks (briefly)

TaskLabelWhat it forces
Context prediction (Doersch 2015)1 of 8 patch positionsObject spatial structure
Rotation prediction (Gidaris 2018)0°, 90°, 180°, 270°Canonical orientation of objects
Jigsaw puzzles (Noroozi 2016)Permutation of 9 shuffled tilesWhole-image structure
Inpainting (Pathak 2016)Predict a masked-out regionLocal context understanding
Colourisation (Zhang 2016)Predict colour from greyscaleTexture-to-colour associations

All share the recipe: invent a derangement, predict it back, throw away the predictor head.

Status: superseded but instructive

Pretext-task methods were the dominant SSL approach roughly 2014–2018. They’ve been mostly displaced by contrastive methods (SimCLR onwards), which produce stronger representations without needing a handcrafted pretext. But pretext tasks remain pedagogically important:

  • They make it explicit that the proxy task is throwaway — the encoder is the product. This insight is reusable across all SSL.
  • They expose the shortcut problem vividly via the chromatic aberration story — a Clever Hans failure mode that recurs in subtler forms throughout deep learning.
  • They’re computationally cheap compared to contrastive methods (no large batches needed) and remain useful in resource-constrained settings.

Connections

  • self-supervised-learning — the umbrella; pretext tasks are the second of the three families covered.
  • contrastive-learning — the modern alternative; mostly displaced pretext tasks but uses the same “throwaway proxy task” pattern.
  • representation-learning — the goal pretext tasks serve.
  • clever-hans-effect — the recurring failure mode; chromatic aberration is the textbook example.
  • autoencoder — another SSL family, with a different proxy (reconstruct the input).
  • transfer-learning — pretext-pretrained encoders transfer the same way ImageNet-pretrained ones do.