A pretext task is an artificial challenge the algorithm sets for itself. Cut a patch from the centre of an image, cut another from a corner, and ask the network to classify where the corner patch came from relative to the centre. To answer correctly, the network has to recognise what’s in the image — a cat’s face implies the ear belongs top-right of it, not bottom-left. The position label is fake (the algorithm made it up); but solving the fake task forces real visual understanding.
The pattern
- Pick a transformation that scrambles the input in a known, parameterisable way.
- The “label” is whatever did — the rotation angle, the patch position, the missing pixels, the colour vs greyscale flag.
- Train the network to invert/predict . Standard supervised classification, except the label was generated by the algorithm.
- Discard the prediction head. Keep the encoder.
The encoder’s mid-layer features are the deliverable — they’ve absorbed whatever visual understanding was needed to solve the pretext.
The canonical example: context prediction (Doersch et al. 2015)
[Doersch, Gupta, Efros — Unsupervised Visual Representation Learning by Context Prediction, ICCV 2015.]
The task
From an image, sample a 3×3 grid of patches. Pick the centre patch (always position “5”) and one of the eight surrounding patches (positions 1–4 or 6–9). Feed both patches into a Siamese network (two parallel encoder copies sharing weights, joining at fc7). Ask: “which of the 8 positions did the second patch come from?”
The label is one of — 8-way classification. Loss is standard cross-entropy.
Why this forces understanding
To answer correctly, the network must:
- Recognise what’s in each patch — “this is a cat’s nose,” “this is a cat’s ear.”
- Know how the parts of objects relate spatially — “ears go above eyes, eyes go above nose, nose goes above mouth.”
A network that only does pixel statistics (colour histograms, edge counts) cannot distinguish “ear above nose” from “ear below nose.” Solving the patch-position task requires semantic content recognition. So the encoder’s features end up capturing object structure — and they transfer to downstream tasks like classification, detection, retrieval.
The result
After pre-training, the network’s fc6 features (just below the position classifier) cluster semantically. Nearest-neighbour search in fc6 space returns visually similar and semantically related results — wheels near other wheels, faces near other faces — without any class label having been seen. R-CNN initialised with these pre-trained weights beats training from scratch by 5% mAP on PASCAL VOC, getting close to ImageNet-supervised pre-training.
The Clever Hans warning — chromatic aberration
Doersch et al. discovered the network was achieving near-perfect accuracy on the patch-position task not by recognising content, but by reading chromatic aberration — a colour-fringe artefact at lens edges that makes the centre of an image look slightly different in colour from the corners. The network had found a shortcut that revealed absolute position from low-level optics, no scene understanding required.
The fix: drop colour channels (random colour-channel removal during training) to break the shortcut. With that, the network was forced to actually look at content, and the learned features became meaningful.
This is a textbook clever-hans-effect. The pretext task created a spurious correlation (lens position ↔ colour shift) that solved the proxy without solving the intended problem. Designing a pretext task always carries this risk; testing on diverse downstream tasks is the only way to know if the encoder learned something real.
Other pretext tasks (briefly)
| Task | Label | What it forces |
|---|---|---|
| Context prediction (Doersch 2015) | 1 of 8 patch positions | Object spatial structure |
| Rotation prediction (Gidaris 2018) | 0°, 90°, 180°, 270° | Canonical orientation of objects |
| Jigsaw puzzles (Noroozi 2016) | Permutation of 9 shuffled tiles | Whole-image structure |
| Inpainting (Pathak 2016) | Predict a masked-out region | Local context understanding |
| Colourisation (Zhang 2016) | Predict colour from greyscale | Texture-to-colour associations |
All share the recipe: invent a derangement, predict it back, throw away the predictor head.
Status: superseded but instructive
Pretext-task methods were the dominant SSL approach roughly 2014–2018. They’ve been mostly displaced by contrastive methods (SimCLR onwards), which produce stronger representations without needing a handcrafted pretext. But pretext tasks remain pedagogically important:
- They make it explicit that the proxy task is throwaway — the encoder is the product. This insight is reusable across all SSL.
- They expose the shortcut problem vividly via the chromatic aberration story — a Clever Hans failure mode that recurs in subtler forms throughout deep learning.
- They’re computationally cheap compared to contrastive methods (no large batches needed) and remain useful in resource-constrained settings.
A friend designs a pretext task: "Take a square image, and predict whether it's been rotated 90° clockwise or not." Why might this learn weaker features than Doersch's context prediction?
Two reasons. First, task difficulty. A binary classification is much easier than 8-way — the network can solve it with shallower features (e.g. detecting whether dominant edges are horizontal or vertical) and doesn’t need deep object recognition. Pretext tasks have to be hard enough that the network must understand content to solve them; trivial pretexts produce trivial features. Second, shortcut potential. A 90° rotation creates very obvious low-level cues (edge orientations flip), so the network can solve the binary classification by detecting “are most edges horizontal vs vertical” without learning what’s in the image. Doersch’s 8-way classification is harder to shortcut because all 8 spatial relations have similar low-level statistics — the network is more likely to actually need semantic recognition.
Why does the chromatic-aberration shortcut count as a "Clever Hans" failure rather than a useful learned feature?
Because the feature it learned (lens position from colour shift) is specific to the proxy task and does not transfer. Recognising chromatic aberration tells you nothing about cat ears, road signs, or tumour boundaries — the things you actually want the encoder to know about. The network’s training accuracy on patch-position prediction was high, but its downstream-task accuracy stayed low because it hadn’t learned anything useful. That gap — high pretext accuracy, low downstream usefulness — is the diagnostic signature of a Clever Hans shortcut. The fix is to break the shortcut (drop colour channels) and verify that downstream performance recovers.
Connections
- self-supervised-learning — the umbrella; pretext tasks are the second of the three families covered.
- contrastive-learning — the modern alternative; mostly displaced pretext tasks but uses the same “throwaway proxy task” pattern.
- representation-learning — the goal pretext tasks serve.
- clever-hans-effect — the recurring failure mode; chromatic aberration is the textbook example.
- autoencoder — another SSL family, with a different proxy (reconstruct the input).
- transfer-learning — pretext-pretrained encoders transfer the same way ImageNet-pretrained ones do.