pretext-task

A pretext task is an artificial challenge the algorithm sets for itself. Cut a patch from the centre of an image, cut another from a corner, and ask the network to classify where the corner patch came from relative to the centre. To answer correctly, the network has to recognise what’s in the image — a cat’s face implies the ear belongs top-right of it, not bottom-left. The position label is fake (the algorithm made it up); but solving the fake task forces real visual understanding.

The pattern

Pick a transformation $T$ that scrambles the input in a known, parameterisable way.
The “label” is whatever $T$ did — the rotation angle, the patch position, the missing pixels, the colour vs greyscale flag.
Train the network to invert/predict $T$ . Standard supervised classification, except the label was generated by the algorithm.
Discard the prediction head. Keep the encoder.

The encoder’s mid-layer features are the deliverable — they’ve absorbed whatever visual understanding was needed to solve the pretext.

The canonical example: context prediction (Doersch et al. 2015)

[Doersch, Gupta, Efros — Unsupervised Visual Representation Learning by Context Prediction, ICCV 2015.]

The task

From an image, sample a 3×3 grid of patches. Pick the centre patch (always position “5”) and one of the eight surrounding patches (positions 1–4 or 6–9). Feed both patches into a Siamese network (two parallel encoder copies sharing weights, joining at fc7). Ask: “which of the 8 positions did the second patch come from?”

The label is one of ${1, 2, 3, 4, 6, 7, 8, 9}$ — 8-way classification. Loss is standard cross-entropy.

Why this forces understanding

To answer correctly, the network must:

Recognise what’s in each patch — “this is a cat’s nose,” “this is a cat’s ear.”
Know how the parts of objects relate spatially — “ears go above eyes, eyes go above nose, nose goes above mouth.”

A network that only does pixel statistics (colour histograms, edge counts) cannot distinguish “ear above nose” from “ear below nose.” Solving the patch-position task requires semantic content recognition. So the encoder’s features end up capturing object structure — and they transfer to downstream tasks like classification, detection, retrieval.

The result

After pre-training, the network’s fc6 features (just below the position classifier) cluster semantically. Nearest-neighbour search in fc6 space returns visually similar and semantically related results — wheels near other wheels, faces near other faces — without any class label having been seen. R-CNN initialised with these pre-trained weights beats training from scratch by 5% mAP on PASCAL VOC, getting close to ImageNet-supervised pre-training.

The Clever Hans warning — chromatic aberration

Doersch et al. discovered the network was achieving near-perfect accuracy on the patch-position task not by recognising content, but by reading chromatic aberration — a colour-fringe artefact at lens edges that makes the centre of an image look slightly different in colour from the corners. The network had found a shortcut that revealed absolute position from low-level optics, no scene understanding required.

The fix: drop colour channels (random colour-channel removal during training) to break the shortcut. With that, the network was forced to actually look at content, and the learned features became meaningful.

This is a textbook clever-hans-effect. The pretext task created a spurious correlation (lens position ↔ colour shift) that solved the proxy without solving the intended problem. Designing a pretext task always carries this risk; testing on diverse downstream tasks is the only way to know if the encoder learned something real.

Other pretext tasks (briefly)

Task	Label	What it forces
Context prediction (Doersch 2015)	1 of 8 patch positions	Object spatial structure
Rotation prediction (Gidaris 2018)	0°, 90°, 180°, 270°	Canonical orientation of objects
Jigsaw puzzles (Noroozi 2016)	Permutation of 9 shuffled tiles	Whole-image structure
Inpainting (Pathak 2016)	Predict a masked-out region	Local context understanding
Colourisation (Zhang 2016)	Predict colour from greyscale	Texture-to-colour associations

All share the recipe: invent a derangement, predict it back, throw away the predictor head.

Status: superseded but instructive

Pretext-task methods were the dominant SSL approach roughly 2014–2018. They’ve been mostly displaced by contrastive methods (SimCLR onwards), which produce stronger representations without needing a handcrafted pretext. But pretext tasks remain pedagogically important:

They make it explicit that the proxy task is throwaway — the encoder is the product. This insight is reusable across all SSL.
They expose the shortcut problem vividly via the chromatic aberration story — a Clever Hans failure mode that recurs in subtler forms throughout deep learning.
They’re computationally cheap compared to contrastive methods (no large batches needed) and remain useful in resource-constrained settings.

A friend designs a pretext task: "Take a square image, and predict whether it's been rotated 90° clockwise or not." Why might this learn weaker features than Doersch's context prediction?

Two reasons. First, task difficulty. A binary classification is much easier than 8-way — the network can solve it with shallower features (e.g. detecting whether dominant edges are horizontal or vertical) and doesn’t need deep object recognition. Pretext tasks have to be hard enough that the network must understand content to solve them; trivial pretexts produce trivial features. Second, shortcut potential. A 90° rotation creates very obvious low-level cues (edge orientations flip), so the network can solve the binary classification by detecting “are most edges horizontal vs vertical” without learning what’s in the image. Doersch’s 8-way classification is harder to shortcut because all 8 spatial relations have similar low-level statistics — the network is more likely to actually need semantic recognition.

Why does the chromatic-aberration shortcut count as a "Clever Hans" failure rather than a useful learned feature?

Because the feature it learned (lens position from colour shift) is specific to the proxy task and does not transfer. Recognising chromatic aberration tells you nothing about cat ears, road signs, or tumour boundaries — the things you actually want the encoder to know about. The network’s training accuracy on patch-position prediction was high, but its downstream-task accuracy stayed low because it hadn’t learned anything useful. That gap — high pretext accuracy, low downstream usefulness — is the diagnostic signature of a Clever Hans shortcut. The fix is to break the shortcut (drop colour channels) and verify that downstream performance recovers.

Connections

self-supervised-learning — the umbrella; pretext tasks are the second of the three families covered.
contrastive-learning — the modern alternative; mostly displaced pretext tasks but uses the same “throwaway proxy task” pattern.
representation-learning — the goal pretext tasks serve.
clever-hans-effect — the recurring failure mode; chromatic aberration is the textbook example.
autoencoder — another SSL family, with a different proxy (reconstruct the input).
transfer-learning — pretext-pretrained encoders transfer the same way ImageNet-pretrained ones do.

Course Notes

Explorer