TIP — Detective vs artist

An autoencoder is an artist: shown a dog, it must redraw the dog pixel-by-pixel. To do well it must memorise everything — including the grass texture and the lighting on the dog’s fur, neither of which has anything to do with “dog-ness.” Contrastive learning is a detective: shown a photo of a suspect and a photo of the same suspect in a hat and sunglasses, all that’s required is to confirm “same person.” No drawing needed. The detective can ignore hat / sunglasses / lighting / pose because those don’t change identity — and that’s exactly the kind of representation we want.

Contrastive learning replaces “rebuild this image” with “tell same-content apart from different-content.” Two augmented views of one dog should map to nearby vectors; that dog and a penguin should map to distant vectors. The network never has to draw a single pixel — it just has to know which images are “the same thing.” That turns out to be enough to learn representations as good as fully-supervised ones, on no labels at all.

The core idea

Pick a batch of images. For each image , generate two random augmentations — crop, flip, colour jitter, blur, etc. They look pixel-different but show the same object. Encode all augmentations through a network; in the resulting feature space:

  • Pull the two views of the same image together (positive pair).
  • Push them away from views of all other images in the batch (negatives).

That’s it. No reconstruction, no labels — just a “find-your-twin” game played in feature space.

The network has to learn what’s invariant under the augmentations: a dog is still a dog whether cropped, recoloured, or blurred. Learning this invariance forces the encoder to extract content (object identity, structure) and ignore appearance (colour, exact crop).

SimCLR: the canonical recipe

[Chen, Kornblith, Norouzi, Hinton — A Simple Framework for Contrastive Learning of Visual Representations, 2020.]

The pipeline

For a minibatch of images :

  1. Augment. Sample two augmentation operators per image. Get , . Now you have augmented views.
  2. Encode. Pass each view through a base encoder (typically ResNet-50): . The vectors are the latent representations — what we’ll keep at the end.
  3. Project. Pass each through a small MLP projection head : . The vectors are where the loss is computed. Note: although is also a “latent” vector in the literal sense (hidden, internal), in SimCLR the kept representation is , not — see latent-representation for why.
  4. Contrast. Compute pairwise similarities (cosine similarity). Apply contrastive loss.
  5. Update. Backprop through and jointly.

After training, throw away and use (and the representation ) for downstream tasks.

The loss (NT-Xent)

For a positive pair — two augmentations of the same image — within a batch of views:

where and is the temperature hyperparameter.

Reading the formula:

  • Numerator (pull). Similarity between the positive pair. The loss decreases as this grows. The network maximises .
  • Denominator (push). Sum of similarities with all other views in the batch ( negatives). The loss decreases as these shrink. The network minimises for .
  • Temperature . Scales the logits. Low → sharp decisions, hard distinctions; high → softer, more permissive. Typical .

The total batch loss averages this over all positive pairs.

It’s just a softmax cross-entropy classification problem in disguise: “given , which of the other vectors is its augmentation twin?”

Why a projection head?

This is the surprising design choice — and why “the representation” can mean two different things.

CAUTION — In SimCLR, is the keeper, not

Comes fromUsed forAfter training
Encoder Downstream tasksKeep
Projection head Computing the contrastive lossDiscard

Empirically, training a linear classifier on outperforms training one on — by a lot. The standard explanation: the contrastive loss forces to be invariant to exactly the augmentations applied (rotation, colour, etc.). That’s good for the contrastive task but throws away information that’s useful for some downstream tasks. The projection head absorbs that information loss, leaving to retain richer general-purpose features.

This is a clean separation worth remembering: the proxy task gets its specialised vector; the kept representation comes from one layer earlier.

Augmentation choice matters enormously

The augmentations define what the encoder is told to be invariant to. Bad choice → bad representation. SimCLR’s recipe (cropping + colour distortion is the killer combo) was found empirically.

Why both? Cropping alone leaves a positional / colour-statistics shortcut: two crops of the same image have similar colour histograms, so the network just matches colour statistics and ignores content. Adding heavy colour distortion breaks that shortcut, forcing actual content recognition. This is a Clever-Hans-style failure mode caught and patched (clever-hans-effect).

Standard SimCLR augmentations: random crop+resize, random horizontal flip, random colour jitter, random colour drop (greyscale), Gaussian blur. Rotation works for some datasets, hurts others (e.g. natural-orientation matters for digits).

Negatives matter — bigger batch is better

The loss’s denominator is summed over all other views in the batch. More views = more negatives = harder discrimination task = better representations. This is why SimCLR uses huge batch sizes (4096+ on TPUs). Methods like MoCo address this with a memory bank of cached negatives; methods like BYOL claim to remove the need for negatives entirely (the why is debated).

Results

The headline: SimCLR with a linear classifier on top of matches the top-1 accuracy of fully-supervised ResNet-50 on ImageNet.

MethodArchitectureTop-1
Supervised ResNet-50ResNet-50~76%
SimCLRResNet-5069.3%
SimCLR (2×)ResNet-50 (2×)74.2%
SimCLR (4×)ResNet-50 (4×)76.5%

With only 1% of ImageNet labels, SimCLR (4×) reaches 85.8% top-5 — beating heavily-engineered semi-supervised baselines that do use the label distribution structure. In other words: with a strong self-supervised pretrained encoder, you barely need labels at all.

Application: unsupervised dataset visualization (t-SimCNE)

A nice byproduct of strong contrastive representations: if you constrain the projection space to be 2-dimensional and replace cosine similarity with a Cauchy kernel , the contrastive objective becomes a visualization method. t-SimCNE (Bohm, Berens, Kobak 2023) does exactly this and produces 2-d embeddings of CIFAR-10 where:

  • Each class forms a coherent macro-cluster — all without ever showing a class label.
  • Within each macro-cluster, sub-structure emerges that captures finer semantics: “bright horses” / “dark horses” / “mounted horses” / “horse heads” all separate; “red cars” / “metallic cars” / “colourful cars” / “duplicate cars” form their own neighbourhoods.

The takeaway is that contrastive learning isn’t just for downstream classifiers. The induced latent geometry is rich enough to be a competitor to t-SNE / UMAP for unsupervised data exploration.

What contrastive learning gets you over autoencoders

autoencoderContrastive
Loss targetReconstruct pixelsTell same-vs-different
What it preservesWhatever explains the most pixels (pose, colour, large structure)Whatever survives augmentation (content, semantic identity)
Throws awayFine details; everything below the bottleneck capacityAnything an augmentation can change (colour, crop, exact pixel layout)
Downstream representationBottleneck Encoder output
Performance vs supervisedSubstantially worseComparable

The shift from “rebuild the picture” to “recognise the same thing” is the whole story. Reconstruction wastes capacity on pixel detail; contrast forces the encoder to focus on what’s actually invariant about an object.

Connections

  • latent-representation — clarifies why (encoder output) is the kept latent representation in SimCLR while (projection) is discarded — the opposite of the autoencoder convention.
  • self-supervised-learning — contrastive is the strongest current example.
  • representation-learning — the goal contrastive serves.
  • autoencoder — the alternative SSL family; contrastive outperforms reconstruction-based methods on most vision benchmarks.
  • data-augmentation — turns from a regulariser into the training signal itself; the augmentation set defines what the encoder learns to ignore.
  • transfer-learning — SimCLR-pretrained encoders are now a default starting point for vision transfer, in place of (or alongside) ImageNet-supervised backbones.
  • clever-hans-effect — augmentation choice is the main defence against the network solving the contrastive task via colour-statistics shortcuts instead of content.