Raw data is high-dimensional and dumb. A million pixels do not say “this is a cat”; they say “intensities at a million coordinates.” Representation learning is the deliberate effort to translate that mess into a short vector where each dimension means something — features that downstream algorithms can actually work with. The “right” representation makes hard problems easy: a linear classifier on top of a good representation can match a deep CNN trained from raw pixels with full supervision.

What “representation” means here

Given raw input (pixels, words, audio), find a function that produces a vector — the representation — with two properties:

  1. Lower-dimensional / structured. has many fewer dimensions than , or at least more semantically meaningful axes.
  2. Useful for downstream tasks. Classification, retrieval, clustering, generation — whatever you want to do with becomes easier given .

The latent space is sometimes called the embedding space, especially in NLP (word embeddings) or retrieval contexts. The per-input vector is the latent-representation of — “latent” because it’s an internal vector the network builds, not directly observed in the input or output.

ASIDE — Beyond images: word embeddings as representation learning

The same idea recurs across domains. Word embeddings (word2vec, GloVe) take raw word strings and learn a vector representation where semantically similar words end up close (king near queen, Paris near France). The training task is different — typically predicting a word from its context — but the recipe is identical: invent a self-supervised proxy task, train an encoder, throw the proxy head away, keep the embedding. The whole modern transformer pipeline (BERT, GPT, sentence embeddings, CLIP) is representation learning at scale.

Why we need it

Raw inputs are bad inputs:

  • Too many dimensions. A image has ~1M numbers. Most ML algorithms choke or overfit on that scale.
  • The wrong axes. “Pixel intensity at row 73, column 412” is not a feature anyone cares about. We care about “is there a face here,” “what’s the pose,” “what’s the colour scheme” — none of which are individual pixels.
  • Sparse meaning. Real images sit on a tiny manifold inside the vast pixel space. Only of the possible greyscale images correspond to anything meaningful. The representation should describe positions on that manifold, not in the ambient pixel space.

The old way: humans hand-designed features (PCA, SIFT, HOG). The new way: let a neural network learn the encoder from data.

How representations get learned (this module)

Three flavours covered in week 6, each defined by what proxy task forces the encoder to extract structure:

MethodTraining signalWhat’s kept
autoencoderReconstruct through a bottleneckBottleneck code
pretext-task (Doersch context prediction, rotation prediction)Predict a label fabricated from the dataMid-layer features (e.g. AlexNet fc6)
contrastive-learning (SimCLR)Augmentations of same image close, others farEncoder output (not the projection )

Plus the supervised special case:

  • transfer-learning — train a CNN on a large labelled source dataset (ImageNet); the early conv layers’ features transfer to other tasks. Same idea, different proxy task (classification on a related domain instead of self-supervision).

What unites all four: the proxy task is throwaway. After training, you discard the part of the network specific to it and keep the encoder. The encoder is the deliverable.

How to know a representation is good

You can’t tell by looking at — it’s just a vector of numbers. The verdict is empirical:

  • Linear evaluation. Freeze the encoder, train a linear classifier on top with limited labels. If accuracy is high, the representation has separated classes for you. SimCLR’s matches fully supervised ResNet-50 accuracy on ImageNet under linear evaluation — a representation good enough that “draw a line through it” suffices.
  • Few-shot / semi-supervised. Same as above but with very few labels. A good representation works with 1% of the labels.
  • Clustering. Plot for unlabelled data; if natural classes form distinct clusters (without ever showing class labels), the representation is capturing the relevant structure. Autoencoders cluster MNIST digits this way (autoencoder).
  • Nearest-neighbour search. Pick a query, find images closest in -space. If they’re semantically similar (same object, similar pose), the representation is good.

Disentanglement: an aspirational property

A representation is disentangled if individual dimensions of correspond to independent factors of variation. For an MNIST autoencoder with : = stroke thickness, = digit angle. Sweeping one while holding the other fixed varies only that physical attribute.

Disentanglement is desirable because it makes the representation interpretable and controllable (e.g. “make this digit thicker” → adjust ). It’s also rare and unreliable — autoencoders sometimes achieve it, sometimes not, depending on architecture, init, data. There’s no architectural trick that guarantees it.

The Tale of Two Representations

CAUTION — " " means different things in AEs vs. SimCLR

Don’t conflate the bottleneck of an autoencoder with the projection-head output of SimCLR. They occupy the same letter but different roles.

AutoencoderSimCLR
Encoder produces (bottleneck)
Projection head produces(no head)
After training, keep
Why = the compressed representation retains general info; specialised to the contrastive task

In both cases, the kept vector is what we mean by “the representation.”

Connections

  • latent-representation — the per-input vector that is the representation; clarifies the AE-vs-SimCLR naming clash.
  • autoencoder — simplest representation-learning model; reconstruction as proxy task.
  • contrastive-learning — modern, stronger proxy task; matches supervised performance.
  • pretext-task — older self-supervised approach; useful baseline.
  • transfer-learning — supervised representation learning (uses labels for the proxy task).
  • self-supervised-learning — the umbrella covering AE, pretext, and contrastive.
  • clever-hans-effect — what happens when the representation is good for the wrong reason: it solves the proxy via a shortcut, doesn’t capture the intended structure.