Raw data is high-dimensional and dumb. A million pixels do not say “this is a cat”; they say “intensities at a million coordinates.” Representation learning is the deliberate effort to translate that mess into a short vector where each dimension means something — features that downstream algorithms can actually work with. The “right” representation makes hard problems easy: a linear classifier on top of a good representation can match a deep CNN trained from raw pixels with full supervision.
What “representation” means here
Given raw input (pixels, words, audio), find a function that produces a vector — the representation — with two properties:
- Lower-dimensional / structured. has many fewer dimensions than , or at least more semantically meaningful axes.
- Useful for downstream tasks. Classification, retrieval, clustering, generation — whatever you want to do with becomes easier given .
The latent space is sometimes called the embedding space, especially in NLP (word embeddings) or retrieval contexts. The per-input vector is the latent-representation of — “latent” because it’s an internal vector the network builds, not directly observed in the input or output.
ASIDE — Beyond images: word embeddings as representation learning
The same idea recurs across domains. Word embeddings (word2vec, GloVe) take raw word strings and learn a vector representation where semantically similar words end up close (
kingnearqueen,ParisnearFrance). The training task is different — typically predicting a word from its context — but the recipe is identical: invent a self-supervised proxy task, train an encoder, throw the proxy head away, keep the embedding. The whole modern transformer pipeline (BERT, GPT, sentence embeddings, CLIP) is representation learning at scale.
Why we need it
Raw inputs are bad inputs:
- Too many dimensions. A image has ~1M numbers. Most ML algorithms choke or overfit on that scale.
- The wrong axes. “Pixel intensity at row 73, column 412” is not a feature anyone cares about. We care about “is there a face here,” “what’s the pose,” “what’s the colour scheme” — none of which are individual pixels.
- Sparse meaning. Real images sit on a tiny manifold inside the vast pixel space. Only of the possible greyscale images correspond to anything meaningful. The representation should describe positions on that manifold, not in the ambient pixel space.
The old way: humans hand-designed features (PCA, SIFT, HOG). The new way: let a neural network learn the encoder from data.
How representations get learned (this module)
Three flavours covered in week 6, each defined by what proxy task forces the encoder to extract structure:
| Method | Training signal | What’s kept |
|---|---|---|
| autoencoder | Reconstruct through a bottleneck | Bottleneck code |
| pretext-task (Doersch context prediction, rotation prediction) | Predict a label fabricated from the data | Mid-layer features (e.g. AlexNet fc6) |
| contrastive-learning (SimCLR) | Augmentations of same image close, others far | Encoder output (not the projection ) |
Plus the supervised special case:
- transfer-learning — train a CNN on a large labelled source dataset (ImageNet); the early conv layers’ features transfer to other tasks. Same idea, different proxy task (classification on a related domain instead of self-supervision).
What unites all four: the proxy task is throwaway. After training, you discard the part of the network specific to it and keep the encoder. The encoder is the deliverable.
How to know a representation is good
You can’t tell by looking at — it’s just a vector of numbers. The verdict is empirical:
- Linear evaluation. Freeze the encoder, train a linear classifier on top with limited labels. If accuracy is high, the representation has separated classes for you. SimCLR’s matches fully supervised ResNet-50 accuracy on ImageNet under linear evaluation — a representation good enough that “draw a line through it” suffices.
- Few-shot / semi-supervised. Same as above but with very few labels. A good representation works with 1% of the labels.
- Clustering. Plot for unlabelled data; if natural classes form distinct clusters (without ever showing class labels), the representation is capturing the relevant structure. Autoencoders cluster MNIST digits this way (autoencoder).
- Nearest-neighbour search. Pick a query, find images closest in -space. If they’re semantically similar (same object, similar pose), the representation is good.
Disentanglement: an aspirational property
A representation is disentangled if individual dimensions of correspond to independent factors of variation. For an MNIST autoencoder with : = stroke thickness, = digit angle. Sweeping one while holding the other fixed varies only that physical attribute.
Disentanglement is desirable because it makes the representation interpretable and controllable (e.g. “make this digit thicker” → adjust ). It’s also rare and unreliable — autoencoders sometimes achieve it, sometimes not, depending on architecture, init, data. There’s no architectural trick that guarantees it.
The Tale of Two Representations
CAUTION — " " means different things in AEs vs. SimCLR
Don’t conflate the bottleneck of an autoencoder with the projection-head output of SimCLR. They occupy the same letter but different roles.
Autoencoder SimCLR Encoder produces (bottleneck) Projection head produces (no head) After training, keep Why = the compressed representation retains general info; specialised to the contrastive task
In both cases, the kept vector is what we mean by “the representation.”
Why is representation learning often described as "the heart of modern deep learning"?
Because the deep network’s final layer (the supervised classifier) is small and easy — what’s hard, and what makes deep learning work, is the learned features that come out of the layers below. A pre-trained ResNet’s fc7 features can be transferred to a totally different task with a tiny new head; that’s the whole power of transfer-learning. Self-supervised methods generalise this — the representation is a deliverable, the original task is incidental. SimCLR shows that with no labels at all you can produce features as good as fully-supervised ones. The representation, not the classifier, is the thing the network is really for.
Connections
- latent-representation — the per-input vector that is the representation; clarifies the AE-vs-SimCLR naming clash.
- autoencoder — simplest representation-learning model; reconstruction as proxy task.
- contrastive-learning — modern, stronger proxy task; matches supervised performance.
- pretext-task — older self-supervised approach; useful baseline.
- transfer-learning — supervised representation learning (uses labels for the proxy task).
- self-supervised-learning — the umbrella covering AE, pretext, and contrastive.
- clever-hans-effect — what happens when the representation is good for the wrong reason: it solves the proxy via a shortcut, doesn’t capture the intended structure.