“Latent” literally means hidden. A latent representation is a vector inside the network that you don’t see directly — it’s an internal encoding the network builds en route from input to output. The key word is “representation”: each latent vector is supposed to stand for the input in a more useful form. Pixels become embeddings; embeddings are what algorithms can actually work with. The slippery part is that “the latent representation” picks out different vectors depending on which architecture you’re looking at — and confusing which one is “the keeper” is one of the most common errors in week 6.

Definitions

  • Latent vector (or ): the activations of some hidden layer in the network, viewed as a coordinate vector. The dimensionality is usually much smaller than the input dimension.
  • Latent space : the vector space these vectors live in. Often with .
  • Latent representation of : the specific latent vector the network produces when fed . Sometimes called the embedding or code.

“Latent” because the vector isn’t part of the input or the output — it’s an intermediate, hidden state of the computation. We can extract it (it’s just an activation), but we don’t supervise it directly; it emerges from training.

Properties we want

For the latent representation to be useful, we want to be:

  1. Lower-dimensional than — so it’s a compression, forcing the network to throw away the unimportant.
  2. Structured — semantically similar inputs map to nearby points; semantically different inputs map to distant points (this gives clustering for free, autoencoder).
  3. Smooth / continuous — small changes in correspond to small, meaningful changes in the decoded image (enables latent walks).
  4. Useful downstream — a small classifier on should solve tasks the raw input couldn’t.

These properties are not guaranteed by the architecture; they emerge from training under the right objective. A well-trained autoencoder produces all four; a poorly-tuned one (e.g. no bottleneck) produces none of them.

The Tale of Two Representations

CAUTION — "latent representation" picks out different vectors in AE vs SimCLR

The same words refer to different network locations. Get this wrong and downstream interpretation breaks.

| | Autoencoder | SimCLR | |---|---|---| | Encoder produces | (the bottleneck) | | | Projection head produces | (no head) | | | Loss is computed on | vs | across the batch | | The “latent representation we keep” | | | | What you discard after training | (nothing) | and the projection head |

Why the same letter, different meaning?

In an autoencoder, is the only hidden vector that matters — it’s the bottleneck, the choke point through which information must pass. We named it “the latent representation” because it’s the only candidate.

In SimCLR, there are two candidate latent vectors: from the encoder, and from the projection head. Both are “hidden” in the literal sense. But is optimised purely for the contrastive task — it has been compressed in a way that discards information specific to that task (e.g. forgetting rotation, colour). retains richer general-purpose features. We keep , throw away .

So when a SimCLR paper says “we use the representation for downstream tasks,” is what they mean by “the latent representation.” When an autoencoder paper says “the latent code is used for clustering,” is what they mean. Same word, different vector.

The general rule: the latent representation is whichever vector you keep after training, regardless of which Greek letter it carries.

Latent walks: probing the space

Once trained, you can investigate what each dimension of the latent space corresponds to by sweeping it.

For an autoencoder with a 2-d bottleneck on MNIST 7s:

  • Encode a real “7” → get .
  • Hold fixed; sweep across a range; decode at each step; look at the sequence of reconstructions.

What you see: the digit physically morphs along one axis of variation (curvature / rotation / thickness / etc.). The network has implicitly assigned an interpretable physical attribute to that latent dimension. We did not tell it to; it discovered it.

When an axis of corresponds cleanly to a single human-interpretable factor (and other axes don’t also vary that factor), the representation is disentangled. Disentanglement is desirable but rare — most trained AEs produce mixed factors, where a single physical attribute is encoded across multiple latent dimensions.

Why “latent” rather than “feature”?

Subtle distinction worth knowing:

  • Feature is older terminology, often referring to named attributes (edge orientation, colour histogram, SIFT descriptor) — things a human or a hand-designed algorithm decided to extract.
  • Latent variable / representation is statistical / probabilistic terminology — the variable is hidden, not directly observed, and we infer it from data. The name comes from probabilistic modelling (latent factor models, latent Dirichlet allocation, etc.) where latent variables explain observed data via some generative process.

In modern deep learning the terms blur. “Latent” carries a connotation of “the network learned this for itself, we don’t know what it means until we probe.” “Feature” carries a connotation of “this represents something.” For week 6’s purposes, treat them as synonyms but lean on “latent” when emphasising that the encoding is unsupervised and discovered.

What the latent representation is not

  • Not the output. The output of an AE is (a reconstruction); the latent is (the compressed code). The output of a SimCLR encoder is ; the projection is computed from it but isn’t kept.
  • Not the parameters. The parameters are the network’s weights — they don’t change per input. The latent representation is per-input.
  • Not the same across runs. Different random seeds produce different latent representations of the same input. The dimensions are not “named” — only the space has structure, and even that is only partially shared between training runs.

Connections

  • autoencoder — the bottleneck is the latent representation; reconstruction loss shapes its meaning.
  • contrastive-learning — the encoder output is the kept latent representation; the projection is discarded.
  • representation-learning — the broader goal of learning useful /, of which “latent representation” is the per-input output.
  • self-supervised-learning — provides the training signal that gives the latent space its structure.
  • u-net — the encoder/decoder architecture is similar, but U-Net’s bottleneck is not used as a “kept” representation — the goal is the segmentation output, not a compressed code.