A generative model is a probability distribution wearing a neural network. The data — images, text, audio — lives on some unknown distribution . We never see directly; we only see samples drawn from it (the training set). The goal is to fit a parameterised distribution that’s close enough to that new samples look indistinguishable from real data. “Looks like a face that doesn’t exist” is just doing its job.
The setup: density estimation
We have a dataset — say, MNIST digits or photos of faces. These are samples from some underlying distribution . We don’t know :
- We can’t write it down mathematically (faces aren’t a Gaussian).
- We can’t compute for an arbitrary (no closed form).
- All we have are samples.
A generative model fits a parameterised distribution — typically a neural network — to the data, with the explicit goal:
This is density estimation with a neural network in place of the closed-form parametric family that classical statistics would use.
What we want from
There are two distinct things you might want from a fitted distribution:
- Evaluation — given an , compute (or at least score) . Useful for anomaly detection, likelihood-based comparison, or training via maximum likelihood.
- Sampling — draw . The new is a generated example: not in the training set, but plausibly from the same distribution.
Different model families prioritise these differently. GANs only do (2); VAEs do both approximately; autoregressive models do both exactly. For most modern applications — generating images, text, audio — sampling is what matters.
Why this is hard for images
The pixel space is enormous: a greyscale image has possible configurations — more than atoms in the observable universe (). Real images occupy a vanishingly small subset of that space; almost all configurations are noise. The structure of the distribution lives on a thin manifold inside the ambient space.
A second hardness: pixels are strongly correlated. You can’t decompose
because the colour of pixel depends on the colour of in deeply context-dependent ways. The naive factorisation works for pure noise (where pixels are independent) but fails for any meaningful image distribution. Capturing the joint distribution faithfully is the central technical challenge.
Families of generative models
Four broad families dominate modern generative modelling:
| Family | Mechanism | Examples |
|---|---|---|
| Latent variable | Sample from a simple prior (typically ); decode through a network: . | GANs, VAEs |
| Diffusion | Start from pure noise; iteratively denoise toward a sample. | DDPM, Stable Diffusion (week 8) |
| Autoregressive | Factorise ; predict one token / pixel at a time conditioned on the rest. | GPT, PixelCNN |
| Normalising flows | Invertible neural networks; transform a simple base distribution into the target distribution exactly. | RealNVP, Glow (not covered in this module) |
This module covers GANs (week 7), diffusion (week 8), and autoregressive models (week 9 onwards). Each has different trade-offs in sample quality, training stability, sampling speed, and likelihood evaluation.
The latent-variable framing (used by GANs and VAEs)
The simplest recipe for a sampling-only generative model:
- Pick a prior distribution over a low-dimensional latent space — typically in . This is fixed by us, not learned.
- Define a generator — a neural network with parameters .
- To sample: draw , then output .
The model distribution is the push-forward of through — the distribution you get by pushing each Gaussian sample through the network. We never write in closed form; we just sample from it by sampling and decoding.
CAUTION — This is not the latent representation of an input
In an autoencoder or SimCLR, — the encoder takes an input and produces a latent. In a GAN/VAE generator, is sampled from a prior — there is no input. The two uses share the letter “latent” but mean different things. See latent-representation for the AE-vs-SimCLR clash; the GAN adds a third meaning: random noise that the generator expands into a sample, with no encoding step.
How do we train ?
You can’t compute the loss "" directly because you don’t know either distribution explicitly. Each generative-model family solves this differently:
- GANs train an auxiliary network (the discriminator) to distinguish samples from the two distributions; the generator learns by trying to fool it. Implicit density estimation. See generative-adversarial-network.
- VAEs maximise an evidence lower bound (ELBO) that approximates likelihood of training data under the model.
- Diffusion models train a network to predict the noise added at each step of a diffusion process; sampling is reversed denoising.
- Autoregressive models maximise exact log-likelihood by training each conditional as a classification problem.
The clever part of each family is the training objective that lets you optimise without ever evaluating either explicitly.
Generation vs reconstruction
Worth nailing the distinction with autoencoders:
| Autoencoder | Generative model | |
|---|---|---|
| Goal | Reconstruct the same input | Generate a new, similar sample |
| Input at inference | A specific | Random (or a condition) |
| Output relationship to input | is from the same distribution, not the same instance | |
| Why we care | The latent is the deliverable | The generated is the deliverable |
An AE that generates a perfect copy of a training image has succeeded; a generative model that does the same has memorised and failed. Generation requires producing samples not in the training set.
A friend says: "If I want to generate new images, I'll just train an autoencoder, then sample random points in latent space and decode them." Why does this approach mostly fail?
Because the autoencoder’s latent space is shaped only by the reconstruction loss — it has no constraint to fill its latent space densely or smoothly. Most points in that you sample randomly will be in unvisited regions of the latent space (the encoder mapped real inputs to a thin manifold inside , not all of it). Decoding from those holes produces garbage. This is exactly the problem VAEs were designed to solve — they explicitly regularise the latent distribution toward , so random samples fall in covered regions. GANs sidestep the problem entirely by only sampling from and training the generator to map all of it to plausible outputs; there’s no encoder, no reconstruction, just .
Why is naively factorising fine for pure noise but disastrous for natural images?
Pure noise is per-pixel independent — by definition, each pixel is drawn independently from a uniform or Gaussian distribution. The factorisation is exact. Natural images have strong long-range correlations: knowing one pixel of a face tells you a lot about its neighbours (skin colour continuity), about distant pixels (the eyes are roughly symmetric), and about global structure (faces have a typical layout). Treating pixels as independent throws all that away — you’d generate an image where each pixel is plausible in isolation (right brightness, right colour) but the joint configuration is total noise. Capturing the joint distribution is the whole point of generative modelling; the per-pixel marginal is trivially easy and useless on its own.
Connections
- generative-adversarial-network — the implicit-density approach; this week’s main subject.
- conditional-generative-model — extends generative models to learn rather than , allowing controlled generation from a condition.
- autoencoder — not a generative model in the strict sense, but the architectural ancestor of VAEs and the encoder-decoder pattern that recurs throughout. See the table above for the goal-level distinction.
- latent-representation — clarifies the various uses of in different architectures; in a GAN, is sampled from a prior, not produced by an encoder.
- bayes-theorem — the probabilistic backbone for understanding conditional generative models (what does it mean to learn a posterior?).
- self-supervised-learning — generative modelling is a form of self-supervised learning (no labels needed; the data supervises itself).