A generative model is a probability distribution wearing a neural network. The data — images, text, audio — lives on some unknown distribution . We never see directly; we only see samples drawn from it (the training set). The goal is to fit a parameterised distribution that’s close enough to that new samples look indistinguishable from real data. “Looks like a face that doesn’t exist” is just doing its job.

The setup: density estimation

We have a dataset — say, MNIST digits or photos of faces. These are samples from some underlying distribution . We don’t know :

  • We can’t write it down mathematically (faces aren’t a Gaussian).
  • We can’t compute for an arbitrary (no closed form).
  • All we have are samples.

A generative model fits a parameterised distribution — typically a neural network — to the data, with the explicit goal:

This is density estimation with a neural network in place of the closed-form parametric family that classical statistics would use.

What we want from

There are two distinct things you might want from a fitted distribution:

  1. Evaluation — given an , compute (or at least score) . Useful for anomaly detection, likelihood-based comparison, or training via maximum likelihood.
  2. Sampling — draw . The new is a generated example: not in the training set, but plausibly from the same distribution.

Different model families prioritise these differently. GANs only do (2); VAEs do both approximately; autoregressive models do both exactly. For most modern applications — generating images, text, audio — sampling is what matters.

Why this is hard for images

The pixel space is enormous: a greyscale image has possible configurations — more than atoms in the observable universe (). Real images occupy a vanishingly small subset of that space; almost all configurations are noise. The structure of the distribution lives on a thin manifold inside the ambient space.

A second hardness: pixels are strongly correlated. You can’t decompose

because the colour of pixel depends on the colour of in deeply context-dependent ways. The naive factorisation works for pure noise (where pixels are independent) but fails for any meaningful image distribution. Capturing the joint distribution faithfully is the central technical challenge.

Families of generative models

Four broad families dominate modern generative modelling:

FamilyMechanismExamples
Latent variableSample from a simple prior (typically ); decode through a network: .GANs, VAEs
DiffusionStart from pure noise; iteratively denoise toward a sample.DDPM, Stable Diffusion (week 8)
AutoregressiveFactorise ; predict one token / pixel at a time conditioned on the rest.GPT, PixelCNN
Normalising flowsInvertible neural networks; transform a simple base distribution into the target distribution exactly.RealNVP, Glow (not covered in this module)

This module covers GANs (week 7), diffusion (week 8), and autoregressive models (week 9 onwards). Each has different trade-offs in sample quality, training stability, sampling speed, and likelihood evaluation.

The latent-variable framing (used by GANs and VAEs)

The simplest recipe for a sampling-only generative model:

  1. Pick a prior distribution over a low-dimensional latent space — typically in . This is fixed by us, not learned.
  2. Define a generator — a neural network with parameters .
  3. To sample: draw , then output .

The model distribution is the push-forward of through — the distribution you get by pushing each Gaussian sample through the network. We never write in closed form; we just sample from it by sampling and decoding.

CAUTION — This is not the latent representation of an input

In an autoencoder or SimCLR, — the encoder takes an input and produces a latent. In a GAN/VAE generator, is sampled from a prior — there is no input. The two uses share the letter “latent” but mean different things. See latent-representation for the AE-vs-SimCLR clash; the GAN adds a third meaning: random noise that the generator expands into a sample, with no encoding step.

How do we train ?

You can’t compute the loss "" directly because you don’t know either distribution explicitly. Each generative-model family solves this differently:

  • GANs train an auxiliary network (the discriminator) to distinguish samples from the two distributions; the generator learns by trying to fool it. Implicit density estimation. See generative-adversarial-network.
  • VAEs maximise an evidence lower bound (ELBO) that approximates likelihood of training data under the model.
  • Diffusion models train a network to predict the noise added at each step of a diffusion process; sampling is reversed denoising.
  • Autoregressive models maximise exact log-likelihood by training each conditional as a classification problem.

The clever part of each family is the training objective that lets you optimise without ever evaluating either explicitly.

Generation vs reconstruction

Worth nailing the distinction with autoencoders:

AutoencoderGenerative model
GoalReconstruct the same inputGenerate a new, similar sample
Input at inferenceA specific Random (or a condition)
Output relationship to input is from the same distribution, not the same instance
Why we careThe latent is the deliverableThe generated is the deliverable

An AE that generates a perfect copy of a training image has succeeded; a generative model that does the same has memorised and failed. Generation requires producing samples not in the training set.

Connections

  • generative-adversarial-network — the implicit-density approach; this week’s main subject.
  • conditional-generative-model — extends generative models to learn rather than , allowing controlled generation from a condition.
  • autoencodernot a generative model in the strict sense, but the architectural ancestor of VAEs and the encoder-decoder pattern that recurs throughout. See the table above for the goal-level distinction.
  • latent-representation — clarifies the various uses of in different architectures; in a GAN, is sampled from a prior, not produced by an encoder.
  • bayes-theorem — the probabilistic backbone for understanding conditional generative models (what does it mean to learn a posterior?).
  • self-supervised-learning — generative modelling is a form of self-supervised learning (no labels needed; the data supervises itself).