generative-model

A generative model is a probability distribution wearing a neural network. The data — images, text, audio — lives on some unknown distribution $p_{d a t a} (x)$ . We never see $p_{d a t a}$ directly; we only see samples drawn from it (the training set). The goal is to fit a parameterised distribution $p_{θ} (x)$ that’s close enough to $p_{d a t a}$ that new samples $\tilde{x} \sim p_{θ}$ look indistinguishable from real data. “Looks like a face that doesn’t exist” is just $p_{θ}$ doing its job.

The setup: density estimation

We have a dataset ${x_{i}}_{i = 1}^{N}$ — say, MNIST digits or photos of faces. These are samples from some underlying distribution $p_{d a t a} (x)$ . We don’t know $p_{d a t a}$ :

We can’t write it down mathematically (faces aren’t a Gaussian).
We can’t compute $p_{d a t a} (x)$ for an arbitrary $x$ (no closed form).
All we have are samples.

A generative model fits a parameterised distribution $p_{θ} (x)$ — typically a neural network — to the data, with the explicit goal:

$p_{θ} (x) \approx p_{d a t a} (x)$

This is density estimation with a neural network in place of the closed-form parametric family that classical statistics would use.

What we want from $p_{θ}$

There are two distinct things you might want from a fitted distribution:

Evaluation — given an $x$ , compute (or at least score) $p_{θ} (x)$ . Useful for anomaly detection, likelihood-based comparison, or training via maximum likelihood.
Sampling — draw $\tilde{x} \sim p_{θ}$ . The new $\tilde{x}$ is a generated example: not in the training set, but plausibly from the same distribution.

Different model families prioritise these differently. GANs only do (2); VAEs do both approximately; autoregressive models do both exactly. For most modern applications — generating images, text, audio — sampling is what matters.

Why this is hard for images

The pixel space $X$ is enormous: a $28 \times 28$ greyscale image has $25 6^{784} \approx 1 0^{1888}$ possible configurations — more than atoms in the observable universe ( $1 0^{80}$ ). Real images occupy a vanishingly small subset of that space; almost all configurations are noise. The structure of the distribution lives on a thin manifold inside the ambient space.

A second hardness: pixels are strongly correlated. You can’t decompose

$p (x) \neq = \prod_{i} p (x^{(i)})$

because the colour of pixel $(15, 23)$ depends on the colour of $(15, 22)$ in deeply context-dependent ways. The naive factorisation works for pure noise (where pixels are independent) but fails for any meaningful image distribution. Capturing the joint distribution faithfully is the central technical challenge.

Families of generative models

Four broad families dominate modern generative modelling:

Family	Mechanism	Examples
Latent variable	Sample $z \sim p (z)$ from a simple prior (typically $N (0, I)$ ); decode through a network: $x = G_{θ} (z)$ .	GANs, VAEs
Diffusion	Start from pure noise; iteratively denoise toward a sample.	DDPM, Stable Diffusion (week 8)
Autoregressive	Factorise $p (x) = \prod_{i} p (x^{(i)} ∣ x^{(< i)})$ ; predict one token / pixel at a time conditioned on the rest.	GPT, PixelCNN
Normalising flows	Invertible neural networks; transform a simple base distribution into the target distribution exactly.	RealNVP, Glow (not covered in this module)

This module covers GANs (week 7), diffusion (week 8), and autoregressive models (week 9 onwards). Each has different trade-offs in sample quality, training stability, sampling speed, and likelihood evaluation.

The latent-variable framing (used by GANs and VAEs)

The simplest recipe for a sampling-only generative model:

Pick a prior distribution $p (z)$ over a low-dimensional latent space — typically $N (0, I)$ in $R^{d}$ . This is fixed by us, not learned.
Define a generator $G_{θ} : Z \to X$ — a neural network with parameters $θ$ .
To sample: draw $z \sim p (z)$ , then output $x = G_{θ} (z)$ .

The model distribution $p_{θ} (x)$ is the push-forward of $p (z)$ through $G_{θ}$ — the distribution you get by pushing each Gaussian sample through the network. We never write $p_{θ} (x)$ in closed form; we just sample from it by sampling $z$ and decoding.

CAUTION — This $z$ is not the latent representation of an input

In an autoencoder or SimCLR, $z = f_{ϕ} (x)$ — the encoder takes an input and produces a latent. In a GAN/VAE generator, $z$ is sampled from a prior — there is no input. The two uses share the letter “latent” but mean different things. See latent-representation for the AE-vs-SimCLR clash; the GAN $z$ adds a third meaning: random noise that the generator expands into a sample, with no encoding step.

How do we train $p_{θ}$ ?

You can’t compute the loss " $p_{θ} - p_{d a t a}$ " directly because you don’t know either distribution explicitly. Each generative-model family solves this differently:

GANs train an auxiliary network (the discriminator) to distinguish samples from the two distributions; the generator learns by trying to fool it. Implicit density estimation. See generative-adversarial-network.
VAEs maximise an evidence lower bound (ELBO) that approximates likelihood of training data under the model.
Diffusion models train a network to predict the noise added at each step of a diffusion process; sampling is reversed denoising.
Autoregressive models maximise exact log-likelihood by training each conditional $p (x^{(i)} ∣ x^{(< i)})$ as a classification problem.

The clever part of each family is the training objective that lets you optimise $p_{θ} \to p_{d a t a}$ without ever evaluating either explicitly.

Generation vs reconstruction

Worth nailing the distinction with autoencoders:

	Autoencoder	Generative model
Goal	Reconstruct the same input	Generate a new, similar sample
Input at inference	A specific $x$	Random $z \sim p (z)$ (or a condition)
Output relationship to input	$\overset{x}{^} \approx x$	$\tilde{x}$ is from the same distribution, not the same instance
Why we care	The latent $z$ is the deliverable	The generated $\tilde{x}$ is the deliverable

An AE that generates a perfect copy of a training image has succeeded; a generative model that does the same has memorised and failed. Generation requires producing samples not in the training set.

A friend says: "If I want to generate new images, I'll just train an autoencoder, then sample random points in latent space and decode them." Why does this approach mostly fail?

Because the autoencoder’s latent space is shaped only by the reconstruction loss — it has no constraint to fill its latent space densely or smoothly. Most points in $Z$ that you sample randomly will be in unvisited regions of the latent space (the encoder mapped real inputs to a thin manifold inside $Z$ , not all of it). Decoding from those holes produces garbage. This is exactly the problem VAEs were designed to solve — they explicitly regularise the latent distribution toward $N (0, I)$ , so random samples fall in covered regions. GANs sidestep the problem entirely by only sampling from $p (z) = N (0, I)$ and training the generator to map all of it to plausible outputs; there’s no encoder, no reconstruction, just $z \to x$ .

Why is naively factorising $p (x) = \prod_{i} p (x^{(i)})$ fine for pure noise but disastrous for natural images?

Pure noise is per-pixel independent — by definition, each pixel is drawn independently from a uniform or Gaussian distribution. The factorisation is exact. Natural images have strong long-range correlations: knowing one pixel of a face tells you a lot about its neighbours (skin colour continuity), about distant pixels (the eyes are roughly symmetric), and about global structure (faces have a typical layout). Treating pixels as independent throws all that away — you’d generate an image where each pixel is plausible in isolation (right brightness, right colour) but the joint configuration is total noise. Capturing the joint distribution is the whole point of generative modelling; the per-pixel marginal is trivially easy and useless on its own.

Connections

generative-adversarial-network — the implicit-density approach; this week’s main subject.
conditional-generative-model — extends generative models to learn $p_{θ} (y ∣ x)$ rather than $p_{θ} (x)$ , allowing controlled generation from a condition.
autoencoder — not a generative model in the strict sense, but the architectural ancestor of VAEs and the encoder-decoder pattern that recurs throughout. See the table above for the goal-level distinction.
latent-representation — clarifies the various uses of $z$ in different architectures; in a GAN, $z$ is sampled from a prior, not produced by an encoder.
bayes-theorem — the probabilistic backbone for understanding conditional generative models (what does it mean to learn a posterior?).
self-supervised-learning — generative modelling is a form of self-supervised learning (no labels needed; the data supervises itself).

Course Notes

Explorer

generative-model

The setup: density estimation

What we want from $p_{θ}$

Why this is hard for images

Families of generative models

The latent-variable framing (used by GANs and VAEs)

How do we train $p_{θ}$ ?

Generation vs reconstruction

Connections

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

generative-model

The setup: density estimation

What we want from pθ​

Why this is hard for images

Families of generative models

The latent-variable framing (used by GANs and VAEs)

How do we train pθ​?

Generation vs reconstruction

Connections

Graph View

Table of Contents

Backlinks

What we want from $p_{θ}$

How do we train $p_{θ}$ ?