THE CRUX: Last week we learned to encode data into useful representations. This week the goal flips — given only training samples, can we build a network that generates new samples from the same distribution? What does it even mean to "learn a distribution" when we can't write it down, can't evaluate it, and only ever see samples from it?

The week’s answer is implicit density estimation: don’t try to write down — train a generator network whose samples are statistically indistinguishable from real data. The genius mechanism is adversarial training: pair the generator with a discriminator that tries to spot fakes, and let them duel. At equilibrium, the generator’s samples match the real distribution. The week then extends this to conditional generation — given a sketch, produce a photo; given a low-res image, produce high-res — by feeding a condition into both networks and reframing the problem as posterior estimation via bayes-theorem.

Where we left off

Week 6 was about learning representations of data via autoencoders, contrastive learning, and pretext tasks. The deliverable was always the encoder — the latent vector was the prize, and reconstruction / contrast / context-prediction was scaffolding to train it.

This week the deliverable flips. We don’t care about an encoder; we want a network whose outputs, when sampled, look like real data we’ve never seen. New faces. New street scenes. New microscopy images. The encoder pattern still appears (cGAN’s discriminator and pix2pix’s U-Net generator are both encoder-shaped), but now the goal is to generate, not to compress.

The framing: density estimation we can’t actually do

Real data comes from some unknown distribution . We never get to see directly:

  • We can’t write it down (faces aren’t a Gaussian).
  • We can’t evaluate it at arbitrary points.
  • We only have samples — the training set.

A generative model fits a parameterised distribution — typically a neural network — with the goal . Then to sample a new , we sample from .

This is density estimation with neural networks instead of classical parametric families. Two things make it hard for images:

  1. Pixel space is enormous configurations for a image, and meaningful images are a vanishingly thin manifold inside that vastness.
  2. Pixels are heavily correlated. The distribution doesn’t factor; the joint structure is what makes images images.

TIP — Why this is the same problem as week 6 (but flipped)

Week 6 said: real images are a tiny island in pixel space, so we should learn the structure of that island via an encoder. Week 7 says: real images are a tiny island in pixel space, so we should learn to sample from it via a generator. The hardness is the same hardness; the deliverable is different.

See generative-model for the full landscape (latent variable models, diffusion, autoregressive, normalising flows).

The latent-variable trick

The simplest sampling-only generative model:

  1. Pick a prior over a small latent space .
  2. Define a generator network .
  3. To sample: ; output .

The model distribution is the push-forward of the Gaussian through — implicitly defined; you can sample from it but never write it in closed form.

CAUTION — This is not the latent of an autoencoder

In a AE or SimCLR, — the encoder consumes an input and outputs a latent. In a GAN/VAE, is a sample from a prior — there is no input to encode. The two uses share the letter “latent” but differ fundamentally. See latent-representation for the AE-vs-SimCLR clash; the GAN adds a third meaning: random noise that the generator expands into a sample.

The remaining problem: how do we train ? We can’t compute to score it against . We don’t have either explicitly. Each generative-model family solves this differently. GANs use adversarial training.

GANs: training by adversarial duel

A generative adversarial network (generative-adversarial-network) pairs the generator with a second network — a discriminator — and trains them against each other.

  • generates fake samples from random .
  • takes inputs (real or fake) and outputs a probability that the input is real. It’s a binary classifier with a sigmoid head.
  • is trained to detect fakes: push for real , for fakes.
  • is trained to fool : push .

It’s a forger-and-detective game. The forger produces counterfeits from random inspiration; the detective inspects each note and announces real-or-fake; both improve over time. At equilibrium, the forger’s fakes are indistinguishable from real — can do no better than 50/50, and ‘s output distribution matches .

The losses

Discriminator loss — standard binary cross-entropy on a mixed batch (label 1 for real, 0 for fake):

Generator loss (theoretical) — minimise the negative of ‘s fake-batch loss (the real-batch term has zero gradient w.r.t. and is dropped):

Generator loss (practical) — same equilibrium, better gradients:

Together and play a min-max game:

Why “theoretical” and “practical” generator losses differ

The theoretical form saturates exactly where you don’t want it to — early in training when (the generator’s outputs are obvious garbage), the gradient is flat. The practical form has the opposite gradient profile: large when is small (lots of learning signal early), saturating only as approaches ‘s upper limit (where you’re already done).

Both losses share the same minimum (push ). Every real GAN implementation uses the practical form. The theoretical form survives in textbooks because it’s the one the JSD-minimisation proof uses.

ASIDE — The Jensen-Shannon connection

Goodfellow’s 2014 paper proved: if has infinite capacity and is perfectly optimised at every step, then the optimal at every point is , and substituting back gives . Minimising the GAN objective minimises Jensen-Shannon divergence between data and model — and JSD is zero iff the two distributions are equal. So adversarial training implicitly minimises a real divergence, even though we never compute it directly. Caveats: real has finite capacity and isn’t perfectly optimised at every step; the practical loss differs from the theoretical one. The result is a guarantee in spirit, not a proof of practical convergence.

The training algorithm

Initialize θ, φ randomly
for t = 1 ... T:
    for k = 1 ... K:                    # train D for K steps (often K=1)
        sample real batch {x_i} ~ p_data
        sample latent batch {z_i} ~ N(0,I)
        L_D = -avg(log D_φ(x_i)) - avg(log(1 - D_φ(G_θ(z_i))))
        φ ← φ - α ∇_φ L_D
    sample latent batch {z_i} ~ N(0,I)  # train G for 1 step
    L_G = -avg(log D_φ(G_θ(z_i)))       # practical generator loss
    θ ← θ - β ∇_θ L_G

Two stochastic gradient descents on opposite objectives, alternated. Backprop flows from through back into — the generator’s output is just an activation map fed into , so the chain rule pushes gradients all the way back to .

After training: throw the discriminator away

Once trained, only is needed. To generate: sample , compute . The discriminator was scaffolding for training; at inference it has no role.

DCGAN and beyond

The original 2014 GAN used MLPs. Real images need convolutions:

  • DCGAN (Radford et al. 2016) — generator made of transposed convolutions / upsampling (analogous to the decoder half of a U-Net, starting from and growing to a full image); discriminator is a standard CNN classifier. DCGAN’s main contribution was a recipe of architectural choices (no FC layers, batch norm everywhere, LeakyReLU in , no pooling) that consistently train. Most modern GAN architectures descend from DCGAN.
  • Latent interpolation — sampling for produces a smooth morph between and . Empirically observed; no theoretical guarantee, but reliable in practice.
  • BigGAN (2019), StyleGAN (2019, behind thispersondoesnotexist.com), StyleGAN-T (text-to-image) — progressively scaled-up architectures that improved sample quality from “ugly digits” (2014) to “photorealistic faces” (2018) to “controllable text-to-image” (2023).

GANs were the state of the art for image generation roughly 2014–2020; they were displaced by diffusion models (which week 8 covers) but remain useful in many image-to-image tasks where their implicit-density approach is well-suited.

Bayesian basics — the bridge to conditional models

The transition to conditional generation goes through Bayes. We pause for a refresher.

The setup: two random variables (observation) and (truth). The joint factors two ways:

Equating gives Bayes’ theorem:

The four named pieces:

TermNameReading
PriorWhat we believe about before seeing
LikelihoodHow is produced given
PosteriorWhat we believe about after seeing

The directionality matters: likelihood goes forward (truth → observation), posterior goes backward (observation → truth). The forward direction is usually easy to model (sensor physics); the backward direction is what we want at inference. Bayes converts one to the other.

Worked example from the slides: penguin flipper length measured by hand, computer-vision estimate . The prior is the dataset histogram of true flipper lengths (centred at mm). The likelihood is the histogram of CV measurements when truth is 192 (centred at 192 with noise spread). The posterior — the actual question we care about, “given the CV said 201, what’s the true value?” — is not centred at 201; it’s pulled toward the prior mean of 190 by Bayes shrinkage.

MMSE — why MSE regression converges to the conditional mean

A classical exercise: given noisy measurements , the estimator that minimises sum-of-squared-error is the sample mean. Generalising: a regression network trained with MSE loss converges to — the conditional mean over all plausible explanations of the observation.

This sets up the next section’s payoff.

Conditional generative models

Switch from learning the marginal to learning the posterior . The neural network is now a parameterised approximation to Bayes.

Tasks that fit this framing:

TaskCondition Target
ColourisationGreyscale imageColour image
Super-resolutionLow-res imageHigh-res image
InpaintingImage with masked regionFilled image
Semantic-map → photo (pix2pix)Class-coloured segmentationRealistic photo
Sketch → photoOutline drawingRealistic photo
Text-to-imageCaptionGenerated image

In every case multiple ‘s are plausible for the same — multiple legitimate colourings, multiple high-res reconstructions, multiple photos matching the same caption. The posterior captures that diversity; the task is to sample from it.

Why MSE regression gives blurry outputs

Train an MLP to map with MSE loss. It converges to — the average of all plausible answers. When the posterior is multimodal (red flower vs yellow flower), the mean is a desaturated grey-pink between them — neither one nor the other. The week-07 problem set drives this home with a digit example: the optimal regression output for “draw a 1” is the pixelwise average of all training “1”s, full of fractional values like and that don’t appear in any single training example.

A generative approach to conditional modelling samples from the posterior instead of averaging across it. Each sample is one plausible answer; running inference multiple times produces diverse outputs. See conditional-generative-model.

cGAN and pix2pix

The cGAN extension is a direct surgical edit of the GAN setup:

  • Generator — takes both noise and the condition.
  • Discriminator — takes both the condition and the candidate target; outputs probability that the pair is real.

Critically, sees the condition. Without it, could produce any realistic regardless of (always output a beautiful brown shoe regardless of the input sketch) and still fool . With the condition, checks that the pair matches — punishing for ignoring .

pix2pix (Isola et al. 2018) is the canonical image-to-image cGAN: U-Net generator + PatchGAN discriminator + L1+GAN hybrid loss. The L1 anchors the output near the ground truth; the GAN sharpens textures to realistic values rather than the L1’s blurry conditional mean. Together they produce outputs that are both faithful to the input and crisp.

Conditional models naturally express uncertainty

Because depends on the random , running it multiple times on the same produces different ‘s. The variation across samples is an estimate of posterior uncertainty: high-confidence regions (real cell boundaries in a microscopy image) stay consistent; low-confidence regions (background noise) vary. This is impossible with a plain MSE-regression network.

Concepts introduced this week

  • generative-model — the broad framing: density estimation , the four families (latent variable, diffusion, autoregressive, normalising flows).
  • generative-adversarial-network — generator + discriminator, BCE-on-mixed-batch loss, theoretical vs practical generator losses, min-max game, Jensen-Shannon proof, training algorithm, DCGAN.
  • bayes-theorem — prior, likelihood, posterior; the penguin walkthrough; MMSE / why regression converges to the conditional mean.
  • conditional-generative-model — cGAN, pix2pix, posterior sampling, why MSE regression gives blurry outputs, uncertainty quantification.

Connections

  • Builds on autoencoder / u-net — the encoder-decoder architecture pattern recurs (DCGAN’s generator is a decoder; pix2pix’s generator is a U-Net), but the training signal is no longer reconstruction. The GAN substitutes the decoder’s reconstruction loss with a discriminator’s adversarial loss.
  • Builds on binary-cross-entropy and sigmoid function — the discriminator is a standard binary classifier; the GAN losses are BCE in a min-max wrapper.
  • Builds on backpropagation — gradients flow from through back into via the chain rule; the generator’s output is just an activation map consumed by the discriminator.
  • Sets up week 8 (diffusion) — diffusion models are another approach to the same problem (sample from given only training samples) but with a fundamentally different training objective: predict the noise added at each step of a forward diffusion process. They have largely displaced GANs as the state of the art for image generation since 2020.
  • Sets up later weeks (autoregressive, multimodal) — text-to-image (covered briefly here as a cGAN application) becomes a major topic via diffusion + CLIP-style contrastive embeddings; autoregressive models cover language modelling end-to-end.

Open questions

  • Mode collapse — GANs sometimes find one or a few outputs that consistently fool and produce only those (a perfect “7” every time). Symptom: low diversity. Fixes (minibatch discrimination, Wasserstein loss, careful capacity balancing) are partial. The deeper question of why the JS-divergence-minimising equilibrium isn’t always the one SGD finds is still active research.
  • Why latent-space interpolation is smooth — we observe smooth morphs in DCGAN/StyleGAN, but there’s no theoretical reason this should hold. The implicit smoothness of is a happy empirical accident.
  • Why gets ignored in some cGANs — pix2pix-style models often learn to ignore the noise input and produce nearly deterministic outputs. The condition dominates; the noise contributes little to output diversity. Fixing this is the motivation for noise-injection techniques in newer architectures and for switching to conditional diffusion (where noise plays a more structural role in the model).

Problem-set lessons

  • Q1 (regression for image generation): Train a regressor to map digit class to a image with MSE loss and multiple training examples per digit. The optimum is the pixelwise average of training images — fractional pixel values like and that don’t look like any specific digit. The MMSE estimator collapses multimodal posteriors into their mean. The generative-model fix: sample from the posterior instead of averaging it.
  • Q2 (cGAN tensor sizes): A discriminator on a batch of 4 colour images takes input of shape and outputs — one real-vs-fake probability per image. Standard CNN classifier on the input side; the only conditional twist is that the input is an image (real or generated), and in cGAN the condition is concatenated with it.
  • Q3 (true/false statements about GANs):
    • “Discriminator only used during training” → True: discarded at inference.
    • “Generator maximises ‘s ability to distinguish” → False: generator minimises it (fools ). It’s the discriminator that maximises distinguishing ability.
    • “Discriminator minimises probability of correctly classifying real data” → False: it maximises correct classification on both real and fake.
    • “To compute discriminator gradients we first need generator gradients” → False: the two are updated in separate SGD steps. When updating , ‘s parameters are fixed (no gradients needed). When updating , gradients flow through (chain rule) but ‘s parameters are fixed and not updated.