A GAN is a forger and a detective locked in a room. The forger () makes counterfeit notes from random inspiration. The detective () inspects each note and decides “real or fake?” The forger improves by studying which fakes get caught; the detective improves by seeing more fakes. At equilibrium, the forger’s counterfeits are indistinguishable from real currency — the discriminator can do no better than coin-flip — and the detective is no longer useful. Throw the detective away. The forger is the deliverable.
Architecture
Two networks, trained jointly:
- Generator with parameters . Takes a sample from a fixed prior (typically ) and produces a sample in image space.
- Discriminator with parameters . Takes an input and outputs a scalar probability that is real (drawn from ) rather than fake (drawn from ). Final layer is a sigmoid; it is just a binary classifier.
The model distribution is the push-forward of through — implicit, never written down. To sample: draw , return .
ASIDE — Forger / detective intuition
Goodfellow’s original framing: imagine a counterfeiter trying to print fake currency. They start with random inspiration () and produce fakes (). A detective inspects every note (real or fake) and announces ” probability this is real.” The counterfeiter studies which fakes the detective catches and refines their craft; the detective sees more fakes and gets better at spotting them. The objective is equilibrium: when the forger’s fakes are indistinguishable from real notes, the detective can do no better than 50/50 guessing. At that point, the forger’s distribution matches the real one.
The discriminator’s loss
The discriminator solves a standard binary classification problem: label for real samples (), label for fakes (, equivalently for ). Standard binary cross-entropy, written with the sign flipped to a minimisation:
The two terms are independent losses on the two batches: real samples (label 1) and fake samples (label 0). In practice, with mini-batches of size real samples and fake samples:
For fixed generator parameters , the discriminator is updated by SGD on to minimise this.
The generator’s loss — theoretical and practical
The generator’s job is to fool the discriminator. The natural objective is the negative of ‘s loss on the fake batch (since doesn’t see real data, the real-data term is irrelevant — its gradient w.r.t. is zero):
Theoretical loss (matches the min-max derivation):
The generator wants to minimise this (push , fooling the detective). The discriminator wants to maximise the same quantity. This is the two-player min-max game:
The saturating-gradient problem
The theoretical loss has a fatal practical flaw early in training. When the generator is bad (its samples are obvious fakes), , so — flat region of the curve. The gradient is small exactly when the generator is worst and most needs a strong learning signal. Conversely, when the generator is excellent and , the gradient is huge — even though training has nearly converged. Backwards from what we want.
Practical loss (the fix, used in real implementations):
Same goal — push — but the gradient is large when is small (training start) and saturates when (training end). Exactly the right shape.
The two losses optimise the same equilibrium () but with different gradient profiles. The practical loss is what every GAN implementation actually uses.
TIP — How to remember the loss flip
Both forms of the generator loss are trying to fool into outputting “1” on fakes. The theoretical form is “minimise the log-probability of being caught” — saturates badly when the model is bad. The practical form is “maximise the log-probability of fooling the detective” — saturates badly when the model is good (which is fine — at that point you’re done). Same equilibrium, opposite saturation behaviour.
The training algorithm
Outer loop iterates many epochs. Inner step alternates D and G updates:
Initialize θ, φ randomly
for t = 1 ... T:
# Train D for K steps (often K=1)
for k = 1 ... K:
Sample real batch {x_i} ~ p_data, size N_{D,r}
Sample latent batch {z_i} ~ N(0,I), size N_{D,f}
L_D = -(1/N_{D,r}) Σ log D_φ(x_i) - (1/N_{D,f}) Σ log(1 - D_φ(G_θ(z_i)))
φ ← φ - α ∇_φ L_D
# Train G for 1 step
Sample latent batch {z_i} ~ N(0,I), size N_G
L_G = -(1/N_G) Σ log D_φ(G_θ(z_i)) # practical generator loss
θ ← θ - β ∇_θ L_G
Two stochastic gradient descents on opposite objectives, alternated. Backprop through provides the gradient signal that flows back into — the generator’s output is just an activation map fed into , so the chain rule pushes gradients all the way back to via .
Why this works — the Jensen-Shannon connection
Goodfellow’s 2014 paper proved a theoretical guarantee:
Optimal discriminator. For fixed generator parameters and assuming has infinite capacity, the optimal discriminator at every point is
Substituting this back into the min-max objective and simplifying yields:
where is the Jensen-Shannon divergence — a symmetric measure of distance between two distributions, with .
Conclusion. If is perfectly optimised at every step before is updated, then minimising over minimises — driving the model distribution to exactly match the data distribution.
Caveats: in practice, is not perfectly optimised at every step (we run only SGD updates), doesn’t have infinite capacity, and we use the practical generator loss instead of the theoretical one. So the theoretical guarantee is loose. But it provides intuition for why the adversarial setup works at all — there’s a real divergence being minimised behind the scenes.
Inference: just keep the generator
After training, the discriminator is thrown away. The trained generator alone defines the model distribution. To generate a new sample:
- Draw .
- Compute .
That’s it. The same generator can produce unlimited new samples by drawing fresh vectors. No labels, no encoder, no input image.
Variants worth knowing
- DCGAN (Radford et al. 2016) — the first practical convolutional GAN. Generator uses transposed convolutions / upsampling (analogous to a U-Net decoder, but starting from and growing to a full image). Discriminator is a standard CNN classifier. DCGAN-style architectures are still the default for many image-generation tasks.
- Latent interpolation in DCGAN — sampling for produces a smooth morph between and . Empirically observed; no theoretical guarantee that the latent space is smooth, but it usually is.
- BigGAN (Brock et al. 2019) — scaled-up DCGAN-style architecture trained on large datasets at high resolution; produced the first photorealistic class-conditional ImageNet generation.
- StyleGAN (Karras et al.) — the architecture behind
thispersondoesnotexist.com; introduces style mixing and disentangled latent control. - Conditional GANs (cGAN / pix2pix) — generate conditioned on an input , instead of unconditionally. See conditional-generative-model.
Common pitfalls
- Mode collapse. discovers one or a few outputs that consistently fool and starts producing only those — generating a perfect “7” every time, ignoring all other digits. Symptom: low diversity in samples. Defences: minibatch discrimination, Wasserstein loss, careful architecture choice.
- Discriminator dominance. If becomes near-perfect early, for all , the practical generator loss saturates, and can’t learn. Defences: keep slightly weaker (lower learning rate, fewer updates per step), use the practical loss not the theoretical one, balance capacity.
- Training instability. The min-max equilibrium is fragile; small changes to architecture, learning rate, or batch size can derail training entirely. GAN training is notoriously empirical — DCGAN’s main contribution was a recipe of architectural choices (no fully-connected layers, batch norm everywhere, LeakyReLU in ) that consistently train.
The discriminator is trained to maximise . Why is this just binary cross-entropy in disguise?
Because the two terms together are the negative log-likelihood of the labels under the discriminator’s predictions — exactly binary-cross-entropy. Real samples have label ; for them the BCE term is . Fake samples have label ; for them the BCE term is . Sum and negate: minimising BCE = maximising . The “min-max” framing of GANs is just standard supervised binary classification on the discriminator’s side, with the twist that the data distribution it’s classifying against () is itself being learned by the generator.
A friend trains a GAN and notices that when the generator is at its worst (early training, ), the gradients on are tiny. What loss function are they probably using, and how should they fix it?
They are probably using the theoretical generator loss , which is flat near — exactly where training starts. The fix is to switch to the practical generator loss , which has large gradients when is small. Both losses share the same minimum (push ); they differ only in the gradient profile during training. The practical form gives strong learning signal early when the generator is bad, and saturates softly when the generator is good — exactly the desired schedule.
After GAN training is complete, why don't we use the discriminator for anything?
Because its job was to provide a training signal for the generator, not to be useful in itself. At the equilibrium of training, for both real and fake inputs — the discriminator is no longer informative; it’s been trained into a corner where it can’t tell the two apart. Even slightly before equilibrium, ‘s purpose is supervision, not classification of new inputs (we already have plenty of real data; we don’t need a network to confirm it’s real). All the value is in , which knows how to map noise to plausible samples. So at inference: keep , discard .
A friend asks: "If GANs minimise Jensen-Shannon divergence between and , why don't we just compute JSD and minimise it directly with gradient descent?"
Because computing JSD directly requires evaluating both and at arbitrary points — and we have neither of those in closed form. is the unknown true distribution we only see samples from; is implicitly defined by the generator (the push-forward of a Gaussian through , which we can sample from but not evaluate). The genius of the GAN setup is that adversarial training optimises JSD without ever computing it — the discriminator implicitly estimates the ratio , and the generator’s loss reduces to JSD when is optimal. Implicit density estimation is the only tractable approach when explicit densities aren’t available.
Connections
- Built on binary-cross-entropy — the discriminator loss is BCE; the practical generator loss is half of it.
- Built on sigmoid function — the discriminator’s final activation produces .
- Built on gradient descent and backpropagation — gradients flow from through back into (the generator’s output is just another activation map that consumes).
- Built on convolutional-neural-network and upsampling — DCGAN’s generator uses transposed convolutions to grow into a full image; its discriminator is a standard CNN.
- Family member of generative-model — the implicit-density-estimation branch.
- Extended by conditional-generative-model — adds a condition to control what gets generated; the basis of pix2pix, image-to-image translation, super-resolution, and class-conditional generation.
- Latent vector in a GAN is not an encoded representation — it’s a sample from a fixed prior. Contrast with latent-representation in autoencoders / SimCLR where comes from an encoder.
- Compared to autoencoder — both have a “decoder-shaped” network, but the AE’s decoder reconstructs a specific input via the bottleneck while the GAN’s generator transforms random noise into novel samples. Different goals, different training signals.