A GAN is a forger and a detective locked in a room. The forger () makes counterfeit notes from random inspiration. The detective () inspects each note and decides “real or fake?” The forger improves by studying which fakes get caught; the detective improves by seeing more fakes. At equilibrium, the forger’s counterfeits are indistinguishable from real currency — the discriminator can do no better than coin-flip — and the detective is no longer useful. Throw the detective away. The forger is the deliverable.

Architecture

Two networks, trained jointly:

  • Generator with parameters . Takes a sample from a fixed prior (typically ) and produces a sample in image space.
  • Discriminator with parameters . Takes an input and outputs a scalar probability that is real (drawn from ) rather than fake (drawn from ). Final layer is a sigmoid; it is just a binary classifier.

The model distribution is the push-forward of through — implicit, never written down. To sample: draw , return .

ASIDE — Forger / detective intuition

Goodfellow’s original framing: imagine a counterfeiter trying to print fake currency. They start with random inspiration () and produce fakes (). A detective inspects every note (real or fake) and announces ” probability this is real.” The counterfeiter studies which fakes the detective catches and refines their craft; the detective sees more fakes and gets better at spotting them. The objective is equilibrium: when the forger’s fakes are indistinguishable from real notes, the detective can do no better than 50/50 guessing. At that point, the forger’s distribution matches the real one.

The discriminator’s loss

The discriminator solves a standard binary classification problem: label for real samples (), label for fakes (, equivalently for ). Standard binary cross-entropy, written with the sign flipped to a minimisation:

The two terms are independent losses on the two batches: real samples (label 1) and fake samples (label 0). In practice, with mini-batches of size real samples and fake samples:

For fixed generator parameters , the discriminator is updated by SGD on to minimise this.

The generator’s loss — theoretical and practical

The generator’s job is to fool the discriminator. The natural objective is the negative of ‘s loss on the fake batch (since doesn’t see real data, the real-data term is irrelevant — its gradient w.r.t. is zero):

Theoretical loss (matches the min-max derivation):

The generator wants to minimise this (push , fooling the detective). The discriminator wants to maximise the same quantity. This is the two-player min-max game:

The saturating-gradient problem

The theoretical loss has a fatal practical flaw early in training. When the generator is bad (its samples are obvious fakes), , so — flat region of the curve. The gradient is small exactly when the generator is worst and most needs a strong learning signal. Conversely, when the generator is excellent and , the gradient is huge — even though training has nearly converged. Backwards from what we want.

Practical loss (the fix, used in real implementations):

Same goal — push — but the gradient is large when is small (training start) and saturates when (training end). Exactly the right shape.

The two losses optimise the same equilibrium () but with different gradient profiles. The practical loss is what every GAN implementation actually uses.

TIP — How to remember the loss flip

Both forms of the generator loss are trying to fool into outputting “1” on fakes. The theoretical form is “minimise the log-probability of being caught” — saturates badly when the model is bad. The practical form is “maximise the log-probability of fooling the detective” — saturates badly when the model is good (which is fine — at that point you’re done). Same equilibrium, opposite saturation behaviour.

The training algorithm

Outer loop iterates many epochs. Inner step alternates D and G updates:

Initialize θ, φ randomly
for t = 1 ... T:
    # Train D for K steps (often K=1)
    for k = 1 ... K:
        Sample real batch {x_i} ~ p_data, size N_{D,r}
        Sample latent batch {z_i} ~ N(0,I), size N_{D,f}
        L_D = -(1/N_{D,r}) Σ log D_φ(x_i) - (1/N_{D,f}) Σ log(1 - D_φ(G_θ(z_i)))
        φ ← φ - α ∇_φ L_D
    # Train G for 1 step
    Sample latent batch {z_i} ~ N(0,I), size N_G
    L_G = -(1/N_G) Σ log D_φ(G_θ(z_i))     # practical generator loss
    θ ← θ - β ∇_θ L_G

Two stochastic gradient descents on opposite objectives, alternated. Backprop through provides the gradient signal that flows back into — the generator’s output is just an activation map fed into , so the chain rule pushes gradients all the way back to via .

Why this works — the Jensen-Shannon connection

Goodfellow’s 2014 paper proved a theoretical guarantee:

Optimal discriminator. For fixed generator parameters and assuming has infinite capacity, the optimal discriminator at every point is

Substituting this back into the min-max objective and simplifying yields:

where is the Jensen-Shannon divergence — a symmetric measure of distance between two distributions, with .

Conclusion. If is perfectly optimised at every step before is updated, then minimising over minimises — driving the model distribution to exactly match the data distribution.

Caveats: in practice, is not perfectly optimised at every step (we run only SGD updates), doesn’t have infinite capacity, and we use the practical generator loss instead of the theoretical one. So the theoretical guarantee is loose. But it provides intuition for why the adversarial setup works at all — there’s a real divergence being minimised behind the scenes.

Inference: just keep the generator

After training, the discriminator is thrown away. The trained generator alone defines the model distribution. To generate a new sample:

  1. Draw .
  2. Compute .

That’s it. The same generator can produce unlimited new samples by drawing fresh vectors. No labels, no encoder, no input image.

Variants worth knowing

  • DCGAN (Radford et al. 2016) — the first practical convolutional GAN. Generator uses transposed convolutions / upsampling (analogous to a U-Net decoder, but starting from and growing to a full image). Discriminator is a standard CNN classifier. DCGAN-style architectures are still the default for many image-generation tasks.
  • Latent interpolation in DCGAN — sampling for produces a smooth morph between and . Empirically observed; no theoretical guarantee that the latent space is smooth, but it usually is.
  • BigGAN (Brock et al. 2019) — scaled-up DCGAN-style architecture trained on large datasets at high resolution; produced the first photorealistic class-conditional ImageNet generation.
  • StyleGAN (Karras et al.) — the architecture behind thispersondoesnotexist.com; introduces style mixing and disentangled latent control.
  • Conditional GANs (cGAN / pix2pix) — generate conditioned on an input , instead of unconditionally. See conditional-generative-model.

Common pitfalls

  • Mode collapse. discovers one or a few outputs that consistently fool and starts producing only those — generating a perfect “7” every time, ignoring all other digits. Symptom: low diversity in samples. Defences: minibatch discrimination, Wasserstein loss, careful architecture choice.
  • Discriminator dominance. If becomes near-perfect early, for all , the practical generator loss saturates, and can’t learn. Defences: keep slightly weaker (lower learning rate, fewer updates per step), use the practical loss not the theoretical one, balance capacity.
  • Training instability. The min-max equilibrium is fragile; small changes to architecture, learning rate, or batch size can derail training entirely. GAN training is notoriously empirical — DCGAN’s main contribution was a recipe of architectural choices (no fully-connected layers, batch norm everywhere, LeakyReLU in ) that consistently train.

Connections

  • Built on binary-cross-entropy — the discriminator loss is BCE; the practical generator loss is half of it.
  • Built on sigmoid function — the discriminator’s final activation produces .
  • Built on gradient descent and backpropagation — gradients flow from through back into (the generator’s output is just another activation map that consumes).
  • Built on convolutional-neural-network and upsampling — DCGAN’s generator uses transposed convolutions to grow into a full image; its discriminator is a standard CNN.
  • Family member of generative-model — the implicit-density-estimation branch.
  • Extended by conditional-generative-model — adds a condition to control what gets generated; the basis of pix2pix, image-to-image translation, super-resolution, and class-conditional generation.
  • Latent vector in a GAN is not an encoded representation — it’s a sample from a fixed prior. Contrast with latent-representation in autoencoders / SimCLR where comes from an encoder.
  • Compared to autoencoder — both have a “decoder-shaped” network, but the AE’s decoder reconstructs a specific input via the bottleneck while the GAN’s generator transforms random noise into novel samples. Different goals, different training signals.