Unconditional generative models produce a sample from nothing — “draw any face.” Conditional generative models produce a sample tied to a specific input — “colourise this greyscale photo of a cat.” The shift is small in architecture (feed the condition into both generator and discriminator) but huge in capability: every interesting image-to-image task — colourisation, super-resolution, semantic-map-to-photo, sketch-to-photo, inpainting, text-to-image — is a conditional generative problem. The probabilistic framing is exactly Bayesian posterior estimation.

The posterior framing

An unconditional generative-model learns the marginal .

A conditional generative model learns the posterior

— the distribution over targets given a condition . To generate: input a condition , sample .

This is direct Bayesian inference parameterised by a neural network. The condition is the observation; is the latent quantity we want to infer; the network learns the posterior.

Tasks that fit this framing

TaskCondition Target
Image colourisationGreyscale imageColour image
Super-resolutionLow-res imageHigh-res image
InpaintingImage with masked regionFilled image
Semantic-map → photoClass-coloured segmentation maskRealistic photo
Sketch → photoOutline drawingRealistic photo
DenoisingNoisy imageClean image
Text-to-imageCaption (string)Generated image
Class-conditional generationClass label (one-hot)Image of that class
Image-to-segmentationPhotoPosterior over segmentations

In every case, multiple ‘s are plausible for a given . A greyscale photo of a flower could be coloured red or yellow or pink. A low-res image has many compatible high-res reconstructions. The posterior captures this multimodality; the task is to sample from it, not to produce a single point estimate.

Why regression fails (the blurry-output problem)

The most natural-seeming approach: train a regression network to map to , with MSE loss between the predicted and the ground-truth .

This systematically fails for tasks with multimodal posteriors. The reason is mathematical: MSE-regression converges to the conditional mean,

— the average of all plausible ‘s for a given . When the posterior has several modes (e.g. flower-could-be-red OR flower-could-be-yellow), the mean lands between them — a washed-out grey-pink that’s neither, with low saturation and blurred edges.

ASIDE — Worked example from the week-07 problem set

Train a regressor to map digit to a pixel image of that digit. The training set has multiple distinct hand-drawn examples per digit. After training to optimum (minimum MSE), what’s the network’s output for ? The pixelwise average of all training images of “1” — fractional pixel values like and that don’t appear in the training set, looking nothing like any specific “1” but compromising between all of them. Same for . The MMSE-optimal output is a blur of every plausible target, because that minimises the squared distance to all of them simultaneously. A conditional generative model picks one plausible instead of averaging.

The fix: use a generative approach. Sample from the posterior instead of averaging over it. Every sample is a single plausible ; running inference multiple times produces diverse outputs (different colourings, different reconstructions), each individually realistic.

Conditional GAN (cGAN)

The most direct extension of the GAN framework to the conditional setting (Mirza & Osindero 2014; pix2pix by Isola et al. 2018 made it practical for image-to-image translation):

  • Generator — takes both a latent noise vector and the condition , produces .
  • Discriminator — takes both the condition and the candidate (real or fake), outputs probability of being a real pair.

The training pairs are from the dataset (real positive examples) and (fake examples sharing the same ). Critically, the discriminator sees the condition, not just the candidate target — it learns to detect mismatched pairs (e.g. a sketch of a shoe paired with a generated handbag) as fake.

Loss is the standard GAN objective with everything conditioned on :

After training, generate by inputting a condition and sampling fresh for each desired output. Different ‘s give different plausible ‘s — capturing the posterior’s multimodality.

pix2pix (Isola et al. 2018)

A specific cGAN architecture for paired image-to-image translation:

  • Generator is a U-Net encoder-decoder taking the condition image as input.
  • Discriminator is a “PatchGAN” — classifies overlapping image patches as real-or-fake rather than the whole image, encouraging local texture realism.
  • Loss combines the cGAN adversarial loss with an additional L1 reconstruction loss between and the ground truth . The L1 term keeps the output close to the target on average; the GAN term ensures the output looks real (sharp, plausible textures), not just average-close.

Pix2pix established the template for paired image-to-image translation: paired training data, U-Net generator, PatchGAN discriminator, hybrid L1+GAN loss. Later work (CycleGAN, etc.) extended this to unpaired translation.

Why the L1 + GAN hybrid loss?

L1 alone (or L2) gives a blurry conditional-mean output — same problem as MSE regression. GAN alone produces realistic-looking but possibly off-target outputs (the generator might produce a beautiful red shoe when the input sketch demands a specific brown one). The combination:

  • L1 term anchors the output near the ground truth (correct overall colour, layout).
  • GAN term sharpens textures and details to realistic (not average) values.

The result: outputs that are both faithful to the input and crisp/realistic.

Conditional models naturally capture uncertainty

A nice feature of the generative framing: because depends on the random , running the generator multiple times on the same condition produces different outputs . The variation across these samples is an estimate of the posterior’s uncertainty.

In a microscopy-denoising example (Prakash et al. 2020), a conditional generative model trained to denoise low-photon images produces outputs where:

  • High-confidence structures (real cell boundaries) are consistent across samples — the posterior is sharply peaked there.
  • Low-confidence regions (background noise) vary across samples — the posterior is broad there.

Averaging the samples gives a smoother estimate; the variance gives an uncertainty map. This is impossible with a plain MSE-regression network, which collapses both into a single output.

How behaves in conditional models

The latent in a cGAN plays a different role than in unconditional GANs. It’s still sampled from a fixed prior, but in many cGAN setups (especially pix2pix) the network learns to largely ignore and treat the conditioning as the dominant input — producing nearly the same output regardless of . This makes the model effectively deterministic, which is fine for tasks like colourisation where we want a single high-quality output, but undermines the diversity benefit of the generative framing.

Modern conditional generative models (conditional diffusion, conditional VAEs) handle this differently — for example, by injecting noise at multiple layers or using stronger latent regularisation — but the core principle is the same: controls what gets generated, provides the variation needed to express the posterior.

Connections

  • Built on generative-adversarial-network — cGAN is a conditional extension of the GAN framework; same min-max game, same losses, same training algorithm, with added as input to both networks.
  • Built on u-net — pix2pix’s generator is a U-Net; the encoder-decoder + skip connections architecture is well-suited to image-to-image tasks where the output is structurally similar to the input.
  • Built on bayes-theorem — conditional GMs learn , a parameterised posterior.
  • Family member of generative-model — the conditional branch.
  • Improvement over MSE regression for posterior estimation — regression converges to the conditional mean (blurry); generative models sample from the posterior (sharp, diverse).