Unconditional generative models produce a sample from nothing — “draw any face.” Conditional generative models produce a sample tied to a specific input — “colourise this greyscale photo of a cat.” The shift is small in architecture (feed the condition into both generator and discriminator) but huge in capability: every interesting image-to-image task — colourisation, super-resolution, semantic-map-to-photo, sketch-to-photo, inpainting, text-to-image — is a conditional generative problem. The probabilistic framing is exactly Bayesian posterior estimation.
The posterior framing
An unconditional generative-model learns the marginal .
A conditional generative model learns the posterior
— the distribution over targets given a condition . To generate: input a condition , sample .
This is direct Bayesian inference parameterised by a neural network. The condition is the observation; is the latent quantity we want to infer; the network learns the posterior.
Tasks that fit this framing
| Task | Condition | Target |
|---|---|---|
| Image colourisation | Greyscale image | Colour image |
| Super-resolution | Low-res image | High-res image |
| Inpainting | Image with masked region | Filled image |
| Semantic-map → photo | Class-coloured segmentation mask | Realistic photo |
| Sketch → photo | Outline drawing | Realistic photo |
| Denoising | Noisy image | Clean image |
| Text-to-image | Caption (string) | Generated image |
| Class-conditional generation | Class label (one-hot) | Image of that class |
| Image-to-segmentation | Photo | Posterior over segmentations |
In every case, multiple ‘s are plausible for a given . A greyscale photo of a flower could be coloured red or yellow or pink. A low-res image has many compatible high-res reconstructions. The posterior captures this multimodality; the task is to sample from it, not to produce a single point estimate.
Why regression fails (the blurry-output problem)
The most natural-seeming approach: train a regression network to map to , with MSE loss between the predicted and the ground-truth .
This systematically fails for tasks with multimodal posteriors. The reason is mathematical: MSE-regression converges to the conditional mean,
— the average of all plausible ‘s for a given . When the posterior has several modes (e.g. flower-could-be-red OR flower-could-be-yellow), the mean lands between them — a washed-out grey-pink that’s neither, with low saturation and blurred edges.
ASIDE — Worked example from the week-07 problem set
Train a regressor to map digit to a pixel image of that digit. The training set has multiple distinct hand-drawn examples per digit. After training to optimum (minimum MSE), what’s the network’s output for ? The pixelwise average of all training images of “1” — fractional pixel values like and that don’t appear in the training set, looking nothing like any specific “1” but compromising between all of them. Same for . The MMSE-optimal output is a blur of every plausible target, because that minimises the squared distance to all of them simultaneously. A conditional generative model picks one plausible instead of averaging.
The fix: use a generative approach. Sample from the posterior instead of averaging over it. Every sample is a single plausible ; running inference multiple times produces diverse outputs (different colourings, different reconstructions), each individually realistic.
Conditional GAN (cGAN)
The most direct extension of the GAN framework to the conditional setting (Mirza & Osindero 2014; pix2pix by Isola et al. 2018 made it practical for image-to-image translation):
- Generator — takes both a latent noise vector and the condition , produces .
- Discriminator — takes both the condition and the candidate (real or fake), outputs probability of being a real pair.
The training pairs are from the dataset (real positive examples) and (fake examples sharing the same ). Critically, the discriminator sees the condition, not just the candidate target — it learns to detect mismatched pairs (e.g. a sketch of a shoe paired with a generated handbag) as fake.
Loss is the standard GAN objective with everything conditioned on :
After training, generate by inputting a condition and sampling fresh for each desired output. Different ‘s give different plausible ‘s — capturing the posterior’s multimodality.
pix2pix (Isola et al. 2018)
A specific cGAN architecture for paired image-to-image translation:
- Generator is a U-Net encoder-decoder taking the condition image as input.
- Discriminator is a “PatchGAN” — classifies overlapping image patches as real-or-fake rather than the whole image, encouraging local texture realism.
- Loss combines the cGAN adversarial loss with an additional L1 reconstruction loss between and the ground truth . The L1 term keeps the output close to the target on average; the GAN term ensures the output looks real (sharp, plausible textures), not just average-close.
Pix2pix established the template for paired image-to-image translation: paired training data, U-Net generator, PatchGAN discriminator, hybrid L1+GAN loss. Later work (CycleGAN, etc.) extended this to unpaired translation.
Why the L1 + GAN hybrid loss?
L1 alone (or L2) gives a blurry conditional-mean output — same problem as MSE regression. GAN alone produces realistic-looking but possibly off-target outputs (the generator might produce a beautiful red shoe when the input sketch demands a specific brown one). The combination:
- L1 term anchors the output near the ground truth (correct overall colour, layout).
- GAN term sharpens textures and details to realistic (not average) values.
The result: outputs that are both faithful to the input and crisp/realistic.
Conditional models naturally capture uncertainty
A nice feature of the generative framing: because depends on the random , running the generator multiple times on the same condition produces different outputs . The variation across these samples is an estimate of the posterior’s uncertainty.
In a microscopy-denoising example (Prakash et al. 2020), a conditional generative model trained to denoise low-photon images produces outputs where:
- High-confidence structures (real cell boundaries) are consistent across samples — the posterior is sharply peaked there.
- Low-confidence regions (background noise) vary across samples — the posterior is broad there.
Averaging the samples gives a smoother estimate; the variance gives an uncertainty map. This is impossible with a plain MSE-regression network, which collapses both into a single output.
How behaves in conditional models
The latent in a cGAN plays a different role than in unconditional GANs. It’s still sampled from a fixed prior, but in many cGAN setups (especially pix2pix) the network learns to largely ignore and treat the conditioning as the dominant input — producing nearly the same output regardless of . This makes the model effectively deterministic, which is fine for tasks like colourisation where we want a single high-quality output, but undermines the diversity benefit of the generative framing.
Modern conditional generative models (conditional diffusion, conditional VAEs) handle this differently — for example, by injecting noise at multiple layers or using stronger latent regularisation — but the core principle is the same: controls what gets generated, provides the variation needed to express the posterior.
A friend trains an MSE-regression network to colourise greyscale flower photos. Outputs come out muted and washed-out — the reds aren't red, the yellows aren't yellow. What's happening, and what should they switch to?
They are seeing the MMSE-regression problem in action. A specific greyscale flower photo could be a red flower, a yellow flower, a pink flower — multiple plausible colourings, all equally consistent with the greyscale input. MSE-regression converges to the conditional mean, which is the pixelwise average of all those plausible colourings — a desaturated, neutral-toned compromise. The fix is a conditional generative model (cGAN with L1+GAN loss like pix2pix, or a conditional diffusion model). The generative loss samples one plausible colouring at a time instead of averaging across all of them, so individual outputs come out vivid and crisp. Running inference multiple times produces diverse colourings of the same input, each individually realistic — which is the right uncertainty behaviour for a problem with no unique answer.
The discriminator in a cGAN sees both the condition and the candidate , instead of just alone. Why is this critical?
Without the condition, the discriminator could only check “is this a realistic-looking image?” — the generator could then learn to produce any realistic image regardless of (e.g. always output a beautiful brown handbag, regardless of which sketch was input) and still fool the discriminator. With the condition, the discriminator checks “is this a realistic pair?” — punishing the generator for outputs that don’t match the input. The condition forces the generator to actually use as a constraint, not just as a starting noise source. This is what turns a generative model into a conditional one in the proper sense.
Connections
- Built on generative-adversarial-network — cGAN is a conditional extension of the GAN framework; same min-max game, same losses, same training algorithm, with added as input to both networks.
- Built on u-net — pix2pix’s generator is a U-Net; the encoder-decoder + skip connections architecture is well-suited to image-to-image tasks where the output is structurally similar to the input.
- Built on bayes-theorem — conditional GMs learn , a parameterised posterior.
- Family member of generative-model — the conditional branch.
- Improvement over MSE regression for posterior estimation — regression converges to the conditional mean (blurry); generative models sample from the posterior (sharp, diverse).