A diffusion model breaks the hard problem of “go from noise to a face in one step” into the easy problem of “remove a little noise” applied 1000 times. The trick is fully symmetric: define a fixed forward process that destroys data into Gaussian noise step by step, then train a neural network to walk that process backwards. The forward process is just math (noise + scaling); the reverse process is the learning. At test time, sample pure Gaussian noise and apply the network 1000 times — out comes a face that doesn’t exist.

The two processes

A DDPM (denoising diffusion probabilistic model) consists of two processes operating on a sequence of variables with typically around 1000:

  • Forward diffusion — fixed (no learned parameters). Take a clean data sample and gradually add Gaussian noise until — pure noise that has lost all information about .
  • Reverse diffusion — learned. Sample and progressively denoise to produce a fresh sample . The reverse process is the generative path and where all training happens.

ASIDE — Why "diffusion"?

The name borrows from physics. A drop of ink in still water spreads gradually until the pigment is uniformly distributed and the original shape is gone — a system relaxing from low entropy to high entropy via small random kicks. The forward process here is the same idea applied to pixels: each step nudges the image with a tiny amount of Gaussian noise, and after enough steps all structure is destroyed and you’re left with pure noise. The reverse process is the (much harder) thermodynamic miracle: watching the diffusion run backwards.

Forward diffusion: the noising chain

Each forward step is a small Gaussian perturbation:

In sample form:

where is the noise schedule — a sequence of small positive numbers (typically ramping linearly to ). The schedule is a hyperparameter we choose; common choices are linear or cosine.

TIP — Why the scaling?

The scaling keeps each step a gentle mix of “old image” and “new noise” rather than just piling noise on top. Without it, the variance of would grow without bound as increases — values would explode out of the meaningful range. With it, the variance stays approximately fixed. After many steps, ends up distributed as — the same simple distribution we can sample from at test time. It’s how the chain terminates at noise in a controlled way, rather than diverging.

The closed-form shortcut: ancestral sampling

The composition of all forward steps from to collapses into a single Gaussian. Define the cumulative product

Then

In sample form:

This is enormously useful for training: instead of running the forward process step by step (1000 sequential operations) to produce a noisy training example at timestep , you can sample in one operation directly from . The schedule is designed so — meaning has zero contribution from and is pure noise as required.

Reverse diffusion: the generative path

To generate a sample, start from pure noise and iteratively denoise:

The ideal reverse step is the conditional distribution — the true posterior over the slightly cleaner image given the slightly noisier one. In general this is intractable — there’s no closed form and we can’t sample from it directly.

The DDPM workaround: approximate it with a Gaussian whose mean is predicted by a neural network. As long as is small at each step, the true posterior is well-approximated by a Gaussian — so the parametric form

is an honest approximation (not a heuristic). Here:

  • is the network’s prediction — a learnable mean of the same shape as the image.
  • is fixed in vanilla DDPM (a function of , not learned).

In sample form:

The network only has to learn the mean of the reverse step. The variance is given by the schedule.

The big simplification: predict the noise, not the mean

A theoretical derivation (in the DDPM paper) shows that predicting directly is equivalent to predicting the noise that was added in the forward step from to . The two are linked by:

where , and is a neural network whose output has the same shape as (it predicts the noise image).

Why predict noise instead of the mean directly? Two reasons:

  1. Easier learning. Predicting noise is a cleaner regression target than predicting a clean image. The noise is approximately the same scale at every timestep; the clean image’s range varies.
  2. Same loss, different framing. The loss becomes simple MSE between predicted noise and actual noise — a supervised problem with abundant training data (we made the noise ourselves).

The network: a U-Net that takes a timestep

The architecture predicting is typically a U-Net (encoder-decoder with skip connections), often augmented with ResNet blocks and self-attention layers. Two crucial details:

  • Input is the noisy image , output is the predicted noise — both have the same shape (same number of pixels, same channels).
  • The timestep is also fed in as a learned vector embedding (via fully-connected layers from the scalar ). The network needs to know “which step of denoising are we at?” — different timesteps require different denoising strategies (lots of noise removal vs fine-detail polishing).

The U-Net is deterministic: same input, same output. Stochasticity in the overall reverse process comes from the explicit noise term added at each sampling step, not from the network.

Training algorithm (DDPM Algorithm 1)

repeat:
    x_0 ~ q(x_0)                      # sample a clean training image
    t   ~ Uniform({1, ..., T})        # pick a random timestep
    ε   ~ N(0, I)                     # sample fresh Gaussian noise
    # Forward shortcut: jump straight from x_0 to x_t in one operation
    x_t = √(α̅_t) · x_0 + √(1 - α̅_t) · ε
    # Loss: MSE between true noise and predicted noise
    L = || ε - ε_θ(x_t, t) ||²
    take gradient step on ∇_θ L
until converged

Three things worth noting:

  • It’s just supervised learning. The forward shortcut gives us an pair for free; the network learns to predict from and . Loss = MSE. Gradient descent. Backprop. All standard.
  • No iterative forward roll-out. The shortcut means each training step costs one network forward+backward pass, not passes. Critical for training feasibility.
  • The network sees all timesteps interleaved. Random sampling of each iteration means the same network learns to denoise at (lots of noise) and (just noise) — the timestep embedding tells it which regime it’s in.

Sampling algorithm (DDPM Algorithm 2)

x_T ~ N(0, I)                          # start from pure Gaussian noise
for t = T, T-1, ..., 1:
    z ~ N(0, I) if t > 1 else 0
    # Apply one reverse step using predicted noise
    x_{t-1} = (1/√α_t) · (x_t - ((1-α_t)/√(1-α̅_t)) · ε_θ(x_t, t)) + σ_t · z
return x_0

The big bracketed expression is exactly — the predicted mean of the reverse Gaussian. Adding samples from that Gaussian. After such steps, is your generated image.

CAUTION — Sampling is slow

The reverse loop runs sequentially — each denoising step requires the result of the previous one, so the U-Net calls cannot be parallelised. With and a heavy U-Net, generating one image can take seconds even on a top GPU. This is the major drawback of diffusion models compared to GANs (which generate in one forward pass). The race to fix this — DDIM, DPM-Solver, distilled diffusion, latent diffusion — is one of the most active research areas in the field.

What the model is really learning

The diffusion model is fundamentally a denoiser. It is not — strictly speaking — an image-generation model. It happens to generate images as a side effect of repeated denoising of pure noise. Every training example, every loss evaluation, every U-Net forward pass is solving the same problem: “given this noisy image and what timestep it is, what noise was added?”

Generation works because if you start from pure noise (which the model has been trained to handle, because in the forward process is essentially pure noise) and apply denoising 1000 times, you end up at a clean image consistent with . The chain of small denoising steps does the work that a GAN’s generator does in one step.

Quality vs diversity vs speed: the trilemma

FamilyQualityDiversitySpeed
GANHighLow (mode collapse)Fast (1 forward pass)
DiffusionVery highHigh (covers modes well)Slow ( forward passes)
VAE / normalising flowsModerateHighFast

Diffusion models excel at the first two and are weak on the third. The “race against the clock” is what motivates latent diffusion (do diffusion in a small latent space, decode once at the end), DDIM (skip steps deterministically), and other accelerated samplers.

Conditional diffusion: feeding in

The same architecture extends trivially to conditional generation where is a class label, a low-resolution image, a text prompt, etc.

The U-Net consumes in addition to and :

  • Scalar (class label) — embed as a vector via lookup or small MLP, add to the timestep embedding.
  • Image (low-res, masked, etc.) — concatenate channel-wise with before the U-Net.
  • Text — encode with a text encoder (CLIP, T5), inject via cross-attention layers throughout the U-Net.

Loss is unchanged — predict the noise that was added, conditioned on . This is the workhorse pattern behind every text-to-image model in production. See conditional-generative-model for the broader posterior-estimation framing that applies equally to cGANs and conditional diffusion.

Key extensions (high-level)

  • Cascaded diffusion. Generate at low resolution (e.g. 32×32), then use a chain of conditional super-resolution diffusion models to scale up (32→64→256). Each stage is conditioned on the previous stage’s output. Faster than one giant model trained at full resolution, and produces sharper outputs at high res.
  • Super-resolution diffusion (SR3). Train where is high-res and is low-res. Concatenate upsampled to in the U-Net. Beats bicubic interpolation and MSE-regression super-resolution by a wide margin (regression converges to the conditional mean — blurry; diffusion samples from the posterior — sharp).
  • Image-to-image diffusion (Palette). Same recipe with being any image-domain condition: greyscale→colour, masked→inpainted, low-quality JPEG→clean, panorama uncropping. Uses one model architecture across all tasks.
  • Diffusion for segmentation. Run forward diffusion on an image, extract intermediate U-Net feature maps, train a small MLP classifier on top to predict per-pixel class labels. The diffusion U-Net’s features turn out to be excellent representations even though it was trained for noise prediction (a representation-learning side effect).
  • Latent diffusion (Stable Diffusion). The decisive practical breakthrough — perform diffusion in the small latent space of a frozen autoencoder rather than in pixel space. Massive speedup, plus a clean place to inject conditioning via cross-attention. Behind essentially every modern text-to-image system.

Common pitfalls

  • Receptive field too small. A shallow U-Net (e.g. only two pooling/upsampling levels) has a limited receptive field — pixel predictions are influenced only by local context. For tasks requiring global consistency (eyes the same colour, both shoes the same pair), this causes asymmetric / inconsistent outputs. Fix: deeper U-Net with more pooling levels, or add self-attention layers that aggregate global context. (Week-08 problem set Q2.)
  • Confusing with a latent code. The noise at is not an encoding of . It’s pure Gaussian noise that has lost all information about the original. Diffusion models are not autoencoders — there is no compressed latent representation of the input. The noise plays the same role as the random at the start of a GAN: a fresh sample from a simple distribution that the model expands into structured output.
  • Both processes are stochastic. The forward process is random (different noise sample → different ); the reverse process is random (different ‘s in sampling → different ). Even with the same starting noise , the reverse process produces different images on different runs because each step injects fresh .
  • Predicting the mean directly is harder than predicting noise. Empirically and theoretically. Always parameterise the network to output noise, and recover analytically.

Connections

  • Built on u-net — the noise-predictor architecture is a U-Net, often with ResNet blocks and self-attention. The encoder-decoder shape is well-suited to image-shape input/output.
  • Built on bayes-theorem — the reverse process is Bayesian inference: given the observation , infer the posterior over . The DDPM derivation involves repeated application of Bayes between forward and reverse processes.
  • Built on backpropagation / gradient descent — training is standard supervised learning with MSE loss. The forward shortcut makes each gradient step cheap.
  • Family member of generative-model — competes with GANs and VAEs for image generation. Currently the dominant paradigm for image synthesis quality.
  • Extended by latent-diffusion-model — performing diffusion in the latent space of a frozen autoencoder rather than pixel space; the basis of Stable Diffusion.
  • Extends to conditional generation — same recipe with a condition added as input. Text-to-image, image-to-image translation, super-resolution, inpainting all follow this pattern.
  • Compared to autoencoder — both have an encoder-decoder shape (the U-Net), but the AE compresses input to a meaningful latent and reconstructs; diffusion adds noise (no learned encoding) and learns to remove it. The “noise at ” is not a latent code in the AE sense — it’s pure noise containing no information about .
  • Compared to GAN — both go from noise to image, but: GAN uses one fast forward pass with adversarial training (unstable, mode-collapse-prone); diffusion uses many slow denoising passes with stable supervised training (high quality, high diversity).