latent-diffusion-model

Pixel-space DDPM is slow because it has to denoise millions of pixels for $T \approx 1000$ steps. Latent diffusion fixes this with one architectural insight: most pixels are redundant — train an autoencoder to compress images into a small latent grid (say $64 \times 64$ instead of $512 \times 512$ ), then run the entire diffusion process in that latent space, and decode once at the end. Same training objective, same U-Net machinery, but everything happens on a representation that’s $\sim 50 \times$ smaller. The result is Stable Diffusion: high-resolution text-to-image generation runnable on consumer GPUs.

The architectural shift

A standard DDPM operates on pixel-space images $x_{t} \in R^{H \times W \times C}$ . For high-resolution generation ( $512 \times 512 \times 3 = 786, 432$ values), 1000 U-Net forward passes per sample are computationally extreme.

A latent diffusion model (LDM) inserts a frozen autoencoder around the diffusion process:

Encoder $E$ — compresses an image $x_{0}$ into a small latent representation $z_{0} = E (x_{0}) \in R^{h \times w \times c}$ , where $h \times w ≪ H \times W$ (typical compression factor: 4×, 8×, or 16× spatially).
Diffusion in latent space — the entire DDPM forward and reverse processes operate on $z_{t}$ , not $x_{t}$ . The U-Net predicts noise on the latent grid.
Decoder $D$ — at the end of sampling, decode the final latent $z_{0}$ back to pixel space: $x_{0} = D (z_{0})$ .

The autoencoder is trained separately (not jointly with the diffusion model), then frozen. Once $E, D$ are trained, the diffusion model only ever sees latents.

The loss function

Vanilla DDPM loss (operating in pixel space):

$L_{DM} = E_{x, ϵ \sim N (0, I), t} [∥ ϵ - ϵ_{θ} (x_{t}, t) ∥_{2}^{2}]$

Latent diffusion loss (operating in latent space):

$L_{LDM} = E_{E (x), ϵ \sim N (0, I), t} [∥ ϵ - ϵ_{θ} (z_{t}, t) ∥_{2}^{2}]$

The only change: $x_{t}$ is replaced by $z_{t}$ in the U-Net’s input. Everything else — noise schedule, sampling algorithm, training procedure — is identical to vanilla DDPM. The expectation now ranges over training latents $E (x)$ rather than training images $x$ .

Why it works

Two facts about images make latent diffusion practical:

Pixel-space images are redundant. Most natural images can be compressed by a learned autoencoder by 4–16× spatially with negligible perceptual loss — neighbouring pixels are highly correlated, fine textures can be regenerated from coarse summaries, etc. So you can lose a lot of pixel-level information without losing what makes the image look like an image.
Diffusion’s hard work is in the structure, not the pixels. The semantic content of an image (faces, objects, scene layout) lives at coarse spatial scales. Diffusion at this scale is enough to capture the structure; the autoencoder’s decoder fills in the fine pixel detail (texture) that doesn’t need to be diffused over.

In practice: a 512×512 image autoencodes to a 64×64 latent (8× spatial compression, 64× fewer “pixels”). The U-Net acts on the 64×64 grid — far cheaper per step. Training is cheaper too because you can fit larger batches.

Conditional latent diffusion: text-to-image

The architecture is built for conditioning. The U-Net is augmented with cross-attention layers that consume an external embedding $τ_{θ} (y)$ :

$L_{LDM} = E_{E (x), y, ϵ \sim N (0, I), t} [∥ ϵ - ϵ_{θ} (z_{t}, t, τ_{θ} (y)) ∥_{2}^{2}]$

Concretely:

$y$ is the conditioning input — a text prompt, a semantic map, a class label, a low-res image, a layout, etc.
$τ_{θ}$ is a domain-specific encoder for $y$ — for text, a transformer-based text encoder (often a frozen CLIP encoder); for images, a CNN; for class labels, an embedding lookup.
The U-Net’s cross-attention layers consume $τ_{θ} (y)$ as keys and values, with the latent feature map as queries — letting every spatial location in $z_{t}$ attend to the conditioning information.

This is the architectural template behind Stable Diffusion — Stability AI’s open-source release that made high-quality text-to-image generation broadly accessible. Closed-source systems built on the same principle: DALL-E 2, Imagen, Midjourney (varied details, same core idea).

The full inference pipeline

To generate an image from a text prompt:

Encode condition:    e = τ_θ("Paris in milky way")        # CLIP text encoder
Sample noise:        z_T ~ N(0, I)                          # in latent space (small!)
for t = T, ..., 1:                                          # reverse diffusion in latent space
    z_{t-1} = denoise(z_t, t, e)                            # U-Net + cross-attention with e
Decode latent:       x_0 = D(z_0)                           # back to pixel space
return x_0

The slow loop is the diffusion sampling, but it’s now in the small latent space, so each step is cheap. The decoder runs once, at the very end. Compare with pixel-space diffusion where every one of the $T$ U-Net calls would be on the full-resolution image.

What you give up

The compression isn’t free — the encoder lossily compresses, so:

The autoencoder needs to be good for the LDM to be good. Stable Diffusion uses a GAN-trained perceptual autoencoder so reconstructions look right perceptually even if not pixel-perfect.
Some fine pixel-level effects can’t be perfectly captured (very fine text, intricate patterns, etc.) — they live below the autoencoder’s resolution.
The latent space is learned, not interpretable. You don’t get clean disentanglement.

In exchange you get: 5–50× faster training, 5–50× faster sampling, and the ability to train on consumer hardware. Worth it.

Why this is the architecture for modern text-to-image

Three reasons LDM dominates production text-to-image:

Compute scales. You can train at high resolution because most compute happens in latent space. Pixel-space diffusion at 512×512+ is prohibitively expensive.
Conditioning is clean. Cross-attention with frozen text encoders gives strong text-following without retraining the text encoder. Swap CLIP for a better encoder, swap the U-Net’s cross-attention scheme — modular.
Released open. Stability AI’s open release of Stable Diffusion (weights + code, Apache-licensed) seeded a massive ecosystem (Diffusers library, ControlNet, LoRAs, fine-tunes). Closed alternatives (DALL-E, Midjourney, Imagen) used the same principle without sharing weights.

ASIDE — "Latent" here means three different things

Don’t confuse the three uses of “latent” floating around this module:

Autoencoder latent (latent-representation): encoder output $z = f_{ϕ} (x)$ , kept as a learned compressed representation of a specific input. Used for clustering / classification.

GAN/VAE latent (generative-adversarial-network $z$ ): a random sample from a prior $N (0, I)$ , fed into a generator. No encoder, no input image — just a fresh noise vector.

Latent diffusion latent (this page, $z_{0}, z_{t}$ ): a spatial grid (e.g. 64×64×4) that is the autoencoder’s encoded form of an input image, and the space in which the diffusion process operates. So an LDM mixes the autoencoder sense (output of $E$ ) with a diffusion-sense (the $z_{t}$ chain). At inference, $z_{T}$ is sampled from $N (0, I)$ — playing the GAN-prior role — and the reverse process produces $z_{0}$ which the decoder turns into a pixel image.

Same word, three different vectors with three different roles. Treat each in context.

A friend says: "Why not just train a normal pixel-space DDPM with a smaller U-Net for speed?" Why is latent diffusion better?

A smaller U-Net loses capacity without changing the problem — you’d still be predicting noise on $H \times W$ pixels each step, just with fewer parameters to do it with. The result: faster but lower-quality (the network can’t model the data well). Latent diffusion changes the problem — it shrinks the spatial dimensions the U-Net operates on, so even a large, capable U-Net runs cheaply per step. You get speed and capacity. The autoencoder takes the perceptually-irrelevant pixel detail off the diffusion model’s hands, freeing it to focus on the structural decisions that actually matter. The right way to think about it: pixel-space diffusion does double duty (modelling structure and low-level pixel details); LDM factors those concerns — the autoencoder handles pixels, diffusion handles structure.

The autoencoder in an LDM is trained separately and then frozen during diffusion training. Why not jointly train them?

Several reasons. Stability: the autoencoder converges to a clean reconstruction objective in isolation; jointly training would entangle its loss with the diffusion loss and risk degenerate solutions (e.g. an encoder that maps everything to noise, which the diffusion model then learns to reverse trivially). Reusability: a single trained autoencoder can serve many diffusion models (different conditioning, different domains, different fine-tunes) — you only train the expensive autoencoder once. Modularity: decoupling means you can swap the autoencoder for a better one (higher fidelity, different latent dimensionality) without retraining the diffusion U-Net. The Stable Diffusion ecosystem heavily exploits this — countless fine-tunes share the same autoencoder backbone.

Connections

Built on diffusion-model — LDM is “DDPM in latent space.” Same forward process, same reverse process, same noise-prediction objective, same U-Net training. The shift is: where do you do the diffusion?
Built on autoencoder — the encoder/decoder pair is a standard autoencoder, trained separately and then frozen. Decoupling is essential.
Built on u-net — the noise predictor is a U-Net, but acting on the small latent grid rather than full-resolution pixels. Augmented with self-attention and cross-attention.
Built on conditional-generative-model — text-to-image is conditional generation $p_{θ} (x ∣ text)$ ; LDM provides the architecture and the cross-attention mechanism for injecting the text condition.
Latent disambiguation: see latent-representation. In LDM, “latent” refers to a spatial grid that’s the autoencoder’s compressed representation; not the per-input vector of an autoencoder, and not the random noise prior of a GAN. Same word, three different meanings — keep them straight.
Practical instance — Stable Diffusion (Stability AI 2022) is the canonical LDM; the architecture is also the basis of Imagen, DALL-E 3, and most modern text-to-image diffusion systems.

Course Notes

Explorer

latent-diffusion-model

The architectural shift

The loss function

Why it works

Conditional latent diffusion: text-to-image

The full inference pipeline

What you give up

Why this is the architecture for modern text-to-image

Connections

Graph View

Table of Contents

Backlinks