Modelling a high-dimensional joint distribution is hard. Modelling a single one-dimensional conditional distribution given the past is easy. Autoregressive models exploit the chain rule of probability to turn the hard problem into a sequence of easy ones — and recover the joint by multiplying.
The factorisation
For any random vector , the chain rule of probability gives:
This identity holds for any joint distribution, regardless of the dimensionality or structure. It’s not an approximation. The autoregressive idea is to parameterise each conditional with a neural network:
A single shared network with parameters takes the previous components as input and outputs the distribution over the next component . Train by maximum likelihood, and you have a tractable density estimate of the full joint.
ASIDE — Why this is a huge simplification
Directly modelling for high-dimensional (e.g. a image, or a 10-second audio waveform sampled at 16 kHz) is intractable — you’d have to specify a probability density over - or -dimensional space. Autoregressive factorisation reduces this to predicting a one-dimensional distribution at a time (just ), conditioned on what came before. Each step is something a network can plausibly do; the chain rule guarantees the product equals the joint.
What the network outputs
For each step, the network outputs a distribution over (not just a point estimate). How that distribution is parameterised depends on the data:
| Data type | Output distribution | Typical implementation |
|---|---|---|
| Discrete (text, classes) | Categorical over vocab | Softmax over vocab logits |
| Discrete-valued audio (8-bit) | Categorical over 256 levels | Softmax over 256 logits |
| Continuous | Gaussian, mixture, or histogram | Mean+variance heads, or discretised bins |
Slide 57 shows the histogram view: the model can output a discretised distribution over possible values — a histogram-like vector over the discrete bin index — or a parametric form like a Gaussian. Either works. Sampling at generation time is then drawing from that output distribution.
Examples
WaveNet — audio (van den Oord et al. 2016)
Raw audio is a 1D sequence of pressure samples (16 kHz means 16{,}000 numbers per second). WaveNet models for each sample with a stack of dilated causal convolutions:
- Causal: filter only sees past samples, never future ones (just like masked attention).
- Dilated: each successive layer skips an exponentially growing number of timesteps (dilation 1, 2, 4, 8, …) so the receptive field grows exponentially with depth.
A WaveNet with layers of dilation has receptive field samples — covers seconds of audio with a modest stack. Output is a softmax over discretised audio levels (typically 256 via -law companding).
PixelRNN / PixelCNN — images (van den Oord et al. 2016)
Raster-scan an image into a sequence: = top-left pixel, = next pixel along, …, = bottom-right pixel. Then model
with a network that produces a categorical distribution over 256 intensities for each pixel given everything that came before in raster order. PixelRNN uses an RNN; PixelCNN uses masked convolutions that only see pixels above and to the left.
Transformer-based language models (GPT family)
Same factorisation, with the conditional implemented by a transformer with masked self-attention. Each token attends only to previous tokens; the final hidden state at each position is projected to a softmax over the vocabulary, giving the distribution over the next token.
GPT, the decoder side of the original transformer, and any “causal language model” you’ve heard of are all autoregressive in this sense. The architecture changes; the factorisation doesn’t.
Training: parallel and fast
Autoregressive models are trained by maximum likelihood: maximise
Crucially, during training the entire ground-truth sequence is known. So all conditional predictions can be computed in parallel in a single forward pass — feed the full target sequence in, get all predictions out, compute the per-position cross-entropy losses, sum, backpropagate. This is teacher forcing: each position is given the ground-truth previous tokens as input, regardless of what the model would have generated.
For a transformer, masked self-attention enforces the “only look at previous tokens” constraint structurally — the lower-triangular mask ensures position never attends to positions . So training is one forward pass: predictions and loss terms in time (the attention cost), all parallelisable on GPUs.
Generation: sequential and slow
At inference, the ground truth isn’t available. To generate a sample:
- Sample .
- Run the network with input to get . Sample .
- Run the network with input to get . Sample .
- …
- Run the network with input to get . Sample .
sequential network passes — cannot be parallelised, because each step depends on the previous sample. This is the fundamental cost of autoregressive models: training is fast, generation is slow.
For a 1024-token GPT generation, you do 1024 forward passes back-to-back. For a 10-second WaveNet sample at 16 kHz, you do 160{,}000 forward passes. This is why generative LLMs are heavily optimised at inference (KV-caching, speculative decoding, etc.) — the per-step cost dominates.
The asymmetry, in a table
| Training | Generation | |
|---|---|---|
| Inputs available | Full ground-truth sequence | Only what’s been generated so far |
| Parallelism | All positions in one pass | Strictly sequential, passes |
| Cost | for transformer (one pass) | if naively recomputed; with caching |
| What’s being computed | Likelihood of the ground truth under the model | A new sample by composition |
The mismatch between training and generation modes is sometimes called exposure bias — at training the model always sees correct previous tokens; at generation it sees its own (potentially wrong) previous samples. Various decoding strategies and training tricks (scheduled sampling, etc.) address this.
Why “ensure correct receptive field”
Slide 57 emphasises that autoregressive models must ensure the network only sees the past. Get this wrong and the model can cheat by peeking at the future during training, then fail catastrophically at generation when the future isn’t available.
Three structural ways to enforce causality:
- Masked attention (transformers) — set future-position attention scores to .
- Causal convolutions (WaveNet) — pad asymmetrically so the filter’s output at time depends only on times .
- Masked convolutions (PixelCNN) — zero out the kernel weights corresponding to future pixels in raster order.
All three implement the same constraint in different architectural shapes.
Related
- generative-model — the broader family; autoregressive is one approach among GANs, VAEs, diffusion, normalising flows
- language-model — most modern language models are autoregressive (next-token prediction)
- transformer — the dominant architecture for autoregressive language modelling
- self-attention — masked self-attention enforces causality structurally
- convolution — causal/masked convolutions enforce causality in WaveNet and PixelCNN
- decoding-strategies — how to actually sample from the per-step distributions (greedy, beam, top-k, top-p, temperature)
- diffusion-model — a non-autoregressive generative alternative; trades sequential generation for iterative denoising
Active Recall
Write the autoregressive factorisation of a joint distribution for and explain why this is a useful framing.
. Each factor is a one-dimensional conditional distribution given the past — far easier to model than the high-dimensional joint directly. The chain rule of probability guarantees the product equals the true joint exactly (this is not an approximation). One neural network with parameters implements all conditionals.
Why are autoregressive models fast to train but slow to generate?
At training, the full ground-truth sequence is available, so all conditional predictions can be computed in a single parallel forward pass (teacher forcing). At generation, each next-token prediction depends on the previously sampled token, which can only be computed after the previous step finishes. So generation requires sequential forward passes, fundamentally serial. This is why LLM inference is the bottleneck and KV-caching / speculative decoding exist.
A friend trains a transformer language model and notices that during training the model gets near-perfect accuracy by attending to position when predicting position . What's gone wrong, and how should the architecture fix it?
The friend forgot to apply causal masking to the self-attention. Without the lower-triangular mask, position can attend to future positions including , which trivially gives away the answer. At training the model exploits this; at generation, future positions don’t exist yet, so the model collapses. The fix is masked self-attention: set the attention scores for to before softmax, so they contribute zero weight.
WaveNet uses dilated causal convolutions. What does each adjective mean and what does each contribute?
Causal: the filter at output time only sees inputs at times (no peeking into the future). Implemented by asymmetric padding so the filter is left-aligned. Necessary for autoregressive sampling. Dilated: each successive layer skips an exponentially growing number of timesteps (dilation 1, 2, 4, 8, …). The receptive field grows exponentially with depth, so a stack of layers covers timesteps without needing layers of size . Together: long context window, no future leakage, modest compute.
Compare the autoregressive approach to a diffusion model for image generation. What's traded off?
Autoregressive (PixelRNN/CNN): sequential pixel-by-pixel generation; tractable likelihood; very slow generation ( network passes); modest sample quality. Diffusion: iterative denoising over steps for the whole image at once; sample quality state-of-the-art; generation cost is network passes regardless of resolution; likelihood not directly tractable. Autoregressive trades quality and parallelism for exact tractability; diffusion trades exactness for quality and parallelism per step.