Modelling a high-dimensional joint distribution is hard. Modelling a single one-dimensional conditional distribution given the past is easy. Autoregressive models exploit the chain rule of probability to turn the hard problem into a sequence of easy ones — and recover the joint by multiplying.

The factorisation

For any random vector , the chain rule of probability gives:

This identity holds for any joint distribution, regardless of the dimensionality or structure. It’s not an approximation. The autoregressive idea is to parameterise each conditional with a neural network:

A single shared network with parameters takes the previous components as input and outputs the distribution over the next component . Train by maximum likelihood, and you have a tractable density estimate of the full joint.

ASIDE — Why this is a huge simplification

Directly modelling for high-dimensional (e.g. a image, or a 10-second audio waveform sampled at 16 kHz) is intractable — you’d have to specify a probability density over - or -dimensional space. Autoregressive factorisation reduces this to predicting a one-dimensional distribution at a time (just ), conditioned on what came before. Each step is something a network can plausibly do; the chain rule guarantees the product equals the joint.

What the network outputs

For each step, the network outputs a distribution over (not just a point estimate). How that distribution is parameterised depends on the data:

Data typeOutput distributionTypical implementation
Discrete (text, classes)Categorical over vocabSoftmax over vocab logits
Discrete-valued audio (8-bit)Categorical over 256 levelsSoftmax over 256 logits
ContinuousGaussian, mixture, or histogramMean+variance heads, or discretised bins

Slide 57 shows the histogram view: the model can output a discretised distribution over possible values — a histogram-like vector over the discrete bin index — or a parametric form like a Gaussian. Either works. Sampling at generation time is then drawing from that output distribution.

Examples

WaveNet — audio (van den Oord et al. 2016)

Raw audio is a 1D sequence of pressure samples (16 kHz means 16{,}000 numbers per second). WaveNet models for each sample with a stack of dilated causal convolutions:

  • Causal: filter only sees past samples, never future ones (just like masked attention).
  • Dilated: each successive layer skips an exponentially growing number of timesteps (dilation 1, 2, 4, 8, …) so the receptive field grows exponentially with depth.

A WaveNet with layers of dilation has receptive field samples — covers seconds of audio with a modest stack. Output is a softmax over discretised audio levels (typically 256 via -law companding).

PixelRNN / PixelCNN — images (van den Oord et al. 2016)

Raster-scan an image into a sequence: = top-left pixel, = next pixel along, …, = bottom-right pixel. Then model

with a network that produces a categorical distribution over 256 intensities for each pixel given everything that came before in raster order. PixelRNN uses an RNN; PixelCNN uses masked convolutions that only see pixels above and to the left.

Transformer-based language models (GPT family)

Same factorisation, with the conditional implemented by a transformer with masked self-attention. Each token attends only to previous tokens; the final hidden state at each position is projected to a softmax over the vocabulary, giving the distribution over the next token.

GPT, the decoder side of the original transformer, and any “causal language model” you’ve heard of are all autoregressive in this sense. The architecture changes; the factorisation doesn’t.

Training: parallel and fast

Autoregressive models are trained by maximum likelihood: maximise

Crucially, during training the entire ground-truth sequence is known. So all conditional predictions can be computed in parallel in a single forward pass — feed the full target sequence in, get all predictions out, compute the per-position cross-entropy losses, sum, backpropagate. This is teacher forcing: each position is given the ground-truth previous tokens as input, regardless of what the model would have generated.

For a transformer, masked self-attention enforces the “only look at previous tokens” constraint structurally — the lower-triangular mask ensures position never attends to positions . So training is one forward pass: predictions and loss terms in time (the attention cost), all parallelisable on GPUs.

Generation: sequential and slow

At inference, the ground truth isn’t available. To generate a sample:

  1. Sample .
  2. Run the network with input to get . Sample .
  3. Run the network with input to get . Sample .
  4. Run the network with input to get . Sample .

sequential network passes — cannot be parallelised, because each step depends on the previous sample. This is the fundamental cost of autoregressive models: training is fast, generation is slow.

For a 1024-token GPT generation, you do 1024 forward passes back-to-back. For a 10-second WaveNet sample at 16 kHz, you do 160{,}000 forward passes. This is why generative LLMs are heavily optimised at inference (KV-caching, speculative decoding, etc.) — the per-step cost dominates.

The asymmetry, in a table

TrainingGeneration
Inputs availableFull ground-truth sequenceOnly what’s been generated so far
ParallelismAll positions in one passStrictly sequential, passes
Cost for transformer (one pass) if naively recomputed; with caching
What’s being computedLikelihood of the ground truth under the modelA new sample by composition

The mismatch between training and generation modes is sometimes called exposure bias — at training the model always sees correct previous tokens; at generation it sees its own (potentially wrong) previous samples. Various decoding strategies and training tricks (scheduled sampling, etc.) address this.

Why “ensure correct receptive field”

Slide 57 emphasises that autoregressive models must ensure the network only sees the past. Get this wrong and the model can cheat by peeking at the future during training, then fail catastrophically at generation when the future isn’t available.

Three structural ways to enforce causality:

  • Masked attention (transformers) — set future-position attention scores to .
  • Causal convolutions (WaveNet) — pad asymmetrically so the filter’s output at time depends only on times .
  • Masked convolutions (PixelCNN) — zero out the kernel weights corresponding to future pixels in raster order.

All three implement the same constraint in different architectural shapes.

  • generative-model — the broader family; autoregressive is one approach among GANs, VAEs, diffusion, normalising flows
  • language-model — most modern language models are autoregressive (next-token prediction)
  • transformer — the dominant architecture for autoregressive language modelling
  • self-attention — masked self-attention enforces causality structurally
  • convolution — causal/masked convolutions enforce causality in WaveNet and PixelCNN
  • decoding-strategies — how to actually sample from the per-step distributions (greedy, beam, top-k, top-p, temperature)
  • diffusion-model — a non-autoregressive generative alternative; trades sequential generation for iterative denoising

Active Recall