recurrent-neural-network

An RNN is the architectural answer to “what if the input is a sequence of arbitrary length?” Feed-forward networks need a fixed-size input; convolutional networks can handle variable spatial extent but not variable depth in the temporal sense. The RNN trick: maintain a hidden state vector that summarises everything seen so far, and update it at each step using the same weights. The hidden state is the model’s running memory; the weight reuse is what lets a single network process inputs of any length. This was the standard architecture for language modelling between feed-forward LMs (2003) and Transformers (2017) — and it shows up again as the conceptual ancestor of state-space models in modern research.

The setup

The input is a sequence $x_{1}, x_{2}, \dots, x_{T}$ (e.g., a sentence’s word embeddings — see word-embedding). The RNN maintains a hidden state $h_{t} \in R^{d}$ that’s updated at each step:

$h_{t} = σ (W_{h} h_{t - 1} + W_{e} e_{t} + b_{1})$

where:

$e_{t} = E x_{t}$ is the embedding of the input token (lookup in embedding matrix $E$ ).
$W_{h} \in R^{d \times d}$ updates the hidden state from its previous value.
$W_{e} \in R^{d \times d_{e}}$ projects the new input.
$σ$ is a non-linearity (tanh, sigmoid, ReLU).
$h_{0}$ is an initial hidden state (often the zero vector or a learned parameter).

For each step (e.g., language modelling) the output distribution is computed from the hidden state:

$\overset{y}{^}_{t} = softmax (U h_{t} + b_{2}) \in R^{∣ V ∣}$

where $U \in R^{∣ V ∣ \times d}$ is the output projection. For an autoregressive LM this is $\overset{y}{^}_{t} = P (w_{t + 1} ∣ w_{1 : t})$ — a distribution over the next word.

A feed-forward network with $T$ inputs would need $T$ separate hidden layers, each with its own weights. Total parameters scale as $T \times d^{2}$ . The RNN uses the same $W_{h}, W_{e}$ at every step:

$h_{1} = σ (W_{h} h_{0} + W_{e} e_{1} + b_{1})$ $h_{2} = σ (W_{h} h_{1} + W_{e} e_{2} + b_{1})$ $h_{3} = σ (W_{h} h_{2} + W_{e} e_{3} + b_{1})$ $\dots$

A single set of parameters ${W_{h}, W_{e}, b_{1}, U, b_{2}, E}$ handles sequences of any length. This is what makes RNNs unbounded-context language models in principle (each step’s hidden state is a function of all previous inputs through the chain of updates).

Compared to n-gram LMs limited to context $N \leq 5$ , an RNN can in principle condition on the entire prefix — this is the reason RNNs improved perplexity dramatically when introduced.

TIP — Same weights = same operator

The weight-sharing across time is the recurrent analogue of weight-sharing across spatial positions in CNNs. A CNN slides one filter across the image; an RNN slides one update operator across the sequence. The benefit in both cases: parameters don’t scale with input size, and patterns learned at one position automatically transfer to others (translation/time-shift equivariance).

Training: backpropagation through time

To train, unroll the RNN over the sequence (drawing each step explicitly), compute losses at each output step, and run standard backpropagation through the unrolled graph. This is called backpropagation through time (BPTT). The gradient with respect to a single hidden state $h_{t}$ flows back through every later step that depended on it:

$\frac{\partial L}{\partial h _{t}} = \sum_{s \geq t} (contribution from step s)$

Each “contribution from step $s$ ” multiplies through $W_{h}$ once per intervening step. After 50 steps, you’ve multiplied by $W_{h}$ fifty times — and this is the source of the next problem.

The vanishing/exploding gradient problem

If $W_{h}$ ‘s singular values are slightly less than 1, the gradient shrinks geometrically as it flows backwards in time — eventually to numerical zero. The model can’t learn from inputs more than ~10 steps in the past. If singular values are slightly greater than 1, the gradient grows geometrically and overflows. Either way, long-range dependencies become unlearnable — a paradox, since the architectural promise of RNNs is unbounded context.

Symptoms:

The hidden state $h_{t}$ near the start of a long sequence has essentially no influence on outputs near the end. Whatever was important about the start has been “forgotten” because gradient signal couldn’t propagate that far.
Training becomes unstable on long sequences — exploding gradients cause loss to spike unpredictably; vanishing gradients cause it to plateau.

The standard fixes:

Gradient clipping — bound $∥\nabla∥$ at a maximum value to prevent explosions.
Gating mechanisms — replace the simple recurrence with a gated one. This is the move from RNN to LSTM: explicit “remember” and “forget” gates allow gradients to flow through long-range “shortcuts” rather than chains of multiplications.

The lecturer summarises:

“The gradient is going to slowly move around to zero. Then there’s the similarity is to have an exploding gradient problems, because if you have some large numbers, it will be multiplied many times. The advantage of RNN was still it’s an elegant solution where you can have an input of any size.”

The other major problem: not parallelisable

To compute $h_{t}$ , you need $h_{t - 1}$ . To compute $h_{t - 1}$ , you need $h_{t - 2}$ . The recurrence is sequential — you cannot compute $h_{t}$ without first computing all $h_{s}$ for $s < t$ . Both training and inference therefore process one token at a time.

For long sequences this is slow on GPU hardware that thrives on parallelism. A 1000-token sequence requires 1000 sequential network applications — not 1000 parallel ones. This is what would later motivate Transformers: the attention mechanism replaces the recurrence with an operation that can be parallelised across positions, unlocking GPU efficiency.

RNNs as language models

For language modelling, the input $x_{t}$ is the one-hot encoding (or embedding) of word $w_{t}$ , and the output $\overset{y}{^}_{t}$ is the predicted distribution for $w_{t + 1}$ :

$\overset{y}{^}_{t} = P (w_{t + 1} ∣ w_{1}, \dots, w_{t})$

Training maximises the log-likelihood of the next word given the prefix — equivalently, minimises cross-entropy. Generation works by sampling $w_{t + 1} \sim \overset{y}{^}_{t}$ , feeding it back as the next input, and continuing (see decoding-strategies). This is the canonical autoregressive LM training and inference loop, identical in spirit to GPT despite the very different architecture.

Advantages over feed-forward NLMs (Bengio 2003):

Unbounded context (can in principle use the entire prefix, not just last $N - 1$ words).
Better perplexity on standard benchmarks.

Disadvantages:

Vanishing gradients limit effective context to ~tens of words despite unbounded theoretical context.
Sequential processing → slow training, poor GPU utilisation.
Unidirectional — natural for autoregressive generation, but means each hidden state only sees left context. (Bidirectional RNNs/LSTMs fix this for representation tasks.)

Stacking and bidirectionality

Two enhancements that improve quality:

Stacked / deep RNN. Use the outputs of one RNN layer as inputs to a second, then a third. Same way feed-forward networks gain capacity via depth. Each layer learns to summarise progressively higher-level features.
Bidirectional RNN. Run one RNN forwards and another backwards, then concatenate (or sum) their hidden states at each position. Result: each position’s representation depends on both left and right context. Used in tasks where the entire input is available (classification, sequence labelling) — but not directly usable for autoregressive generation, since it’d need the future to predict the present.

A stacked bidirectional LSTM is the architecture behind ELMo (2018), which produced the first widely-used contextualized word embeddings before BERT.

Connections

Sequence model used by RNN-based LMs — the dominant LM architecture between Bengio 2003 and the Transformer era.
Generalised by LSTMs — gating mechanisms address vanishing gradients while keeping the recurrent structure.
Superseded by Transformers (week 11) — attention replaces recurrence with an operation that can be parallelised across positions and that propagates information without the multiplication-chain bottleneck.
Trained by backpropagation through time, which is just standard backpropagation applied to the unrolled graph.

Course Notes

Explorer

recurrent-neural-network

The setup

Training: backpropagation through time

The vanishing/exploding gradient problem

The other major problem: not parallelisable

RNNs as language models

Stacking and bidirectionality

Connections

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

recurrent-neural-network

The setup

The crucial point: weight sharing across time

Training: backpropagation through time

The vanishing/exploding gradient problem

The other major problem: not parallelisable

RNNs as language models

Stacking and bidirectionality

Connections

Graph View

Table of Contents

Backlinks