Break a messy nested expression into an explicit graph of small steps. Each node holds an intermediate result; each edge is one elementary operation. Once the function is drawn as a DAG, computing gradients with the chain rule becomes mechanical.

Definition

A computation graph is a directed acyclic graph (DAG) where:

  • Nodes represent variables — inputs, outputs, or intermediate results.
  • Edges represent elementary operations (addition, multiplication, squaring, applying an activation, etc.). The edge goes from operand(s) to result.

Because it’s acyclic, there’s no loop — values flow in one direction, from inputs through intermediate results to the final output. Because it’s directed, each node has clearly defined predecessors (nodes it depends on) and successors (nodes that depend on it).

Why bother: chain rule gets systematic

Given a composite expression like , you could compute by a flurry of chain-rule applications in one go. But for a bigger network the expression is a hundred levels deep with shared sub-expressions, and hand-derivation becomes error-prone and compute-heavy.

A computation graph fixes this by:

  1. Factoring the composite expression into a series of simple, named intermediate results — one per node.
  2. Localising the derivative work: each edge contributes a single elementary derivative that depends only on its two endpoints.
  3. Chaining those local derivatives through the graph to recover any .

Once the graph is drawn, the chain rule mechanically tells you how to combine the local derivatives along any path from input to output.

Worked example: rewriting a regression loss as a graph

Take the squared-error loss for a single training example:

Introduce intermediate variables to isolate each elementary operation:

Draw this as a chain:

Each arrow is one elementary operation. Each node is a function of its predecessor.

Now to compute : walk the chain and multiply the local derivatives.

EdgeLocal derivative

Chain rule gives:

The graph transforms a single tangled derivative into a systematic walk.

What the chain rule does in a graph

The chain rule’s job in a computation graph is to trace a path. Pick any input node and any output node ; multiply the local derivatives along every edge on the path connecting them; the product is . The yellow arrows above show the path for the simplest case — one input, one output, no branching. When the graph branches, you sum over paths (next section).

Branching graphs and shared subexpressions

Straight-line chains are easy. Real computations branch and merge — and this is where the graph representation earns its keep.

Consider . The variable appears twice. Draw the graph:

  • are leaf (input) nodes.
  • (intermediate).
  • (intermediate).
  • (output).

Here feeds into both and , so depends on through two distinct paths. Computing requires summing contributions from each path — this is the multivariate chain rule, covered in backpropagation:

ASIDE — Branching is why computation graphs matter for networks

A neural network is full of branching: a single weight early in the network influences many downstream neurons (and therefore many paths to the loss). Without an explicit graph, bookkeeping these paths by hand becomes unmanageable. With the graph, each node just accumulates contributions from all its successors — a local computation.

Node types

When drawing a computation graph for a neural network (e.g. the week 3 linear-regression graph), the nodes fall into three visual categories:

  • Input nodes (blue circles in the slides) — features and ground-truth labels . Given, not learned.
  • Parameter nodes (orange circles) — learnable weights and biases . These are what gradient descent updates.
  • Computation nodes (yellow squares) — intermediate results: weighted sums, activations, differences, the loss itself. Produced by combining predecessor nodes via elementary operations.

Only parameter nodes receive gradient updates; input nodes stay fixed, and computation nodes are recomputed on each forward pass.

Reusing intermediate derivatives (efficiency)

In the graph, both and require . A naive pen-and-paper approach recomputes it. A real implementation:

  • Traverses the graph backwards from the output.
  • Computes each intermediate partial once.
  • Caches it for use by all downstream dependents.

This is the core insight that makes backpropagation efficient enough for networks with billions of parameters: the same intermediate partials appear in many final derivatives, so compute them once and reuse.

  • backpropagation — the algorithm that walks the computation graph in both directions (forward for values, backward for gradients)
  • multi-layer-perceptron — a neural network is a computation graph with a specific structure

Active Recall