Break a messy nested expression into an explicit graph of small steps. Each node holds an intermediate result; each edge is one elementary operation. Once the function is drawn as a DAG, computing gradients with the chain rule becomes mechanical.
Definition
A computation graph is a directed acyclic graph (DAG) where:
- Nodes represent variables — inputs, outputs, or intermediate results.
- Edges represent elementary operations (addition, multiplication, squaring, applying an activation, etc.). The edge goes from operand(s) to result.
Because it’s acyclic, there’s no loop — values flow in one direction, from inputs through intermediate results to the final output. Because it’s directed, each node has clearly defined predecessors (nodes it depends on) and successors (nodes that depend on it).
Why bother: chain rule gets systematic
Given a composite expression like , you could compute by a flurry of chain-rule applications in one go. But for a bigger network the expression is a hundred levels deep with shared sub-expressions, and hand-derivation becomes error-prone and compute-heavy.
A computation graph fixes this by:
- Factoring the composite expression into a series of simple, named intermediate results — one per node.
- Localising the derivative work: each edge contributes a single elementary derivative that depends only on its two endpoints.
- Chaining those local derivatives through the graph to recover any .
Once the graph is drawn, the chain rule mechanically tells you how to combine the local derivatives along any path from input to output.
Worked example: rewriting a regression loss as a graph
Take the squared-error loss for a single training example:
Introduce intermediate variables to isolate each elementary operation:
Draw this as a chain:
Each arrow is one elementary operation. Each node is a function of its predecessor.
Now to compute : walk the chain and multiply the local derivatives.
| Edge | Local derivative |
|---|---|
Chain rule gives:
The graph transforms a single tangled derivative into a systematic walk.
What the chain rule does in a graph
The chain rule’s job in a computation graph is to trace a path. Pick any input node and any output node ; multiply the local derivatives along every edge on the path connecting them; the product is . The yellow arrows above show the path for the simplest case — one input, one output, no branching. When the graph branches, you sum over paths (next section).
Branching graphs and shared subexpressions
Straight-line chains are easy. Real computations branch and merge — and this is where the graph representation earns its keep.
Consider . The variable appears twice. Draw the graph:
- are leaf (input) nodes.
- (intermediate).
- (intermediate).
- (output).
Here feeds into both and , so depends on through two distinct paths. Computing requires summing contributions from each path — this is the multivariate chain rule, covered in backpropagation:
ASIDE — Branching is why computation graphs matter for networks
A neural network is full of branching: a single weight early in the network influences many downstream neurons (and therefore many paths to the loss). Without an explicit graph, bookkeeping these paths by hand becomes unmanageable. With the graph, each node just accumulates contributions from all its successors — a local computation.
Node types
When drawing a computation graph for a neural network (e.g. the week 3 linear-regression graph), the nodes fall into three visual categories:
- Input nodes (blue circles in the slides) — features and ground-truth labels . Given, not learned.
- Parameter nodes (orange circles) — learnable weights and biases . These are what gradient descent updates.
- Computation nodes (yellow squares) — intermediate results: weighted sums, activations, differences, the loss itself. Produced by combining predecessor nodes via elementary operations.
Only parameter nodes receive gradient updates; input nodes stay fixed, and computation nodes are recomputed on each forward pass.
Reusing intermediate derivatives (efficiency)
In the graph, both and require . A naive pen-and-paper approach recomputes it. A real implementation:
- Traverses the graph backwards from the output.
- Computes each intermediate partial once.
- Caches it for use by all downstream dependents.
This is the core insight that makes backpropagation efficient enough for networks with billions of parameters: the same intermediate partials appear in many final derivatives, so compute them once and reuse.
Related
- backpropagation — the algorithm that walks the computation graph in both directions (forward for values, backward for gradients)
- multi-layer-perceptron — a neural network is a computation graph with a specific structure
Active Recall
What does it mean for a computation graph to be "directed" and "acyclic"? Why do both properties matter?
Directed: each edge has a direction (operand → result), so every node has well-defined predecessors (what it depends on) and successors (what depends on it). This lets us order computations — forward and backward passes both need that ordering. Acyclic: no loops. Cycles would mean a value depends on itself, making both evaluation and gradient computation ill-defined. Together, the two properties guarantee a well-defined topological order in which you can evaluate (forward) or differentiate (backward).
For , draw the computation graph and compute .
Graph: . enters at the node. Local derivatives on the path : , , . Chain: .
In the graph for , why does have two terms whereas has only one?
reaches through exactly one path (), so the chain rule gives one product. reaches through two paths ( and ), so we must sum the contributions from both paths — this is the multivariate chain rule. In network terms: parameters that feed into multiple downstream neurons must accumulate gradient contributions from every path to the loss.
You're given a computation graph for . Identify which leaf variables are inputs, which are learnable parameters, and which is the ground-truth label of a linear regression problem.
Working backward from the regression structure where : is the loss, is the ground-truth label (it gets subtracted from the prediction), and the predicted-value subtree is . Comparing to , each multiplication is a (parameter × input) pair and the standalone is the bias. So one valid assignment is parameters: (the two weights and the bias); inputs: (the two features); ground-truth label: . The two products are interchangeable — and can swap roles within each multiplication.
Why do modern deep learning frameworks (PyTorch, TensorFlow) build a computation graph rather than just computing derivatives symbolically?
Building the graph makes gradient computation local and reusable: each node only needs to know how its output depends on its immediate inputs. The framework can then walk the graph backwards, accumulating gradients, and crucially can share intermediate results between the many derivatives it needs to compute. Symbolic differentiation of a full expression produces enormous formulae with massive redundancy; the graph-based approach (automatic differentiation) is exponentially more efficient for the kind of deeply nested expressions neural networks define.