Computational graphs

The essential role of computational graphs

Differentiating a modern network by hand is out of the question: its loss depends on billions of parameters, and written as a single formula the derivative would be an expression of unmanageable size, with no hope of evaluating it once, let alone once per parameter. Training nonetheless needs that entire gradient at every step, and obtains it at a cost close to running the model once. The reason is structural.

The output of a network is built through a long sequence of elementary operations, which makes its loss a deeply composite function of the parameters. Differentiating such a function is, in principle, only the chain rule; the real difficulty is bookkeeping, tracking how a small change in one early quantity propagates all the way to the final scalar without recomputing the same intermediate pieces over and over. A computational graph is the structure that handles that bookkeeping: it records, operation by operation, which quantity is produced from which.

Important

A computational graph turns a complicated function into a network of simple local operations. Once that structure is visible, differentiation stops being opaque symbol-pushing and becomes an organized, almost mechanical, application of the chain rule.

Read this way, the model name is irrelevant and only the structure of the computation matters. What is usually called “backpropagation” is the name this structure-driven differentiation takes in deep learning: reverse-mode automatic differentiation on a computational graph.

A concrete companion to this note

The development here is abstract on purpose. The micrograd series builds exactly this object in a few dozen lines of Python: a scalar Value that records the operation and operands that produced it, then differentiates itself through a single backward() call. Reading the two together, the graph described here and the running code there, is the quickest way to turn the abstract account in this note into something concrete.

From formula to graph

The equivalence between a compact formula and a graph-ready factorization is easiest to see side by side.

Compact expression	Graph-ready factorization
$a = σ (x_{0} w_{0} + x_{1} w_{1} + x_{2} w_{2} + b)$	$i_{0} = x_{0} w_{0}, i_{1} = x_{1} w_{1}, i_{2} = x_{2} w_{2}$ $z = i_{0} + i_{1} + i_{2} + b$ $a = σ (z)$

This factorization is precisely what the computational graph records; the repeated colors only help the eye follow the same quantities across the two columns. The two columns hold the same function, yet they foreground different things. The compact formula foregrounds the end result; the factorization foregrounds the dependency structure, which quantity feeds which. Differentiation cares only about the second.

What a computational graph is

A computational graph is the directed graph that lays a computation bare, recording which quantities are combined by which operations, and in what order, to produce the final result.

In the convention adopted here:

nodes represent operations such as multiplication, addition, or sigmoid;
edges represent the flow of numerical values from one operation to the next;
leaves typically correspond to external quantities such as inputs, parameters, or constants;
the root is the final quantity of interest, which in learning is usually a scalar loss.

Conventions vary

Some texts place variables at nodes and operations on edges. Others, including the diagram below, use the opposite convention. Nothing essential depends on that choice. What matters is that the graph faithfully captures which quantities depend on which others.

A single forward evaluation never loops back on itself: every node draws only on values already produced upstream, so the graph is acyclic. That acyclicity is what lets the computation be evaluated at all, and it is also what lets differentiation proceed in an orderly sweep, since both the forward and the backward pass can follow a topological order of the nodes.

The same object, in code

This is precisely the structure micrograd records: each node (a Value) stores the operands that produced it (_prev) and the operation that combined them (_op). Constructing such a node and inspecting _prev and _op is the subject of The Value object.

One forward pass, and the path of a single gradient

The diagram expands a small 3-1 MLP into its computational graph. Reading left to right is the forward pass: three mult nodes form the products $i_{0} = x_{0} w_{0}$ , $i_{1} = x_{1} w_{1}$ , $i_{2} = x_{2} w_{2}$ ; the add node combines them with the bias into the pre-activation $z = i_{0} + i_{1} + i_{2} + b$ ; the sigmoid node returns $a = 1/ (1 + e^{- z})$ .

Beneath each node sit its local derivatives, one per input. The red elements isolate a single question, how $w_{0}$ affects $a$ . Because $w_{0}$ reaches $a$ through exactly one route, $w_{0} \to i_{0} \to z \to a$ , the answer is the ordinary chain rule: the product of the three local derivatives lying on that route,
$\frac{\partial a}{\partial w _{0}} = \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial i _{0}} \cdot \frac{\partial i _{0}}{\partial w _{0}} .$
The dashed line is that route walked backward through the graph; the grayed-out derivatives ( $\frac{\partial i _{0}}{\partial x _{0}}$ , $\frac{\partial z}{\partial i _{1}}$ , and the rest) lie on other routes and take no part in this gradient.

A single route is the easy case. As soon as a quantity reaches the output through more than one route, the product turns into a sum over routes, the multivariate chain rule developed in Branching, merging, and gradient accumulation below. Where the local derivatives themselves come from is the next section; the product above is reassembled in The chain rule on a graph. The very same neuron gradient is computed by hand, node by node, in micrograd’s Manual backpropagation through a neuron.

Strictly speaking, training usually appends one more node on top of $a$ , namely a scalar loss

L = ℓ (a, y),

because the true training objective is not the activation itself but the loss.
However, the graph shown above is already sufficient to understand the key local mechanisms.

Backpropagation runs on any differentiable graph

Nothing in this picture is specific to a perceptron. A computational graph is only a visual account of a computation: nodes are operations, edges carry the data between them. Backpropagation runs on any such graph, provided each operation can report a local derivative with respect to each of its inputs. The MLP drawn here is the simplest instance; convolutional networks, recurrent networks, and Transformers are differentiable graphs in exactly the same sense, and the same backward pass trains them all. The precise condition, and the operations that break it, are taken up in the final section of this note.

The principle of locality

The power of the graph comes from a single discipline: no node has to understand the computation it belongs to. A node that produces

v = ϕ (u_{1}, \dots, u_{k})

needs to know only two facts about itself: its forward rule, how to compute $v$ from its inputs $u_{1}, \dots, u_{k}$ , and its local derivative rules, how $v$ responds when one input $u_{j}$ is nudged. Nothing more. Equipped with these, the node can play its part with no view of the graph beyond its own inputs and output.

For the neuron of the figure, both ingredients can be read off at once:

Node	Forward rule	Local derivatives
`mult`	$i_{0} = x_{0} w_{0}$	$\frac{\partial i _{0}}{\partial x _{0}} = w_{0}, \frac{\partial i _{0}}{\partial w _{0}} = x_{0}$
`add`	$z = i_{0} + i_{1} + i_{2} + b$	$\frac{\partial z}{\partial i _{0}} = 1, \frac{\partial z}{\partial i _{1}} = 1, \frac{\partial z}{\partial i _{2}} = 1, \frac{\partial z}{\partial b} = 1$
`sigmoid`	$a = σ (z)$	$\frac{\partial a}{\partial z} = σ (z) (1 - σ (z)) = a (1 - a)$

Important

The global derivative is not stored anywhere in the graph. It emerges by composing local derivatives. This locality is the entire reason computational graphs scale.

Once every primitive in a large computation carries its own forward rule and local derivatives, differentiation becomes modular: the gradient of the whole is assembled from parts that were each trivial to differentiate in isolation.

In micrograd this modularity is literal: every operation stores its own forward value and a small _backward rule for its local derivatives. A composite such as $tanh$ is then a free choice, either given one such rule directly or split into primitives that each carry theirs, as worked through in micrograd’s Detour: Breaking Down tanh.

The chain rule on a graph

With the local pieces in place, the graph turns a question about influence into a question about routes. Asking how one quantity affects another amounts to asking which paths connect them, and the chain rule reads the answer straight off those paths.

Suppose the goal is to compute the effect of $w_{0}$ on the final activation $a$ .
The graph reveals the dependency chain immediately:

w_{0} ⟶ i_{0} ⟶ z ⟶ a .

The chain rule then gives

\frac{\partial a}{\partial w _{0}} = \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial i _{0}} \cdot \frac{\partial i _{0}}{\partial w _{0}} .

Substituting the local derivatives,

\frac{\partial a}{\partial w _{0}} = a (1 - a) \cdot 1 \cdot x_{0} = a (1 - a) x_{0} .

If a scalar loss is attached on top, the same pattern simply extends one step further:

\frac{\partial L}{\partial w _{0}} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial i _{0}} \cdot \frac{\partial i _{0}}{\partial w _{0}} .

The chain rule is, in this light, the graph’s own law of motion for sensitivity: the rule by which a perturbation introduced at one node is carried, edge by edge and factor by factor, all the way to the output.

Branching, merging, and gradient accumulation

The previous example followed a single chain of dependence. Real graphs are rarely so tidy: a quantity often reaches the output through several routes at once. This multi-path case is the one informal descriptions get wrong, by following a single path and quietly dropping the rest. The rule that handles it correctly is simple and exact.

Accumulation rule

When a quantity affects the final output through several downstream operations, its total derivative is the sum of the contributions arriving from each of them.

To see why, consider

u = f (x), v = g (x), s = h (u, v) .

Then $x$ influences $s$ through two branches, one through $u$ and one through $v$ .
The multivariate chain rule gives

\frac{\partial s}{\partial x} = \frac{\partial s}{\partial u} \frac{\partial u}{\partial x} + \frac{\partial s}{\partial v} \frac{\partial v}{\partial x} . (1)

Equation $(1)$ is the origin of gradient accumulation.

In graph language, if a value $x_{v}$ feeds several downstream operations $c$ , then

\frac{\partial s}{\partial x _{v}} = v \to c \sum \frac{\partial s}{\partial x _{c}} \frac{\partial x _{c}}{\partial x _{v}} . (2)

where the sum runs over every operation $c$ that takes $x_{v}$ as an input.

This formula is one of the central equations of backpropagation.
It states that the total sensitivity of the final scalar $s$ to an intermediate quantity $x_{v}$ is obtained by summing the sensitivities transmitted back from every immediate downstream use of $x_{v}$ .

Example

Consider
$u = x^{2}, v = sin x, s = u + v .$
Then
$\frac{\partial s}{\partial x} = \frac{\partial s}{\partial u} \frac{\partial u}{\partial x} + \frac{\partial s}{\partial v} \frac{\partial v}{\partial x} = 1 \cdot 2 x + 1 \cdot cos x = 2 x + cos x .$
The derivative is a sum because the influence of $x$ reaches $s$ through two distinct downstream branches.

The slogan “multiply derivatives along a path” names only half of the rule.
The correct picture is:

multiply local derivatives along each path;
sum the contributions across all paths.

The same rule elsewhere on the site

This sum-over-paths structure is derived in closed form in The big picture, where the loss gradient becomes a sum, over every path from a weight to the loss, of the product of the edge factors along that path. In code, that summation is exactly what forces micrograd to accumulate with self.grad += ... rather than =: when one value feeds several downstream operations, each contributes a term, and overwriting would keep only the last. It is shown there as a concrete bug and its one-character fix in Automatic backpropagation part 2.

Adjoint variables and the backward pass

The backward pass revolves around one quantity attached to every node. Let $s$ denote the scalar output of interest, which in training is the loss. Then for every value produced inside the graph, the output $x_{v}$ of some node $v$ , its adjoint is the derivative of $s$ with respect to that value:

\overset{x}{ˉ}_{v} ≜ \frac{\partial s}{\partial x _{v}} .

In words, $\overset{x}{ˉ}_{v}$ answers a single question: if $x_{v}$ were nudged by a tiny amount, by how much would the final scalar $s$ change? This same quantity is also called the upstream gradient at node $v$ .

Why it is called the upstream gradient

“Upstream” is meant relative to the backward pass, which flows opposite to the forward computation. In the forward graph, values run from the inputs toward $s$ , so everything sitting between $x_{v}$ and $s$ lies downstream of $x_{v}$ . The backward pass reverses this: it starts at $s$ and travels back toward the inputs, and along that reversed current the gradient reaches $x_{v}$ from the side nearer $s$ , that is, from upstream. The two words describe two opposite flows over the same graph: downstream is where the values go, upstream is where the gradient comes from. And because $\overset{x}{ˉ}_{v}$ has flowed all the way back through the region between $x_{v}$ and $s$ , it already sums up the total effect of everything downstream of $x_{v}$ .

Why is this the right quantity to track? Because $\overset{x}{ˉ}_{v}$ already holds everything the rest of the graph has to say about $x_{v}$ . In the sum-over-paths picture, $\partial s / \partial x_{v}$ is the total of the path products running from $x_{v}$ all the way to $s$ , so folding all of those paths into the single number $\overset{x}{ˉ}_{v}$ means that, once it is known, nothing downstream of $x_{v}$ ever has to be looked at again.

Now make this operational. Suppose a node $c$ computes

x_{c} = ϕ_{c} (x_{p_{1}}, \dots, x_{p_{k}}),

with inputs $x_{p_{1}}, \dots, x_{p_{k}}$ . Once the adjoint $\overset{x}{ˉ}_{c}$ of its output is known, the node can hand each input the share of the sensitivity that passes through it. Read the chain rule one factor at a time: nudging $x_{p_{j}}$ moves the output $x_{c}$ by the local derivative $\partial x_{c} / \partial x_{p_{j}}$ , and $s$ responds to a move in $x_{c}$ by $\overset{x}{ˉ}_{c}$ , so the change in $s$ that travels back through this one operation is their product. That product is the contribution $c$ sends to $x_{p_{j}}$ :

\overset{x}{ˉ}_{p_{j}} + = \overset{x}{ˉ}_{c} \frac{\partial x _{c}}{\partial x _{p_{j}}} . (3)

The update accumulates (+=, not =) for a concrete reason: the same input $x_{p_{j}}$ may feed several downstream operations, and each of them sends back a term of this form. Their sum is exactly the accumulation rule, Equation $(2)$ ; writing = would keep only the last contributor and silently discard the rest.

Local backpropagation rule

Equation $(3)$ is the entire backward pass in miniature: the chain rule for a single operation. Run it at every node, pushing each adjoint back onto that node’s inputs and accumulating, and it rebuilds the global gradient $\partial s / \partial x_{v}$ for every quantity in the graph.

The backward pass assembles every adjoint in a single sweep. It seeds the output with

\overset{s}{ˉ} = \frac{\partial s}{\partial s} = 1,

and then visits the nodes in reverse topological order, the forward order run backward. Each node, when visited, applies rule $(3)$ to push contributions onto its own inputs. The ordering is what makes this correct: a node is reached only after every operation it feeds has already been processed, so by the time the algorithm uses a node’s adjoint $\overset{x}{ˉ}_{v}$ , all the terms that make it up have already arrived and $\overset{x}{ˉ}_{v}$ is final. No adjoint is ever read before it is complete.

Core operational picture

The forward pass computes values. The backward pass computes adjoints. Backpropagation is the systematic propagation of adjoints through the graph using local derivative rules.

For learning problems, the scalar $s$ is typically the loss $L$ .
Thus, the backward pass produces quantities such as

\frac{\partial L}{\partial w _{0}}, \frac{\partial L}{\partial w _{1}}, \frac{\partial L}{\partial b},

which are exactly the gradients required by gradient-based optimization.

Bridge to micrograd: the same algorithm in code

Every abstract object in this section has a one-to-one counterpart in the micrograd Value class:

Here (abstract) In micrograd (code)
adjoint $\overset{x}{ˉ}_{v} = \partial s / \partial x_{v}$ v.grad
seed $\overset{s}{ˉ} = 1$ o.grad = 1.0
local backward rule, Eq. $(3)$ the _backward closure, e.g. self.grad += out.grad * other.data
accumulation across branches (Eq. $(2)$ ) self.grad += ..., the fix for reused nodes
reverse topological order for node in reversed(topo): node._backward()

One naming note, to head off a common stumble. Micrograd stores a node’s inputs in _prev, and Karpathy’s traversal happens to name the loop variable child (for child in v._prev). That word is just a local label for the inputs. This note sidesteps the ambiguity entirely by never using “parent” or “child”: it speaks only of a node’s inputs and the operations downstream of it, so no directional convention has to be memorized.

Here (abstract)	In micrograd (code)
adjoint $\overset{x}{ˉ}_{v} = \partial s / \partial x_{v}$	`v.grad`
seed $\overset{s}{ˉ} = 1$	`o.grad = 1.0`
local backward rule, Eq. $(3)$	the `_backward` closure, e.g. `self.grad += out.grad * other.data`
accumulation across branches (Eq. $(2)$ )	`self.grad += ...`, the fix for reused nodes
reverse topological order	`for node in reversed(topo): node._backward()`

Why backpropagation is efficient

Scope of this section

The present section explains why backpropagation is structurally efficient when viewed as graph-based reverse-mode differentiation. The note On the efficiency of BP addresses a different question: how this efficiency compares, at the level of computational cost, with naive gradient estimation methods such as finite differences.

At a purely conceptual level, one may imagine computing a derivative by enumerating every path from a parameter to the loss, multiplying local derivatives along each path, and then summing over all paths.
That is exactly the closed-form sum-over-paths formula. It is correct, but as an algorithm it is hopeless: the number of distinct paths can grow exponentially with depth, so evaluating it directly is infeasible.

Backpropagation avoids this combinatorial explosion through reuse.

The adjoint $\overset{x}{ˉ}_{v} = \partial s / \partial x_{v}$ is a cached summary of all downstream paths from $x_{v}$ to the output $s$ .
Once this quantity has been computed, all downstream complexity has been compressed into a single number, vector, or tensor.
There is no need to recompute that downstream influence separately for every ancestor.

This is a dynamic-programming viewpoint:

the forward pass computes and stores intermediate values;
the backward pass computes and stores intermediate adjoints;
each node is processed once in the forward direction and once in the reverse direction.

Success

Backpropagation is efficient because of how the chain rule is organized: shared subcomputations are computed once and reused, rather than recomputed for every ancestor. The adjoint is the cached summary that makes this reuse possible, collapsing the exponential path sum into a single linear-time sweep.

This explains the structural source of the speedup.
The dedicated note On the efficiency of BP then turns that structural insight into an explicit cost comparison against naive numerical differentiation.

The cheap-gradient principle

The reuse has a striking consequence. Reverse mode computes the gradient with respect to every input at a cost within a small constant factor (in practice two to four times) of evaluating the function once, no matter how many inputs there are. This constant-factor guarantee is a classical result, the Baur-Strassen theorem (1983), and it is the reason training a model with billions of parameters is even thinkable: one backward pass costs about as much as one forward pass, not one forward pass per parameter.

Backpropagation and automatic differentiation

Backpropagation belongs to a wider subject, automatic differentiation. The clearest way to place it is to set it beside the other ways of computing a derivative, the ones it is most often confused with. There are four:

Method	Main idea	Nature of result	Main limitation
Manual differentiation	Derive formulas by hand	Exact symbolic derivative	Impossible to scale to large programs
Symbolic differentiation	Manipulate algebraic expressions symbolically	Exact symbolic derivative	Expression swell and poor fit to real code
Numerical differentiation	Use finite differences	Approximate derivative	Slow and numerically fragile
Automatic differentiation	Differentiate the executed computation by composing local derivatives	Exact up to floating-point arithmetic	Requires differentiable local primitives

What sets automatic differentiation apart is that it neither approximates, as finite differences do, nor manipulates symbols, as a computer-algebra system would. It differentiates the computation exactly as it was executed, operation by operation, which is precisely the work the computational graph was built to support.

Important

Backpropagation is the reverse mode of automatic differentiation. Like symbolic differentiation its result is exact, but it is read off the computation as actually executed, one operation at a time, which is what lets it scale to programs of any size.

Why reverse mode is the right mode for learning

Automatic differentiation comes in two directions, and the choice between them is not a matter of taste. For a map

f : R^{n} \to R^{m},

forward mode carries sensitivities along with the computation, from inputs toward outputs, while reverse mode propagates them the opposite way, from outputs back toward inputs. Their costs are mirror images: forward mode pays once for each input it wants to differentiate, reverse mode once for each output. The first is the right tool when the inputs of interest are few, the second when the outputs are few.

Training a neural network sits about as far toward the reverse-mode extreme as a problem can. The function being differentiated is the loss,

L (θ) \in R,

a single scalar perched at the end of a computation that may depend on billions of parameters $θ$ . To assemble the entire gradient, forward mode would have to be rerun once per parameter, resweeping the whole network to recover each coordinate; reverse mode seeds the lone scalar with $1$ and reads off every parameter derivative in one backward sweep. The asymmetry is not marginal: it is the difference between billions of forward passes and a single backward one.

Important

This is why backpropagation is the natural differentiation strategy for deep learning: the output dimension is usually $1$ , while the parameter dimension is enormous.

This input-output asymmetry is the structural reason reverse mode dominates gradient-based learning.

The price reverse mode pays, and what the gradient is for

The asymmetry is not free. Forward mode carries sensitivities alongside the forward computation and needs almost no extra memory, at the cost of one pass per input direction. Reverse mode recovers the whole gradient in a single backward sweep, but only by storing the intermediate forward values that the local backward rules consume. Time is therefore bought with memory, which is why large-model training relies on the recomputation strategies discussed below.

A separation worth keeping clear: backpropagation produces the gradient, and nothing more. Turning that gradient into a parameter update is the distinct task of gradient descent and its variants, set in context in how a network is trained.

Scalar picture versus tensor reality

Everything so far has been told in scalars, where the logic is at its most transparent: one number in, one number out, one derivative between them. Real systems work on vectors, matrices, and higher-order tensors. The move costs nothing conceptually, and that is a gift of mathematical abstraction: the chain rule and the principle of locality were never statements about numbers, only about how a composed map factors into simpler ones, so they hold word for word whether the object travelling along an edge is a single scalar or a high-order tensor. What the move does cost is care with notation, because the derivative between two tensors is itself an array.

A single concrete step makes the jump painless. Take the smallest genuinely vector-valued operation, a linear layer with two inputs and two neurons, $Y = WX$ :

X = (x_{1} x_{2}), W = (2113), Y = (y_{1} y_{2}) = (2 x_{1} + x_{2} x_{1} + 3 x_{2}) .

There is no single number “the derivative of $Y$ with respect to $X$ ” here. The honest object is the table of all four partial derivatives, the Jacobian, which for a linear layer is just the weight matrix $W$ :

\frac{\partial Y}{\partial X} = (\partial y_{1} / \partial x_{1} \partial y_{2} / \partial x_{1} \partial y_{1} / \partial x_{2} \partial y_{2} / \partial x_{2}) = (2113) .

Now run one backward step. Suppose the rest of the network has already reported how much the loss $s$ cares about each of this layer’s outputs, the upstream gradient written as a row, $\overset{ˉ}{Y} = (\overset{y}{ˉ}_{1}, \overset{y}{ˉ}_{2}) = (1, 2)$ . The gradient the layer must hand back to its inputs is the row-times-matrix product

\overset{ˉ}{X} = \overset{ˉ}{Y} \frac{\partial Y}{\partial X} = (1, 2) (2113) = (= 4 1 \cdot 2 + 2 \cdot 1, = 7 1 \cdot 1 + 2 \cdot 3) = (4, 7) .

Look at one entry. The gradient reaching input $x_{1}$ is $\overset{x}{ˉ}_{1} = 4$ , and it is a sum of two pieces: $1 \cdot 2$ coming back through neuron $y_{1}$ and $2 \cdot 1$ coming back through neuron $y_{2}$ , because $x_{1}$ feeds both neurons. This is the accumulation rule from the scalar case, now carried out automatically by the matrix multiply, one row of $\overset{ˉ}{Y}$ against one column of the Jacobian per input. Everything that follows is exactly this picture, written for tensors of any shape (and for general operations, where the Jacobian varies with $X$ rather than being the constant $W$ of a linear layer, hence the notation $J_{f} (X)$ ).

In general, a local operation takes an input tensor $X$ and produces an output tensor $Y$ :

Y = f (X) .

To connect this with standard multivariable calculus, one may temporarily imagine flattening $X$ into a vector in $R^{n}$ and $Y$ into a vector in $R^{m}$ .
At that point, the derivative of $f$ at the current point is represented by its Jacobian matrix

J_{f} (X) \in R^{m \times n},

whose entries are

(J_{f} (X))_{ij} = \frac{\partial ( Y ) _{i}}{\partial ( X ) _{j}} .

In other words, the Jacobian collects all first-order partial derivatives of the output components with respect to the input components: its row $i$ is the gradient of the single output $(Y)_{i}$ with respect to every input, and its column $j$ holds the sensitivity of every output to the single input $(X)_{j}$ .

The Jacobian is important because it is the linear object that locally approximates the operation:

Δ Y \approx J_{f} (X) Δ X .

Note

This is the tensor-valued analogue of the ordinary derivative of a scalar function: it describes how a small perturbation in the input propagates to a first-order perturbation in the output.

However, reverse-mode differentiation does not usually need the full Jacobian explicitly.

What is given during the backward pass is an upstream gradient

\overset{ˉ}{Y} = \frac{\partial s}{\partial Y},

where $s$ is the final scalar objective, typically the loss.
This is the gradient arriving at the present operation from the side of the graph nearer to $s$ .
What is needed is the corresponding gradient with respect to the operation’s input, the downstream gradient

\overset{ˉ}{X} = \frac{\partial s}{\partial X},

defined exactly like $\overset{ˉ}{Y}$ but for the input tensor rather than the output. Computing $\overset{ˉ}{X}$ from $\overset{ˉ}{Y}$ is the one job a local operation has to do on the backward pass.

Because $s$ depends on $X$ only through $Y$ , the multivariate chain rule builds each input sensitivity by summing the contributions routed through the $m$ output components:

\frac{\partial s}{\partial ( X ) _{j}} = i = 1 \sum m \frac{\partial s}{\partial ( Y ) _{i}} \frac{\partial ( Y ) _{i}}{\partial ( X ) _{j}}, j = 1, \dots, n .

Read the right-hand side off the matrices: $\frac{\partial s}{\partial ( Y ) _{i}}$ is entry $i$ of the upstream gradient, and $\frac{\partial ( Y ) _{i}}{\partial ( X ) _{j}}$ is entry $(i, j)$ of the Jacobian, so the sum over $i$ is precisely the row vector $\overset{ˉ}{Y}$ times column $j$ of $J_{f}$ . Writing both gradients as rows (the convention this note uses), $\overset{ˉ}{Y} \in R^{1 \times m}$ and $\overset{ˉ}{X} \in R^{1 \times n}$ , all $n$ of these equations collapse into a single matrix product:

1 \times n \overset{ˉ}{X} = 1 \times m \overset{ˉ}{Y} m \times n J_{f} (X) . (4)

The shapes settle the rule on their own: a $1 \times m$ row times an $m \times n$ matrix can only produce a $1 \times n$ row, which is exactly the gradient with respect to $X$ . This contraction is the vector-Jacobian product (VJP): the upstream gradient $\overset{ˉ}{Y}$ is fed through the local Jacobian to yield the downstream gradient $\overset{ˉ}{X}$ .

About transpose conventions

Many texts use column-vector gradients instead. In that convention the same rule is written as
$\overset{ˉ}{X} = J_{f} (X)^{⊤} \overset{ˉ}{Y} .$
The mathematics is the same; only the orientation convention changes.

Why autodiff systems remain practical

Modern frameworks do not explicitly materialize gigantic Jacobian tensors for every layer. Each primitive operation implements a local backward rule that maps an upstream gradient to the corresponding input gradients. This preserves the locality principle even in high-dimensional tensor computations.

For example:

matrix multiplication implements a backward rule using transposed matrices rather than a gigantic explicit Jacobian;
convolution layers use structured local backward operators rather than flattening everything into an enormous dense matrix;
elementwise activations act through a diagonal Jacobian, applied as a cheap element-wise multiply without ever building the diagonal matrix (the same $diag (f^{'})$ structure derived coordinate by coordinate in Backpropagation through time).

The scalar graph is therefore the conceptual prototype of the tensor case, not a toy: the locality principle and the chain rule carry over unchanged, and only the object sitting at each node turns from a number into an array.

From mathematical graph to framework graph

Up to this point, the discussion has focused on the mathematical computational graph: the dependency structure required by the chain rule. Modern deep-learning frameworks implement that same idea, but not always in the same architectural way. Two distinctions are especially important.

Dynamic versus staged graphs

In some systems, the graph is built as the program executes. This is the define-by-run style associated with frameworks such as PyTorch. The executed operations themselves reveal the graph, and the backward pass is attached to the concrete computation that actually happened. The micrograd Value class is the smallest possible example of this style: the graph exists only as the _prev links the operations leave behind, and backward() simply walks them in reverse.

In other systems, the computation is first represented in a more explicit intermediate form and then transformed, optimized, or compiled before execution. This staged or compiled style appears in systems such as JAX/XLA and in graph-compilation pipelines more generally.

Info

The mathematical object is the same in both cases: a differentiable computation with a dependency structure. What changes is when and how that structure is materialized, optimized, and executed.

So the phrase computational graph can refer to more than one layer of reality:

the conceptual graph needed by the chain rule;
the internal graph representation used by a framework;
the optimized or compiled graph that is actually scheduled on hardware.

These layers are related, but they are not always identical.

Graph structure versus execution strategy

The clean graph drawn on paper is not necessarily the graph executed literally by the framework.

For example:

several primitive operations may be fused into a single kernel to reduce memory traffic and launch overhead;
the backward pass may recompute some forward activations instead of storing all of them, in order to save memory;
the graph may be rewritten internally into a more hardware-efficient form before execution.

Memory is often the real bottleneck

The conceptual description of backpropagation says: store the intermediate quantities needed for the backward pass. In modern large-scale models, however, storing all activations can be prohibitively expensive. This is why practical systems often use recomputation strategies such as gradient checkpointing: some intermediate values are deliberately discarded during the forward pass and recomputed later during the backward pass, trading extra compute for lower memory usage.

This does not alter the mathematical content of backpropagation. It changes the execution strategy used to realize that mathematics under real hardware constraints.

What can and cannot be backpropagated through

Backpropagation applies whenever the computation can be decomposed into local operations that provide valid derivative rules.

A loose way to put this is “the whole graph must be differentiable,” but the real requirement is narrower and purely local. Differentiability is not needed everywhere: an isolated kink does no harm. And it is not a property of the composed function at all. On the backward pass each operation is queried on its own for a local derivative rule, and nothing ever assembles, or tests, differentiability for the graph as a whole.

Differentiability requirement

Backpropagation requires each local operation to admit a usable local derivative rule, at least almost everywhere or in the generalized sense employed by the framework.

This is why piecewise-differentiable operations such as ReLU or max can still be trained successfully: although they are not differentiable everywhere in the classical sense, they still admit workable local backward rules.

By contrast, operations that sever the computational dependency in a genuinely non-differentiable way, such as hard discrete choices or sampling steps without reparameterization, do not support ordinary gradient flow through the naive graph.

In practice, modern machine learning sometimes works around this limitation through surrogate constructions.

Examples include:

reparameterization tricks, which rewrite stochastic nodes in differentiable form
straight-through estimators, which deliberately use a backward rule that does not coincide with the true derivative of the forward operation.

Warning

Such techniques should not be confused with ordinary backpropagation through a genuinely differentiable graph. They are pragmatic extensions or approximations introduced precisely because the naive graph would otherwise block gradient flow.

The criterion is therefore about the graph, not the label “neural network”: whether a differentiable pathway runs from the scalar objective back to the parameters.

Summary

Seen through the graph, the parts of backpropagation fall into place as a single coherent procedure. The graph supplies, in a single structure, the decomposition of a tangled computation into elementary operations, a place to attach each operation’s local derivative, the reason gradients multiply along paths and add across branches, and the reuse that makes the entire backward sweep cheap. It is, at the same time, the exact bridge from a trained network back to reverse-mode automatic differentiation.

The full procedure is short to state:

express the forward computation as a graph of simple differentiable primitives;
evaluate the graph forward and cache the intermediate values needed by local derivative rules;
initialize the adjoint of the scalar objective to $1$ ;
traverse the graph backward, accumulating adjoints using the local chain-rule update;
read off the resulting derivatives with respect to every parameter.

Important

Backpropagation is no special invention of neural networks. It is reverse-mode automatic differentiation running on a computational graph, and once that is seen, the apparent intricacy of gradient-based learning resolves into three plain ideas working in concert: composition, locality, and the chain rule.

Deep Learning: Zero to Hero

Explorer

Computational graphs

The essential role of computational graphs

From formula to graph

What a computational graph is

The principle of locality

The chain rule on a graph

Branching, merging, and gradient accumulation

Adjoint variables and the backward pass

Why backpropagation is efficient

Backpropagation and automatic differentiation

Why reverse mode is the right mode for learning

Scalar picture versus tensor reality

From mathematical graph to framework graph

Dynamic versus staged graphs

Graph structure versus execution strategy

What can and cannot be backpropagated through

Summary

Graph View

Table of Contents

Backlinks