Read the Overview First
This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.
Implementing the backward function for a whole expression graph
The previous note attached a local _backward function to each operation node. However, those functions still had to be called manually, one node at a time, in the correct order.
This note automates that process in two steps:
- build a topological ordering of the expression graph, so the local
_backwardfunctions can be called in the correct order; - wrap that graph traversal inside a new
backward()method on theValueclass, so gradients can be computed with a single call such aso.backward().
Rebuilding the graph to reset gradients
The previous _backward() calls modified the grad fields in the existing graph.
To start again from a clean state, the same graph is rebuilt. Since the Value constructor initializes grad to 0.0, rebuilding the graph resets all gradients before another backward pass.
The same bias value is kept from the previous neuron example so the forward value and the resulting gradients remain numerically convenient and easy to inspect.
x1 = Value(2.0, label = 'x1')
x2 = Value(0.0, label = 'x2')
w1 = Value(-3.0,label ='w1')
w2 = Value(1.0, label = 'w2')
b = Value(6.8813735870195432, label = 'b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
draw_dot(o)Calling _backward() in topological order
The output node is seeded first:
The goal is to call every local _backward() function in the correct order, from the output node back toward the input nodes.
Why topological order is needed
A node should run its
_backward()function only after its own gradient has been fully computed.For example,
nis used to computeo = tanh(n), son._backward()should run only aftero._backward()has written its contribution inton.grad.A topological ordering gives a safe dependency order for the graph. Reversing that order allows backpropagation to start from the output and move backward toward the leaves.
The helper function build_topo(v) recursively visits the predecessors stored in v._prev. A node is added to topo only after its children have been visited. Starting from the output node o, this produces an ordering that contains all nodes needed for the backward pass.
Because a node is appended after its predecessors, topo is ordered from the leaf nodes toward the output node. Backpropagation needs the opposite direction: it must start from the output and move backward toward the leaves. This is why the loop iterates over reversed(topo).
Iterating over reversed(topo) then calls the local _backward() function of each node from the output side of the graph toward the inputs. This removes the need to manually call each _backward() function in the correct order, which would be error-prone and tedious for larger graphs.
o.grad = 1.0
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(o)
for node in reversed(topo):
node._backward()
draw_dot(o)Value class: sixth version
The topological sort and the backward pass can now be wrapped inside a new method called backward on the Value class.
With this version, calling o.backward() computes the gradients of all nodes in the graph with respect to the output node o.
The method performs the same steps shown above:
- build the topological order starting from
self; - seed the output gradient with
self.grad = 1.0; - visit nodes in reverse topological order and call each stored
_backward()function.
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad = out.grad * 1.0
other.grad= out.grad * 1.0
out._backward = _backward
return out
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad = out.grad * other.data
other.grad = out.grad * self.data
out._backward = _backward
return out
def tanh(self):
x = self.data
t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad = out.grad * (1 - t**2)
out._backward = _backward
return out
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for node in reversed(topo):
node._backward()Rebuilding the graph with the sixth version
The graph is rebuilt again so all gradients start from zero before calling the new backward() method.
The neuron still has two inputs, x1 and x2; two weights, w1 and w2; and the same bias value as before. The bias value is chosen so that the backpropagation values remain convenient to inspect. Changing it is a useful way to see how the computed values change.
The computation is still:
and:
x1 = Value(2.0, label = 'x1')
x2 = Value(0.0, label = 'x2')
w1 = Value(-3.0,label ='w1')
w2 = Value(1.0, label = 'w2')
b = Value(6.8813735870195432, label = 'b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
draw_dot(o)Calling backward()
The full backward pass can now be triggered with a single call:
o.backward()
draw_dot(o)Fixing a backpropagation bug when one node is used multiple times
A remaining bug
The sixth version automates the traversal order, but it still contains an important bug: gradients are overwritten instead of accumulated.
This appears when the same node contributes to the final output through more than one path.
Bug example 1: a + a
In this example, b = a + a.
The correct derivative is:
Therefore, a.grad should be 2.0, not 1.0.
However, the current implementation uses:
self.grad = out.grad * 1.0This is not correct when there are two paths from a to b. In this graph, both inputs of the addition node point to the same object a, so the two gradient contributions must be added together.
The local derivative is not the problem
The local derivative of addition is still correct: each input contributes with a factor of
1.0. The bug is caused by the assignment operator=.When the first input position of
b = a + acontributes toa.grad, the value is set to1.0. When the second input position contributes, the same assignment setsa.gradto1.0again instead of adding another1.0.The final gradient therefore remains
1.0, even though the correct total contribution is:
a = Value(3.0, label='a')
b = a + a; b.label='b'
b.backward()
draw_dot(b) Why only one
anode is visibleIn
b = a + a, there are not two differentaobjects. The sameValueobjectais used twice as the two inputs of the addition.The graph visualization shows only one
anode because eachValuestores its predecessors in_prev = set(_children). Asetremoves duplicates, so the two references to the same objectacollapse into one visible predecessor.The important point is that the addition operation still receives two input positions, and both positions refer to the same object. During the backward pass, this creates two gradient contributions for that same
a: one from the left input position and one from the right input position.
Bug example 2: shared variables in a larger graph
The same issue appears in a larger expression where a and b are reused.
Here, both variables are used to calculate d, and they are also used to calculate e:
a = Value(-2.0, label='a')
b = Value(3.0, label='b')
d = a * b; d.label='d'
e = a + b; e.label='e'
f = d * e; f.label='f'
f.backward()
draw_dot(f)The multivariate chain rule and gradient accumulation
In a computation graph, a variable does not always travel along a single straight path. Often, variables are reused and branch into multiple paths.
In the example above, a and b do not go to only one place. They are used to compute d, and they are also used to compute e.
If a standard assignment (self.grad = ...) is used during backpropagation, the gradient flowing back from the final output f behaves like this:
- the path flowing back through
dcalculates its gradient contribution foraand setsa.grad; - the path flowing back through
ecalculates another gradient contribution foraand overwritesa.grad.
Because = overwrites the previous value, only the gradient from the last processed path is remembered. The information from the other branch is lost.
The gradient must accumulate all paths
When a node has multiple outgoing branches, its total effect on the final output is the sum of the gradient contributions flowing back from each path.
For the graph above, a contributes to f through both d and e, so the total derivative is:
More generally, if a node feeds into multiple later nodes , then:
Why
+=is neededTo translate this mathematical sum into code, the accumulation operator (
+=) must be used instead of the assignment operator (=).By writing
self.grad += ..., the backward pass collects and adds together the incoming contributions from all paths without overwriting them. In the example above, the gradient flowing back fromdis added to the gradient flowing back frome, giving the correct combined total gradient foraandb.
For accumulation to work correctly, each gradient needs a clean starting value. This is why every variable starts with self.grad = 0.0: the gradient begins at 0, and each incoming contribution is safely added to it.
Value class: seventh version
The bug is fixed by changing the local _backward functions so they accumulate gradients instead of overwriting them.
This means replacing assignments such as:
self.grad = ...with accumulations such as:
self.grad += ...This update is applied to the local backward functions for addition, multiplication, and tanh, so gradient contributions from multiple paths are summed instead of lost.
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad * 1.0
other.grad += out.grad * 1.0
out._backward = _backward
return out
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += out.grad * other.data
other.grad += out.grad * self.data
out._backward = _backward
return out
def tanh(self):
x = self.data
t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += out.grad * (1 - t**2)
out._backward = _backward
return out
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for node in reversed(topo):
node._backward() Bug example 1 is fixed
The bug is now solved for
b = a + a.The gradient of
ais correctly computed as2.0instead of1.0, because the addition node now accumulates the contributions from both paths fromatob.
a = Value(3.0, label='a')
b = a + a; b.label='b'
b.backward()
draw_dot(b) Bug example 2 is fixed
The bug is also solved for the larger graph.
The gradients of
aandbare now computed correctly because the local_backwardfunctions for addition and multiplication accumulate gradient contributions instead of overwriting them.
a = Value(-2.0, label='a')
b = Value(3.0, label='b')
d = a * b; d.label='d'
e = a + b; e.label='e'
f = d * e; f.label='f'
f.backward()
draw_dot(f)