Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

Implementing the backward function for a whole expression graph

The previous note attached a local _backward function to each operation node. However, those functions still had to be called manually, one node at a time, in the correct order.

This note automates that process in two steps:

  1. build a topological ordering of the expression graph, so the local _backward functions can be called in the correct order;
  2. wrap that graph traversal inside a new backward() method on the Value class, so gradients can be computed with a single call such as o.backward().

Rebuilding the graph to reset gradients

The previous _backward() calls modified the grad fields in the existing graph.

To start again from a clean state, the same graph is rebuilt. Since the Value constructor initializes grad to 0.0, rebuilding the graph resets all gradients before another backward pass.

The same bias value is kept from the previous neuron example so the forward value and the resulting gradients remain numerically convenient and easy to inspect.

x1 = Value(2.0, label = 'x1')
x2 = Value(0.0, label = 'x2')
w1 = Value(-3.0,label ='w1')
w2 = Value(1.0, label = 'w2')
b = Value(6.8813735870195432, label = 'b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
 
draw_dot(o)

Calling _backward() in topological order

The output node is seeded first:

The goal is to call every local _backward() function in the correct order, from the output node back toward the input nodes.

Why topological order is needed

A node should run its _backward() function only after its own gradient has been fully computed.

For example, n is used to compute o = tanh(n), so n._backward() should run only after o._backward() has written its contribution into n.grad.

A topological ordering gives a safe dependency order for the graph. Reversing that order allows backpropagation to start from the output and move backward toward the leaves.

The helper function build_topo(v) recursively visits the predecessors stored in v._prev. A node is added to topo only after its children have been visited. Starting from the output node o, this produces an ordering that contains all nodes needed for the backward pass.

Because a node is appended after its predecessors, topo is ordered from the leaf nodes toward the output node. Backpropagation needs the opposite direction: it must start from the output and move backward toward the leaves. This is why the loop iterates over reversed(topo).

Iterating over reversed(topo) then calls the local _backward() function of each node from the output side of the graph toward the inputs. This removes the need to manually call each _backward() function in the correct order, which would be error-prone and tedious for larger graphs.

o.grad = 1.0 
 
topo = []
visited = set()
def build_topo(v):
  if v not in visited:
    visited.add(v)
    for child in v._prev:
      build_topo(child)
    topo.append(v)
build_topo(o)
 
 
for node in reversed(topo):
  node._backward()
 
 
draw_dot(o)

Value class: sixth version

The topological sort and the backward pass can now be wrapped inside a new method called backward on the Value class.

With this version, calling o.backward() computes the gradients of all nodes in the graph with respect to the output node o.

The method performs the same steps shown above:

  1. build the topological order starting from self;
  2. seed the output gradient with self.grad = 1.0;
  3. visit nodes in reverse topological order and call each stored _backward() function.
class Value:
 
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._backward = lambda: None 
    self._prev = set(_children)
    self._op = _op
    self.label = label
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
 
    def _backward(): 
      self.grad = out.grad * 1.0 
      other.grad= out.grad * 1.0 
    out._backward = _backward
 
    return out
 
  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
 
    def _backward(): 
      self.grad  = out.grad * other.data 
      other.grad = out.grad * self.data  
    out._backward = _backward
 
    return out
  
  def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
    out = Value(t, (self,), 'tanh') 
 
    def _backward(): 
      self.grad = out.grad * (1 - t**2) 
    out._backward = _backward
 
    return out
  
  
  def backward(self):
 
    topo = []
    visited = set()
    def build_topo(v):
      if v not in visited:
        visited.add(v)
        for child in v._prev:
          build_topo(child)
        topo.append(v)
    build_topo(self) 
    
    self.grad = 1.0
    for node in reversed(topo):
      node._backward()

Rebuilding the graph with the sixth version

The graph is rebuilt again so all gradients start from zero before calling the new backward() method.

The neuron still has two inputs, x1 and x2; two weights, w1 and w2; and the same bias value as before. The bias value is chosen so that the backpropagation values remain convenient to inspect. Changing it is a useful way to see how the computed values change.

The computation is still:

and:

x1 = Value(2.0, label = 'x1')
x2 = Value(0.0, label = 'x2')
w1 = Value(-3.0,label ='w1')
w2 = Value(1.0, label = 'w2')
b = Value(6.8813735870195432, label = 'b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
 
draw_dot(o)

Calling backward()

The full backward pass can now be triggered with a single call:

o.backward()
 
draw_dot(o)

Fixing a backpropagation bug when one node is used multiple times

A remaining bug

The sixth version automates the traversal order, but it still contains an important bug: gradients are overwritten instead of accumulated.

This appears when the same node contributes to the final output through more than one path.

Bug example 1: a + a

In this example, b = a + a.

The correct derivative is:

Therefore, a.grad should be 2.0, not 1.0.

However, the current implementation uses:

self.grad = out.grad * 1.0

This is not correct when there are two paths from a to b. In this graph, both inputs of the addition node point to the same object a, so the two gradient contributions must be added together.

The local derivative is not the problem

The local derivative of addition is still correct: each input contributes with a factor of 1.0. The bug is caused by the assignment operator =.

When the first input position of b = a + a contributes to a.grad, the value is set to 1.0. When the second input position contributes, the same assignment sets a.grad to 1.0 again instead of adding another 1.0.

The final gradient therefore remains 1.0, even though the correct total contribution is:

a = Value(3.0, label='a')
b = a + a; b.label='b'
b.backward()
draw_dot(b) 

Why only one a node is visible

In b = a + a, there are not two different a objects. The same Value object a is used twice as the two inputs of the addition.

The graph visualization shows only one a node because each Value stores its predecessors in _prev = set(_children). A set removes duplicates, so the two references to the same object a collapse into one visible predecessor.

The important point is that the addition operation still receives two input positions, and both positions refer to the same object. During the backward pass, this creates two gradient contributions for that same a: one from the left input position and one from the right input position.

Bug example 2: shared variables in a larger graph

The same issue appears in a larger expression where a and b are reused.

Here, both variables are used to calculate d, and they are also used to calculate e:

a = Value(-2.0, label='a')
b = Value(3.0, label='b')
d = a * b; d.label='d'
e = a + b; e.label='e'
f = d * e; f.label='f'
 
f.backward()
 
draw_dot(f)

The multivariate chain rule and gradient accumulation

In a computation graph, a variable does not always travel along a single straight path. Often, variables are reused and branch into multiple paths.

In the example above, a and b do not go to only one place. They are used to compute d, and they are also used to compute e.

If a standard assignment (self.grad = ...) is used during backpropagation, the gradient flowing back from the final output f behaves like this:

  1. the path flowing back through d calculates its gradient contribution for a and sets a.grad;
  2. the path flowing back through e calculates another gradient contribution for a and overwrites a.grad.

Because = overwrites the previous value, only the gradient from the last processed path is remembered. The information from the other branch is lost.

The gradient must accumulate all paths

When a node has multiple outgoing branches, its total effect on the final output is the sum of the gradient contributions flowing back from each path.

For the graph above, a contributes to f through both d and e, so the total derivative is:

More generally, if a node feeds into multiple later nodes , then:

Why += is needed

To translate this mathematical sum into code, the accumulation operator (+=) must be used instead of the assignment operator (=).

By writing self.grad += ..., the backward pass collects and adds together the incoming contributions from all paths without overwriting them. In the example above, the gradient flowing back from d is added to the gradient flowing back from e, giving the correct combined total gradient for a and b.

For accumulation to work correctly, each gradient needs a clean starting value. This is why every variable starts with self.grad = 0.0: the gradient begins at 0, and each incoming contribution is safely added to it.

Value class: seventh version

The bug is fixed by changing the local _backward functions so they accumulate gradients instead of overwriting them.

This means replacing assignments such as:

self.grad = ...

with accumulations such as:

self.grad += ...

This update is applied to the local backward functions for addition, multiplication, and tanh, so gradient contributions from multiple paths are summed instead of lost.

class Value:
 
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._backward = lambda: None 
    self._prev = set(_children)
    self._op = _op
    self.label = label
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
 
    def _backward(): 
      self.grad  += out.grad * 1.0
      other.grad += out.grad * 1.0
    out._backward = _backward
 
    return out
 
  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
 
    def _backward(): 
      self.grad  += out.grad * other.data
      other.grad += out.grad * self.data
    out._backward = _backward
 
    return out
  
  def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
    out = Value(t, (self,), 'tanh') 
 
    def _backward(): 
      self.grad += out.grad * (1 - t**2)
    out._backward = _backward
 
    return out
  
  
  def backward(self):
 
    topo = []
    visited = set()
    def build_topo(v):
      if v not in visited:
        visited.add(v)
        for child in v._prev:
          build_topo(child)
        topo.append(v)
    build_topo(self) 
    
    self.grad = 1.0 
    for node in reversed(topo): 
      node._backward() 

Bug example 1 is fixed

The bug is now solved for b = a + a.

The gradient of a is correctly computed as 2.0 instead of 1.0, because the addition node now accumulates the contributions from both paths from a to b.

a = Value(3.0, label='a')
b = a + a; b.label='b'
b.backward()
draw_dot(b) 

Bug example 2 is fixed

The bug is also solved for the larger graph.

The gradients of a and b are now computed correctly because the local _backward functions for addition and multiplication accumulate gradient contributions instead of overwriting them.

a = Value(-2.0, label='a')
b = Value(3.0, label='b')
d = a * b; d.label='d'
e = a + b; e.label='e'
f = d * e; f.label='f'
 
f.backward()
 
draw_dot(f)