Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

The previous notes performed manual backpropagation, one local derivative at a time. That process is useful for understanding the chain rule, but it quickly becomes tedious and infeasible as the computation graph grows.

First Step Toward Automatic Backpropagation

This note takes the first step toward ending the suffering of manual backpropagation. It does not yet implement the final backward() method for the whole graph. Instead, it prepares the Value class for that method: each operation will attach a small local _backward function to the output node it creates.

For now, those local _backward functions will still be called manually in the correct order. In the next note, the graph traversal will be automated with a topological ordering and then wrapped into a proper Value.backward() method, so a single call such as o.backward() will compute the gradients for the whole graph.

Implementing the backward function for each operation

The core idea is to store, inside each Value object, a small function that knows how to propagate gradients through the operation that created that node.

Value class: fifth version

This version adds a new attribute called _backward. The attribute stores the local backward rule for a node: given the gradient of the output node, it knows how to propagate that gradient to the node’s inputs.

At construction time, every Value receives a default _backward function that does nothing. Operation nodes then replace that default with the specific backward rule for the operation that created them.

Why store a backward function on each node?

During backpropagation, each node must know how to pass its gradient to the nodes that produced it.

A multiplication node must propagate gradients using the multiplication rule, an addition node using the addition rule, and a tanh node using the derivative of tanh.

Storing a small _backward function inside each output node makes every node carry its own local backward rule. Later, backpropagation can simply visit the nodes and call node._backward().

class Value:
 
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._backward = lambda: None
    self._prev = set(_children)
    self._op = _op
    self.label = label
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
 
    # Closure: keeps access to self, other, and out from this addition.
    def _backward():
      self.grad = out.grad * 1.0
      other.grad = out.grad * 1.0
    out._backward = _backward
 
    return out
 
  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
 
    # Closure: keeps access to self, other, and out from this multiplication.
    def _backward():
      self.grad = out.grad * other.data
      other.grad = out.grad * self.data
    out._backward = _backward
 
    return out
  
  def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
    out = Value(t, (self,), 'tanh') 
 
    # Closure: keeps access to self, out, and t from this tanh operation.
    def _backward():
      self.grad = out.grad * (1 - t**2)
    out._backward = _backward
 
    return out

The local rules implemented in the class are the same rules used during manual backpropagation:

  • For an addition node, the local derivative with respect to each input is 1.0, so the incoming gradient is copied to both inputs.
  • For a multiplication node, the local derivative with respect to one input is the value of the other input, so the incoming gradient is multiplied by that opposite input.
  • For a tanh node, the local derivative is . Since the forward pass already computed t, the backward function can use 1 - t**2.

Rebuilding the computation graph

The same neuron graph is rebuilt using the updated Value class.

The graph contains two inputs, x1 and x2; two weights, w1 and w2; and a bias term b. The bias value is chosen so that the resulting backpropagation numbers are convenient and easy to inspect. Changing this value is a useful way to see how the forward value and the resulting gradients change.

The neuron computes:

and then applies tanh to obtain the output:

x1 = Value(2.0, label = 'x1')
x2 = Value(0.0, label = 'x2')
w1 = Value(-3.0,label ='w1')
w2 = Value(1.0, label = 'w2')
b = Value(6.8813735870195432, label = 'b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
 
draw_dot(o)

Calling the local backward functions

Before the full automatic backward() method is implemented, the local _backward() functions can be called manually in reverse topological order.

The output node is seeded first:

The Value class initializes every gradient to 0.0, so the output gradient must be set to 1.0 before any local backward function is called.

Then each _backward() call propagates the gradient one operation farther backward through the graph:

  • o._backward() propagates through tanh into n
  • n._backward() propagates through the addition into x1w1x2w2 and b
  • b._backward() is a no-op because b is a leaf node
  • x1w1x2w2._backward() propagates through the addition into x1w1 and x2w2
  • x2w2._backward() propagates through the multiplication into x2 and w2
  • x1w1._backward() propagates through the multiplication into x1 and w1
o.grad = 1.0
o._backward()
n._backward()
b._backward()
x1w1x2w2._backward()
x2w2._backward()
x1w1._backward()
draw_dot(o)