Automatic backpropagation part 1

Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

The previous notes performed manual backpropagation, one local derivative at a time. That process is useful for understanding the chain rule, but it quickly becomes tedious and infeasible as the computation graph grows.

First Step Toward Automatic Backpropagation

This note takes the first step toward ending the suffering of manual backpropagation. It does not yet implement the final backward() method for the whole graph. Instead, it prepares the Value class for that method: each operation will attach a small local _backward function to the output node it creates.

For now, those local _backward functions will still be called manually in the correct order. In the next note, the graph traversal will be automated with a topological ordering and then wrapped into a proper Value.backward() method, so a single call such as o.backward() will compute the gradients for the whole graph.

Implementing the backward function for each operation

The core idea is to store, inside each Value object, a small function that knows how to propagate gradients through the operation that created that node.

Value class: fifth version

This version adds a new attribute called _backward. The attribute stores the local backward rule for a node: given the gradient of the output node, it knows how to propagate that gradient to the node’s inputs.

At construction time, every Value receives a default _backward function that does nothing. Operation nodes then replace that default with the specific backward rule for the operation that created them.

Why store a backward function on each node?

During backpropagation, each node must know how to pass its gradient to the nodes that produced it.

A multiplication node must propagate gradients using the multiplication rule, an addition node using the addition rule, and a tanh node using the derivative of tanh.

Storing a small _backward function inside each output node makes every node carry its own local backward rule. Later, backpropagation can simply visit the nodes and call node._backward().

class Value:
 
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._backward = lambda: None
    self._prev = set(_children)
    self._op = _op
    self.label = label
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
 
    # Closure: keeps access to self, other, and out from this addition.
    def _backward():
      self.grad = out.grad * 1.0
      other.grad = out.grad * 1.0
    out._backward = _backward
 
    return out
 
  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
 
    # Closure: keeps access to self, other, and out from this multiplication.
    def _backward():
      self.grad = out.grad * other.data
      other.grad = out.grad * self.data
    out._backward = _backward
 
    return out
  
  def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
    out = Value(t, (self,), 'tanh') 
 
    # Closure: keeps access to self, out, and t from this tanh operation.
    def _backward():
      self.grad = out.grad * (1 - t**2)
    out._backward = _backward
 
    return out

_backward is an attribute that stores a function

In Python, functions are ordinary objects: they can be assigned to variables, stored inside attributes, and called later.

In __init__, self._backward = lambda: None creates a default callable attribute on every Value object. The expression lambda: None is an anonymous function that does nothing when called.

Inside __add__, __mul__, and tanh, def _backward(): ... defines a local function for one specific operation.

In methods such as __add__, __mul__, and tanh, the assignment out._backward = _backward does not call that function. It stores the local function inside the _backward attribute of the output node out. The function is called only later, when code such as out._backward() is executed.

Closures in this code

A closure is a function that can still use variables from the place where it was created.

For example, when x1w1 = x1 * w1 is executed, __mul__ creates the output node x1w1 and defines a local _backward function. That function keeps access to the exact self, other, and out objects from that multiplication.

Later, when x1w1._backward() is called, those objects correspond to x1, w1, and x1w1. This is why the stored function can update the correct gradients: x1.grad and w1.grad.

The local rules implemented in the class are the same rules used during manual backpropagation:

For an addition node, the local derivative with respect to each input is 1.0, so the incoming gradient is copied to both inputs.
For a multiplication node, the local derivative with respect to one input is the value of the other input, so the incoming gradient is multiplied by that opposite input.
For a tanh node, the local derivative is $1 - tanh (x)^{2}$ . Since the forward pass already computed t, the backward function can use 1 - t**2.

Rebuilding the computation graph

The same neuron graph is rebuilt using the updated Value class.

The graph contains two inputs, x1 and x2; two weights, w1 and w2; and a bias term b. The bias value is chosen so that the resulting backpropagation numbers are convenient and easy to inspect. Changing this value is a useful way to see how the forward value and the resulting gradients change.

The neuron computes:

n = x_{1} w_{1} + x_{2} w_{2} + b

and then applies tanh to obtain the output:

o = tanh (n)

x1 = Value(2.0, label = 'x1')
x2 = Value(0.0, label = 'x2')
w1 = Value(-3.0,label ='w1')
w2 = Value(1.0, label = 'w2')
b = Value(6.8813735870195432, label = 'b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
 
draw_dot(o)

Calling the local backward functions

Before the full automatic backward() method is implemented, the local _backward() functions can be called manually in reverse topological order.

Why the Parentheses Matter in node._backward()

The expression node._backward() first looks up the _backward attribute stored on node.

That attribute contains a function, so the final () calls that stored function.

Without parentheses, node._backward only returns the function stored in the attribute.

With parentheses, node._backward() calls that stored function and performs the gradient update.

This looks similar to calling a regular method, but the important detail is that different nodes can store different _backward functions. A multiplication output stores a multiplication backward function, an addition output stores an addition backward function, and a leaf node stores the default function that does nothing.

Since the function is stored directly on the node, no argument has to be passed to _backward(). The function already knows which nodes to update because it was created as a closure.

The output node is seeded first:

\frac{\partial o}{\partial o} = 1

The Value class initializes every gradient to 0.0, so the output gradient must be set to 1.0 before any local backward function is called.

Then each _backward() call propagates the gradient one operation farther backward through the graph:

o._backward() propagates through tanh into n
n._backward() propagates through the addition into x1w1x2w2 and b
b._backward() is a no-op because b is a leaf node
x1w1x2w2._backward() propagates through the addition into x1w1 and x2w2
x2w2._backward() propagates through the multiplication into x2 and w2
x1w1._backward() propagates through the multiplication into x1 and w1

o.grad = 1.0
o._backward()
n._backward()
b._backward()
x1w1x2w2._backward()
x2w2._backward()
x1w1._backward()
draw_dot(o)

Deep Learning: Zero to Hero

Explorer

Automatic backpropagation part 1

Implementing the backward function for each operation

Value class: fifth version

Rebuilding the computation graph

Calling the local backward functions

Graph View

Table of Contents