Read the Overview First
This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.
The previous notes performed manual backpropagation, one local derivative at a time. That process is useful for understanding the chain rule, but it quickly becomes tedious and infeasible as the computation graph grows.
First Step Toward Automatic Backpropagation
This note takes the first step toward ending the suffering of manual backpropagation. It does not yet implement the final
backward()method for the whole graph. Instead, it prepares theValueclass for that method: each operation will attach a small local_backwardfunction to the output node it creates.For now, those local
_backwardfunctions will still be called manually in the correct order. In the next note, the graph traversal will be automated with a topological ordering and then wrapped into a properValue.backward()method, so a single call such aso.backward()will compute the gradients for the whole graph.
Implementing the backward function for each operation
The core idea is to store, inside each Value object, a small function that knows how to propagate gradients through the operation that created that node.
Value class: fifth version
This version adds a new attribute called _backward. The attribute stores the local backward rule for a node: given the gradient of the output node, it knows how to propagate that gradient to the node’s inputs.
At construction time, every Value receives a default _backward function that does nothing. Operation nodes then replace that default with the specific backward rule for the operation that created them.
Why store a backward function on each node?
During backpropagation, each node must know how to pass its gradient to the nodes that produced it.
A multiplication node must propagate gradients using the multiplication rule, an addition node using the addition rule, and a
tanhnode using the derivative oftanh.Storing a small
_backwardfunction inside each output node makes every node carry its own local backward rule. Later, backpropagation can simply visit the nodes and callnode._backward().
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
# Closure: keeps access to self, other, and out from this addition.
def _backward():
self.grad = out.grad * 1.0
other.grad = out.grad * 1.0
out._backward = _backward
return out
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
# Closure: keeps access to self, other, and out from this multiplication.
def _backward():
self.grad = out.grad * other.data
other.grad = out.grad * self.data
out._backward = _backward
return out
def tanh(self):
x = self.data
t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
# Closure: keeps access to self, out, and t from this tanh operation.
def _backward():
self.grad = out.grad * (1 - t**2)
out._backward = _backward
return out
_backwardis an attribute that stores a functionIn Python, functions are ordinary objects: they can be assigned to variables, stored inside attributes, and called later.
In
__init__,self._backward = lambda: Nonecreates a default callable attribute on everyValueobject. The expressionlambda: Noneis an anonymous function that does nothing when called.Inside
__add__,__mul__, andtanh,def _backward(): ...defines a local function for one specific operation.In methods such as
__add__,__mul__, andtanh, the assignmentout._backward = _backwarddoes not call that function. It stores the local function inside the_backwardattribute of the output nodeout. The function is called only later, when code such asout._backward()is executed.
Closures in this code
A closure is a function that can still use variables from the place where it was created.
For example, when
x1w1 = x1 * w1is executed,__mul__creates the output nodex1w1and defines a local_backwardfunction. That function keeps access to the exactself,other, andoutobjects from that multiplication.Later, when
x1w1._backward()is called, those objects correspond tox1,w1, andx1w1. This is why the stored function can update the correct gradients:x1.gradandw1.grad.
The local rules implemented in the class are the same rules used during manual backpropagation:
- For an addition node, the local derivative with respect to each input is
1.0, so the incoming gradient is copied to both inputs. - For a multiplication node, the local derivative with respect to one input is the value of the other input, so the incoming gradient is multiplied by that opposite input.
- For a
tanhnode, the local derivative is . Since the forward pass already computedt, the backward function can use1 - t**2.
Rebuilding the computation graph
The same neuron graph is rebuilt using the updated Value class.
The graph contains two inputs, x1 and x2; two weights, w1 and w2; and a bias term b. The bias value is chosen so that the resulting backpropagation numbers are convenient and easy to inspect. Changing this value is a useful way to see how the forward value and the resulting gradients change.
The neuron computes:
and then applies tanh to obtain the output:
x1 = Value(2.0, label = 'x1')
x2 = Value(0.0, label = 'x2')
w1 = Value(-3.0,label ='w1')
w2 = Value(1.0, label = 'w2')
b = Value(6.8813735870195432, label = 'b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
draw_dot(o)Calling the local backward functions
Before the full automatic backward() method is implemented, the local _backward() functions can be called manually in reverse topological order.
Why the Parentheses Matter in
node._backward()The expression
node._backward()first looks up the_backwardattribute stored onnode.That attribute contains a function, so the final
()calls that stored function.Without parentheses,
node._backwardonly returns the function stored in the attribute.With parentheses,
node._backward()calls that stored function and performs the gradient update.This looks similar to calling a regular method, but the important detail is that different nodes can store different
_backwardfunctions. A multiplication output stores a multiplication backward function, an addition output stores an addition backward function, and a leaf node stores the default function that does nothing.Since the function is stored directly on the node, no argument has to be passed to
_backward(). The function already knows which nodes to update because it was created as a closure.
The output node is seeded first:
The Value class initializes every gradient to 0.0, so the output gradient must be set to 1.0 before any local backward function is called.
Then each _backward() call propagates the gradient one operation farther backward through the graph:
o._backward()propagates throughtanhintonn._backward()propagates through the addition intox1w1x2w2andbb._backward()is a no-op becausebis a leaf nodex1w1x2w2._backward()propagates through the addition intox1w1andx2w2x2w2._backward()propagates through the multiplication intox2andw2x1w1._backward()propagates through the multiplication intox1andw1
o.grad = 1.0
o._backward()
n._backward()
b._backward()
x1w1x2w2._backward()
x2w2._backward()
x1w1._backward()
draw_dot(o)