Read the Overview First
This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.
Seeding the output gradient
The previous note ended with a computation graph ready for backpropagation. The first manual step is to initialize the gradient of the output node itself.
Since the output is the loss function , the starting gradient is:
If changes by a tiny amount , then itself changes by the same amount . Therefore the derivative of with respect to itself is .
For any scalar function , a derivative can be estimated numerically with the finite-difference quotient:
In the special case of the output node itself, this becomes:
How the numerical checks are read
Each numerical check rebuilds the same scalar graph twice.
L1stores the original scalar output, whileL2stores the scalar output after one quantity has been perturbed by a tiny amounth.Since
Lis aValueobject, the raw scalar number is read fromL.data. The printed quantity(L2 - L1) / his the finite-difference estimate of the corresponding derivative.
def numerical_gradient_check_output():
h = 0.0001
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L1 = L.data
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L2 = L.data + h
print((L2 - L1) / h)
numerical_gradient_check_output()0.9999999999976694
The numerical result is essentially , so the output node can be seeded with gradient 1.0.
L.grad = 1.0
draw_dot(L)Why
draw_dot(L)is available hereThe original material is developed as a Jupyter notebook and then distributed across multiple notes. The function
draw_dotwas defined in the previous note, The Value Object: Micrograd’s core, and is reused here to visualize the updated computation graph.
Backpropagating through L = d * f
The final operation in the graph is:
The goal is to compute the gradients of with respect to the two inputs of this multiplication, and .
First, can be viewed as a function of while is held constant:
Using the finite-difference definition,
and substituting gives:
Therefore:
By symmetry of multiplication:
These two values are stored in d.grad and f.grad.
d.grad = f.data
f.grad = d.data
draw_dot(L)Numerical checks for d and f
The gradients just assigned can be checked numerically. For , the perturbed graph can be built by adding Value(h) when constructing d. Equivalently, the already-created node could be nudged with d.data += h; the explicit graph construction is used below.
def numerical_gradient_check_d():
h = 0.0001
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L1 = L.data
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c + Value(h); d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L2 = L.data
print((L2 - L1) / h)
numerical_gradient_check_d()-1.9999999999953388
The result matches .
def numerical_gradient_check_f():
h = 0.0001
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L1 = L.data
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0 + h, label='f')
L = d * f; L.label='L'
L2 = L.data
print((L2 - L1) / h)
numerical_gradient_check_f()3.9999999999995595
The result matches .
Backpropagating through d = e + c
The next operation to move through is:
The derivative of with respect to is:
Therefore:
By symmetry of addition:
The gradient already stored on d is:
Applying the chain rule gives:
and similarly:
Both gradients can now be written into the graph.
c.grad = -2.0
e.grad = -2.0
draw_dot(L)Numerical check for e
The gradient with respect to e can be checked by perturbing e.data by a tiny amount before recomputing d and then .
def numerical_gradient_check_e():
h = 0.000001
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L1 = L.data
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
e.data += h
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L2 = L.data
print((L2 - L1) / h)
numerical_gradient_check_e()-2.000000000279556
The result matches .
Backpropagating through e = a * b
The final operation to move through is:
The gradient flowing into this multiplication is already known:
To compute , the chain rule gives:
Since :
Therefore:
Similarly:
and therefore:
Numerical checks for a and b
The final two gradients can also be checked with finite differences. Perturbing a gives:
def numerical_gradient_check_a():
h = 0.0001
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L1 = L.data
a = Value(2.0 + h, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L2 = L.data
print((L2 - L1) / h)
numerical_gradient_check_a()6.000000000021544
Perturbing b gives:
def numerical_gradient_check_b():
h = 0.0001
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L1 = L.data
a = Value(2.0, label='a')
b = Value(-3.0 + h, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L2 = L.data
print((L2 - L1) / h)
numerical_gradient_check_b()-4.000000000008441
The numerical checks match the manually derived gradients:
The final leaf gradients can now be written into the graph.
a.grad = 6.0
b.grad = -4.0
draw_dot(L)Manual backpropagation recap
Core idea
Manual backpropagation is the recursive application of the chain rule backward through the computation graph. Each local operation contributes a local derivative, and each node stores the derivative of the final output with respect to that node.