Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

Seeding the output gradient

The previous note ended with a computation graph ready for backpropagation. The first manual step is to initialize the gradient of the output node itself.

Since the output is the loss function , the starting gradient is:

If changes by a tiny amount , then itself changes by the same amount . Therefore the derivative of with respect to itself is .

For any scalar function , a derivative can be estimated numerically with the finite-difference quotient:

In the special case of the output node itself, this becomes:

def numerical_gradient_check_output():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data + h
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_output()
0.9999999999976694

The numerical result is essentially , so the output node can be seeded with gradient 1.0.

L.grad = 1.0
draw_dot(L)

Backpropagating through L = d * f

The final operation in the graph is:

The goal is to compute the gradients of with respect to the two inputs of this multiplication, and .

First, can be viewed as a function of while is held constant:

Using the finite-difference definition,

and substituting gives:

Therefore:

By symmetry of multiplication:

These two values are stored in d.grad and f.grad.

d.grad = f.data
f.grad = d.data
draw_dot(L)

Numerical checks for d and f

The gradients just assigned can be checked numerically. For , the perturbed graph can be built by adding Value(h) when constructing d. Equivalently, the already-created node could be nudged with d.data += h; the explicit graph construction is used below.

def numerical_gradient_check_d():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c + Value(h); d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_d()
-1.9999999999953388

The result matches .

def numerical_gradient_check_f():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0 + h, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_f()
3.9999999999995595

The result matches .

Backpropagating through d = e + c

The next operation to move through is:

The derivative of with respect to is:

Therefore:

By symmetry of addition:

The gradient already stored on d is:

Applying the chain rule gives:

and similarly:

Both gradients can now be written into the graph.

c.grad = -2.0
e.grad = -2.0
draw_dot(L)

Numerical check for e

The gradient with respect to e can be checked by perturbing e.data by a tiny amount before recomputing d and then .

def numerical_gradient_check_e():
  h = 0.000001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  e.data += h
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_e()
-2.000000000279556

The result matches .

Backpropagating through e = a * b

The final operation to move through is:

The gradient flowing into this multiplication is already known:

To compute , the chain rule gives:

Since :

Therefore:

Similarly:

and therefore:

Numerical checks for a and b

The final two gradients can also be checked with finite differences. Perturbing a gives:

def numerical_gradient_check_a():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0 + h, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_a()
6.000000000021544

Perturbing b gives:

def numerical_gradient_check_b():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0 + h, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_b()
-4.000000000008441

The numerical checks match the manually derived gradients:

The final leaf gradients can now be written into the graph.

a.grad = 6.0
b.grad = -4.0
draw_dot(L)

Manual backpropagation recap

Core idea

Manual backpropagation is the recursive application of the chain rule backward through the computation graph. Each local operation contributes a local derivative, and each node stores the derivative of the final output with respect to that node.