Manual Backpropagation example

Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

Seeding the output gradient

The previous note ended with a computation graph ready for backpropagation. The first manual step is to initialize the gradient of the output node itself.

Since the output is the loss function $L$ , the starting gradient is:

\frac{\partial L}{\partial L} = 1

If $L$ changes by a tiny amount $h$ , then $L$ itself changes by the same amount $h$ . Therefore the derivative of $L$ with respect to itself is $1$ .

For any scalar function $g$ , a derivative can be estimated numerically with the finite-difference quotient:

\frac{g ( x + h ) - g ( x )}{h}

In the special case of the output node itself, this becomes:

\frac{L + h - L}{h}

How the numerical checks are read

Each numerical check rebuilds the same scalar graph twice. L1 stores the original scalar output, while L2 stores the scalar output after one quantity has been perturbed by a tiny amount h.

Since L is a Value object, the raw scalar number is read from L.data. The printed quantity (L2 - L1) / h is the finite-difference estimate of the corresponding derivative.

def numerical_gradient_check_output():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data + h
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_output()

0.9999999999976694

The numerical result is essentially $1$ , so the output node can be seeded with gradient 1.0.

L.grad = 1.0
draw_dot(L)

Why draw_dot(L) is available here

The original material is developed as a Jupyter notebook and then distributed across multiple notes. The function draw_dot was defined in the previous note, The Value Object: Micrograd’s core, and is reused here to visualize the updated computation graph.

Backpropagating through `L = d * f`

The final operation in the graph is:

L = d \times f

The goal is to compute the gradients of $L$ with respect to the two inputs of this multiplication, $d$ and $f$ .

First, $L$ can be viewed as a function of $d$ while $f$ is held constant:

L (d) = d \times f

Using the finite-difference definition,

\frac{\partial L}{\partial d} = \frac{L ( d + h ) - L ( d )}{h}

and substituting $L = d \times f$ gives:

\frac{L ( d + h ) - L ( d )}{h} = \frac{( d + h ) f - df}{h} = \frac{df + h f - df}{h} = \frac{h f}{h} = f

Therefore:

\frac{\partial L}{\partial d} = f = - 2.0

By symmetry of multiplication:

\frac{\partial L}{\partial f} = d = 4.0

These two values are stored in d.grad and f.grad.

d.grad = f.data
f.grad = d.data
draw_dot(L)

Numerical checks for `d` and `f`

The gradients just assigned can be checked numerically. For $d$ , the perturbed graph can be built by adding Value(h) when constructing d. Equivalently, the already-created node could be nudged with d.data += h; the explicit graph construction is used below.

def numerical_gradient_check_d():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c + Value(h); d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_d()

-1.9999999999953388

The result matches $\frac{\partial L}{\partial d} = - 2.0$ .

def numerical_gradient_check_f():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0 + h, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_f()

3.9999999999995595

The result matches $\frac{\partial L}{\partial f} = 4.0$ .

Backpropagating through `d = e + c`

The next operation to move through is:

d = e + c

The derivative of $d$ with respect to $c$ is:

\frac{\partial d}{\partial c} = \frac{d ( c + h ) - d ( c )}{h} = \frac{( c + h + e ) - ( c + e )}{h} = \frac{h}{h} = 1.0

Therefore:

\frac{\partial d}{\partial c} = 1.0

By symmetry of addition:

\frac{\partial d}{\partial e} = 1.0

The gradient already stored on d is:

d . grad = \frac{\partial L}{\partial d} = - 2.0

Applying the chain rule gives:

\frac{\partial L}{\partial c} = \frac{\partial L}{\partial d} \frac{\partial d}{\partial c} = (- 2.0) (1.0) = - 2.0

and similarly:

\frac{\partial L}{\partial e} = \frac{\partial L}{\partial d} \frac{\partial d}{\partial e} = (- 2.0) (1.0) = - 2.0

Both gradients can now be written into the graph.

c.grad = -2.0
e.grad = -2.0
draw_dot(L)

Numerical check for `e`

The gradient with respect to e can be checked by perturbing e.data by a tiny amount before recomputing d and then $L$ .

def numerical_gradient_check_e():
  h = 0.000001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  e.data += h
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_e()

-2.000000000279556

The result matches $\frac{\partial L}{\partial e} = - 2.0$ .

Backpropagating through `e = a * b`

The final operation to move through is:

e = a \times b

The gradient flowing into this multiplication is already known:

\frac{\partial L}{\partial e} = - 2.0

To compute $\frac{\partial L}{\partial a}$ , the chain rule gives:

\frac{\partial L}{\partial a} = \frac{\partial L}{\partial e} \frac{\partial e}{\partial a}

Since $e = ab$ :

\frac{\partial e}{\partial a} = \frac{( a + h ) b - ab}{h} = \frac{ab + hb - ab}{h} = \frac{hb}{h} = b = - 3.0

Therefore:

\frac{\partial L}{\partial a} = \frac{\partial L}{\partial e} \frac{\partial e}{\partial a} = (- 2.0) (- 3.0) = 6.0

Similarly:

\frac{\partial e}{\partial b} = a = 2.0

and therefore:

\frac{\partial L}{\partial b} = \frac{\partial L}{\partial e} \frac{\partial e}{\partial b} = (- 2.0) (2.0) = - 4.0

Numerical checks for `a` and `b`

The final two gradients can also be checked with finite differences. Perturbing a gives:

def numerical_gradient_check_a():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0 + h, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_a()

6.000000000021544

Perturbing b gives:

def numerical_gradient_check_b():
  h = 0.0001
 
  a = Value(2.0, label='a')
  b = Value(-3.0, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L1 = L.data
 
  a = Value(2.0, label='a')
  b = Value(-3.0 + h, label='b')
  c = Value(10.0, label='c')
  e = a*b; e.label='e'
  d = e + c; d.label='d'
  f = Value(-2.0, label='f')
  L = d * f; L.label='L'
  L2 = L.data
 
  print((L2 - L1) / h)
 
 
numerical_gradient_check_b()

-4.000000000008441

The numerical checks match the manually derived gradients:

\frac{\partial L}{\partial a} = 6.0, \frac{\partial L}{\partial b} = - 4.0

The final leaf gradients can now be written into the graph.

a.grad = 6.0
b.grad = -4.0
draw_dot(L)

Manual backpropagation recap

Core idea

Manual backpropagation is the recursive application of the chain rule backward through the computation graph. Each local operation contributes a local derivative, and each node stores the derivative of the final output $L$ with respect to that node.

Deep Learning: Zero to Hero

Explorer

Manual Backpropagation example

Seeding the output gradient

Backpropagating through `L = d * f`

Numerical checks for `d` and `f`

Backpropagating through `d = e + c`

Numerical check for `e`

Backpropagating through `e = a * b`

Numerical checks for `a` and `b`

Manual backpropagation recap

Graph View

Table of Contents

Deep Learning: Zero to Hero

Explorer

Manual Backpropagation example

Seeding the output gradient

Backpropagating through L = d * f

Numerical checks for d and f

Backpropagating through d = e + c

Numerical check for e

Backpropagating through e = a * b

Numerical checks for a and b

Manual backpropagation recap

Graph View

Table of Contents

Backpropagating through `L = d * f`

Numerical checks for `d` and `f`

Backpropagating through `d = e + c`

Numerical check for `e`

Backpropagating through `e = a * b`

Numerical checks for `a` and `b`