Manual backpropagation through a neuron

Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

The `tanh` activation

The next example performs manual backpropagation through a tiny neuron. The neuron will use the hyperbolic tangent activation function, $tanh$ .

The function can be written in terms of exponentials as:

tanh (x) = \frac{e ^{2 x} - 1}{e ^{2 x} + 1}

The shape of $tanh$ can be visualized directly:

plt.plot(np.arange(-5, 5, 0.2), np.tanh(np.arange(-5, 5, 0.2))); plt.grid();

Adding `tanh` to `Value`

This fourth version of the Value class receives a new method, tanh(), which computes the hyperbolic tangent of the scalar stored in self.data.

Since $tanh$ is a unary operation, the output node has only one predecessor: the current object itself, self.

class Value:
 
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._prev = set(_children)
    self._op = _op
    self.label = label
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
    return out
 
  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
    return out
 
  def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
    out = Value(t, (self,), 'tanh')
    return out

Building the neuron

The neuron has two inputs:

x_{1} = 2.0, x_{2} = 0.0

and two weights:

w_{1} = - 3.0, w_{2} = 1.0

The bias is set to:

b = 6.8813735870195432

Info

This particular bias value is chosen so that the manual backpropagation produces convenient round numbers. Changing it is a useful way to see how the backpropagation values change.

The neuron first computes the weighted sum:

n = x_{1} w_{1} + x_{2} w_{2} + b

and then applies the activation:

o = tanh (n)

x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
b = Value(6.8813735870195432, label='b')
 
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
 
o = n.tanh(); o.label = 'o'
 
draw_dot(o)

Seeding the output gradient

Manual backpropagation starts from the output node. Since the output is $o$ , its gradient with respect to itself is:

\frac{\partial o}{\partial o} = 1

o.grad = 1.0
 
draw_dot(o)

Backpropagating through `tanh`

The output node is:

o = tanh (n)

The derivative of $tanh$ is:

\frac{\partial o}{\partial n} = 1 - tanh (n)^{2} = 1 - o^{2}

For the current forward value, this derivative is 0.5, so:

n . grad = 0.5

1 - o.data**2
n.grad = 0.5
 
draw_dot(o)

Backpropagating through the addition nodes

The node $n$ is an addition:

n = x_{1} w_{1} + x_{2} w_{2} + b

More explicitly, the graph stores:

n = (x_{1} w_{1} + x_{2} w_{2}) + b

Addition node local derivative

For an addition node, each local derivative is $1$ because increasing either input by a small amount increases the output by exactly the same amount.

For example, if $z = x + y$ , then:
$\frac{\partial z}{\partial x} = 1, \frac{\partial z}{\partial y} = 1$

Therefore:

\frac{\partial n}{\partial b} = 1, \frac{\partial n}{\partial ( x _{1} w _{1} + x _{2} w _{2} )} = 1

Since n.grad stores $\frac{\partial o}{\partial n} = 0.5$ , the chain rule gives:

\frac{\partial o}{\partial b} = \frac{\partial o}{\partial n} \cdot \frac{\partial n}{\partial b} = 0.5 \cdot 1 = 0.5

and:

\frac{\partial o}{\partial ( x _{1} w _{1} + x _{2} w _{2} )} = \frac{\partial o}{\partial n} \cdot \frac{\partial n}{\partial ( x _{1} w _{1} + x _{2} w _{2} )} = 0.5 \cdot 1 = 0.5

Thus:

b.grad = 0.5
x1w1x2w2.grad = 0.5

The node x1w1x2w2 is also an addition:

x_{1} w_{1} + x_{2} w_{2}

Again, both local derivatives are 1:

\frac{\partial ( x _{1} w _{1} + x _{2} w _{2} )}{\partial ( x _{1} w _{1} )} = 1, \frac{\partial ( x _{1} w _{1} + x _{2} w _{2} )}{\partial ( x _{2} w _{2} )} = 1

Applying the chain rule:

\frac{\partial o}{\partial ( x _{1} w _{1} )} = \frac{\partial o}{\partial ( x _{1} w _{1} + x _{2} w _{2} )} \cdot \frac{\partial ( x _{1} w _{1} + x _{2} w _{2} )}{\partial ( x _{1} w _{1} )} = 0.5 \cdot 1 = 0.5

and:

\frac{\partial o}{\partial ( x _{2} w _{2} )} = \frac{\partial o}{\partial ( x _{1} w _{1} + x _{2} w _{2} )} \cdot \frac{\partial ( x _{1} w _{1} + x _{2} w _{2} )}{\partial ( x _{2} w _{2} )} = 0.5 \cdot 1 = 0.5

Thus:

x1w1.grad = 0.5
x2w2.grad = 0.5

Backpropagating through the multiplication nodes

The remaining local operations are multiplications:

x_{1} w_{1} = x_{1} \times w_{1}, x_{2} w_{2} = x_{2} \times w_{2}

Multiplication node local derivative

For a multiplication node, the local derivative with respect to one input is the other input. This happens because, if $z = x y$ , then changing $x$ while keeping $y$ fixed changes the output at rate $y$ , and changing $y$ while keeping $x$ fixed changes the output at rate $x$ .
$\frac{\partial ( x _{1} w _{1} )}{\partial x _{1}} = w_{1}, \frac{\partial ( x _{1} w _{1} )}{\partial w _{1}} = x_{1}$
and:
$\frac{\partial ( x _{2} w _{2} )}{\partial x _{2}} = w_{2}, \frac{\partial ( x _{2} w _{2} )}{\partial w _{2}} = x_{2}$

Applying the chain rule gives:

\frac{\partial o}{\partial x _{1}} = \frac{\partial o}{\partial ( x _{1} w _{1} )} \cdot \frac{\partial ( x _{1} w _{1} )}{\partial x _{1}} = 0.5 \cdot w_{1} . data = 0.5 \cdot (- 3.0) = - 1.5

\frac{\partial o}{\partial w _{1}} = \frac{\partial o}{\partial ( x _{1} w _{1} )} \cdot \frac{\partial ( x _{1} w _{1} )}{\partial w _{1}} = 0.5 \cdot x_{1} . data = 0.5 \cdot 2.0 = 1.0

\frac{\partial o}{\partial x _{2}} = \frac{\partial o}{\partial ( x _{2} w _{2} )} \cdot \frac{\partial ( x _{2} w _{2} )}{\partial x _{2}} = 0.5 \cdot w_{2} . data = 0.5 \cdot 1.0 = 0.5

\frac{\partial o}{\partial w _{2}} = \frac{\partial o}{\partial ( x _{2} w _{2} )} \cdot \frac{\partial ( x _{2} w _{2} )}{\partial w _{2}} = 0.5 \cdot x_{2} . data = 0.5 \cdot 0.0 = 0.0

These gradients are assigned directly from the already-computed upstream gradients and the local multiplication derivatives.

x1.grad = x1w1.grad * w1.data
w1.grad = x1w1.grad * x1.data
 
x2.grad = x2w2.grad * w2.data
w2.grad = x2w2.grad * x2.data
 
draw_dot(o)

Deep Learning: Zero to Hero

Explorer

Manual backpropagation through a neuron

The `tanh` activation

Adding `tanh` to `Value`

Building the neuron

Seeding the output gradient

Backpropagating through `tanh`

Backpropagating through the addition nodes

Backpropagating through the multiplication nodes

Graph View

Table of Contents

Backlinks

Deep Learning: Zero to Hero

Explorer

Manual backpropagation through a neuron

The tanh activation

Adding tanh to Value

Building the neuron

Seeding the output gradient

Backpropagating through tanh

Backpropagating through the addition nodes

Backpropagating through the multiplication nodes

Graph View

Table of Contents

Backlinks

The `tanh` activation

Adding `tanh` to `Value`

Backpropagating through `tanh`