Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

The tanh activation

The next example performs manual backpropagation through a tiny neuron. The neuron will use the hyperbolic tangent activation function, .

The function can be written in terms of exponentials as:

The shape of can be visualized directly:

plt.plot(np.arange(-5, 5, 0.2), np.tanh(np.arange(-5, 5, 0.2))); plt.grid();

Adding tanh to Value

This fourth version of the Value class receives a new method, tanh(), which computes the hyperbolic tangent of the scalar stored in self.data.

Since is a unary operation, the output node has only one predecessor: the current object itself, self.

class Value:
 
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._prev = set(_children)
    self._op = _op
    self.label = label
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
    return out
 
  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
    return out
 
  def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
    out = Value(t, (self,), 'tanh')
    return out

Building the neuron

The neuron has two inputs:

and two weights:

The bias is set to:

Info

This particular bias value is chosen so that the manual backpropagation produces convenient round numbers. Changing it is a useful way to see how the backpropagation values change.

The neuron first computes the weighted sum:

and then applies the activation:

x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
b = Value(6.8813735870195432, label='b')
 
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
 
o = n.tanh(); o.label = 'o'
 
draw_dot(o)

Seeding the output gradient

Manual backpropagation starts from the output node. Since the output is , its gradient with respect to itself is:

o.grad = 1.0
 
draw_dot(o)

Backpropagating through tanh

The output node is:

The derivative of is:

For the current forward value, this derivative is 0.5, so:

1 - o.data**2
n.grad = 0.5
 
draw_dot(o)

Backpropagating through the addition nodes

The node is an addition:

More explicitly, the graph stores:

Addition node local derivative

For an addition node, each local derivative is because increasing either input by a small amount increases the output by exactly the same amount.

For example, if , then:

Therefore:

Since n.grad stores , the chain rule gives:

and:

Thus:

b.grad = 0.5
x1w1x2w2.grad = 0.5

The node x1w1x2w2 is also an addition:

Again, both local derivatives are 1:

Applying the chain rule:

and:

Thus:

x1w1.grad = 0.5
x2w2.grad = 0.5

Backpropagating through the multiplication nodes

The remaining local operations are multiplications:

Multiplication node local derivative

For a multiplication node, the local derivative with respect to one input is the other input. This happens because, if , then changing while keeping fixed changes the output at rate , and changing while keeping fixed changes the output at rate .

and:

Applying the chain rule gives:

These gradients are assigned directly from the already-computed upstream gradients and the local multiplication derivatives.

x1.grad = x1w1.grad * w1.data
w1.grad = x1w1.grad * x1.data
 
x2.grad = x2w2.grad * w2.data
w2.grad = x2w2.grad * x2.data
 
draw_dot(o)