Read the Overview First
This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.
The tanh activation
The next example performs manual backpropagation through a tiny neuron. The neuron will use the hyperbolic tangent activation function, .
The function can be written in terms of exponentials as:
The shape of can be visualized directly:
plt.plot(np.arange(-5, 5, 0.2), np.tanh(np.arange(-5, 5, 0.2))); plt.grid();Adding tanh to Value
This fourth version of the Value class receives a new method, tanh(), which computes the hyperbolic tangent of the scalar stored in self.data.
Since is a unary operation, the output node has only one predecessor: the current object itself, self.
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
return out
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
return out
def tanh(self):
x = self.data
t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
return outBuilding the neuron

The neuron has two inputs:
and two weights:
The bias is set to:
Info
This particular bias value is chosen so that the manual backpropagation produces convenient round numbers. Changing it is a useful way to see how the backpropagation values change.
The neuron first computes the weighted sum:
and then applies the activation:
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
b = Value(6.8813735870195432, label='b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
o = n.tanh(); o.label = 'o'
draw_dot(o)Seeding the output gradient
Manual backpropagation starts from the output node. Since the output is , its gradient with respect to itself is:
o.grad = 1.0
draw_dot(o)Backpropagating through tanh
The output node is:
The derivative of is:
For the current forward value, this derivative is 0.5, so:
1 - o.data**2
n.grad = 0.5
draw_dot(o)Backpropagating through the addition nodes
The node is an addition:
More explicitly, the graph stores:
Addition node local derivative
For an addition node, each local derivative is because increasing either input by a small amount increases the output by exactly the same amount.
For example, if , then:
Therefore:
Since n.grad stores , the chain rule gives:
and:
Thus:
b.grad = 0.5
x1w1x2w2.grad = 0.5The node x1w1x2w2 is also an addition:
Again, both local derivatives are 1:
Applying the chain rule:
and:
Thus:
x1w1.grad = 0.5
x2w2.grad = 0.5Backpropagating through the multiplication nodes
The remaining local operations are multiplications:
Multiplication node local derivative
For a multiplication node, the local derivative with respect to one input is the other input. This happens because, if , then changing while keeping fixed changes the output at rate , and changing while keeping fixed changes the output at rate .
and:
Applying the chain rule gives:
These gradients are assigned directly from the already-computed upstream gradients and the local multiplication derivatives.
x1.grad = x1w1.grad * w1.data
w1.grad = x1w1.grad * x1.data
x2.grad = x2w2.grad * w2.data
w2.grad = x2w2.grad * x2.data
draw_dot(o)