Read the Overview First
This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.
Breaking up tanh
So far, tanh has been implemented as one single operation node. This note takes a small detour and rebuilds the same activation from more elementary operations.
The hyperbolic tangent can be written as:
Instead of calling n.tanh() directly, the output will be built as:
Operation granularity is a design choice
The level of abstraction used to represent operations in the computation graph is chosen by the designer of the autograd engine.
tanhcan be represented as one high-level operation node, or it can be decomposed into lower-level nodes such as multiplication, exponentiation, subtraction, addition, and division.Both choices are valid as long as each operation can perform its forward pass and provide the correct local gradient for the backward pass.
Value class: eighth version
To express tanh through elementary operations, the Value class needs a few more operators:
- scalar-friendly addition and multiplication;
- reverse addition and reverse multiplication, so expressions such as
2 * nwork; - exponentiation by a scalar, so division can be implemented as multiplication by a reciprocal;
- negation, subtraction, and true division;
- an
exp()method, so exponentials can become graph nodes with their own backward rule.
Scalar operands
The __add__ and __mul__ methods convert raw Python numbers into Value objects when needed. This is done by checking whether other is already a Value; if it is not, it is wrapped as Value(other).
This allows expressions such as:
a + 1
a * 2without requiring the scalar to be manually wrapped as Value(1) or Value(2).
Division through exponentiation
Division does not need a separate primitive operation. It can be expressed using the identity:
Therefore, implementing exponentiation by a scalar constant is enough to support division. In particular, division becomes the special case where the exponent is -1.
The implementation only supports powers where the exponent is an int or a float. This keeps the operation simple: self is a Value, but other is treated as a fixed scalar constant.
For example, in:
the current implementation supports the case where x is a Value node and k is a plain Python number.
In that case, the power node has only one differentiable input: the base x. Therefore, its local derivative is:
and the backward rule only needs to add a contribution to x.grad.
If the exponent were a
ValueIf the exponent
kwere also aValue, then the operation would have two differentiable inputs: the basexand the exponentk.The backward pass would need to update both gradients separately. In both updates, the multiplication by
y.gradcomes from the chain rule: the local derivative of the power node is multiplied by the gradient arriving from the final output.For the base:
For the exponent:
The second local derivative would be:
For this implementation, only scalar powers are supported, so the local derivative of:
is simply:
During backpropagation, this local derivative is multiplied by the incoming gradient out.grad.
Reverse multiplication
The method __rmul__ handles multiplication when the Value object appears on the right side of *.
For example, when Python evaluates:
2 * ait first asks the integer 2 whether it knows how to multiply itself by a Value object. A normal integer does not know how to do that. Python then asks the right-hand object a whether it can handle the reverse operation by calling:
a.__rmul__(2)The implementation returns:
self * otherwhich evaluates the expression as a * 2. The existing __mul__ method already knows how to wrap the raw number 2 into Value(2) and finish the computation.
Other convenience operators
Reverse addition can be implemented as normal addition because addition is commutative:
Negation can be implemented as multiplication by -1, and subtraction can be implemented as addition of the negated value:
Exponential operation
The new exp() method computes the exponential of the scalar stored in self.data and creates an exp node in the graph.
The local derivative of is itself. Since the output node already stores exp(x) in out.data, the backward rule can use out.data * out.grad.
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad * 1.0
other.grad += out.grad * 1.0
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += out.grad * other.data
other.grad += out.grad * self.data
out._backward = _backward
return out
def __pow__(self, other):
assert isinstance(other, (int, float)), "only supporting int/float powers for now"
out = Value(self.data ** other, (self,), f'**{other}')
def _backward():
self.grad += other * (self.data ** (other - 1)) * out.grad
out._backward = _backward
return out
def __rmul__(self, other):
return self * other
def __radd__(self, other):
return self + other
def __truediv__(self, other):
return self * other**-1
def __neg__(self):
return self * -1
def __sub__(self, other):
return self + (-other)
def tanh(self):
x = self.data
t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += (1 - t**2) * out.grad
out._backward = _backward
return out
def exp(self):
x = self.data
out = Value(math.exp(x), (self,), 'exp')
def _backward():
self.grad += out.data * out.grad
out._backward = _backward
return out
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0
for node in reversed(topo):
node._backward() Rebuilding the neuron with decomposed tanh
The same neuron example is rebuilt.
It has two inputs:
two weights:
and the same bias value:
This bias is chosen so that backpropagation produces convenient round numbers. Changing it is a useful way to see how the forward value and gradients change.
The neuron first computes:
Then, instead of using n.tanh(), it builds the output through the exponential form of tanh:
x1 = Value(2.0, label = 'x1')
x2 = Value(0.0, label = 'x2')
w1 = Value(-3.0,label ='w1')
w2 = Value(1.0, label = 'w2')
b = Value(6.8813735870195432, label = 'b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
e = (2*n).exp()
o = (e - 1) / (e + 1)
o.label = 'o'
o.backward()
draw_dot(o)Takeaway
This exercise has two main lessons.
- Rebuilding
tanhfrom its atomic components gives a useful opportunity to practice extending the small autograd engine with new operations, such as exponentiation, exponentials, and their corresponding derivatives. In this sense,tanhis not only an activation function to reproduce, but also a concrete example for testing and strengthening the machinery of automatic differentiation. - This example also shows that the level of abstraction of the operations implemented in the graph is a design choice. At first,
tanhwas implemented directly as a single operation. Later, it was decomposed into more elementary operations, such as exponentials, subtraction, and division. Both approaches are valid: the difference is the granularity at which the computational graph is built.
What matters for automatic differentiation
Each node produces an output as a function of its input values.
What matters for automatic differentiation is not the specific operation itself, but whether the operation can perform its forward pass and compute the corresponding local gradients during the backward pass.
Once those local gradients are available, the chain rule can connect them together and backpropagation can continue through the graph. Therefore, the function connecting a node’s inputs to its output, and the level of abstraction used to represent it, are design choices left to the programmer.