Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

Breaking up tanh

So far, tanh has been implemented as one single operation node. This note takes a small detour and rebuilds the same activation from more elementary operations.

The hyperbolic tangent can be written as:

Instead of calling n.tanh() directly, the output will be built as:

Operation granularity is a design choice

The level of abstraction used to represent operations in the computation graph is chosen by the designer of the autograd engine.

tanh can be represented as one high-level operation node, or it can be decomposed into lower-level nodes such as multiplication, exponentiation, subtraction, addition, and division.

Both choices are valid as long as each operation can perform its forward pass and provide the correct local gradient for the backward pass.

Value class: eighth version

To express tanh through elementary operations, the Value class needs a few more operators:

  • scalar-friendly addition and multiplication;
  • reverse addition and reverse multiplication, so expressions such as 2 * n work;
  • exponentiation by a scalar, so division can be implemented as multiplication by a reciprocal;
  • negation, subtraction, and true division;
  • an exp() method, so exponentials can become graph nodes with their own backward rule.

Scalar operands

The __add__ and __mul__ methods convert raw Python numbers into Value objects when needed. This is done by checking whether other is already a Value; if it is not, it is wrapped as Value(other).

This allows expressions such as:

a + 1
a * 2

without requiring the scalar to be manually wrapped as Value(1) or Value(2).

Division through exponentiation

Division does not need a separate primitive operation. It can be expressed using the identity:

Therefore, implementing exponentiation by a scalar constant is enough to support division. In particular, division becomes the special case where the exponent is -1.

The implementation only supports powers where the exponent is an int or a float. This keeps the operation simple: self is a Value, but other is treated as a fixed scalar constant.

For example, in:

the current implementation supports the case where x is a Value node and k is a plain Python number.

In that case, the power node has only one differentiable input: the base x. Therefore, its local derivative is:

and the backward rule only needs to add a contribution to x.grad.

If the exponent were a Value

If the exponent k were also a Value, then the operation would have two differentiable inputs: the base x and the exponent k.

The backward pass would need to update both gradients separately. In both updates, the multiplication by y.grad comes from the chain rule: the local derivative of the power node is multiplied by the gradient arriving from the final output.

For the base:

For the exponent:

The second local derivative would be:

For this implementation, only scalar powers are supported, so the local derivative of:

is simply:

During backpropagation, this local derivative is multiplied by the incoming gradient out.grad.

Reverse multiplication

The method __rmul__ handles multiplication when the Value object appears on the right side of *.

For example, when Python evaluates:

2 * a

it first asks the integer 2 whether it knows how to multiply itself by a Value object. A normal integer does not know how to do that. Python then asks the right-hand object a whether it can handle the reverse operation by calling:

a.__rmul__(2)

The implementation returns:

self * other

which evaluates the expression as a * 2. The existing __mul__ method already knows how to wrap the raw number 2 into Value(2) and finish the computation.

Other convenience operators

Reverse addition can be implemented as normal addition because addition is commutative:

Negation can be implemented as multiplication by -1, and subtraction can be implemented as addition of the negated value:

Exponential operation

The new exp() method computes the exponential of the scalar stored in self.data and creates an exp node in the graph. The local derivative of is itself. Since the output node already stores exp(x) in out.data, the backward rule can use out.data * out.grad.

class Value:
 
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._backward = lambda: None 
    self._prev = set(_children)
    self._op = _op
    self.label = label
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')
 
    def _backward(): 
      self.grad  += out.grad * 1.0  
      other.grad += out.grad * 1.0  
    out._backward = _backward
 
    return out
 
  def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data * other.data, (self, other), '*')
 
    def _backward(): 
      self.grad  += out.grad * other.data  
      other.grad += out.grad * self.data   
    out._backward = _backward
 
    return out
  
 
  def __pow__(self, other):
    assert isinstance(other, (int, float)), "only supporting int/float powers for now"
    out = Value(self.data ** other, (self,), f'**{other}')
 
    def _backward():
      self.grad += other * (self.data ** (other - 1)) * out.grad
    out._backward = _backward
 
    return out
 
  def __rmul__(self, other):
    return self * other
  
  def __radd__(self, other):
    return self + other
 
  def __truediv__(self, other):
    return self * other**-1
  
  def __neg__(self):
    return self * -1
  
  def __sub__(self, other):
    return self + (-other)
 
  def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
    out = Value(t, (self,), 'tanh') 
 
    def _backward(): 
      self.grad += (1 - t**2) * out.grad 
    out._backward = _backward
 
 
    return out
  
 
  def exp(self):
    x = self.data
    out = Value(math.exp(x), (self,), 'exp')
 
    def _backward():
      self.grad += out.data * out.grad
    out._backward = _backward
 
    return out
  
  
  def backward(self):
 
    topo = []
    visited = set()
    def build_topo(v):
      if v not in visited:
        visited.add(v)
        for child in v._prev:
          build_topo(child)
        topo.append(v)
    build_topo(self) 
    
    self.grad = 1.0 
    for node in reversed(topo): 
      node._backward() 

Rebuilding the neuron with decomposed tanh

The same neuron example is rebuilt.

It has two inputs:

two weights:

and the same bias value:

This bias is chosen so that backpropagation produces convenient round numbers. Changing it is a useful way to see how the forward value and gradients change.

The neuron first computes:

Then, instead of using n.tanh(), it builds the output through the exponential form of tanh:

x1 = Value(2.0, label = 'x1')
x2 = Value(0.0, label = 'x2')
w1 = Value(-3.0,label ='w1')
w2 = Value(1.0, label = 'w2')
b = Value(6.8813735870195432, label = 'b')
x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b; n.label = 'n'
 
e = (2*n).exp() 
o = (e - 1) / (e + 1)
 
o.label = 'o'
o.backward()
draw_dot(o)

Takeaway

This exercise has two main lessons.

  • Rebuilding tanh from its atomic components gives a useful opportunity to practice extending the small autograd engine with new operations, such as exponentiation, exponentials, and their corresponding derivatives. In this sense, tanh is not only an activation function to reproduce, but also a concrete example for testing and strengthening the machinery of automatic differentiation.
  • This example also shows that the level of abstraction of the operations implemented in the graph is a design choice. At first, tanh was implemented directly as a single operation. Later, it was decomposed into more elementary operations, such as exponentials, subtraction, and division. Both approaches are valid: the difference is the granularity at which the computational graph is built.

What matters for automatic differentiation

Each node produces an output as a function of its input values.

What matters for automatic differentiation is not the specific operation itself, but whether the operation can perform its forward pass and compute the corresponding local gradients during the backward pass.

Once those local gradients are available, the chain rule can connect them together and backpropagation can continue through the graph. Therefore, the function connecting a node’s inputs to its output, and the level of abstraction used to represent it, are design choices left to the programmer.