Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

Building the Value Object

Neural networks can be viewed as large mathematical expressions built by composing many small scalar operations. Plain Python numbers are enough to compute the final value of such an expression, but they do not remember the operations that produced it. Once backpropagation enters the picture, that memory becomes essential: each intermediate value must know not only its numerical data, but also how it depends on earlier values.

Micrograd introduces this memory through the Value class. A Value starts as a thin wrapper around a scalar number, but it will gradually become a node in a computation graph: it will store its data, remember the operation that created it, keep references to its parent values, and eventually hold the gradient of the final output with respect to itself.

Value class: first version

class Value:
 
  def __init__(self, data):
    self.data = data
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    out = Value(self.data + other.data)
    return out
 
  def __mul__(self, other):
    out = Value(self.data * other.data)
    return out
 

What this 1st version does

This first version of Value is only a scalar wrapper. Each object stores one number in self.data.

  • __init__ is called when a new Value is created and saves the input number inside the object.
  • __repr__ controls how a Value is displayed, so printing it shows something readable like Value(data=2.0).
  • __add__ defines what happens when two Value objects are added with a + b.
  • __mul__ defines what happens when two Value objects are multiplied with a * b.

These special Python methods are called dunder methods because their names start and end with double underscores. Python calls them automatically behind the scenes:

  • creating Value(2.0) calls __init__
  • displaying a Value calls __repr__
  • writing a + b calls a.__add__(b), and writing a * b calls a.__mul__(b).

In both __add__ and __mul__, the raw scalar values are read from .data, combined using normal Python arithmetic, and wrapped back into a new Value. This keeps the result inside the same custom data structure instead of falling back to a plain Python number.

Testing scalar arithmetic

At this point, a few Value objects can be instantiated to check that the first version behaves like a tiny scalar arithmetic system: it can display values, add them, and multiply them using ordinary Python syntax.

a = Value(2.0)
a # Python will internally use the __repr__ function to return the string "Value(data={self.data})"
Value(data=2.0)
b = Value(-3.0)
c = Value(10.0)
print('a + b =',a + b) # Python will internally call a.__add__(b)  (other = b and self =a)
d = a*b +c  # equivalent to (a.__
mul__(b)).__add__(c)
print('d = a*b +c =',d)
a + b = Value(data=-1.0)
d = a*b +c = Value(data=4.0)

Recording the Computation Graph

Value class: second version

class Value:
 
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self._prev = set(_children)
    self._op = _op
    self.label = label
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+') 
    return out
 
  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
    return out
 

The first version of Value only wrapped a scalar number. This second version starts turning each Value into a node of a computation graph: every result now remembers which values produced it and which operation was used. This is necessary because backpropagation requires knowing the local structure of the computation graph in order to move backward from the final output to the values that contributed to it.

What changed in this version

AttributeMeaning
dataThe scalar value stored in the node.
_childrenThe input Value objects passed to the constructor when a new value is produced by an operation.
_prevThe internal set of direct predecessor nodes, built from _children.
_opThe operation that produced the current node, encoded as a string, such as '+' or '*'.
labelAn optional name used later to make graph visualizations easier to read.

In __add__ and __mul__, the line out = Value(...) constructs the Value that represents the result of the operation. The constructor receives the numeric result, the two Value objects that produced it, and the symbol of the operation. For example, Value(self.data * other.data, (self, other), '*') stores the result of the multiplication, remembers self and other as the parent nodes, and records '*' as the local operation.

A directly created value such as Value(2.0) is a leaf node: no previous values produced it and no operation created it. Therefore _children=(), _prev=set(), and _op=''.

The distinction between _children and _prev is intentional: _children is the constructor argument, while _prev is the internal set used by the object to remember its direct predecessors.

By storing this local history, each Value object becomes a traceable node in the computation graph. Later, backpropagation will traverse this graph backward to propagate gradients from the final output back to the values that produced it.

Building the example graph

The updated Value class can now be used to build a small scalar expression while preserving its local graph structure. In the example below, a, b, c, and f are created directly as leaf nodes. The intermediate values e, d, and L are then produced by arithmetic operations. The labels do not affect the computation; they are only names that will make the graph easier to read when it is visualized.

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L
Value(data=-8.0)

Inspecting _prev and _op

The _prev attribute can be inspected to check which Value objects directly produced a given node. In this case, d was created by adding e and c, where e is the result of a * b. Therefore, _prev contains the two direct inputs to that addition: Value(data=-6.0) and Value(data=10.0).

d._prev
{Value(data=-6.0), Value(data=10.0)}

The _op attribute records the operation that produced the node. Since d was produced by adding two Value objects, its operation symbol is '+'.

d._op
'+'

Visualizing the computation graph

Now that each Value remembers its direct predecessors in _prev and the operation that produced it in _op, the full computation graph can be visualized. The goal is to start from the final output , trace backward through all the values that contributed to it, and render those relationships as a Graphviz diagram.

This visualization does not change the computation itself. It simply provides a readable picture of the forward pass: input values flow into operations, operations produce new values, and the final node sits at the end of the expression.

from graphviz import Digraph
 
def trace(root):
  # Walk backward from root and collect the graph that produced it.
  nodes, edges = set(), set()
 
  def build(v):
    # Visit each Value object only once, even if it is reused downstream.
    if v not in nodes:
      nodes.add(v)
 
      # _prev stores the direct inputs that were used to create v.
      for child in v._prev:
        edges.add((child, v))
        build(child)
 
  build(root)
  return nodes, edges
 
 
def draw_dot(root):
  # Draw the graph left-to-right, matching the forward flow of computation.
  dot = Digraph(format='svg', graph_attr={
      'rankdir': 'LR',
      'bgcolor': 'transparent',
      'fontname': 'sans-serif'
  })
 
  nodes, edges = trace(root)
  for n in nodes:
    # Use the Python object id as a unique Graphviz node name.
    uid = str(id(n))
 
    # Show the human label and the scalar value stored in this Value object.
    label = "{ %s | data %.4f}" % (n.label, n.data)
 
    if n._op:
        # Non-leaf Values were produced by an operation, so draw the Value node
        # and a separate operation node for readability.
        dot.node(name=uid, label=label, shape='Mrecord',
                 fillcolor='#40C4FF', color='#00B0FF', style='filled', fontname='sans-serif')
        dot.node(name=uid + n._op, label=n._op, shape='circle',
                 fillcolor='#FFEA00', color='#FFD600', style='filled', fontname='sans-serif')
        dot.edge(uid + n._op, uid, color='#0033FF', penwidth='2.0')
    else:
        # Leaf Values were created directly, so no operation node is needed.
        dot.node(name=uid, label=label, shape='Mrecord',
                 fillcolor='#00E676', color='#00C853', style='filled', fontname='sans-serif')
 
  for n1, n2 in edges:
    # Draw each dependency as input Value -> operation -> result Value.
    dot.edge(str(id(n1)), str(id(n2)) + n2._op, color='#D500F9', penwidth='2.0')
 
  return dot
 
 
draw_dot(L)

Value nodes vs operation nodes

The real computation graph is made of Value objects: they store data, _prev, _op, and label. The yellow operation nodes are only a visualization trick. They make the graph easier to read as input Value -> operation -> result Value.

How draw_dot(root) reads the graph

draw_dot first calls trace(root), then draws each Value as a record-shaped node. Leaf values are shown in green because they were created directly. Values produced by operations are shown in blue, with a separate yellow node for the operation stored in _op.

For more details about the Graphviz-specific parts of the code, such as Digraph, node, edge, and graph_attr, consult the Python Graphviz API reference.

Forward pass recap

At this stage, the computation graph can represent mathematical expressions built from two scalar operations:

  • addition: +
  • multiplication: *

Every quantity in the graph is still a scalar. Even with only these two operations, the graph is already expressive enough to perform a complete forward pass and compute a final output .

The computation starts from several leaf values:

These values are combined step by step into intermediate values e and d, and finally into one single scalar output:

The graph above represents this forward pass: information flows from the input values, through the operations, until the final result is produced.


Preparing for Backpropagation

With the forward graph in place, the next step is backpropagation. Backpropagation starts from the node of the computational graph representing the final output:

Starting point

The starting point is the derivative of with respect to itself:

This anchors the backward pass: the output changes one-for-one with itself.

From there, the backward pass continues through the graph and computes:

Leaf nodes: weights and data

In the computation graph, some leaf nodes will represent:

  • the weights of the neural network
  • the input data

Usually, the gradients with respect to the weights are the ones that matter most, i.e. the derivatives of the loss function with respect to the weights, because those are the values updated during training.

The input data is normally fixed, so is usually not used for parameter updates.

Instead, gradient information is used to iteratively adjust the weights and reduce the loss.

Value class: third version

class Value:
 
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    # Gradient of the final output L with respect to this Value.
    self.grad = 0.0
 
    # Keep the local graph structure needed to traverse the expression backward.
    self._prev = set(_children)
    self._op = _op
    self.label = label
 
  def __repr__(self):
    return f"Value(data={self.data})"
 
  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')
    return out
 
  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')
    return out
 

The role of grad

Each Value has a grad attribute that stores the derivative of the final output with respect to that specific Value.

For example, if a Value represents , then a.grad will eventually hold:

It starts at 0.0 because, before backpropagation, no local sensitivity of with respect to this Value has been computed yet. Initializing every gradient to 0.0 means that, at the start, each node is treated as if a tiny change to that value does not affect the output . This matches the derivative intuition from Derivatives Refresh: a derivative measures how much the output changes when the input is nudged by a very small amount.

Rebuilding the graph with grad

The previous code block is now executed again using the updated Value class.

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
L
Value(data=-8.0)

Redrawing the graph

Now that the graph has been rebuilt with the updated objects, it can be drawn again using a slightly extended visualization that also shows the new grad field for every node.

from graphviz import Digraph
 
def trace(root):
  nodes, edges = set(), set()
  def build(v):
    if v not in nodes:
      nodes.add(v)
      for child in v._prev:
        edges.add((child, v))
        build(child)
  build(root)
  return nodes, edges
 
def draw_dot(root):
  dot = Digraph(format='svg', graph_attr={
      'rankdir': 'LR', 
      'bgcolor': 'transparent',
      'fontname': 'sans-serif'
  }) 
 
  nodes, edges = trace(root)
  for n in nodes:
    uid = str(id(n))
    
    label = "{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad)
    
    if n._op:
        dot.node(name=uid, label=label, shape='Mrecord', 
                 fillcolor='#40C4FF', color='#00B0FF', style='filled', fontname='sans-serif')
        dot.node(name=uid + n._op, label=n._op, shape='circle', 
                 fillcolor='#FFEA00', color='#FFD600', style='filled', fontname='sans-serif')
        dot.edge(uid + n._op, uid, color='#0033FF', penwidth='2.0')
    else:    
        dot.node(name=uid, label=label, shape='Mrecord', 
                 fillcolor='#00E676', color='#00C853', style='filled', fontname='sans-serif')
 
  for n1, n2 in edges:
    dot.edge(str(id(n1)), str(id(n2)) + n2._op, color='#D500F9', penwidth='2.0')
 
  return dot
 
 
 
draw_dot(L)

What changes in the graph visualization

Now every object of the Value class has a grad attribute. This means that every node in the computation graph can store both its scalar value and its gradient. For each node, two pieces of information can now be displayed:

  • data: the actual scalar value stored in the node
  • grad: the gradient of the final output with respect to that node

Ready for the backward pass

Forward values first, gradients next

At this point, the forward pass gives the value of each node. The next step is to fill in the grad values by running the backward pass.

The graph is therefore ready for performing backpropagation.