Read the Overview First
This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.
Building the Value Object
Neural networks can be viewed as large mathematical expressions built by composing many small scalar operations. Plain Python numbers are enough to compute the final value of such an expression, but they do not remember the operations that produced it. Once backpropagation enters the picture, that memory becomes essential: each intermediate value must know not only its numerical data, but also how it depends on earlier values.
Micrograd introduces this memory through the Value class. A Value starts as a thin wrapper around a scalar number, but it will gradually become a node in a computation graph: it will store its data, remember the operation that created it, keep references to its parent values, and eventually hold the gradient of the final output with respect to itself.
Value class: first version
class Value:
def __init__(self, data):
self.data = data
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
out = Value(self.data + other.data)
return out
def __mul__(self, other):
out = Value(self.data * other.data)
return out
What this 1st version does
This first version of
Valueis only a scalar wrapper. Each object stores one number inself.data.
__init__is called when a newValueis created and saves the input number inside the object.__repr__controls how aValueis displayed, so printing it shows something readable likeValue(data=2.0).__add__defines what happens when twoValueobjects are added witha + b.__mul__defines what happens when twoValueobjects are multiplied witha * b.These special Python methods are called dunder methods because their names start and end with double underscores. Python calls them automatically behind the scenes:
- creating
Value(2.0)calls__init__- displaying a
Valuecalls__repr__- writing
a + bcallsa.__add__(b), and writinga * bcallsa.__mul__(b).In both
__add__and__mul__, the raw scalar values are read from.data, combined using normal Python arithmetic, and wrapped back into a newValue. This keeps the result inside the same custom data structure instead of falling back to a plain Python number.
Testing scalar arithmetic
At this point, a few Value objects can be instantiated to check that the first version behaves like a tiny scalar arithmetic system: it can display values, add them, and multiply them using ordinary Python syntax.
a = Value(2.0)
a # Python will internally use the __repr__ function to return the string "Value(data={self.data})"Value(data=2.0)
b = Value(-3.0)
c = Value(10.0)
print('a + b =',a + b) # Python will internally call a.__add__(b) (other = b and self =a)
d = a*b +c # equivalent to (a.__
mul__(b)).__add__(c)
print('d = a*b +c =',d)a + b = Value(data=-1.0)
d = a*b +c = Value(data=4.0)
Recording the Computation Graph
Value class: second version
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
return out
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
return out
The first version of Value only wrapped a scalar number. This second version starts turning each Value into a node of a computation graph: every result now remembers which values produced it and which operation was used. This is necessary because backpropagation requires knowing the local structure of the computation graph in order to move backward from the final output to the values that contributed to it.
What changed in this version
Attribute Meaning dataThe scalar value stored in the node. _childrenThe input Valueobjects passed to the constructor when a new value is produced by an operation._prevThe internal set of direct predecessor nodes, built from _children._opThe operation that produced the current node, encoded as a string, such as '+'or'*'.labelAn optional name used later to make graph visualizations easier to read. In
__add__and__mul__, the lineout = Value(...)constructs theValuethat represents the result of the operation. The constructor receives the numeric result, the twoValueobjects that produced it, and the symbol of the operation. For example,Value(self.data * other.data, (self, other), '*')stores the result of the multiplication, remembersselfandotheras the parent nodes, and records'*'as the local operation.A directly created value such as
Value(2.0)is a leaf node: no previous values produced it and no operation created it. Therefore_children=(),_prev=set(), and_op=''.The distinction between
_childrenand_previs intentional:_childrenis the constructor argument, while_previs the internal set used by the object to remember its direct predecessors.
By storing this local history, each Value object becomes a traceable node in the computation graph. Later, backpropagation will traverse this graph backward to propagate gradients from the final output back to the values that produced it.
Building the example graph
The updated Value class can now be used to build a small scalar expression while preserving its local graph structure. In the example below, a, b, c, and f are created directly as leaf nodes. The intermediate values e, d, and L are then produced by arithmetic operations. The labels do not affect the computation; they are only names that will make the graph easier to read when it is visualized.
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
LValue(data=-8.0)
Inspecting _prev and _op
The _prev attribute can be inspected to check which Value objects directly produced a given node. In this case, d was created by adding e and c, where e is the result of a * b. Therefore, _prev contains the two direct inputs to that addition: Value(data=-6.0) and Value(data=10.0).
d._prev{Value(data=-6.0), Value(data=10.0)}
The _op attribute records the operation that produced the node. Since d was produced by adding two Value objects, its operation symbol is '+'.
d._op'+'
Visualizing the computation graph
Now that each Value remembers its direct predecessors in _prev and the operation that produced it in _op, the full computation graph can be visualized. The goal is to start from the final output , trace backward through all the values that contributed to it, and render those relationships as a Graphviz diagram.
This visualization does not change the computation itself. It simply provides a readable picture of the forward pass: input values flow into operations, operations produce new values, and the final node sits at the end of the expression.
from graphviz import Digraph
def trace(root):
# Walk backward from root and collect the graph that produced it.
nodes, edges = set(), set()
def build(v):
# Visit each Value object only once, even if it is reused downstream.
if v not in nodes:
nodes.add(v)
# _prev stores the direct inputs that were used to create v.
for child in v._prev:
edges.add((child, v))
build(child)
build(root)
return nodes, edges
def draw_dot(root):
# Draw the graph left-to-right, matching the forward flow of computation.
dot = Digraph(format='svg', graph_attr={
'rankdir': 'LR',
'bgcolor': 'transparent',
'fontname': 'sans-serif'
})
nodes, edges = trace(root)
for n in nodes:
# Use the Python object id as a unique Graphviz node name.
uid = str(id(n))
# Show the human label and the scalar value stored in this Value object.
label = "{ %s | data %.4f}" % (n.label, n.data)
if n._op:
# Non-leaf Values were produced by an operation, so draw the Value node
# and a separate operation node for readability.
dot.node(name=uid, label=label, shape='Mrecord',
fillcolor='#40C4FF', color='#00B0FF', style='filled', fontname='sans-serif')
dot.node(name=uid + n._op, label=n._op, shape='circle',
fillcolor='#FFEA00', color='#FFD600', style='filled', fontname='sans-serif')
dot.edge(uid + n._op, uid, color='#0033FF', penwidth='2.0')
else:
# Leaf Values were created directly, so no operation node is needed.
dot.node(name=uid, label=label, shape='Mrecord',
fillcolor='#00E676', color='#00C853', style='filled', fontname='sans-serif')
for n1, n2 in edges:
# Draw each dependency as input Value -> operation -> result Value.
dot.edge(str(id(n1)), str(id(n2)) + n2._op, color='#D500F9', penwidth='2.0')
return dot
draw_dot(L)How
trace(root)works
tracestarts from the final output node,root(Lin this example), and recursively follows each node’s_prevlinks backward. It returns two sets:nodes, containing all reachableValueobjects, andedges, containing pairs like(child, parent)to describe which values produced which result.
Why sets are used
A computation graph is not always a simple tree: the same
Valueobject can be reached through more than one path when tracing backward from the final output. Thenodesset therefore works as a visited set, ensuring that each object is traversed and drawn only once. Theedgesset similarly avoids drawing the same dependency more than once.
Value nodes vs operation nodes
The real computation graph is made of
Valueobjects: they storedata,_prev,_op, andlabel. The yellow operation nodes are only a visualization trick. They make the graph easier to read asinput Value -> operation -> result Value.
How
draw_dot(root)reads the graph
draw_dotfirst callstrace(root), then draws eachValueas a record-shaped node. Leaf values are shown in green because they were created directly. Values produced by operations are shown in blue, with a separate yellow node for the operation stored in_op.For more details about the Graphviz-specific parts of the code, such as
Digraph,node,edge, andgraph_attr, consult the Python Graphviz API reference.
Why
id(n)is usedGraphviz needs a unique internal name for every node.
id(n)gives a unique identifier for each Python object, so two differentValueobjects can safely have the same label or numeric data without colliding in the diagram.
Forward pass recap
At this stage, the computation graph can represent mathematical expressions built from two scalar operations:
- addition:
+ - multiplication:
*
Every quantity in the graph is still a scalar. Even with only these two operations, the graph is already expressive enough to perform a complete forward pass and compute a final output .
The computation starts from several leaf values:
These values are combined step by step into intermediate values e and d, and finally into one single scalar output:
The graph above represents this forward pass: information flows from the input values, through the operations, until the final result is produced.
Preparing for Backpropagation
With the forward graph in place, the next step is backpropagation. Backpropagation starts from the node of the computational graph representing the final output:
What backpropagation computes
From the output node, the computation moves backward through the graph, computing gradients for all intermediate values. For every value in the graph, the quantity of interest is:
That is, how much the final output changes when that particular node changes.
Starting point
The starting point is the derivative of with respect to itself:
This anchors the backward pass: the output changes one-for-one with itself.
From there, the backward pass continues through the graph and computes:
How to read these gradients
These gradients tell how each value in the computation graph influences the final output .
A positive gradient means that increasing the node tends to increase . A negative gradient means that increasing the node tends to decrease . A gradient close to zero means that small changes to that node have little local effect on .
Why this matters for neural networks
This is the central quantity needed in neural networks. In that setting, is usually a loss function, and the derivative of interest is the derivative of the loss with respect to the weights of the neural network:
In this small graph, the leaf variables include values such as:
Some of these variables will eventually represent the weights of a neural network. The goal is to understand how those weights are impacting the loss function .
Leaf nodes: weights and data
In the computation graph, some leaf nodes will represent:
- the weights of the neural network
- the input data
Usually, the gradients with respect to the weights are the ones that matter most, i.e. the derivatives of the loss function with respect to the weights, because those are the values updated during training.
The input data is normally fixed, so is usually not used for parameter updates.
Instead, gradient information is used to iteratively adjust the weights and reduce the loss.
Value class: third version
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
# Gradient of the final output L with respect to this Value.
self.grad = 0.0
# Keep the local graph structure needed to traverse the expression backward.
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
return f"Value(data={self.data})"
def __add__(self, other):
out = Value(self.data + other.data, (self, other), '+')
return out
def __mul__(self, other):
out = Value(self.data * other.data, (self, other), '*')
return out
The role of
gradEach
Valuehas agradattribute that stores the derivative of the final output with respect to that specificValue.For example, if a
Valuerepresents , thena.gradwill eventually hold:It starts at
0.0because, before backpropagation, no local sensitivity of with respect to thisValuehas been computed yet. Initializing every gradient to0.0means that, at the start, each node is treated as if a tiny change to that value does not affect the output . This matches the derivative intuition from Derivatives Refresh: a derivative measures how much the output changes when the input is nudged by a very small amount.
Rebuilding the graph with grad
The previous code block is now executed again using the updated Value class.
a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a*b; e.label='e'
d = e + c; d.label='d'
f = Value(-2.0, label='f')
L = d * f; L.label='L'
LValue(data=-8.0)
Why the code block is rerun
The original code is developed in a Jupyter notebook, where cells are executed sequentially. After redefining the
Valueclass, the previous objectsa,b,c,d,e,f, andLstill refer to instances created from the older class definition. Re-executing the full block rebuilds the computation graph using the current version ofValue, including the newgradattribute.
Redrawing the graph
Now that the graph has been rebuilt with the updated objects, it can be drawn again using a slightly extended visualization that also shows the new grad field for every node.
from graphviz import Digraph
def trace(root):
nodes, edges = set(), set()
def build(v):
if v not in nodes:
nodes.add(v)
for child in v._prev:
edges.add((child, v))
build(child)
build(root)
return nodes, edges
def draw_dot(root):
dot = Digraph(format='svg', graph_attr={
'rankdir': 'LR',
'bgcolor': 'transparent',
'fontname': 'sans-serif'
})
nodes, edges = trace(root)
for n in nodes:
uid = str(id(n))
label = "{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad)
if n._op:
dot.node(name=uid, label=label, shape='Mrecord',
fillcolor='#40C4FF', color='#00B0FF', style='filled', fontname='sans-serif')
dot.node(name=uid + n._op, label=n._op, shape='circle',
fillcolor='#FFEA00', color='#FFD600', style='filled', fontname='sans-serif')
dot.edge(uid + n._op, uid, color='#0033FF', penwidth='2.0')
else:
dot.node(name=uid, label=label, shape='Mrecord',
fillcolor='#00E676', color='#00C853', style='filled', fontname='sans-serif')
for n1, n2 in edges:
dot.edge(str(id(n1)), str(id(n2)) + n2._op, color='#D500F9', penwidth='2.0')
return dot
draw_dot(L)What changes in the graph visualization
Now every object of the
Valueclass has agradattribute. This means that every node in the computation graph can store both its scalar value and its gradient. For each node, two pieces of information can now be displayed:
data: the actual scalar value stored in the nodegrad: the gradient of the final output with respect to that node
Ready for the backward pass
Forward values first, gradients next
At this point, the forward pass gives the value of each node. The next step is to fill in the
gradvalues by running the backward pass.The graph is therefore ready for performing backpropagation.