Building an MLP in Micrograd

Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

The previous notes built the scalar autograd engine: the Value object can now build computation graphs, run backpropagation, and accumulate gradients correctly.

The next step is to build a tiny neural network library on top of that engine. The goal is to mirror the style of PyTorch neural network modules, but using the scalar Value objects developed in Micrograd.

From autograd engine to neural network modules

Micrograd first matched the core idea of PyTorch autograd on a scalar computation graph.

This note starts matching the neural-network-module side of PyTorch: a neuron, a layer, and an MLP are implemented as small callable Python objects.

Building an MLP

The construction starts from a single neuron. A similar neuron was already used earlier, but here it is packaged as a reusable object.

A neuron stores:

one weight for each input;
one bias;
a forward computation that takes an input vector, computes a weighted sum, and applies tanh.

Each weight and bias is a Value object, so gradients can be computed with respect to them during backpropagation.

The weights and bias are initialized randomly with random.uniform(-1, 1).

Callable modules

The __call__ method is invoked automatically by Python when an object is called like a function.

This means a Neuron instance can be evaluated as neuron(x), a Layer instance as layer(x), and an MLP instance as n(x).

This pattern is close to the way PyTorch modules define a forward computation and are then called like functions.

import random
 
 
class Neuron: 
 
    def __init__ (self, nin):
        self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
        self.b = Value(random.uniform(-1,1))
    
    def __call__(self, x):
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        out = act.tanh()
        return out
 
class Layer:
 
    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(nout)]
    
    def __call__(self, x):
        outs = [neuron(x) for neuron in self.neurons]
        return outs[0] if len(outs) == 1 else outs
 
class MLP:
 
    def __init__(self, nin, nouts):
        sz = [nin] + nouts
        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
    
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

The Layer class contains a list of neurons. Each neuron receives nin inputs and produces one output, so a layer with nout neurons produces nout outputs.

If a layer has only one neuron, its output is returned directly as a scalar Value; otherwise, the list of neuron outputs is returned.

The MLP class receives:

nin, the number of input values;
nouts, a list containing the number of neurons in each layer.

Inside the constructor, these two pieces are combined with:

sz = [nin] + nouts

List concatenation
Here + is not doing numerical addition. Since both operands are lists, Python uses + to concatenate them. So nin is first wrapped in a one-element list, [nin], and then attached in front of nouts. This wrapping is necessary because nin is an integer, while nouts is a list.

For example:
nin = 3
nouts = [4, 4, 1]
sz = [nin] + nouts
gives:
sz = [3, 4, 4, 1]
This list represents the full sequence of layer dimensions:
3 -> 4 -> 4 -> 1

The MLP is then built by taking consecutive pairs from this list. This is what happens in:

self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]

The consecutive pairs are:

3 -> 4, first layer: $3$ inputs, $4$ neurons;
4 -> 4, second layer: $4$ inputs, $4$ neurons;
4 -> 1, output layer: $4$ inputs, $1$ neuron.

Therefore, MLP(3, [4, 4, 1]) creates:

an input size of $3$ ;
a first hidden layer with $4$ neurons;
a second hidden layer with $4$ neurons;
an output layer with $1$ neuron.

Visualizing one forward pass

The input:

x = [2.0, 3.0, -1.0]

is passed through an MLP with two hidden layers of width 4 and a single scalar output.

x = [2.0, 3.0, -1.0]
n = MLP(3, [4, 4, 1])
draw_dot(n(x))

Because the MLP is built entirely out of scalar Value operations, the resulting computation graph is large even for this tiny network.

Random initialization

The exact numerical output depends on the random weights and biases initialized by random.uniform(-1, 1).

The values shown below correspond to one particular run of the notebook.

Building a toy dataset

A small toy dataset is used to evaluate the MLP.

The dataset contains four samples. Each sample has three features, matching the input size of the MLP.

The target list ys contains the desired output for each input sample.

xs = [
    [2.0, 3.0, -1.0],
    [3.0, -1.0, 0.5],
    [0.5, 1.0, 1.0],
    [1.0, 1.0, -1.0],
]
 
ys = [1.0, -1.0, -1.0, 1.0]
 
ypred = [n(x) for x in xs]
ypred

[Value(data=-0.9016746945335403),
 Value(data=-0.9744021347961607),
 Value(data=-0.9758381637844795),
 Value(data=-0.9876951155314461)]

Each element of ypred is a Value object. This matters because each prediction remains connected to the computation graph of the MLP, so gradients can be computed with respect to the network parameters.

Computing the loss

The loss used here comes from the MSE loss function. In general, MSE compares each prediction with its target, squares the error, and averages these squared errors over the dataset or batch:

L_{MSE} = \frac{1}{n} i \sum (y_{out, i} - y_{gt, i})^{2}

In this Micrograd example, the same idea is applied to the four training samples. Each prediction $y_{out}$ is compared with its target $y_{gt}$ (the subscript $gt$ stands for ground truth), the difference is squared, and the squared errors are added together:

L = i \sum (y_{out, i} - y_{gt, i})^{2}

So this is MSE specialized to this tiny batch, but without the averaging factor $\frac{1}{n}$ . In other words, Micrograd is using the sum of squared errors version of MSE. The role is the same: produce one scalar objective that measures how far the predictions are from the targets. Omitting the division only scales the loss and the gradients by a constant factor, which can be compensated for by the learning rate.

Most importantly for Micrograd, this expression creates a single Value object representing the scalar objective that the computation graph can backpropagate through.

loss = sum((yout -ygt)**2 for ygt, yout in zip(ys, ypred))
loss

Value(data=7.568537561169102)

The loss is a Value object
The result is not a plain Python number. It is a Value object:
Value(data=7.568537561169102)
This matters because loss is the final node of the computation graph. It stores the numerical value of the loss in loss.data, but it also keeps the graph structure needed to propagate gradients backward through all four forward passes and into the MLP parameters.

Fixing `sum()` with `radd`

While computing the squared-error loss with:

loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))

Python may trigger:

TypeError: unsupported operand type(s) for +: 'int' and 'Value'

The reason is the behavior of the built-in sum() function.

Python defines sum() as:

sum(iterable, start=0)

By default, the start value is the integer $0$ . Therefore, the first addition attempted by sum() is:

0 + Value_Object

The standard int class does not know how to add itself to a custom Value object, so the operation fails.

Reflected addition

When the object on the left side of + does not know how to handle the operation, Python checks whether the object on the right side implements the reflected addition method __radd__.

Implementing __radd__ makes Value behave more like a native Python number in expressions such as 10 + a or sum(list_of_values).

The robust library-level fix is to implement __radd__ inside the Value class:

def __radd__(self, other):
    return self + other

An inline alternative is to tell sum() to start from a Value object instead of the default integer $0$ :

loss = sum(((yout - ygt)**2 for ygt, yout in zip(ys, ypred)), Value(0.0))

Both methods work, but implementing __radd__ is the better library design because it makes Value objects interact naturally with Python arithmetic.

Backpropagating through the loss

Calling loss.backward() computes gradients for all nodes in the computation graph with respect to the loss.

Those gradients are what will later allow the MLP weights and biases to be updated with gradient descent.

loss.backward()
 
draw_dot(loss)

Deep Learning: Zero to Hero

Explorer

Building an MLP in Micrograd

Building an MLP

Visualizing one forward pass

Building a toy dataset

Computing the loss

Fixing `sum()` with `radd`

Backpropagating through the loss

Graph View

Table of Contents

Deep Learning: Zero to Hero

Explorer

Building an MLP in Micrograd

Building an MLP

Visualizing one forward pass

Building a toy dataset

Computing the loss

Fixing sum() with __radd__

Backpropagating through the loss

Graph View

Table of Contents

Fixing `sum()` with `radd`