Preview of a single optimization step

Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

Using gradients to change the output

After the manual backward pass, every node in the scalar graph stores a gradient. For a node such as a, the value stored in a.grad represents $\partial L / \partial a$ .

This number describes how a small change in a.data affects the final scalar output $L$ .

The leaf nodes in the current graph are:

a, b, c, f

Their gradients can be used to nudge their .data values in a direction that changes $L$ . Moving a value in the direction of its gradient performs a small gradient ascent step:

data \leftarrow data + step size \cdot grad

In this toy example, the goal is to make $L$ increase. Since the previous value was:

L = - 8.0

an increase means that the new value of $L$ should become less negative.

Ascent vs descent

This preview moves in the direction of the gradient, so it performs gradient ascent on $L$ .

When training a neural network with a loss function, the usual goal is to make the loss smaller. In that case, the update uses the opposite direction:
$data \leftarrow data - step size \cdot grad$

Applying one small step

The step size used below is 0.01. Each leaf value is moved by 0.01 * grad.

a.data += 0.01 * a.grad
b.data += 0.01 * b.grad
c.data += 0.01 * c.grad
f.data += 0.01 * f.grad

After changing the leaf values, the downstream values must be recomputed. The old objects e, d, and L still contain the forward values from the previous graph. Rebuilding them creates a new forward pass using the updated leaf data.

e = a * b
d = e + c
L = d * f
 
print(L.data)

-7.286496

The new value is larger than the previous value $- 8.0$ , so the gradient-ascent step changed $L$ in the expected direction.

What this previews

This is only a small preview of the optimization logic used later when training neural networks. In a real training loop, the values updated by this kind of rule are usually trainable parameters, such as weights and biases, while input data is normally kept fixed.

Deep Learning: Zero to Hero

Explorer

Preview of a single optimization step

Using gradients to change the output

Applying one small step

Graph View

Table of Contents