Read the Overview First
This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.
Using gradients to change the output
After the manual backward pass, every node in the scalar graph stores a gradient. For a node such as a, the value stored in a.grad represents .
This number describes how a small change in a.data affects the final scalar output .
The leaf nodes in the current graph are:
Their gradients can be used to nudge their .data values in a direction that changes . Moving a value in the direction of its gradient performs a small gradient ascent step:
In this toy example, the goal is to make increase. Since the previous value was:
an increase means that the new value of should become less negative.
Ascent vs descent
This preview moves in the direction of the gradient, so it performs gradient ascent on .
When training a neural network with a loss function, the usual goal is to make the loss smaller. In that case, the update uses the opposite direction:
Applying one small step
The step size used below is 0.01. Each leaf value is moved by 0.01 * grad.
a.data += 0.01 * a.grad
b.data += 0.01 * b.grad
c.data += 0.01 * c.grad
f.data += 0.01 * f.gradAfter changing the leaf values, the downstream values must be recomputed. The old objects e, d, and L still contain the forward values from the previous graph. Rebuilding them creates a new forward pass using the updated leaf data.
e = a * b
d = e + c
L = d * f
print(L.data)-7.286496
The new value is larger than the previous value , so the gradient-ascent step changed in the expected direction.
What this previews
This is only a small preview of the optimization logic used later when training neural networks. In a real training loop, the values updated by this kind of rule are usually trainable parameters, such as weights and biases, while input data is normally kept fixed.