Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

Derivative of a simple function with one input

Before jumping into neural networks, it is necessary to build (or refresh) a solid intuition for what a derivative actually is. To achieve this, a simple, visual example provides a helpful starting point.

In this first step, a basic quadratic function is considered:

The function is first evaluated at one specific input value, and then numpy and matplotlib are used to plot it over a range of values to clearly reveal its shape (a parabola).

Observing the curve helps illustrate how a small bump in the input () changes the output (), which forms the core idea behind backpropagation.

# simple scalar function
def f(x):    
  return 3*x**2 - 4*x +5
  
f(3.0)
20.0
xs = np.arange(-5, 5, 0.25)
ys = f(xs)  # vectorized evaluation of f on the array xs
plt.plot(xs, ys)

To measure the slope of the curve at a specific point , we slightly perturb the input and observe how much the output changes. A scalar function is differentiable at if the following finite limit exists:

In exact mathematics, the derivative is obtained by taking the limit as . In code, however, the limit is approximated by choosing as a tiny positive number. This gives a finite difference approximation.

The finite difference ratio

measures how much the output changes per unit of input change. Geometrically, this is the familiar rise over run ratio: the change in output divided by the change in input. This gives the local slope of the function at the chosen point, and it is the same basic idea that later becomes the local sensitivity used in backpropagation.

For the function considered above,

the symbolic derivative is obtained by applying the usual differentiation rules:

The symbolic result can be compared with a finite difference approximation at different points. In each case, the numerical check uses the same rise over run computation:

h = 0.000001
x = ...  # choose one of: 2/3, 3.0, -3.0
(f(x + h) - f(x))/h  # finite difference: rise over run
Point Symbolic derivativeFinite difference resultInterpretation
2.999378523327323e-06Extremely close to zero. This is the vertex of the parabola, where the slope is exactly zero.
14.000003002223593Very close to , confirming a positive slope at this point.
-21.999997002808414Very close to , confirming a negative slope at this point.

Note

The finite difference results are not exactly equal to the symbolic values because the derivative is being approximated with a finite value of and floating-point arithmetic. Still, the numerical checks agree with the symbolic derivative.

Finite Precision and

In exact arithmetic, decreasing would make the finite difference approximation converge to the symbolic derivative. For example, at , the finite difference approximation should approach the symbolic value as .

In floating-point arithmetic, however, choosing too small can produce an output that is exactly 0.0. This happens because numbers are stored with finite precision: if is below the resolution available around the current value of , then x + h may be rounded back to x, making f(x + h) - f(x) equal to zero. That exact 0.0 is therefore a numerical precision artifact, not a more faithful computation of the limit.

Derivative of a function with multiple inputs

As complexity increases, functions typically involve multiple independent variables. Understanding how each input contributes to the final output is the foundation of backpropagation.

Formally, given a scalar function of multiple variables

the partial derivative with respect to the variable is defined as:

provided that the limit exists. In other words, only is perturbed, while all the other variables are kept constant. From the point of view of directional derivatives, this corresponds to moving along one coordinate axis at a time.

Consider a function defined by three scalar inputs and :

By performing a local sensitivity analysis, it is possible to quantify how a microscopic change in any single input affects the output , while keeping all other variables constant. This is achieved through partial derivatives:

Perturbed inputPartial derivativeInterpretation
If is increased by a tiny amount , the output changes by approximately . The sign and magnitude of the change depend on the current value of .
If is increased by a tiny amount , the output changes by approximately . The sign and magnitude of the change depend on the current value of .
Since is added directly, a tiny change in produces the same tiny change in , regardless of the values of or .

To check these analytical slopes, the same “rise over run” method used for scalar functions can be applied to each variable individually

h = 0.0001
 
# Evaluation point for the local sensitivity checks
a = 2.0
b = -3.0
c = 10.0
# Sensitivity analysis for 'a'
d1 = a * b + c
a += h
d2 = a * b + c
print('d1',d1)
print('d2',d2)
print(f"Slope wrt 'a': {(d2 - d1)/h:.4f}") # Approximates b (-3.0)
d1 4.0 
d2 3.999699999999999 
slope -3.000000000010772
# Sensitivity analysis for 'b'
a = 2.0 # reset to the initial state
d1 = a * b + c
b += h
d2 = a * b + c
print('d1',d1)
print('d2',d2)
print(f"Slope wrt 'b': {(d2 - d1)/h:.4f}") # Approximates a (2.0)
d1 4.0 
d2 4.0002 
slope 2.0000000000042206
# Sensitivity analysis for 'c'
b = -3.0 # reset to the initial state
d1 = a * b + c
c += h
d2 = a * b + c
print('d1',d1)
print('d2',d2)
print(f"Slope wrt 'c': {(d2 - d1)/h:.4f}") # Approximates 1.0
d1 4.0
d2 4.0001
slope 0.9999999999976694