Derivatives Refresh

Read the Overview First

This note is part of the Micrograd series. Before continuing, it is essential to read Micrograd overview: it explains how these Markdown notes are organized, how the code carries across notes, and how to read or reproduce the material step by step.

Derivative of a simple function with one input

Before jumping into neural networks, it is necessary to build (or refresh) a solid intuition for what a derivative actually is. To achieve this, a simple, visual example provides a helpful starting point.

In this first step, a basic quadratic function is considered:

f (x) = 3 x^{2} - 4 x + 5

The function is first evaluated at one specific input value, and then numpy and matplotlib are used to plot it over a range of values to clearly reveal its shape (a parabola).

Observing the curve helps illustrate how a small bump in the input ( $x$ ) changes the output ( $y$ ), which forms the core idea behind backpropagation.

# simple scalar function
def f(x):    
  return 3*x**2 - 4*x +5
  
f(3.0)

20.0

xs = np.arange(-5, 5, 0.25)
ys = f(xs)  # vectorized evaluation of f on the array xs
plt.plot(xs, ys)

To measure the slope of the curve at a specific point $x_{0}$ , we slightly perturb the input and observe how much the output changes. A scalar function $f (x)$ is differentiable at $x_{0}$ if the following finite limit exists:

f^{'} (x_{0}) = h \to 0 lim \frac{f ( x _{0} + h ) - f ( x _{0} )}{h}

In exact mathematics, the derivative is obtained by taking the limit as $h \to 0$ . In code, however, the limit is approximated by choosing $h$ as a tiny positive number. This gives a finite difference approximation.

The finite difference ratio

\frac{f ( x + h ) - f ( x )}{h}

measures how much the output changes per unit of input change. Geometrically, this is the familiar rise over run ratio: the change in output divided by the change in input. This gives the local slope of the function at the chosen point, and it is the same basic idea that later becomes the local sensitivity used in backpropagation.

For the function considered above,

f (x) = 3 x^{2} - 4 x + 5

the symbolic derivative is obtained by applying the usual differentiation rules:

f^{'} (x) = 6 x - 4

The symbolic result can be compared with a finite difference approximation at different points. In each case, the numerical check uses the same rise over run computation:

h = 0.000001
x = ...  # choose one of: 2/3, 3.0, -3.0
(f(x + h) - f(x))/h  # finite difference: rise over run

Point $x$	Symbolic derivative	Finite difference result	Interpretation
$\frac{2}{3}$	$f^{'} (\frac{2}{3}) = 6 \cdot \frac{2}{3} - 4 = 0$	`2.999378523327323e-06`	Extremely close to zero. This is the vertex of the parabola, where the slope is exactly zero.
$3.0$	$f^{'} (3) = 6 \cdot 3 - 4 = 14$	`14.000003002223593`	Very close to $14$ , confirming a positive slope at this point.
$- 3.0$	$f^{'} (- 3) = 6 \cdot (- 3) - 4 = - 22$	`-21.999997002808414`	Very close to $- 22$ , confirming a negative slope at this point.

Note

The finite difference results are not exactly equal to the symbolic values because the derivative is being approximated with a finite value of $h$ and floating-point arithmetic. Still, the numerical checks agree with the symbolic derivative.

Finite Precision and $h$

In exact arithmetic, decreasing $h$ would make the finite difference approximation converge to the symbolic derivative. For example, at $x = 3$ , the finite difference approximation should approach the symbolic value $f^{'} (3) = 14$ as $h \to 0$ .

In floating-point arithmetic, however, choosing $h$ too small can produce an output that is exactly 0.0. This happens because numbers are stored with finite precision: if $h$ is below the resolution available around the current value of $x$ , then x + h may be rounded back to x, making f(x + h) - f(x) equal to zero. That exact 0.0 is therefore a numerical precision artifact, not a more faithful computation of the limit.

Derivative of a function with multiple inputs

As complexity increases, functions typically involve multiple independent variables. Understanding how each input contributes to the final output is the foundation of backpropagation.

Formally, given a scalar function of multiple variables

f (x_{1}, x_{2}, \dots, x_{n})

the partial derivative with respect to the variable $x_{i}$ is defined as:

\frac{\partial f}{\partial x _{i}} (x_{1}, x_{2}, \dots, x_{n}) = h \to 0 lim \frac{f ( x _{1} , \dots , x _{i} + h , \dots , x _{n} ) - f ( x _{1} , \dots , x _{i} , \dots , x _{n} )}{h}

provided that the limit exists. In other words, only $x_{i}$ is perturbed, while all the other variables are kept constant. From the point of view of directional derivatives, this corresponds to moving along one coordinate axis at a time.

Consider a function $d$ defined by three scalar inputs $a, b,$ and $c$ :

d (a, b, c) = a \cdot b + c

By performing a local sensitivity analysis, it is possible to quantify how a microscopic change in any single input affects the output $d$ , while keeping all other variables constant. This is achieved through partial derivatives:

\frac{\partial d}{\partial a}, \frac{\partial d}{\partial b}, \frac{\partial d}{\partial c}

Perturbed input	Partial derivative	Interpretation
$a$	$\frac{\partial d}{\partial a} = b$	If $a$ is increased by a tiny amount $h$ , the output $d$ changes by approximately $b \cdot h$ . The sign and magnitude of the change depend on the current value of $b$ .
$b$	$\frac{\partial d}{\partial b} = a$	If $b$ is increased by a tiny amount $h$ , the output $d$ changes by approximately $a \cdot h$ . The sign and magnitude of the change depend on the current value of $a$ .
$c$	$\frac{\partial d}{\partial c} = 1$	Since $c$ is added directly, a tiny change in $c$ produces the same tiny change in $d$ , regardless of the values of $a$ or $b$ .

To check these analytical slopes, the same “rise over run” method used for scalar functions can be applied to each variable individually

h = 0.0001
 
# Evaluation point for the local sensitivity checks
a = 2.0
b = -3.0
c = 10.0

# Sensitivity analysis for 'a'
d1 = a * b + c
a += h
d2 = a * b + c
print('d1',d1)
print('d2',d2)
print(f"Slope wrt 'a': {(d2 - d1)/h:.4f}") # Approximates b (-3.0)

d1 4.0 
d2 3.999699999999999 
slope -3.000000000010772

Intuition for the Negative Sign

For the check with respect to $a$ , the second value is
$d_{2} = (a + h) b + c$
while the original value is $d_{1} = ab + c$ . Since in this example $b = - 3.0$ , increasing $a$ makes the product $(a + h) b$ more negative. In other words, a slightly larger quantity is subtracted from $c$ , so $d_{2} - d_{1}$ is negative and the derivative with respect to $a$ has a negative sign.

# Sensitivity analysis for 'b'
a = 2.0 # reset to the initial state
d1 = a * b + c
b += h
d2 = a * b + c
print('d1',d1)
print('d2',d2)
print(f"Slope wrt 'b': {(d2 - d1)/h:.4f}") # Approximates a (2.0)

d1 4.0 
d2 4.0002 
slope 2.0000000000042206

# Sensitivity analysis for 'c'
b = -3.0 # reset to the initial state
d1 = a * b + c
c += h
d2 = a * b + c
print('d1',d1)
print('d2',d2)
print(f"Slope wrt 'c': {(d2 - d1)/h:.4f}") # Approximates 1.0

d1 4.0
d2 4.0001
slope 0.9999999999976694

Deep Learning: Zero to Hero

Explorer

Derivatives Refresh

Derivative of a simple function with one input

Derivative of a function with multiple inputs

Graph View

Table of Contents

Backlinks