(Pure) Stochastic Gradient Descent

Full-batch gradient descent updates the parameters only after computing the gradient of the average loss over the entire dataset. Every single step therefore has to look at all $N$ training examples before it can move the parameters at all. When $N$ is large, this makes each step expensive and learning slow.

The trade-off SGD makes

Pure stochastic gradient descent replaces the exact full-dataset gradient with the gradient of one training example, drawn at random. The cost per step collapses from $O (N)$ to $O (1)$ , but the step no longer follows the true gradient: it follows a noisy, randomly chosen estimate of it. The rest of this note makes that estimate precise. It is unbiased, meaning that on average it equals the true gradient, and its very noise turns out to help on the nonconvex landscapes of deep learning.

The objective being minimised

Using the notation of the neural networks vocabulary, the training set is

D = {(x^{(i)}, y^{(i)})}_{i = 1}^{N},

and the objective is the empirical risk, the average of the per-example losses,

L (θ) = \frac{1}{N} i = 1 \sum N L_{x^{(i)}} (θ),

where $L_{x^{(i)}} (θ)$ is the loss contributed by the single pair $(x^{(i)}, y^{(i)})$ .

A concrete per-example loss

For mean squared error, the loss of one example is
$L_{x^{(i)}} (θ) = \frac{1}{2} f_{θ} (x^{(i)}) - y^{(i)}^{2},$
where $f_{θ} (x^{(i)})$ is the network’s output on input $x^{(i)}$ .

By linearity of differentiation, the gradient that full-batch descent would use is the average of the per-example gradients,

\nabla_{θ} L (θ) = \frac{1}{N} i = 1 \sum N \nabla_{θ} L_{x^{(i)}} (θ) .

This is the exact quantity that SGD will replace by a single random term of the sum.

How a step chooses its example

A single SGD step uses one training example, identified by its index. Let $i_{t} \in {1, \dots, N}$ denote the index used at step $t$ . In pure SGD a step, an iteration, and a parameter update are the same event: one example in, one update out.

The index is supplied by shuffling epochs. An epoch is one full pass through the dataset. At the start of epoch $e$ a fresh random permutation $π_{e}$ of ${1, \dots, N}$ is drawn, and the epoch visits the examples in that order, $π_{e} (1), π_{e} (2), \dots, π_{e} (N)$ . Each step then carries two numbers: its global index $t$ , and its position $s$ inside the current epoch $e$ . The two are related by

t = (e - 1) N + s, s \in {1, \dots, N}, so that i_{t} = π_{e} (s) .

The permutation is drawn once per epoch; the $N$ steps that follow consume its indices in order, and after the $N$ -th step the epoch closes and a new permutation is drawn.

Two epochs over a four-example dataset

Take $N = 4$ , with examples indexed $1, 2, 3, 4$ , and suppose the first two permutations drawn are $π_{1} = (3, 1, 4, 2)$ and $π_{2} = (2, 4, 1, 3)$ . The global step counter then unrolls as follows.

step $t$ epoch $e$ position $s$ example $i_{t} = π_{e} (s)$
1 1 1 3
2 1 2 1
3 1 3 4
4 1 4 2
5 2 1 2
6 2 2 4
7 2 3 1
8 2 4 3

Two facts are visible at once:

within an epoch every index appears exactly once, so one epoch uses every example exactly once, in a random order;

$t = (e - 1) N + s$ is just the dictionary between a global step and its (epoch, position) pair: step $t = 6$ , for instance, is the $s = 2$ step of epoch $e = 2$ , so $i_{6} = π_{2} (2) = 4$ .

What happens between epochs is just as telling. The same four examples reappear in epoch 2, but under a new permutation, so an example is revisited across epochs, in a different order each time, and never twice within a single epoch. Training for several epochs is exactly this: the dataset reused again and again, reshuffled each time.

step $t$	epoch $e$	position $s$	example $i_{t} = π_{e} (s)$
1	1	1	3
2	1	2	1
3	1	3	4
4	1	4	2
5	2	1	2
6	2	2	4
7	2	3	1
8	2	4	3

Shuffling is far from a cosmetic detail.

Datasets are seldom stored in random order: examples tend to arrive grouped by class, by source, or by the time they were collected. Read in that fixed order, the consecutive single-example gradients stop pointing in scattered directions and start pointing the same way, because neighbouring examples resemble one another. The update then becomes a coherent drift that fits one region of the data and is partly undone when the next block arrives, and the trajectory ends up tracing the storage order of the dataset rather than descending its average loss. A fresh permutation each epoch removes this, and the benefit runs deeper than the order merely looking random. Decorrelating successive steps lets their fluctuations cancel over a short window instead of compounding in one direction, so the path follows the average loss rather than chasing whatever block is currently being read.

Reshuffling also severs a silent coupling: under a fixed order a given example is always visited at the same point of every epoch, and therefore always meets the same learning rate, the same schedule phase, the same neighbours. Reshuffling makes that treatment uniform across the dataset, so no example is systematically privileged or penalised over training.

A counterintuitive payoff: shuffling can beat independent sampling

Reshuffling without replacement is not merely a convenient stand-in for the independent draws assumed in the analysis below. In many settings it converges faster. Forcing every example to appear exactly once per epoch makes the gradient accumulated over a whole epoch lower in variance than independent sampling, which may draw some examples twice and miss others across the same span; in the extreme, summed over a full epoch the per-example gradients always add up to the exact full-dataset gradient, with no sampling variance left at all. The clean independent model is the easier one to analyse, yet the messier scheme that real training uses is frequently the better optimiser. This gap is the subject of the random reshuffling literature.

The update, and why its gradient is random

Once $i_{t}$ is fixed, the step feeds example $x^{(i_{t})}$ through the network, computes its per-example gradient by backpropagation, and steps downhill along it:

θ^{(t + 1)} = θ^{(t)} - η \nabla_{θ} L_{x^{(i_{t})}} (θ^{(t)}) .

Written coordinate by coordinate over the weights $w_{k}$ and biases $b_{ℓ}$ collected in $θ$ , the same rule reads

w_{k}^{(t + 1)} = w_{k}^{(t)} - η \frac{\partial L _{x^{(i_{t})}}}{\partial w _{k}}, b_{ℓ}^{(t + 1)} = b_{ℓ}^{(t)} - η \frac{\partial L _{x^{(i_{t})}}}{\partial b _{ℓ}} .

These are not two rules but one: the compact $θ$ -update written out per coordinate.

The decisive feature of this update is that its gradient is not a fixed vector. Because the index $i_{t}$ is drawn at random, $\nabla_{θ} L_{x^{(i_{t})}} (θ)$ is a random vector, and every property of SGD that follows is a statement about how that random vector relates to the true gradient.

Why the per-example gradient is a random vector

Fix the parameters at a value $θ$ and consider the map that sends an index to its gradient,
$φ_{θ} : {1, \dots, N} \to R^{m}, i \mapsto \nabla_{θ} L_{x^{(i)}} (θ) .$
This map is deterministic: each of the $N$ indices returns one specific gradient vector. The randomness enters only through which index is drawn. Since the sampling rule makes $i_{t}$ a random variable, the gradient actually used, $φ_{θ} (i_{t})$ , is a fixed function evaluated at a random argument, and a function of a random variable is again a random variable. Because $φ_{θ}$ takes its values in $R^{m}$ , that random variable is vector-valued: a random vector. It is a vector because there is one component per parameter, and random because its argument $i_{t}$ is random; the two properties are independent.

The per-example gradient is unbiased

The link between the noisy single-example gradient and the true gradient is exact, and it is the central theoretical justification for SGD. It is cleanest to state under the idealised model of uniform sampling with replacement, in which every step draws

P (i_{t} = i) = \frac{1}{N}, i = 1, \dots, N,

independently of the other steps.

Warning

This model is an idealisation, and one feature of it deserves to be named, because it is the first thing that looks wrong about it: since the draws are independent, the same example can be picked at two different steps while another is skipped in between, which real training deliberately avoids. It is adopted here for a single reason, that independence makes the expectation below transparent, and the box that follows shows the conclusion is unchanged under the no-repeat sampling that actual training uses.

Collect the step’s gradient into the random vector

G_{t} (θ) = \nabla_{θ} L_{x^{(i_{t})}} (θ) .

Its expectation is taken over the random index $i_{t}$ , componentwise across the $m$ coordinates of $θ$ . By the definition of expectation for a discrete random variable, and then by uniformity,

E [G_{t} (θ)] = i = 1 \sum N P (i_{t} = i) \nabla_{θ} L_{x^{(i)}} (θ) = \frac{1}{N} i = 1 \sum N \nabla_{θ} L_{x^{(i)}} (θ) .

The right-hand side is exactly the full-dataset gradient written down earlier, so

E [G_{t} (θ)] = \nabla_{θ} L (θ) .

The unbiasedness result

The single-example gradient is an unbiased estimator of the full-dataset gradient: averaged over the random index, it equals the true gradient exactly. This is what makes SGD a principled method rather than a cheap heuristic. In expectation the noisy SGD trajectory does what full-batch descent does, even though no individual step matches its full-batch counterpart.

Unbiased is not the same as accurate

The equality holds only in expectation. For any single realised draw of $i_{t}$ , the one-example gradient can point well away from the true gradient; the agreement is recovered only as an average over many steps. Reading $E [G_{t}] = \nabla L$ as $G_{t} \approx \nabla L$ at each step is the most common early misconception about SGD.

Drawing the example: the practical regimes

The proof above assumed uniform sampling with replacement ( $i_{t}$ drawn i.i.d., $P (i_{t} = i) = 1/ N$ ). That model was chosen for one reason: independent draws make every step’s distribution identical and free of history, so the expectation of a single step is the whole argument. It is a convenience, not a requirement, and the regime used in practice inherits the same conclusion.

Without replacement (random reshuffling). Real training shuffles once per epoch, so within an epoch the examples are drawn without replacement. This does not weaken the result. Each position of a uniformly random permutation is itself marginally uniform: of the $N!$ equally likely orderings, the $(N - 1)!$ that fix example $i$ at position $s$ give $P (π_{e} (s) = i) = (N - 1)! / N! = 1/ N$ , so the per-step expectation $E [G_{t}] = \nabla_{θ} L$ still holds at every step. Reshuffling only couples the steps inside an epoch, and that coupling is benign: a complete epoch touches every example exactly once, so the per-step gradients averaged over one epoch reproduce the true gradient exactly, not merely in expectation. This is also the regime motivated earlier, where a fresh permutation breaks the correlations of a fixed traversal order. The two benefits are independent: one keeps the estimate unbiased, the other avoids a deterministic ordering.

Non-uniform sampling. The draw need not be uniform. If example $i$ is taken with probability $p_{i} > 0$ , then the reweighted gradient $\frac{1}{N p _{i}} \nabla_{θ} L_{x^{(i_{t})}} (θ)$ is again unbiased for $\nabla_{θ} L (θ)$ , since
$i = 1 \sum N p_{i} \cdot \frac{1}{N p _{i}} \nabla_{θ} L_{x^{(i)}} (θ) = \frac{1}{N} i = 1 \sum N \nabla_{θ} L_{x^{(i)}} (θ) .$
This is what importance sampling exploits: drawing higher-gradient examples more often and reweighting accordingly reduces the variance of the estimate without disturbing its mean.

Why the noise helps

Because one example need not represent the whole dataset, the gradient can swing substantially from step to step, and the SGD path is far less smooth than the full-batch one. Informally one writes

\nabla_{θ} L_{x^{(i_{t})}} (θ) \approx \nabla_{θ} L (θ),

but this is only a heuristic: the discrepancy at a single step can be large, and the approximation becomes exact only in expectation.

That roughness is not purely a defect.

Noise as a feature, not a bug

The sampling noise nudges the trajectory off any single deterministic path. On the nonconvex landscapes classified by the Hessian, this helps the iterates escape poor local minima and slip off saddle points, where a full-batch optimiser stalls because its gradient vanishes. The same noise that makes any one step inaccurate makes the whole trajectory explore more broadly. Beyond its lower cost per step, this is a key reason SGD often generalises better than full-batch descent on deep networks.

"SGD" in everyday usage

In practice the term SGD is used loosely to include mini-batch SGD, which averages the gradient over a small batch of examples per step. Strictly, pure SGD is the case of exactly one example per update, the version analysed here.

Deep Learning: Zero to Hero

Explorer

(Pure) Stochastic Gradient Descent

The objective being minimised

How a step chooses its example

The update, and why its gradient is random

The per-example gradient is unbiased

Why the noise helps

Graph View

Table of Contents

Backlinks