Mini-batch SGD

Full-batch gradient descent and pure SGD sit at opposite ends of a single dial:

Full-batch descent computes the gradient of the average loss over all $N$ examples before each update: the direction is exact, but every step costs $O (N)$ ;
Pure SGD updates from one example at a time: each step costs $O (1)$ , but the direction is a noisy estimate of the true gradient. Mini-batch SGD turns the dial to an intermediate setting, computing each update from a small subset of $B$ examples, called a mini-batch. It is the regime that essentially all modern deep-learning training actually uses.

Typical sizes are $B = 32$ , $64$ , or $128$ , though the best value depends on the model, the hardware, and the optimisation regime.

Why mini-batch is the practical default

Full-batch GD is unbiased but expensive: each update costs $O (N)$ . Pure SGD is unbiased but high-variance: each update costs $O (1)$ , yet its direction is very noisy. Mini-batch SGD sits between them, at cost $O (B)$ per update and a gradient variance that falls like $1/ B$ . Choosing $B$ trades gradient accuracy (favours large $B$ ) against per-step cost and exploration noise (favours small $B$ ). That trade-off, together with how well a batch maps onto parallel hardware, is why mini-batch SGD is the universal default rather than either extreme.

Why batch sizes are usually powers of two

Sizes such as $32$ , $64$ , $128$ are not special mathematically. They are favoured because they fit modern accelerators well.

On GPUs, parallel work is issued in small groups of threads. In NVIDIA terminology these groups are warps, and a warp holds $32$ threads. Batch sizes that are multiples of $32$ tend to keep the hardware fully occupied, leaving fewer partially idle thread groups.

There is also a memory-side reason: these sizes produce tensor shapes that align well with low-level kernels and high-throughput matrix units, which improves throughput.

This is a hardware-aware heuristic, not a law. The best batch size remains a compromise among optimisation behaviour, available memory, and hardware throughput, so values such as $96$ or $160$ can be entirely reasonable when they fit the model or the device better.

From a single example to a batch

Using the notation of the neural networks vocabulary, the training set and objective are as in pure SGD,

D = {(x^{(i)}, y^{(i)})}_{i = 1}^{N}, L (θ) = \frac{1}{N} i = 1 \sum N L_{x^{(i)}} (θ) .

At step $t$ , rather than picking one index, a mini-batch step selects a whole index set

B_{t} \subseteq {1, \dots, N}, ∣ B_{t} ∣ = B, 1 < B < N .

The set $B_{t}$ simply names which examples this step uses. Stacking their inputs and targets row by row gives the batched tensors

X_{B_{t}} \in R^{B \times d_{x}}, Y_{B_{t}} \in R^{B \times d_{y}},

matching the batch notation of the neural networks vocabulary.

What a batch actually is

Take a dataset of $N = 6$ examples. A step with batch size $B = 3$ might draw the index set $B_{t} = {2, 4, 5}$ . This single symbol is just the list of rows this step will use: examples $2$ , $4$ , and $5$ . Their inputs are stacked into one matrix, their targets into another,
$X_{B_{t}} = x^{(2)} x^{(4)} x^{(5)} \in R^{3 \times d_{x}}, Y_{B_{t}} = y^{(2)} y^{(4)} y^{(5)} \in R^{3 \times d_{y}},$
where each $x^{(i)}$ occupies one row. A single forward pass now processes all three rows at once, which is exactly what makes a batch efficient on parallel hardware. The remaining examples ${1, 3, 6}$ are not used at this step; they appear in other batches.

The mini-batch loss and update

The mini-batch loss is the average loss over the examples in the current batch, and its gradient is the matching average of the per-example gradients,

L_{B_{t}} (θ) = \frac{1}{B} i \in B_{t} \sum L_{x^{(i)}} (θ), \nabla_{θ} L_{B_{t}} (θ) = \frac{1}{B} i \in B_{t} \sum \nabla_{θ} L_{x^{(i)}} (θ) .

This batch gradient is used as a cheaper stand-in for the full gradient $\nabla_{θ} L (θ)$ , and the update is the familiar step along it,

θ^{(t + 1)} = θ^{(t)} - η \nabla_{θ} L_{B_{t}} (θ^{(t)}) = θ^{(t)} - η \frac{1}{B} i \in B_{t} \sum \nabla_{θ} L_{x^{(i)}} (θ^{(t)}) .

Unpacking $θ$ into weights $w_{k}$ and biases $b_{ℓ}$ , the same rule reads, per coordinate,

w_{k}^{(t + 1)} = w_{k}^{(t)} - η \frac{1}{B} i \in B_{t} \sum \frac{\partial L _{x^{(i)}}}{\partial w _{k}}, b_{ℓ}^{(t + 1)} = b_{ℓ}^{(t)} - η \frac{1}{B} i \in B_{t} \sum \frac{\partial L _{x^{(i)}}}{\partial b _{ℓ}} .

As in pure SGD, the gradient that drives this update is a random vector, because the batch $B_{t}$ is drawn at random.

Why the mini-batch gradient is a random vector

Fix the parameters at a value $θ$ and let $S_{B}$ be the family of all size- $B$ subsets of ${1, \dots, N}$ . The map that sends a subset to its batch-averaged gradient,
$Φ_{θ} : S_{B} \to R^{m}, B \mapsto \frac{1}{B} i \in B \sum \nabla_{θ} L_{x^{(i)}} (θ),$
is deterministic: a given subset returns one specific vector. The randomness enters only through which subset is drawn. The batching rule makes $B_{t}$ a random element of $S_{B}$ , so the mini-batch gradient $Φ_{θ} (B_{t})$ is a fixed function evaluated at a random argument, and a function of a random variable is again a random variable. Since $Φ_{θ}$ takes its values in $R^{m}$ , that variable is a random vector. This is the size- $B$ version of the single-example argument in pure SGD, recovered exactly when $B = 1$ .

Building an epoch from batches

On a finite dataset the batches of one epoch are produced by shuffling and slicing:

the indices ${1, \dots, N}$ are permuted,
the permuted list is partitioned into contiguous blocks of size $B$ , and the optimiser steps once per block, feeding $X_{B_{t}}, Y_{B_{t}}$ through the network and applying the update above.

When every block has been processed, the epoch is complete and the next epoch reshuffles.

One epoch with $N = 10$ , $B = 3$

Shuffling the indices ${1, \dots, 10}$ might produce the order $(7, 2, 9, 1, 5, 10, 3, 8, 4, 6)$ . Partitioning this list into contiguous blocks of $B = 3$ gives the batches of one epoch:

iteration $t$ batch $B_{t}$
1 ${7, 2, 9}$
2 ${1, 5, 10}$
3 ${3, 8, 4}$
4 ${6}$

Ten examples do not divide evenly into threes, so the last block holds the single leftover example. Every example appears in exactly one batch, so one epoch still uses the whole dataset once, just $B$ examples at a time. The next epoch reshuffles and re-slices, producing a different grouping.

iteration $t$	batch $B_{t}$
1	${7, 2, 9}$
2	${1, 5, 10}$
3	${3, 8, 4}$
4	${6}$

The leftover block is the reason the number of iterations per epoch depends on a convention: keeping the final smaller batch gives $⌈ N / B ⌉$ iterations, while dropping it gives $⌊ N / B ⌋$ . For the example above, that is $⌈ 10/3 ⌉ = 4$ against $⌊ 10/3 ⌋ = 3$ .

Shuffling matters here for the same reason developed for pure SGD: without it, each batch inherits the storage order of the dataset, so batches grouped by class or source push the parameters in correlated directions epoch after epoch. A fresh permutation breaks that pattern and is also what makes the randomness analysed next genuinely random.

The mini-batch gradient is unbiased

The reassurance that justifies replacing the full gradient by a batch average is that, on average, nothing is lost. Treat $B_{t}$ as a uniformly chosen size- $B$ subset of ${1, \dots, N}$ , and collect the batch gradient into the random vector

G_{t} (θ) = \nabla_{θ} L_{B_{t}} (θ) = \frac{1}{B} i \in B_{t} \sum \nabla_{θ} L_{x^{(i)}} (θ) .

The cleanest way to take the expectation is to turn the sum over the random set into a sum over all examples gated by an indicator,

G_{t} (θ) = \frac{1}{B} i = 1 \sum N 1_{{i \in B_{t}}} \nabla_{θ} L_{x^{(i)}} (θ),

where $1_{{i \in B_{t}}}$ is $1$ when example $i$ lands in the current batch and $0$ otherwise. Expectation is componentwise over the random batch, and by linearity it falls on the indicators alone:

E [G_{t} (θ)] = \frac{1}{B} i = 1 \sum N P (i \in B_{t}) \nabla_{θ} L_{x^{(i)}} (θ) .

For a uniform size- $B$ subset, symmetry makes every example equally likely to be included, and since each batch fills $B$ of the $N$ places, the $N$ inclusion probabilities must sum to $B$ ; sharing that equally gives $P (i \in B_{t}) = B / N$ . Substituting, the $B$ cancels and the full-dataset gradient reappears:

E [G_{t} (θ)] = \frac{1}{B} i = 1 \sum N \frac{B}{N} \nabla_{θ} L_{x^{(i)}} (θ) = \frac{1}{N} i = 1 \sum N \nabla_{θ} L_{x^{(i)}} (θ) = \nabla_{θ} L (θ) .

The mini-batch gradient is unbiased

Averaged over the random choice of batch, the mini-batch gradient equals the full-dataset gradient exactly, for any batch size $B$ . Pure SGD is the case $B = 1$ and full-batch GD the case $B = N$ ; both are unbiased, and so is everything in between. As with pure SGD, this holds only in expectation: for any one realised batch the estimate can differ from the true gradient, and the agreement is recovered only on average.

When batches are formed by shuffling once per epoch and slicing, as in real training, the per-step expectation conditioned on the batches already used in the epoch is over the remaining examples rather than all $N$ . But one full epoch covers every example exactly once, so the batch gradients averaged over a complete epoch reproduce the true gradient exactly. The practical conclusion is unchanged.

How the noise shrinks with the batch size

Unbiasedness settles where the estimate is centred, not how far it scatters. The scatter is what the batch size controls, and it is the real reason larger batches give a cleaner signal. For a vector-valued gradient the right object is a covariance matrix. Write the per-example gradient and the full gradient as

g^{(i)} (θ) = \nabla_{θ} L_{x^{(i)}} (θ), \overset{ˉ}{g} (θ) = \frac{1}{N} i = 1 \sum N g^{(i)} (θ),

and define the covariance of the per-example gradients across the dataset,

Σ_{pop} (θ) = \frac{1}{N} i = 1 \sum N (g^{(i)} (θ) - \overset{ˉ}{g} (θ)) (g^{(i)} (θ) - \overset{ˉ}{g} (θ))^{⊤} .

This matrix records how widely the example-wise gradients spread around their mean: its diagonal holds the variance of each gradient coordinate, its off-diagonal entries how coordinates move together. For a batch drawn uniformly without replacement, the covariance of the batch gradient is

Cov [G_{t} (θ)] = \frac{N - B}{B ( N - 1 )} Σ_{pop} (θ) .

Where the factor $\frac{N - B}{B ( N - 1 )}$ comes from

Take a single gradient coordinate and write $a_{i}$ for its value on example $i$ , with population mean $\overset{a}{ˉ} = \frac{1}{N} \sum_{i} a_{i}$ and population variance $σ^{2} = \frac{1}{N} \sum_{i} (a_{i} - \overset{a}{ˉ})^{2}$ , which is the matching diagonal entry of $Σ_{pop}$ . The batch estimate of this coordinate is
$\overset{a}{^} = \frac{1}{B} i = 1 \sum N 1_{i} a_{i}, 1_{i} = 1_{{i \in B_{t}}} .$
For a uniform batch of size $B$ drawn without replacement, the indicators satisfy
$E [1_{i}] = \frac{B}{N}, E [1_{i} 1_{j}] = \frac{B}{N} \cdot \frac{B - 1}{N - 1} (i \neq = j),$
the second because once $i$ takes one of the $B$ slots, $j$ must take one of the remaining $B - 1$ out of $N - 1$ . These give
$Var (1_{i}) = \frac{B ( N - B )}{N ^{2}}, Cov (1_{i}, 1_{j}) = - \frac{B ( N - B )}{N ^{2} ( N - 1 )} (i \neq = j) .$
The negative covariance is the fingerprint of sampling without replacement: putting one example in the batch makes any other slightly less likely. Substituting into $\overset{a}{^}$ , and using $\sum_{i \neq = j} a_{i} a_{j} = (\sum_{i} a_{i})^{2} - \sum_{i} a_{i}^{2} = N^{2} \overset{a}{ˉ}^{2} - \sum_{i} a_{i}^{2}$ ,
$Var (\overset{a}{^}) = \frac{1}{B ^{2}} Var (1_{i}) i \sum a_{i}^{2} + Cov (1_{i}, 1_{j}) i \neq = j \sum a_{i} a_{j} = \frac{N - B}{B ( N - 1 )} σ^{2} .$
This is the diagonal of the stated matrix; redoing the same steps with the outer product $(g^{(i)} - \overset{ˉ}{g}) (g^{(j)} - \overset{ˉ}{g})^{⊤}$ in place of the scalar $a_{i} a_{j}$ gives the full matrix form. The endpoints check out: $B = N$ yields variance $0$ (the full-batch gradient is exact), and $B = 1$ yields $σ^{2}$ (a lone example, maximal noise), the two extremes of the regimes table below.

The mean stays pinned at $\nabla_{θ} L$ ; only the prefactor changes with $B$ . When $B ≪ N$ the finite-population factor $\frac{N - B}{N - 1} \approx 1$ , and the expression collapses to the rule worth remembering,

Cov [G_{t} (θ)] \approx \frac{1}{B} Σ_{pop} (θ) .

So doubling the batch size roughly halves the variance of the gradient estimate: a batch of $64$ produces about half the gradient noise of a batch of $32$ , at twice the per-step cost. This is the precise sense in which larger batches give a more accurate, lower-noise direction.

Noise as implicit regularisation

A lower-noise gradient sounds strictly better, yet driving the noise to zero is not the goal. The fluctuation of mini-batch gradients is one of the most important sources of implicit regularisation in deep learning.

Because a batch is only a partial view of the data, its gradient jitters around the full-gradient direction. That jitter perturbs the trajectory and discourages the iterates from settling into sharp, narrow basins of the loss. The minima it tends to favour are flatter, and flatter minima are commonly associated with robustness to parameter perturbations and, in many settings, better generalisation.

Smaller batches inject more useful noise

Since the gradient noise scales like $1/ B$ , smaller batches perturb each step more, biasing the optimiser toward flatter regions that often generalise better. This is the main reason large-batch training, which suppresses the noise, can hurt test accuracy unless compensated by other means. The same “stochastic temperature” reading appears in cosine annealing: a smaller effective step acts as lower-temperature exploitation, a larger one as higher-temperature exploration.

Batch size and learning rate

The batch size $B$ and the learning rate $η$ should be tuned together, not in isolation. A larger $B$ estimates the update direction with lower variance, which usually allows a larger step before training destabilises. This is the intuition behind the linear scaling rule of large-batch training: multiplying the batch size by $k$ suggests multiplying the learning rate by roughly the same $k$ ,

B \mapsto k B ⟹ η \mapsto k η .

This is a practical heuristic, not a universal law. It depends on the optimiser, the model, and the training phase, and large-batch runs often need a warm-up phase before the scaled rate can be used safely.

The critical batch size

Enlarging $B$ does not buy proportional speed-ups forever. Beyond a task-dependent critical batch size, the batch gradient is already so close to the full gradient that adding more examples barely reduces the number of optimisation steps to reach a target loss, while still costing more compute per step. Past that point, larger batches spend more compute than they save. The critical batch size is not a universal constant: it depends on the model, optimiser, dataset, and training phase, and must be found empirically.

The broader lesson is that $B$ changes the optimisation dynamics, not just the throughput, which is why batch size and learning rate are tuned jointly.

Three regimes, one formula

The same update covers all three regimes, distinguished only by the choice of $B$ :

Regime	Batch size $B$	Cost per update	Gradient variance	Typical role
Full-batch GD	$B = N$	$O (N)$	$0$ (exact gradient)	Small datasets; theoretical analysis; second-order methods.
Pure SGD	$B = 1$	$O (1)$	Maximum (one example)	Mainly conceptual and theoretical; rarely used directly.
Mini-batch SGD	$1 < B < N$	$O (B)$	Falls like $1/ B$	The universal default in modern training.

Moving down the table is monotone in both directions that matter: as $B$ grows, the cost per update rises and the gradient variance falls. The right $B$ is the value that balances these opposing trends for the task at hand.

Summing instead of averaging

Some formulations sum the per-example gradients instead of averaging them, dropping the $\frac{1}{B}$ factor. This is not a different method: it rescales the effective learning rate by $B$ , since $\sum_{i \in B_{t}} \nabla_{θ} L_{x^{(i)}} = B \cdot \frac{1}{B} \sum_{i \in B_{t}} \nabla_{θ} L_{x^{(i)}}$ . What matters is that the convention is fixed once and used consistently, so that a reported learning rate is comparable across runs.

Deep Learning: Zero to Hero

Explorer

From a single example to a batch

The mini-batch loss and update

Building an epoch from batches

The mini-batch gradient is unbiased

How the noise shrinks with the batch size

Noise as implicit regularisation

Batch size and learning rate

Three regimes, one formula

Graph View

Table of Contents

Backlinks