Full-batch gradient descent and pure SGD sit at opposite ends of a single dial:

  • Full-batch descent computes the gradient of the average loss over all examples before each update: the direction is exact, but every step costs ;
  • Pure SGD updates from one example at a time: each step costs , but the direction is a noisy estimate of the true gradient. Mini-batch SGD turns the dial to an intermediate setting, computing each update from a small subset of examples, called a mini-batch. It is the regime that essentially all modern deep-learning training actually uses.

Typical sizes are , , or , though the best value depends on the model, the hardware, and the optimisation regime.

Why mini-batch is the practical default

Full-batch GD is unbiased but expensive: each update costs . Pure SGD is unbiased but high-variance: each update costs , yet its direction is very noisy. Mini-batch SGD sits between them, at cost per update and a gradient variance that falls like . Choosing trades gradient accuracy (favours large ) against per-step cost and exploration noise (favours small ). That trade-off, together with how well a batch maps onto parallel hardware, is why mini-batch SGD is the universal default rather than either extreme.

From a single example to a batch

Using the notation of the neural networks vocabulary, the training set and objective are as in pure SGD,

At step , rather than picking one index, a mini-batch step selects a whole index set

The set simply names which examples this step uses. Stacking their inputs and targets row by row gives the batched tensors

matching the batch notation of the neural networks vocabulary.

What a batch actually is

Take a dataset of examples. A step with batch size might draw the index set . This single symbol is just the list of rows this step will use: examples , , and . Their inputs are stacked into one matrix, their targets into another,

where each occupies one row. A single forward pass now processes all three rows at once, which is exactly what makes a batch efficient on parallel hardware. The remaining examples are not used at this step; they appear in other batches.

The mini-batch loss and update

The mini-batch loss is the average loss over the examples in the current batch, and its gradient is the matching average of the per-example gradients,

This batch gradient is used as a cheaper stand-in for the full gradient , and the update is the familiar step along it,

Unpacking into weights and biases , the same rule reads, per coordinate,

As in pure SGD, the gradient that drives this update is a random vector, because the batch is drawn at random.

Building an epoch from batches

On a finite dataset the batches of one epoch are produced by shuffling and slicing:

  • the indices are permuted,
  • the permuted list is partitioned into contiguous blocks of size , and the optimiser steps once per block, feeding through the network and applying the update above.

When every block has been processed, the epoch is complete and the next epoch reshuffles.

One epoch with ,

Shuffling the indices might produce the order . Partitioning this list into contiguous blocks of gives the batches of one epoch:

iteration batch
1
2
3
4

Ten examples do not divide evenly into threes, so the last block holds the single leftover example. Every example appears in exactly one batch, so one epoch still uses the whole dataset once, just examples at a time. The next epoch reshuffles and re-slices, producing a different grouping.

The leftover block is the reason the number of iterations per epoch depends on a convention: keeping the final smaller batch gives iterations, while dropping it gives . For the example above, that is against .

Shuffling matters here for the same reason developed for pure SGD: without it, each batch inherits the storage order of the dataset, so batches grouped by class or source push the parameters in correlated directions epoch after epoch. A fresh permutation breaks that pattern and is also what makes the randomness analysed next genuinely random.

The mini-batch gradient is unbiased

The reassurance that justifies replacing the full gradient by a batch average is that, on average, nothing is lost. Treat as a uniformly chosen size- subset of , and collect the batch gradient into the random vector

The cleanest way to take the expectation is to turn the sum over the random set into a sum over all examples gated by an indicator,

where is when example lands in the current batch and otherwise. Expectation is componentwise over the random batch, and by linearity it falls on the indicators alone:

For a uniform size- subset, symmetry makes every example equally likely to be included, and since each batch fills of the places, the inclusion probabilities must sum to ; sharing that equally gives . Substituting, the cancels and the full-dataset gradient reappears:

The mini-batch gradient is unbiased

Averaged over the random choice of batch, the mini-batch gradient equals the full-dataset gradient exactly, for any batch size . Pure SGD is the case and full-batch GD the case ; both are unbiased, and so is everything in between. As with pure SGD, this holds only in expectation: for any one realised batch the estimate can differ from the true gradient, and the agreement is recovered only on average.

When batches are formed by shuffling once per epoch and slicing, as in real training, the per-step expectation conditioned on the batches already used in the epoch is over the remaining examples rather than all . But one full epoch covers every example exactly once, so the batch gradients averaged over a complete epoch reproduce the true gradient exactly. The practical conclusion is unchanged.

How the noise shrinks with the batch size

Unbiasedness settles where the estimate is centred, not how far it scatters. The scatter is what the batch size controls, and it is the real reason larger batches give a cleaner signal. For a vector-valued gradient the right object is a covariance matrix. Write the per-example gradient and the full gradient as

and define the covariance of the per-example gradients across the dataset,

This matrix records how widely the example-wise gradients spread around their mean: its diagonal holds the variance of each gradient coordinate, its off-diagonal entries how coordinates move together. For a batch drawn uniformly without replacement, the covariance of the batch gradient is

The mean stays pinned at ; only the prefactor changes with . When the finite-population factor , and the expression collapses to the rule worth remembering,

So doubling the batch size roughly halves the variance of the gradient estimate: a batch of produces about half the gradient noise of a batch of , at twice the per-step cost. This is the precise sense in which larger batches give a more accurate, lower-noise direction.

Noise as implicit regularisation

A lower-noise gradient sounds strictly better, yet driving the noise to zero is not the goal. The fluctuation of mini-batch gradients is one of the most important sources of implicit regularisation in deep learning.

Because a batch is only a partial view of the data, its gradient jitters around the full-gradient direction. That jitter perturbs the trajectory and discourages the iterates from settling into sharp, narrow basins of the loss. The minima it tends to favour are flatter, and flatter minima are commonly associated with robustness to parameter perturbations and, in many settings, better generalisation.

Smaller batches inject more useful noise

Since the gradient noise scales like , smaller batches perturb each step more, biasing the optimiser toward flatter regions that often generalise better. This is the main reason large-batch training, which suppresses the noise, can hurt test accuracy unless compensated by other means. The same “stochastic temperature” reading appears in cosine annealing: a smaller effective step acts as lower-temperature exploitation, a larger one as higher-temperature exploration.

Batch size and learning rate

The batch size and the learning rate should be tuned together, not in isolation. A larger estimates the update direction with lower variance, which usually allows a larger step before training destabilises. This is the intuition behind the linear scaling rule of large-batch training: multiplying the batch size by suggests multiplying the learning rate by roughly the same ,

This is a practical heuristic, not a universal law. It depends on the optimiser, the model, and the training phase, and large-batch runs often need a warm-up phase before the scaled rate can be used safely.

The critical batch size

Enlarging does not buy proportional speed-ups forever. Beyond a task-dependent critical batch size, the batch gradient is already so close to the full gradient that adding more examples barely reduces the number of optimisation steps to reach a target loss, while still costing more compute per step. Past that point, larger batches spend more compute than they save. The critical batch size is not a universal constant: it depends on the model, optimiser, dataset, and training phase, and must be found empirically.

The broader lesson is that changes the optimisation dynamics, not just the throughput, which is why batch size and learning rate are tuned jointly.

Three regimes, one formula

The same update covers all three regimes, distinguished only by the choice of :

RegimeBatch size Cost per updateGradient varianceTypical role
Full-batch GD (exact gradient)Small datasets; theoretical analysis; second-order methods.
Pure SGDMaximum (one example)Mainly conceptual and theoretical; rarely used directly.
Mini-batch SGDFalls like The universal default in modern training.

Moving down the table is monotone in both directions that matter: as grows, the cost per update rises and the gradient variance falls. The right is the value that balances these opposing trends for the task at hand.

Summing instead of averaging

Some formulations sum the per-example gradients instead of averaging them, dropping the factor. This is not a different method: it rescales the effective learning rate by , since . What matters is that the convention is fixed once and used consistently, so that a reported learning rate is comparable across runs.