Mini-batch stochastic gradient descent is the standard compromise between the two previous regimes:
- full-batch gradient descent, which uses the entire dataset at each update;
- pure SGD, which uses exactly one training example at each update.
In mini-batch SGD, each update is computed from a small subset of training examples, called a mini-batch.
If the mini-batch size is denoted by , then typical choices are , , or , although the best value depends on the model, the hardware, and the optimization regime.
On the Dimension of Batch Size
Those values are not special from a purely mathematical point of view. In practice, they are often favored because they interact well with modern accelerator hardware.
On GPUs, parallel work is executed in small groups of threads. In the NVIDIA terminology these groups are called warps, and a warp typically contains threads. Using batch sizes that are multiples of often helps keep the hardware more fully occupied, reducing the amount of partially idle parallel work.
There is also a memory-side reason. Batch sizes such as , , and often lead to tensor shapes that are friendlier to low-level kernels, memory alignment, and high-throughput matrix operations. In modern training stacks, this can improve throughput and make better use of matrix units or tensor-oriented compute paths.
However, this is a hardware-aware heuristic, not a universal law. The best batch size is still a compromise among:
- optimization behavior;
- available memory;
- hardware throughput.
For that reason, powers of two are common, but values such as , , or other non-powers of two may also be entirely reasonable when they better fit the model or the device.
Using the notation introduced in Neural Networks vocabulary, let
and suppose that the objective is
At iteration , instead of selecting a single example, one selects a mini-batch index set
The set identifies the examples used at that optimizer step.
The corresponding batched inputs and targets are
which matches the batch notation adopted in Neural Networks vocabulary.
The mini-batch loss is the average loss over the examples contained in the current batch:
Its gradient is therefore
which is used as a computationally cheaper approximation to the full gradient
Why The Mini-Batch Gradient Is A Random Vector
Under random batching, the object
is random: strictly speaking, it is a random subset of , or equivalently a set-valued random variable.
Consequently,
is random as well, because it is obtained by evaluating a vector-valued expression at a random mini-batch.
It’s a random vector because its value has one coordinate for each parameter in . Each coordinate is a scalar random variable, and together those coordinates form a vector.
The update rule becomes
If the compact parameter vector is unpacked into weights and biases, the same rule can be written component wise as
These are not different learning rules: they are simply the componentwise form of the same mini-batch update.
Why Mini-Batch SGD Is A Compromise
Mini-batch SGD does not use the whole dataset at once, as in full-batch gradient descent, and it does not use only one training example at a time, as in pure SGD.
Instead, it uses a fixed-size mini-batch containing training examples.
For this reason, it sits naturally between the two extremes:
- it is much cheaper per update than full-batch gradient descent;
- it is less noisy than pure SGD;
- it maps efficiently onto modern vectorized hardware such as GPUs.
Modern Viewpoint: Noise Can Be Beneficial
In deep learning, the stochasticity of mini-batch gradients is not merely a defect to be reduced as much as possible.
It can also act as a form of implicit regularization.
Because each mini-batch provides only a partial view of the whole dataset, the corresponding gradient typically fluctuates around the full-gradient direction. This noise perturbs the optimization trajectory and can help the iterates avoid becoming too tightly trapped in sharp basins of the loss landscape.
From a modern optimization viewpoint, this is one reason mini-batch training often favors flatter solutions over sharper ones. Flatter minima are frequently associated with better robustness to perturbations of the parameters and, in many practical settings, with better generalization. In this sense, the residual noise of mini-batch SGD is often a useful feature rather than only a source of error.
Batch Size And Learning Rate
The batch size should not be thought of independently of the learning rate .
When increases, the mini-batch gradient usually becomes less noisy, so the update direction is estimated with lower variance. In that regime, one can often take larger optimization steps without immediately destabilizing training.
This is the intuition behind the linear scaling rule often used in large-batch training: if the batch size is multiplied by a factor , one tries increasing the learning rate by approximately the same factor. In informal form,
This should be interpreted as a practical heuristic, not as a universal law. Its usefulness depends on the optimizer, the model, the training phase, and the overall schedule. In particular, large-batch training often also requires additional stabilization, such as a warm-up phase, before the scaled learning rate can be used safely.
Increasing therefore tends to make each individual update more accurate, but it does not imply that the entire training run will become proportionally faster in wall-clock time. In practice, diminishing returns usually appear: beyond a hardware- and problem-dependent regime, enlarging the batch may keep reducing gradient noise while providing little additional reduction in the number of optimization steps required to reach a target loss.
This practical saturation point is often discussed under the name critical batch size. The expression should not be interpreted as a single universal threshold valid for every model and dataset; rather, it refers to the empirical observation that, past a certain scale, using still larger mini-batches often consumes more compute than it saves in optimization time.
The broader lesson is that increasing does not merely change computational throughput: it also changes the optimization dynamics. Batch size and learning rate must therefore be tuned jointly rather than treated as unrelated hyperparameters.
One Epoch In Mini-Batch Training
In the common finite-dataset implementation, a single epoch proceeds as follows:
Training pipeline with mini-batches
- the indices are shuffled;
- the shuffled list is split into mini-batches of size ;
- one mini-batch is selected;
- the corresponding batch is fed through the network;
- the mean mini-batch loss and the mean mini-batch gradient are computed;
- the update rule for is applied;
- the same steps are repeated for every mini-batch created from that shuffled pass through the dataset.
Once all those mini-batches have been processed, one epoch is completed.
If the final smaller batch is kept, then the number of iterations per epoch is
if the final smaller batch is dropped, it is
Shuffling is important here for the same reason as in pure SGD: without it, the order of the examples inside successive mini-batches would inherit the storage order of the dataset, which can introduce persistent ordering effects across epochs. If no shuffling and no other random batching mechanism are used, then the sequence of mini-batches is deterministic, and the randomness discussed below is not generated by the batching process itself.
Expectation And Unbiasedness Of The Mini-Batch Gradient
In the idealized analysis, is treated as a uniformly chosen subset of having cardinality .
For each fixed parameter value , define
Since is a random vector, its expectation is taken componentwise. The expectation is with respect to the random choice of the mini-batch .
A convenient way to write the batch average is
where is the indicator of the event that example belongs to the current mini-batch.
Taking expectations and using linearity,
Under uniform sampling of a batch of size , each example has inclusion probability
Therefore
But, by linearity of differentiation,
Hence
This is the precise sense in which the mini-batch gradient is an unbiased estimator of the full gradient. It does not mean that the gradient computed from one specific mini-batch is itself equal to the full gradient; the equality holds only at the level of expectation over the random batch selection.
When mini-batches are instead formed by shuffling once per epoch and then partitioning the permuted dataset, the same intuition remains valid, but the exact conditional expectation at a given within-epoch iteration is the average over the examples that remain unused in that epoch rather than over the whole dataset.
Variance Reduction With Batch Size
Unbiasedness answers only one question: is the mini-batch gradient correct on average? The answer is yes. If one could average the gradients produced by many random mini-batches of the same size, that average would match the full gradient.
A different question is how much the gradient fluctuates from one random mini-batch to another. This is the role of variance. In scalar problems one speaks of variance; for vector-valued gradients, the corresponding object is the covariance matrix. To make this precise, let
so that is the gradient contributed by example , while is the full-dataset gradient.
Now define the covariance matrix of the per-example gradients:
This matrix measures how much the example-wise gradients are scattered around their mean. Its diagonal entries are the variances of the individual gradient coordinates, while the off-diagonal entries describe how different coordinates fluctuate together.
If the mini-batch is sampled uniformly without replacement, then the covariance of the mini-batch gradient is
The important point is the prefactor in front of . As grows, that prefactor shrinks roughly like . So the mini-batch gradient stays centered at the same mean, but it becomes progressively less noisy.
In particular, if one looks at a single gradient coordinate and denotes by its variance across training examples, then
Hence each coordinate variance decreases on the order of , with the additional finite-population correction factor . When the dataset is large and , this is often summarized more simply as
Equivalently, doubling the batch size roughly halves the variance of the gradient estimator. This is the mathematical reason why larger mini-batches usually produce a cleaner optimization signal: the estimator remains centered at the full gradient, but its fluctuations become smaller. That reduction in stochasticity is one reason larger learning rates can sometimes be used more safely, although the effect is optimizer- and regime-dependent rather than automatic.
Approximate Equality In Practice
For one realized mini-batch, one often writes informally
because the mini-batch average is usually more representative of the full dataset than a single-example gradient. However, this remains an approximation for the specific batch currently used. In general, the approximation tends to improve as the batch size increases, and it becomes exact when the batch coincides with the whole dataset.
Three Regimes, One Common Formula
The same update pattern covers the three standard regimes:
- full-batch gradient descent: ;
- pure SGD: ;
- mini-batch SGD: .
Convention On Scaling
Some formulations omit the factor and sum gradients rather than averaging them.
This is not a different method in principle; it corresponds to a rescaling of the learning rate .
What matters is that the convention be stated explicitly and used consistently.