Mini-batch stochastic gradient descent is the standard compromise between the two previous regimes:

  • full-batch gradient descent, which uses the entire dataset at each update;
  • pure SGD, which uses exactly one training example at each update.

In mini-batch SGD, each update is computed from a small subset of training examples, called a mini-batch.
If the mini-batch size is denoted by , then typical choices are , , or , although the best value depends on the model, the hardware, and the optimization regime.

Using the notation introduced in Neural Networks vocabulary, let

and suppose that the objective is

At iteration , instead of selecting a single example, one selects a mini-batch index set

The set identifies the examples used at that optimizer step.
The corresponding batched inputs and targets are

which matches the batch notation adopted in Neural Networks vocabulary.

The mini-batch loss is the average loss over the examples contained in the current batch:

Its gradient is therefore

which is used as a computationally cheaper approximation to the full gradient

The update rule becomes

If the compact parameter vector is unpacked into weights and biases, the same rule can be written component wise as

These are not different learning rules: they are simply the componentwise form of the same mini-batch update.

Why Mini-Batch SGD Is A Compromise

Mini-batch SGD does not use the whole dataset at once, as in full-batch gradient descent, and it does not use only one training example at a time, as in pure SGD.
Instead, it uses a fixed-size mini-batch containing training examples.
For this reason, it sits naturally between the two extremes:

  • it is much cheaper per update than full-batch gradient descent;
  • it is less noisy than pure SGD;
  • it maps efficiently onto modern vectorized hardware such as GPUs.

Modern Viewpoint: Noise Can Be Beneficial

In deep learning, the stochasticity of mini-batch gradients is not merely a defect to be reduced as much as possible.
It can also act as a form of implicit regularization.

Because each mini-batch provides only a partial view of the whole dataset, the corresponding gradient typically fluctuates around the full-gradient direction. This noise perturbs the optimization trajectory and can help the iterates avoid becoming too tightly trapped in sharp basins of the loss landscape.

From a modern optimization viewpoint, this is one reason mini-batch training often favors flatter solutions over sharper ones. Flatter minima are frequently associated with better robustness to perturbations of the parameters and, in many practical settings, with better generalization. In this sense, the residual noise of mini-batch SGD is often a useful feature rather than only a source of error.

Batch Size And Learning Rate

The batch size should not be thought of independently of the learning rate .
When increases, the mini-batch gradient usually becomes less noisy, so the update direction is estimated with lower variance. In that regime, one can often take larger optimization steps without immediately destabilizing training.

This is the intuition behind the linear scaling rule often used in large-batch training: if the batch size is multiplied by a factor , one tries increasing the learning rate by approximately the same factor. In informal form,

This should be interpreted as a practical heuristic, not as a universal law. Its usefulness depends on the optimizer, the model, the training phase, and the overall schedule. In particular, large-batch training often also requires additional stabilization, such as a warm-up phase, before the scaled learning rate can be used safely.

Increasing therefore tends to make each individual update more accurate, but it does not imply that the entire training run will become proportionally faster in wall-clock time. In practice, diminishing returns usually appear: beyond a hardware- and problem-dependent regime, enlarging the batch may keep reducing gradient noise while providing little additional reduction in the number of optimization steps required to reach a target loss.

This practical saturation point is often discussed under the name critical batch size. The expression should not be interpreted as a single universal threshold valid for every model and dataset; rather, it refers to the empirical observation that, past a certain scale, using still larger mini-batches often consumes more compute than it saves in optimization time.

The broader lesson is that increasing does not merely change computational throughput: it also changes the optimization dynamics. Batch size and learning rate must therefore be tuned jointly rather than treated as unrelated hyperparameters.

One Epoch In Mini-Batch Training

In the common finite-dataset implementation, a single epoch proceeds as follows:

Training pipeline with mini-batches

  1. the indices are shuffled;
  2. the shuffled list is split into mini-batches of size ;
  3. one mini-batch is selected;
  4. the corresponding batch is fed through the network;
  5. the mean mini-batch loss and the mean mini-batch gradient are computed;
  6. the update rule for is applied;
  7. the same steps are repeated for every mini-batch created from that shuffled pass through the dataset.

Once all those mini-batches have been processed, one epoch is completed.

If the final smaller batch is kept, then the number of iterations per epoch is

if the final smaller batch is dropped, it is

Shuffling is important here for the same reason as in pure SGD: without it, the order of the examples inside successive mini-batches would inherit the storage order of the dataset, which can introduce persistent ordering effects across epochs. If no shuffling and no other random batching mechanism are used, then the sequence of mini-batches is deterministic, and the randomness discussed below is not generated by the batching process itself.

Expectation And Unbiasedness Of The Mini-Batch Gradient

In the idealized analysis, is treated as a uniformly chosen subset of having cardinality .
For each fixed parameter value , define

Since is a random vector, its expectation is taken componentwise. The expectation is with respect to the random choice of the mini-batch .

A convenient way to write the batch average is

where is the indicator of the event that example belongs to the current mini-batch.

Taking expectations and using linearity,

Under uniform sampling of a batch of size , each example has inclusion probability

Therefore

But, by linearity of differentiation,

Hence

This is the precise sense in which the mini-batch gradient is an unbiased estimator of the full gradient. It does not mean that the gradient computed from one specific mini-batch is itself equal to the full gradient; the equality holds only at the level of expectation over the random batch selection.

When mini-batches are instead formed by shuffling once per epoch and then partitioning the permuted dataset, the same intuition remains valid, but the exact conditional expectation at a given within-epoch iteration is the average over the examples that remain unused in that epoch rather than over the whole dataset.

Variance Reduction With Batch Size

Unbiasedness answers only one question: is the mini-batch gradient correct on average? The answer is yes. If one could average the gradients produced by many random mini-batches of the same size, that average would match the full gradient.

A different question is how much the gradient fluctuates from one random mini-batch to another. This is the role of variance. In scalar problems one speaks of variance; for vector-valued gradients, the corresponding object is the covariance matrix. To make this precise, let

so that is the gradient contributed by example , while is the full-dataset gradient.

Now define the covariance matrix of the per-example gradients:

This matrix measures how much the example-wise gradients are scattered around their mean. Its diagonal entries are the variances of the individual gradient coordinates, while the off-diagonal entries describe how different coordinates fluctuate together.

If the mini-batch is sampled uniformly without replacement, then the covariance of the mini-batch gradient is

The important point is the prefactor in front of . As grows, that prefactor shrinks roughly like . So the mini-batch gradient stays centered at the same mean, but it becomes progressively less noisy.

In particular, if one looks at a single gradient coordinate and denotes by its variance across training examples, then

Hence each coordinate variance decreases on the order of , with the additional finite-population correction factor . When the dataset is large and , this is often summarized more simply as

Equivalently, doubling the batch size roughly halves the variance of the gradient estimator. This is the mathematical reason why larger mini-batches usually produce a cleaner optimization signal: the estimator remains centered at the full gradient, but its fluctuations become smaller. That reduction in stochasticity is one reason larger learning rates can sometimes be used more safely, although the effect is optimizer- and regime-dependent rather than automatic.

Approximate Equality In Practice

For one realized mini-batch, one often writes informally

because the mini-batch average is usually more representative of the full dataset than a single-example gradient. However, this remains an approximation for the specific batch currently used. In general, the approximation tends to improve as the batch size increases, and it becomes exact when the batch coincides with the whole dataset.

Three Regimes, One Common Formula

The same update pattern covers the three standard regimes:

  • full-batch gradient descent: ;
  • pure SGD: ;
  • mini-batch SGD: .

Convention On Scaling

Some formulations omit the factor and sum gradients rather than averaging them.
This is not a different method in principle; it corresponds to a rescaling of the learning rate .
What matters is that the convention be stated explicitly and used consistently.