Full-batch gradient descent updates the parameters only after computing the gradient of the average loss over the entire dataset. When the dataset is large, this makes each update expensive.

The underlying idea of pure stochastic gradient descent is to replace that full gradient by the gradient of a single training example.

Using the notation introduced in Neural Networks vocabulary, let

and suppose that the training objective is

where denotes the loss contribution of the single example .

Note

For example, if the mean squared error is used, the loss contribution of a single training example can be written as

where denotes the network output for the input .

At iteration , one index is selected. This symbol denotes the index of the single training example used at optimizer step . Therefore, in pure SGD, one iteration, one optimizer step, and one parameter update coincide.

In the common finite-dataset implementation, training is organized into epochs. At the beginning of epoch , a random permutation of the index set is drawn, and the examples are processed in the order

If denotes the iteration number within epoch , then the index used at the -th iteration of that epoch is

In other words, the shuffle fixes a random order for the entire epoch, and each iteration simply takes the next example in that order. Thus, the shuffle acts before the epoch starts, while each subsequent iteration still performs exactly one SGD update based on exactly one example. After the -th iteration of that epoch, one epoch has been completed.

Once has been determined, stochastic gradient descent feeds the corresponding example through the network and computes the associated per-example gradient

typically by means of backpropagation.

The parameter update is then

If the compact parameter vector is unpacked into weights and biases, this becomes

These are not different learning rules: they are simply the component wise form of the same SGD update written in terms of .

Why Shuffling Matters

If the dataset is traversed in the same fixed order at every epoch, then the sequence of updates inherits that same deterministic order. When the stored examples are grouped by class, source, difficulty, or time, many consecutive gradients may point in correlated directions. The parameters may then over-adapt to one portion of the dataset and later partially undo that adaptation when a different portion is encountered.

Shuffling removes this persistent ordering effect. No example is always first, no block of similar examples is always encountered at the same stage of training, and the stochastic updates are less systematically biased by the storage order of the dataset.

If no shuffling and no other random sampling mechanism are used, then the sequence of indices is deterministic. In that case, the randomness discussed below is not generated by the data-order process itself.

Expectation And Unbiasedness Under Uniform Sampling

In the idealized case of uniform sampling with replacement,

Let

Since is a random vector, its expectation is taken componentwise:

where is the number of coordinates of . The expectation here is taken with respect to the random choice of the index ; the vector simply specifies the point in parameter space at which the gradient is being evaluated.

Using the definition of expectation for a discrete random variable,

Because the sampling is uniform,

On the other hand, since

linearity of differentiation gives

Therefore

This is the precise sense in which the single-example gradient is an unbiased estimator of the full gradient. It does not mean that the gradient computed from the particular sampled example is itself equal to the full gradient; the equality holds only at the level of expectation over the sampling randomness.

When training is implemented by shuffling once per epoch and then traversing the permuted dataset without replacement, the same intuition remains valid, but the exact conditional expectation at a given iteration within the epoch is the average over the remaining unused examples of that epoch rather than over the whole dataset.

Because a single training example is not necessarily representative of the entire dataset, the corresponding gradient may vary substantially from one iteration to the next. As a result, the updates of the weights and biases exhibit stochastic fluctuations, and the optimization trajectory is typically less smooth than in full-batch gradient descent. At the same time, this variability can be beneficial: the sampling-induced noise may help the iterates avoid becoming trapped in poor local minima or saddle-type regions, thereby encouraging a broader exploration of the parameter space, or equivalently, of the loss landscape.

Approximate Equality In Practice

For one realized draw of the training example, one may write informally

but this is only a heuristic statement. For a single example, the discrepancy may be large, because one sample need not be representative of the whole dataset. The approximation becomes exact only in expectation, or accidentally at parameter values for which the sampled-example gradient happens to coincide with the dataset-average gradient.

Tip

In much of deep-learning practice, the term “SGD” is often used loosely to include mini-batch SGD.
Strictly speaking, however, pure SGD corresponds to the case where exactly one training example is used per update.