Stochastic GD (SGD)

Several challenges arise in the practical application of the gradient descent rule, many of which will be examined in detail later. At this stage, however, it is useful to highlight one key difficulty.

Consider the quadratic loss introduced earlier. This loss has the form

L = \frac{1}{n} x \sum L_{x},

where the individual contribution for a training example $x$ is defined as if $L$ is the MSE

L_{x} \equiv \frac{1}{2} ∥ y (x) - a ∥^{2} .

In practice, computing the full gradient $\nabla L$ requires first evaluating the gradients $\nabla L_{x}$ for each individual training input $x$ , and then averaging them:

\nabla L = \frac{1}{n} x \sum \nabla L_{x} .

When the training set is very large, this computation becomes prohibitively time-consuming, and as a result the learning process proceeds very slowly.

A widely used strategy for addressing this computational bottleneck is stochastic gradient descent (SGD).
The central idea is to approximate the full gradient $\nabla L$ by computing it on a small, randomly selected subset of the training data, rather than over the entire dataset.

Concretely, suppose a subset of $m$ training examples is sampled at random, denoted $X_{1}, X_{2}, \dots, X_{m}$ . This subset is referred to as a mini-batch.
For each example $X_{j}$ , the corresponding gradient $\nabla L_{X_{j}}$ is computed.
If $m$ is sufficiently large, the average of these gradients provides a good approximation to the true gradient:

\frac{1}{m} j = 1 \sum m \nabla L_{X_{j}} \approx \frac{1}{n} x \sum \nabla L_{x} = \nabla L .

Equivalently,

\nabla L \approx \frac{1}{m} j = 1 \sum m \nabla L_{X_{j}} .

Thus, by averaging over a randomly chosen mini-batch, an efficient and reasonably accurate estimate of the full gradient can be obtained, significantly accelerating the learning process.

To make the connection with neural networks explicit, let $w_{k}$ and $b_{l}$ denote the weights and biases, respectively.
Under stochastic gradient descent, training proceeds by selecting a randomly chosen mini-batch of inputs and updating the parameters according to

w_{k}^{'} = w_{k} - η \frac{1}{m} j = 1 \sum m \frac{\partial L _{X_{j}}}{\partial w _{k}},

b_{l} ⟶ b_{l}^{'} = b_{l} - η \frac{1}{m} j = 1 \sum m \frac{\partial L _{X_{j}}}{\partial b _{l}},

where the sums extend over all training examples $X_{j}$ in the current mini-batch.
Once this mini-batch has been processed, a new mini-batch is drawn at random, and the process is repeated.
After all training examples have been used once, an epoch of training is said to be complete, after which the cycle begins again.

Convention on Scaling

Conventions vary regarding the scaling of the loss function and the mini-batch updates.

In the earlier quadratic loss definition, the cost function was scaled by a factor of $\frac{1}{n}$ . In other contexts, the cost is written as a direct sum over training examples, omitting this factor. This convention is particularly useful when the total number of training examples is not fixed in advance—for instance, when new data is generated in real time.

Similarly, the update rules above include the factor $\frac{1}{m}$ to average over the mini-batch. Some formulations omit this term, which is conceptually equivalent to rescaling the learning rate $η$ .

While these differences are not mathematically fundamental, they must be carefully noted when comparing results across different works or implementations.

Deep Learning

Explorer

Stochastic GD (SGD)

Graph View