On the efficiency of BP

Question

Why can backpropagation be considered an efficient algorithm?

To answer this question, it is appropriate to consider an alternative approach to gradient calculation.

One might imagine being at the dawn of neural network research in the 1950s or 60s, among the first to conceive the use of gradient descent for learning. To make this idea operationally valid, a method for calculating the gradient of the loss function is necessary. Recalling the notions of infinitesimal calculus, an attempt might be made to employ the chain rule for this calculation; however, after some manipulation, the algebraic complexity would prove so great as to discourage further effort.

The Numerical Alternative: “Rise over Run”

Consequently, an alternative approach would be sought. The loss function would be considered a function of weights alone, $L = L (w)$ (the treatment of biases will be addressed later). By numbering the weights $w_{1}, w_{2}, \dots,$ the objective becomes calculating $\partial L / \partial w_{j}$ .

\frac{\partial L}{\partial w _{j}} \approx \frac{L ( w + ε e _{j} ) - L ( w )}{ε} (rise over run)

Where:

$ε > 0$ is a small positive number.
$e_{j}$ represents the unit vector in the $j$ -th direction.

In other words, $\partial L / \partial w_{j}$ can be estimated by calculating the loss $L$ for two slightly different values of $w_{j}$ , and then applying the rise over run equation. The same principle allows for the calculation of partial derivatives $\partial L / \partial b$ with respect to the biases.

This approach appears highly promising. It is conceptually simple and extremely easy to implement, requiring only a few lines of code. It certainly presents itself as more promising than the idea of employing the chain rule for the gradient calculation.

The Implementation Gap

However, although this approach appears attractive, it proves to be extremely slow upon implementation.

To understand why, consider a neural network with one million weights. In such a case, for every single weight $w_{j}$ , it is necessary to calculate $L (w + ε e_{j})$ to determine $\partial L / \partial w_{j}$ . This means that to calculate the entire gradient, the loss function must be evaluated one million different times, requiring one million forward passes through the network (for every single training example). Since $L (w)$ must also be calculated, the total computation amounts to one million and one passes through the network.

The Backpropagation Breakthrough

Backpropagation key idea

The brilliance of backpropagation lies in the fact that it allows for the simultaneous calculation of all partial derivatives $\partial L / \partial w_{j}$ using a single forward pass through the network, followed by a single backward pass. Generally, the computational cost of the backward pass is nearly identical to that of the forward pass.

Note

This statement appears plausible; however, to be formulated with rigor, it requires an in-depth analysis. Its plausibility derives from the fact that the dominant computational cost in the forward pass consists of multiplication by weight matrices, while in the backward pass, it consists of multiplication by the transposed matrices. It is evident that these operations possess a similar computational cost.

Conclusion

Key Result

Therefore, the total cost of backpropagation is approximately equivalent to only two forward passes through the network. Comparing this cost to the one million and one forward passes required for the “rise over run” approach, it follows that although backpropagation appears superficially more complex, it is actually far faster.

Deep Learning

Explorer

On the efficiency of BP

The Numerical Alternative: “Rise over Run”

The Backpropagation Breakthrough

Conclusion

Graph View

Table of Contents