The big picture

Two Mysteries of Backpropagation

The backpropagation algorithm, as commonly presented, raises two fundamental questions:

1. What is Backpropagation actually doing?

The first question concerns the real nature of the algorithm: what is it actually doing? A conceptual representation has been outlined where the error is backpropagated starting from the neural network’s output. However, it remains unclear what is truly occurring beyond matrix and vector multiplications. Is it possible to develop a deeper and more concrete intuition regarding what these calculations represent?

2. How was Backpropagation discovered?

The second question concerns the very genesis of the algorithm: how could someone have originally discovered backpropagation? It is one thing to follow the steps of an algorithm or even to understand the formal proof of its correctness. It is quite another to understand the problem to the point of being able to independently discover the algorithm itself. Thus, a question arises:

Question

Does a plausible line of reasoning exist that could have led to the discovery of the backpropagation algorithm?

In this section, an attempt will be made to answer both of these questions.

The Heuristic of Perturbation

To improve the intuition regarding the algorithm’s functioning, the following scenario is considered:

Important

It is hypothesized that a small variation $Δ w_{jk}^{(l)}$ has been applied to a specific weight $w_{jk}^{(l)}$ within the neural network.

This variation of the weight implies a change in the output activation of the corresponding neuron.

This, in turn, will lead to a change in all activations in the following layer.

These changes will then cause variations in the next layer, and then the one after that, and so on, until causing a cascading variation in the final layer and, ultimately, in the loss function.

The variation in loss, $Δ L$ , is related to the variation in weight, $Δ w_{jk}^{l}$ , by the equation:

Δ L \approx \frac{\partial L}{\partial w _{jk}^{l}} Δ w_{jk}^{l} (1)

Note

This suggests that a possible approach to calculate $\partial L / \partial w_{jk}^{l}$ consists of meticulously tracing how a small variation in $w_{jk}^{l}$ propagates until it causes a small variation in $L$ . If this attempt were successful, expressing every intermediate step in terms of easily calculable quantities, one should then be able to compute $\partial L / \partial w_{jk}^{l}$ .

Tracing the Propagation Path

The process is executed as follows. The variation $Δ w_{jk}^{l}$ causes a small variation $Δ a_{j}^{l}$ in the activation of the $j$ -th neuron in the $l$ -th layer. This variation is given by:

Δ a_{j}^{l} \approx \frac{\partial a _{j}^{l}}{\partial w _{jk}^{l}} Δ w_{jk}^{l} (2)

The variation of activation $Δ a_{j}^{l}$ will, in turn, cause variations in all activations of the next layer, layer $l + 1$ .

Attention is now focused on how a single activation of that layer, for example $a_{q}^{l + 1}$ , is influenced.

Specifically, it will cause the following variation:

Δ a_{q}^{l + 1} \approx \frac{\partial a _{q}^{l + 1}}{\partial a _{j}^{l}} Δ a_{j}^{l}

Substituting the expression from Equation (2), the following is obtained:

Δ a_{q}^{l + 1} \approx \frac{\partial a _{q}^{l + 1}}{\partial a _{j}^{l}} \cdot \frac{\partial a _{j}^{l}}{\partial w _{jk}^{l}} \cdot Δ w_{jk}^{l}

Naturally, the variation $Δ a_{q}^{(l + 1)}$ will in turn cause changes in the activations of the subsequent layer. One can imagine a path traversing the entire network, from the weight $w_{jk}^{(l)}$ to the loss function $L$ , where each activation variation determines a variation in the next, eventually producing a modification of the output loss.

If the path traverses activations $a_{j}^{(l)}, a_{q}^{(l + 1)}, \dots, a_{n}^{(L - 1)}, a_{m}^{(L)}$ , the resulting expression is:

Δ L \approx \frac{\partial L}{\partial a _{m}^{L}} \cdot \frac{\partial a _{m}^{L}}{\partial a _{n}^{L - 1}} \cdot \frac{\partial a _{n}^{L - 1}}{\partial a _{p}^{L - 2}} \dots \frac{\partial a _{q}^{l + 1}}{\partial a _{j}^{l}} \cdot \frac{\partial a _{j}^{l}}{\partial w _{jk}^{l}} \cdot Δ w_{jk}^{l}

A term of the type $\partial a / \partial a$ has been introduced for each additional neuron crossed, in addition to the final $\partial L / \partial a_{m}^{(L)}$ term. This represents the variation of the loss function $L$ due to activation changes along this specific path in the network.

Summing Over All Possible Paths

Naturally, numerous paths exist through which a variation in $w_{jk}^{(l)}$ can propagate to influence the loss. To calculate the total variation of $L$ , it is plausible that one must sum over all possible paths connecting the weight to the final loss function:

Δ L \approx mn p \dots q \sum \frac{\partial L}{\partial a _{m}^{L}} \frac{\partial a _{m}^{L}}{\partial a _{n}^{L - 1}} \frac{\partial a _{n}^{L - 1}}{\partial a _{p}^{L - 2}} \dots \frac{\partial a _{q}^{l + 1}}{\partial a _{j}^{l}} \frac{\partial a _{j}^{l}}{\partial w _{jk}^{l}} Δ w_{jk}^{l}

where the sum is performed over all possible choices of intermediate neurons along the path. Comparing this with Equation (1), it is observed that:

\frac{\partial L}{\partial w _{jk}^{l}} = mn p \dots q \sum \frac{\partial L}{\partial a _{m}^{L}} \frac{\partial a _{m}^{L}}{\partial a _{n}^{L - 1}} \frac{\partial a _{n}^{L - 1}}{\partial a _{p}^{L - 2}} \dots \frac{\partial a _{q}^{l + 1}}{\partial a _{j}^{l}} \frac{\partial a _{j}^{l}}{\partial w _{jk}^{l}} (3)

Note

Equation (3) may appear complex; however, it admits a clear intuitive interpretation. The variation of the loss function $L$ is being calculated with respect to a network weight. The equation shows that every connection (or edge) between two neurons in the network is associated with a variation factor, which corresponds to the partial derivative of one neuron’s activation with respect to the other’s.

The edge connecting the first weight to the first neuron has $\frac{\partial a _{j}^{(l)}}{\partial w _{jk}^{(l)}}$ as its variation factor. The variation factor of an entire path is simply the product of the variation factors associated with its component edges. The total derivative $\frac{\partial L}{\partial w _{jk}^{(l)}}$ is therefore the sum of the variation factors of all paths connecting the initial weight to the final loss.

This procedure is illustrated below for a single path:

Conclusion

Note

What has been presented thus far constitutes a heuristic argument: a way of interpreting what happens when a weight is perturbed within a neural network.

The following line of thought is useful for further developing this argument. First, explicit expressions for all partial derivatives in Equation (3) can be derived; this is a relatively simple operation requiring only basic differential calculus. Once these expressions are obtained, an attempt can be made to reformulate all sums over indices as matrix multiplications. This step is somewhat tedious and requires perseverance, but it does not involve extraordinary insights.

Upon completing this procedure and simplifying the result as much as possible, it is discovered that one arrives exactly at the backpropagation algorithm.

Important

It is therefore possible to interpret the backpropagation algorithm as a method for calculating the sum of variation factors along all possible paths. In other words, the backpropagation algorithm represents an ingenious way to keep track of small perturbations introduced into weights (and biases), monitoring how they propagate through the network, reach the output, and finally influence the loss function.

Deep Learning

Explorer

The big picture

Two Mysteries of Backpropagation

1. What is Backpropagation actually doing?

2. How was Backpropagation discovered?

The Heuristic of Perturbation

Tracing the Propagation Path

Summing Over All Possible Paths

Conclusion

Graph View

Table of Contents