Introduction

The Missing Link

Thus far, it has been examined how neural networks can learn their weights and biases using the gradient descent algorithm. However, a significant gap remains in the explanation provided: the methodology for calculating the gradient of the cost function has not been discussed. This represents a gap of considerable magnitude. A fast algorithm for calculating these gradients, known as backpropagation, is now discussed.

⏳ Historical Context and Popularization

While backpropagation is the primary tool for neural network learning today, its recognition followed a non-linear path.

Era	Development	Key Significance
1970s	Original Introduction	The algorithm was originally introduced during this decade.
1986	Global Recognition	A famous article by Rumelhart, Hinton, and Williams demonstrated its superior speed.

Schmidhuber's Historical Note

As frequently noted by Jürgen Schmidhuber, backpropagation was invented well before the 1986 article popularized it. This earlier invention allowed neural networks to solve problems previously considered insoluble.

🧠 Why Backpropagation is Worth Studying

The motivation for studying this algorithm lies, naturally, in achieving a deeper understanding of the system.

The Mathematical Heart

The core of backpropagation consists of an expression for the partial derivative of the cost function $L$ with respect to any weight $w$ (or bias $b$ ) in the network:
$\frac{\partial L}{\partial w}, \frac{\partial L}{\partial b}$
These expression indicate the rate at which the loss varies as weights and biases are modified. Although the formulation is somewhat complex, it possesses an inherent elegance: each of its elements allows for a natural and intuitive interpretation.

Behavioral Insights

Backpropagation is not merely an efficient algorithm for learning; it provides detailed insights into the network’s internal dynamics:

Parameter Influence: It reveals how the modification of weights and biases influences the overall behavior of the network.

Computational Efficiency: It is significantly faster than previous learning approaches, enabling the use of neural networks for complex tasks.

Explanatory Power: It allows for a structural transparency that merits thorough study to understand how a connectionist system adapts.

Clarification on the Term Backpropagation

The term backpropagation is frequently subject to a common misinterpretation, where it is viewed as the entire learning algorithm for neural networks.

Actually, backpropagation refers strictly to the method for calculating the gradient. The actual learning, the update of the model’s parameters, is performed by a separate algorithm, such as Stochastic Gradient Descent (SGD), which utilizes the gradient provided by backpropagation to adjust weights and biases.

Furthermore, backpropagation is often mistakenly considered a technique exclusive to neural networks. In principle, it can be applied to calculate the derivatives of any differentiable function.

Chiarimento sul Termine Backpropagation

Il termine backpropagation è sovente oggetto di un’interpretazione errata, secondo la quale esso rappresenterebbe l’intero algoritmo di apprendimento per le reti neurali.

In realtà, la backpropagation si riferisce unicamente al metodo per il calcolo del gradiente. L’apprendimento vero e proprio è invece demandato a un algoritmo distinto, come la discesa stocastica del gradiente (SGD), che utilizza tale gradiente per aggiornare i parametri del modello.

Si rileva, inoltre, un’altra comune imprecisione, che considera la backpropagation come una tecnica specifica per le reti neurali. In linea di principio, essa può essere impiegata per calcolare le derivate di qualsiasi funzione (restituendo, per alcune di esse, come output che la derivata è indefinita).