Skip connections

The Challenge of Depth in Neural Networks

As the architecture of a neural network grows in depth (i.e., more hidden layers), all the precautions discussed so far, such as the choice of activation function, loss function, and weight initialization, struggle to fully address a fundamental issue:

The deeper the network, the harder it becomes to faithfully propagate information across its layers.

This difficulty in faithfully propagating information can be better understood by considering the most elementary case:

What if the network were required to do nothing more than preserve its input unchanged? In other words, what if it had to learn the identity mapping?

The Identity Mapping Paradox

Let’s consider a deep neural network tasked with the simplest possible objective: mapping each input $x$ to the same output $x$ . In other words, learning an identity mapping.

Despite the simplicity of the task, the random initialization of parameters makes it highly unlikely that the network will faithfully reproduce the input as output by learning the identity mapping.

For a deep neural network, learning the identity function is as challenging as learning a complex task.

💡L’intuizione di Kaiming He: Skip Connections

💡Idea

Kaiming He e colleghi (2015) ribaltano la prospettiva:
non serve più modellare esplicitamente l’identità, ma viene trasmessa in parallelo, intatta

In particolare:

Biforcazione del flusso: in punti strategici dell’architettura, la rete suddivide il flusso dei dati: una parte attraversa i layer (convoluzione, attivazione, ecc.), mentre l’altra viene riportata in uscita senza modifiche, tramite una connessione diretta (skip connection).
Fusione tra percorso diretto e trasformato:
all’uscita del layer, il risultato della trasformazione viene sommato all’input originale.
Questa somma avviene prima dell’applicazione della funzione di attivazione, così che anche l’attivazione agisca sul dato “completo” (input + residuo).
Focus sulla funzione residua: i layer non devono più apprendere l’intera funzione desiderata, ma soltanto ciò che va aggiunto all’input per ottenere l’output finale. Questo significa che la rete apprende una residual function, cioè la differenza tra ciò che si ha e ciò che si vuole.

Questo semplice artificio – cuore delle Residual Networks (ResNet) – consente di addestrare con successo reti profonde centinaia di layer, riducendo vanishing gradient e difficoltà di ottimizzazione.

💡Idea

Kaiming He and colleagues (2015) turned the perspective upside down:
it is no longer necessary to explicitly model the identity, as it is transmitted in parallel, intact.

Specifically:

Branching of the data flow:
At strategic points in the architecture, the network splits the data flow: one part passes through the layers (convolution, activation, etc.), while the other is directly carried to the output unchanged via a direct connection (skip connection).
Merge of direct and transformed paths:
At the layer’s output, the result of the transformation is added to the original input.
This addition takes place before the activation function is applied, so that the activation operates on the “complete” signal (input + residual).
Focus on the residual function:
The layers no longer need to learn the entire desired mapping, but only what must be added to the input to obtain the final output.
In other words, the network learns a residual function: the difference between what is available and what is required.

This simple trick (the core of Residual Networks (ResNet) makes it possible to successfully train networks hundreds of layers deep, mitigating vanishing gradients and optimization difficulties.

Inserendo scorciatoie strategiche, la rete smette di combattere la profondità e inizia a sfruttarla.

💪 Advantages of Skip Connections

Skip connections represent a simple yet revolutionary insight that brings numerous benefits to the training of deep networks.

✅ 1. No need to learn the identity mapping

Grazie alle skip connection, non si chiede più alla rete di “ricostruire” l’input come output quando non c’è nulla da cambiare.
L’identità è già presente nel flusso parallelo della skip connection: la rete può concentrarsi su ciò che deve modificare, non su ciò che deve lasciare invariato.

👉 Corollario

Focus sul residuo informativo

Un importante corollario: la rete si concentra su cosa aggiungere all’identità per raggiungere il risultato desiderato. Se si vuole trasformare l’immagine di un gatto in quella di un cane, è più semplice partire da ciò che i due condividono (occhi, naso, orecchie…) e aggiungere le differenze, piuttosto che costruire tutto da zero.

Important

L’idea sottesa alla skip connection è quella di apprendere una trasformazione residuale. Tale approccio si chiama residual learning: la rete apprende una funzione che rappresenta ciò che manca all’identità per raggiungere l’obiettivo finale. Cioè se si vuole apprendere un task complesso, questo lo si può riguardare come: non fare niente $+$ fai quello che ti serve a niente per raggiungere il tuo scopo.

Important

L’idea sottesa alla skip connection è quella di scomporre un task complesso in due parti:

Task Complesso = Identità (non fare nulla) + Trasformazione Residua (fai solo ciò che serve in aggiunta).

In questo modo, la rete non deve reinventare da zero l’input, ma solo apprendere la “differenza” necessaria per raggiungere l’obiettivo.

Note

Anche “non fare nulla” può essere un compito difficile per una rete profonda.
Con le skip connection, il “non fare nulla” viene regalato alla rete — può concentrarsi su cosa serve davvero.

✅ 2. Improved gradient flow in backpropagation

During backpropagation, the gradient propagates through the network following the chain rule.
In a very deep network, this gradient tends to gradually dilute, eventually almost vanishing (i.e., the vanishing gradient problem).

With skip connections:

Part of the gradient still travels along the canonical path through the intermediate layers, and may suffer from vanishing as it attenuates;
A portion of the gradient, however, follows the skip connection, bypassing the layers and thus avoids vanishing, preserving its strength almost intact.

The “fast lane” provided by skip connections shortens the chain of gradient multiplications, drastically reducing the risk of vanishing gradients.

Important

Grazie alla skip connection si dispone di un path alternativo per la backpropagation lungo il quale la catena dei prodotti delle derivate parziali, risultante dalla chain rule, è molto piu breve rispetto a com’è sul path canonico in assenza di skip connection.

Cruciale

Quindi le skip connection sono un modo per stabilire dei path alternativi per il gradiente nella fase di backpropagation

Summary

Skip connections:

Simplify learning, by providing identity as a default behavior rather than making the network learn it;
Make the network focus on the residual information, i.e., on what truly matters;
Promote stable gradient propagation, by providing direct paths through which the gradient can flow, mitigating the vanishing gradient problem.

Remark

Skip connections are crucial for neural networks. In a deep network with $1000$ layers, without skip connections the gradient is at risk of vanishing.
With skip connections, a significant portion of it reaches the earlier layers, making it possible to train very deep architectures.

[!note] 🔗 Backpropagation con Skip Connections — Derivazione completa

Setup del blocco residuo (generale):
Sia

z^{(l)} h^{(l)} = W^{(l)} h^{(l - 1)} + b^{(l)}, = f (z^{(l)}) + S^{(l)} h^{(l - k)},

dove:

$f$ è una non-linearità applicata element-wise (derivabile),
$S^{(l)}$ rappresenta la skip connection (identità $I$ nel caso standard “ResNet-style”, oppure una proiezione/1x1 conv/linear map),
$k \geq 1$ indica il salto (tipicamente $k = 1$ per skip identità dal layer precedente).

Indichiamo con $L$ la loss e definiamo le “deltas”:

δ^{(l)} ≜ \frac{\partial L}{\partial h ^{(l)}} .

1) Nodo di somma (split del gradiente)

Poiché $h^{(l)} = f (z^{(l)}) + S^{(l)} h^{(l - k)}$ , la derivata rispetto ai due rami è:

\frac{\partial L}{\partial f ( z ^{(l)} )} = δ^{(l)}, \frac{\partial L}{\partial ( S ^{(l)} h ^{(l - k)} )} = δ^{(l)} .

Il contributo verso l’input della skip è quindi:

δ^{(l - k)} + = (S^{(l)})^{⊤} δ^{(l)}

Nel caso di skip identità ( $S^{(l)} = I$ ):

δ^{(l - k)} + = δ^{(l)} .

2) Non-linearità (catena sul ramo “main”)

Definiamo il gradiente rispetto alla pre-attivazione:

g^{(l)} ≜ \frac{\partial L}{\partial z ^{(l)}} = J_{f} (z^{(l)})^{⊤} δ^{(l)} .

Per attivazioni element-wise, $J_{f}$ è diagonale e dunque:

g^{(l)} = f^{'} (z^{(l)}) ⊙ δ^{(l)}

dove $⊙$ è il prodotto elemento per elemento.

3) Gradiente rispetto ai parametri del layer

Usando $z^{(l)} = W^{(l)} h^{(l - 1)} + b^{(l)}$ :

\frac{\partial L}{\partial W ^{(l)}} = g^{(l)} (h^{(l - 1)})^{⊤}, \frac{\partial L}{\partial b ^{(l)}} = g^{(l)}

4) Flusso verso gli input del layer “main”

Il contributo verso $h^{(l - 1)}$ dal ramo principale è:

δ^{(l - 1)} + = (W^{(l)})^{⊤} g^{(l)}

Regola di accumulo generale.
In un grafo computazionale con più cammini in uscita da un nodo $x$ , i contributi di gradiente si sommano:
$\frac{\partial L}{\partial x} = tutti i rami r \sum (\frac{\partial y _{r}}{\partial x})^{⊤} \frac{\partial L}{\partial y _{r}} .$

5) Caso ResNet-style (skip identità tra layer consecutivi)

Per il blocco

h^{(l)} = f (W^{(l)} h^{(l - 1)} + b^{(l)}) + h^{(l - 1)},

si ha:

δ^{(l - 1)} = ramo main (W^{(l)})^{⊤} g^{(l)} + skip identit \overset{a}{ˋ} δ^{(l)} .

Ossia

δ^{(l - 1)} = δ^{(l)} + (W^{(l)})^{⊤} (f^{'} (z^{(l)}) ⊙ δ^{(l)})

6) Perché le skip connections aiutano coi gradienti (intuizione formale)

In una catena di blocchi residui con skip identità, iterando la relazione sopra si ottiene (per semplicità tralasciamo i bias):

δ^{(l - 1)} = [I + (W^{(l)})^{⊤} D^{(l)}] δ^{(l)},

dove $D^{(l)} = diag (f^{'} (z^{(l)}))$ .

Ripetendo su più livelli, la presenza dell’identità contribuisce con termini additivi che preservano il segnale di gradiente e mitigano l’effetto di prodotti ripetuti di matrici (che possono causare vanishing/exploding). Questo è il nocciolo del successo delle residual networks.

7) Variante con skip proiettiva (es. $S^{(l)} = W_{s}^{(l)}$ )

Se la skip è lineare (proiezione/cambio di dimensione):

h^{(l)} = f (z^{(l)}) + W_{s}^{(l)} h^{(l - k)},

allora:

δ^{(l - k)} + = (W_{s}^{(l)})^{⊤} δ^{(l)}, \frac{\partial L}{\partial W _{s}^{(l)}} = δ^{(l)} (h^{(l - k)})^{⊤} .

8) Riepilogo operativo (pseudocodice concettuale per un blocco)

Input: $δ^{(l)}$
$g^{(l)} \leftarrow f^{'} (z^{(l)}) ⊙ δ^{(l)}$
Accumula parametri:
$\nabla W^{(l)} \leftarrow g^{(l)} (h^{(l - 1)})^{⊤}, \nabla b^{(l)} \leftarrow g^{(l)}$
Verso $h^{(l - 1)}$ : $δ^{(l - 1)} + = (W^{(l)})^{⊤} g^{(l)}$
Verso $h^{(l - k)}$ : $δ^{(l - k)} + = (S^{(l)})^{⊤} δ^{(l)}$ (oppure $+ δ^{(l)}$ se identità)

9) Istanze comuni di $f$

ReLU: $f (x) = max (0, x)$ , $f^{'} (x) = 1_{x > 0}$
tanh: $f^{'} (x) = 1 - tanh^{2} (x)$
Sigmoid: $f^{'} (x) = σ (x) (1 - σ (x))$

Sostituendo $f^{'} (z^{(l)})$ nelle formule precedenti ottieni immediatamente i casi specifici.

Deep Learning

Explorer

Skip connections

The Challenge of Depth in Neural Networks

💡L’intuizione di Kaiming He: Skip Connections

💪 Advantages of Skip Connections

✅ 1. No need to learn the identity mapping

Focus sul residuo informativo

✅ 2. Improved gradient flow in backpropagation

Summary

1) Nodo di somma (split del gradiente)

2) Non-linearità (catena sul ramo “main”)

3) Gradiente rispetto ai parametri del layer

4) Flusso verso gli input del layer “main”

5) Caso ResNet-style (skip identità tra layer consecutivi)

6) Perché le skip connections aiutano coi gradienti (intuizione formale)

7) Variante con skip proiettiva (es. $S^{(l)} = W_{s}^{(l)}$ )

8) Riepilogo operativo (pseudocodice concettuale per un blocco)

9) Istanze comuni di $f$

Graph View

Table of Contents

Deep Learning

Explorer

Skip connections

The Challenge of Depth in Neural Networks

💡L’intuizione di Kaiming He: Skip Connections

💪 Advantages of Skip Connections

✅ 1. No need to learn the identity mapping

Focus sul residuo informativo

✅ 2. Improved gradient flow in backpropagation

Summary

1) Nodo di somma (split del gradiente)

2) Non-linearità (catena sul ramo “main”)

3) Gradiente rispetto ai parametri del layer

4) Flusso verso gli input del layer “main”

5) Caso ResNet-style (skip identità tra layer consecutivi)

6) Perché le skip connections aiutano coi gradienti (intuizione formale)

7) Variante con skip proiettiva (es. S(l)=Ws(l)​)

8) Riepilogo operativo (pseudocodice concettuale per un blocco)

9) Istanze comuni di f

Graph View

Table of Contents

💡L’intuizione di Kaiming He: Skip Connections

7) Variante con skip proiettiva (es. $S^{(l)} = W_{s}^{(l)}$ )

9) Istanze comuni di $f$