Initialization, Activations, and Normalization
This database collects the papers that made deep networks train more reliably by controlling signal scale, gradient flow, and activation statistics.
Initialization and Activations
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2010 | Understanding the Difficulty of Training Deep Feedforward Neural Networks | Xavier initialization | Links activation statistics, gradients, and trainability. |
| 2011 | Deep Sparse Rectifier Neural Networks | ReLU | Helped popularize rectified activations for deep networks. |
| 2015 | Delving Deep into Rectifiers | He initialization / PReLU | Initialization designed for ReLU-like networks. |
| 2016 | Gaussian Error Linear Units | GELU | Smooth activation later used in many Transformer models. |
| 2017 | Searching for Activation Functions | Swish | Learned/search-discovered activation used in efficient networks. |
Normalization
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2015 | Batch Normalization | BatchNorm | Stabilizes and accelerates training through batch statistics. |
| 2016 | Weight Normalization | WeightNorm | Reparameterizes weights to decouple length and direction. |
| 2016 | Layer Normalization | LayerNorm | Batch-size independent normalization for recurrent and Transformer-style models. |
| 2018 | Group Normalization | GroupNorm | Normalization independent of batch size, useful in vision. |
| 2019 | Root Mean Square Layer Normalization | RMSNorm | Simplifies LayerNorm by normalizing root-mean-square scale. |
Reading Path
| Step | Read |
|---|---|
| 1 | Xavier, ReLU, and He initialization. |
| 2 | BatchNorm for the CNN/deep-network training shift. |
| 3 | LayerNorm and RMSNorm for sequence models and LLMs. |
| 4 | GroupNorm for small-batch and dense vision settings. |