Initialization, Activations, and Normalization

This database collects the papers that made deep networks train more reliably by controlling signal scale, gradient flow, and activation statistics.

Initialization and Activations

YearPaperTopicNote
2010Understanding the Difficulty of Training Deep Feedforward Neural NetworksXavier initializationLinks activation statistics, gradients, and trainability.
2011Deep Sparse Rectifier Neural NetworksReLUHelped popularize rectified activations for deep networks.
2015Delving Deep into RectifiersHe initialization / PReLUInitialization designed for ReLU-like networks.
2016Gaussian Error Linear UnitsGELUSmooth activation later used in many Transformer models.
2017Searching for Activation FunctionsSwishLearned/search-discovered activation used in efficient networks.

Normalization

YearPaperTopicNote
2015Batch NormalizationBatchNormStabilizes and accelerates training through batch statistics.
2016Weight NormalizationWeightNormReparameterizes weights to decouple length and direction.
2016Layer NormalizationLayerNormBatch-size independent normalization for recurrent and Transformer-style models.
2018Group NormalizationGroupNormNormalization independent of batch size, useful in vision.
2019Root Mean Square Layer NormalizationRMSNormSimplifies LayerNorm by normalizing root-mean-square scale.

Reading Path

StepRead
1Xavier, ReLU, and He initialization.
2BatchNorm for the CNN/deep-network training shift.
3LayerNorm and RMSNorm for sequence models and LLMs.
4GroupNorm for small-batch and dense vision settings.