Initialization, Activations, and Normalization

This database collects the papers that made deep networks train more reliably by controlling signal scale, gradient flow, and activation statistics.

Initialization and Activations

Year	Paper	Topic	Note
2010	Understanding the Difficulty of Training Deep Feedforward Neural Networks	Xavier initialization	Links activation statistics, gradients, and trainability.
2011	Deep Sparse Rectifier Neural Networks	ReLU	Helped popularize rectified activations for deep networks.
2015	Delving Deep into Rectifiers	He initialization / PReLU	Initialization designed for ReLU-like networks.
2016	Gaussian Error Linear Units	GELU	Smooth activation later used in many Transformer models.
2017	Searching for Activation Functions	Swish	Learned/search-discovered activation used in efficient networks.

Normalization

Year	Paper	Topic	Note
2015	Batch Normalization	BatchNorm	Stabilizes and accelerates training through batch statistics.
2016	Weight Normalization	WeightNorm	Reparameterizes weights to decouple length and direction.
2016	Layer Normalization	LayerNorm	Batch-size independent normalization for recurrent and Transformer-style models.
2018	Group Normalization	GroupNorm	Normalization independent of batch size, useful in vision.
2019	Root Mean Square Layer Normalization	RMSNorm	Simplifies LayerNorm by normalizing root-mean-square scale.

Reading Path

Step	Read
1	Xavier, ReLU, and He initialization.
2	BatchNorm for the CNN/deep-network training shift.
3	LayerNorm and RMSNorm for sequence models and LLMs.
4	GroupNorm for small-batch and dense vision settings.