Generalization and Scaling

This database collects papers that explain why modern networks generalize despite overparameterization, and how performance changes with model size, data, and compute.

Generalization and Overparameterization

YearPaperTopicNote
2016Understanding Deep Learning Requires Rethinking GeneralizationGeneralization puzzleShows large networks can fit random labels.
2018The Lottery Ticket HypothesisSparse subnetworksDense networks contain trainable sparse winning tickets.
2018Neural Tangent KernelInfinite-width theoryConnects wide neural networks to kernel dynamics.
2019Deep Double DescentDouble descentTest error can improve again beyond interpolation.
2020What Neural Networks Memorize and WhyMemorizationStudies memorization patterns in deep networks.

Scaling Laws

YearPaperTopicNote
2020Scaling Laws for Neural Language Models (OpenAI)Scaling lawsLoss vs model size, dataset size, and compute.
2022Training Compute-Optimal Large Language ModelsChinchillaCompute-optimal balance between parameters and tokens.
2020Scaling Laws for Autoregressive Generative ModelingGenerative scalingScaling behavior beyond language-only models.
2022Scaling Laws for Reward Model OveroptimizationAlignment scalingStudies how reward model optimization can overfit.

Reading Path

StepRead
1Rethinking Generalization and Lottery Ticket for overparameterized networks.
2NTK and Deep Double Descent for theory and interpolation behavior.
3Scaling Laws and Chinchilla for modern large-model training.
4Reward model overoptimization for scaling issues in alignment.