Generalization and Scaling
This database collects papers that explain why modern networks generalize despite overparameterization, and how performance changes with model size, data, and compute.
Generalization and Overparameterization
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2016 | Understanding Deep Learning Requires Rethinking Generalization | Generalization puzzle | Shows large networks can fit random labels. |
| 2018 | The Lottery Ticket Hypothesis | Sparse subnetworks | Dense networks contain trainable sparse winning tickets. |
| 2018 | Neural Tangent Kernel | Infinite-width theory | Connects wide neural networks to kernel dynamics. |
| 2019 | Deep Double Descent | Double descent | Test error can improve again beyond interpolation. |
| 2020 | What Neural Networks Memorize and Why | Memorization | Studies memorization patterns in deep networks. |
Scaling Laws
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2020 | Scaling Laws for Neural Language Models (OpenAI) | Scaling laws | Loss vs model size, dataset size, and compute. |
| 2022 | Training Compute-Optimal Large Language Models | Chinchilla | Compute-optimal balance between parameters and tokens. |
| 2020 | Scaling Laws for Autoregressive Generative Modeling | Generative scaling | Scaling behavior beyond language-only models. |
| 2022 | Scaling Laws for Reward Model Overoptimization | Alignment scaling | Studies how reward model optimization can overfit. |
Reading Path
| Step | Read |
|---|---|
| 1 | Rethinking Generalization and Lottery Ticket for overparameterized networks. |
| 2 | NTK and Deep Double Descent for theory and interpolation behavior. |
| 3 | Scaling Laws and Chinchilla for modern large-model training. |
| 4 | Reward model overoptimization for scaling issues in alignment. |