Optimization and Regularization
This database collects practical papers that changed how deep networks are trained and regularized.
Optimization
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2012 | On the Difficulty of Training Recurrent Neural Networks | Gradient clipping | Diagnoses exploding and vanishing gradients in RNNs. |
| 2014 | Adam: A Method for Stochastic Optimization | Adam | Adaptive first and second moment optimizer. |
| 2016 | SGDR: Stochastic Gradient Descent with Warm Restarts | LR schedules | Cosine annealing and warm restarts. |
| 2017 | Decoupled Weight Decay Regularization | AdamW | Separates weight decay from adaptive gradient updates. |
| 2020 | Sharpness-Aware Minimization for Efficiently Improving Generalization | SAM optimizer | Optimizes for neighborhoods with lower sharpness. |
Regularization and Data Mixing
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2014 | Dropout | Dropout | Randomly removes units during training to reduce co-adaptation. |
| 2015 | Rethinking the Inception Architecture for Computer Vision | Label smoothing | Includes label smoothing as a practical regularization method. |
| 2017 | mixup: Beyond Empirical Risk Minimization | Mixup | Trains on convex combinations of examples and labels. |
| 2019 | AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty | Data augmentation | Improves robustness through stochastic augmentation mixtures. |
Reading Path
| Step | Read |
|---|---|
| 1 | Gradient clipping and Adam for optimization basics. |
| 2 | Dropout and label smoothing for regularization. See Architecture and Trainability for stochastic depth. |
| 3 | AdamW and cosine schedules for modern training recipes. |
| 4 | Mixup, AugMix, and SAM for robustness and generalization-oriented training. |