Optimization and Regularization

This database collects practical papers that changed how deep networks are trained and regularized.

Optimization

YearPaperTopicNote
2012On the Difficulty of Training Recurrent Neural NetworksGradient clippingDiagnoses exploding and vanishing gradients in RNNs.
2014Adam: A Method for Stochastic OptimizationAdamAdaptive first and second moment optimizer.
2016SGDR: Stochastic Gradient Descent with Warm RestartsLR schedulesCosine annealing and warm restarts.
2017Decoupled Weight Decay RegularizationAdamWSeparates weight decay from adaptive gradient updates.
2020Sharpness-Aware Minimization for Efficiently Improving GeneralizationSAM optimizerOptimizes for neighborhoods with lower sharpness.

Regularization and Data Mixing

YearPaperTopicNote
2014DropoutDropoutRandomly removes units during training to reduce co-adaptation.
2015Rethinking the Inception Architecture for Computer VisionLabel smoothingIncludes label smoothing as a practical regularization method.
2017mixup: Beyond Empirical Risk MinimizationMixupTrains on convex combinations of examples and labels.
2019AugMix: A Simple Data Processing Method to Improve Robustness and UncertaintyData augmentationImproves robustness through stochastic augmentation mixtures.

Reading Path

StepRead
1Gradient clipping and Adam for optimization basics.
2Dropout and label smoothing for regularization. See Architecture and Trainability for stochastic depth.
3AdamW and cosine schedules for modern training recipes.
4Mixup, AugMix, and SAM for robustness and generalization-oriented training.