Foundational Papers
These are the backbone papers behind modern deep learning. This page is the map: each focused database keeps the paper list compact, link-first, and easy to extend.
Focused Databases
| Database | Scope |
|---|---|
| Learning and Backpropagation | Backpropagation, early CNNs, LSTMs, deep belief nets, and early deep representation learning. |
| Architecture and Trainability | Highway networks, ResNets, skip connections, pre-activation residual units, dense connections, Transformers, and neural ODEs. |
| Initialization, Activations, and Normalization | Xavier, ReLU, He initialization, BatchNorm, LayerNorm, GroupNorm, RMSNorm, and modern activations. |
| Optimization and Regularization | Adam, AdamW, dropout, gradient clipping, label smoothing, mixup, learning-rate schedules, and sharpness-aware training. |
| Generalization and Scaling | Generalization puzzles, memorization, lottery tickets, NTK, double descent, scaling laws, and compute-optimal training. |
Milestone Map
| Stage | Key Papers | Primary Database |
|---|---|---|
| Learning foundations | Backpropagation, early CNNs, LSTM, deep belief nets, deep autoencoders | Learning and Backpropagation |
| Trainable depth | Highway Networks, ResNet, Identity Mappings, Stochastic Depth, DenseNet, Transformer, Neural ODEs | Architecture and Trainability |
| Stable signal flow | Xavier, ReLU, He initialization, GELU, BatchNorm, LayerNorm, GroupNorm, RMSNorm | Initialization, Activations, and Normalization |
| Training recipes | Gradient clipping, Adam, Dropout, label smoothing, AdamW, mixup, AugMix, SAM optimizer | Optimization and Regularization |
| Generalization and scale | Rethinking Generalization, Lottery Ticket, NTK, Double Descent, Scaling Laws, Chinchilla | Generalization and Scaling |
Suggested Paths
| Path | Read |
|---|---|
| Training mechanics | Backpropagation → Xavier → ReLU → He initialization → BatchNorm → ResNet → LayerNorm. |
| Deep architectures | Highway Networks → ResNet → Identity Mappings → DenseNet → Transformer → Neural ODEs. |
| Practical optimization | Gradient clipping → Dropout → Adam → label smoothing → AdamW → mixup → SAM. |
| Modern scaling | Rethinking Generalization → Lottery Ticket → NTK → Double Descent → Scaling Laws → Chinchilla. |