Optimizer theory and learning-rate scheduling for deep neural networks, from sparse gradients to AdamW and cosine annealing