Begin with Simplicity

Empirical Guidelines

  1. Initial Phase: Start by experimenting with the simplest and most established policies, such as Step Decay or Exponential Decay. These are often sufficient to achieve good results.
  2. Alternative Strategy (Fallback if the two policies do not work): If the loss stagnates or training fails to converge, switch to more dynamic schedulers. Cosine Annealing is the most common and effective subsequent choice.

Advanced Strategy: Warm-up + Cosine Annealing

For rapid and effective convergence (often in fewer than 100 epochs), one of the most powerful pipelines is the combination of Linear Warm-up and Cosine Annealing with Restarts.

Strategy Analysis: Why Does It Work?

The effectiveness of this combination stems from how it addresses two critical phases of training: initial stabilization and exploration of the loss landscape.

1. Linear Warm-up: Stabilizing the Start

In the first epochs, the learning rate does not start from its base value but from a very low value and grows linearly until it reaches it. The purpose is to avoid issues related to the high initial variance of adaptive optimizers.

Why avoid a high LR immediately with Adaptive Optimizers

Optimizers like Adam calculate adaptive estimates of gradients (moments) which, at the beginning, have very high variance. A high learning rate in this phase can cause excessively large gradient variations, risking “shooting” the model’s weights into a sub-optimal region or into suspicious local minima from which it will be difficult to escape.

Warm-up solves this problem: it dampens initial updates, giving time for the moment estimates to stabilize before the learning rate reaches its full potential.

2. Cosine Annealing with Restarts: Maximizing Exploration

Once the warm-up is finished, the learning rate enters a cyclic phase in which:

  1. It decays smoothly from to following the cosine curve.
  2. It is abruptly reset to to begin a new cycle.

This approach is powerful because each cycle provides a new phase of exploration → exploitation, drastically increasing the probability of finding better local minima compared to a single descent, often in fewer than 100 epochs.

3. The Winning Combination

In summary, the two components work in perfect synergy to optimize the entire training process:

  • The Warm-up stabilizes optimization during the most delicate initial phases.
  • The Cosine Annealing cycles robustly explore multiple regions of the loss landscape.

Empirically, this pipeline is one of the most effective strategies for identifying a set of promising minima in a reduced number of epochs, as it balances initial stability with broad exploration of the loss landscape.


Pytorch Implementation

In PyTorch, combining multiple schedulers in sequence is easily achieved with SequentialLR.

from torch.optim.lr_scheduler import SequentialLR, LinearLR, CosineAnnealingWarmRestarts
 
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
 
# 1. Warm-up Scheduler (first 5 epochs)
warmup_scheduler = LinearLR(optimizer, start_factor=1e-3, end_factor=1.0, total_iters=5)
 
# 2. Main Scheduler for the rest of the training
main_scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2, eta_min=1e-5)
 
# 3. Combine schedulers in sequence
# SequentialLR uses warmup_scheduler for the first 5 iterations, then switches to main_scheduler
scheduler = SequentialLR(
    optimizer,
    schedulers=[warmup_scheduler, main_scheduler],
    milestones=[5]
)