1. Intro

Definition

Cosine Annealing is a learning-rate schedule in which the learning rate is decreased according to a half-cosine curve, from a maximum to a minimum . Unlike Exponential Decay, which introduces abrupt changes at predetermined milestones, cosine annealing reduces the learning rate smoothly and continuously.

Intuition behind the name

The terminology is inspired by annealing in metallurgy, where a material is cooled gradually so that it can settle into a lower-energy and more stable configuration. The optimization analogy:

  • training begins with a relatively large learning rate, favouring exploration;
  • the learning rate is then reduced progressively;
  • the late phase of training emphasizes stabilization and fine-grained convergence.

Primary source

Loshchilov, Ilya, and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR, 2017. The paper introduced cosine annealing together with the warm-restarts mechanism developed in Section 5; modern usage often adopts only the cosine-decay part.

2. The cosine schedule

Within a single cycle, the learning rate decays from to following the half-cosine

The four quantities mean:

  • , the starting learning rate at the beginning of the cycle;
  • , the final learning rate at the end of the cycle;
  • , the current position inside the cycle, measured in scheduler steps;
  • , the total length of the cycle, measured in scheduler steps.

The endpoints follow immediately: at , and ; at , and .

Scheduler steps, not epochs

Throughout this note, and are scheduler time units, not epochs. Each call to scheduler.step() advances by one. Whether that unit is an epoch or a mini-batch iteration is determined entirely by how often step() is called, which is a choice of the training loop, not of the scheduler. This single detail is the source of essentially every cosine-annealing implementation bug in the wild; the practical implications are collected in Section 6.

2.1 Why the cosine shape: slow–fast–slow decay

The decay rate is not constant: it starts slowly, accelerates in the middle, and slows down again near the end.

The slow–fast–slow profile is one of the practical features that distinguishes cosine annealing from alternatives:

  • the long high-LR plateau at the start gives the optimizer a substantial exploration window before the learning rate falls noticeably;
  • the rapid mid-cycle drop transitions the optimizer firmly from exploration to exploitation;
  • the slow tail near lets the late-stage optimization settle into the chosen basin without the abrupt jolts that step decay produces.

3. Why smooth decay is useful

Two complementary perspectives explain the empirical success of cosine annealing.

3.1 Exploration to exploitation, without discontinuities

A learning-rate schedule controls how aggressively the optimizer moves through parameter space. A high learning rate produces large steps that can leave a current basin and reach a new one; a low learning rate produces small steps that refine the position within whatever basin the optimizer currently occupies. The same trade-off, in classical reinforcement-learning language, is exploration vs exploitation.

Compared with step decay, which collapses both regimes into an abrupt transition at a chosen milestone, cosine annealing creates a continuous interpolation between them.

PolicyBehaviourPractical implication
Exponential DecayLearning rate drops abruptly at predetermined milestones.Optimization dynamics change suddenly; the timing of milestones requires manual tuning and can be wrong.
Cosine AnnealingLearning rate decreases smoothly over the entire cycle.Transition from exploration to exploitation is gradual; no milestone tuning is needed; the optimizer is rarely caught off-guard by a sudden regime change.

Cosine is not universally better

The smoothness of cosine annealing is a different trade-off, not a universal improvement. Step decay imposes explicit, controlled regime changes that are sometimes exactly what is desired (e.g., when the optimal milestones are known from prior experiments). Cosine annealing trades that control for a continuous, parameter-free transition.

3.2 Stochastic-noise viewpoint

In mini-batch SGD, the learning rate also controls the scale of stochastic noise injected into the parameter trajectory: the per-step update is the mini-batch gradient (a noisy estimate of the full-batch gradient) multiplied by the learning rate. Larger amplifies the noise, smaller damps it.

Read in this light, cosine annealing modulates not just step size but effective optimization temperature:

  • the early high-LR plateau corresponds to a high-temperature exploration phase, in which mini-batch noise lets the optimizer cross between basins;
  • the late low-LR tail corresponds to a low-temperature exploitation phase, in which the optimizer settles into whichever basin it last entered.

Implicit regularization via flat minima

Optimization theory and empirical work (Hochreiter and Schmidhuber, 1997; Keskar et al., 2017; Smith and Le, 2018) suggest that low-temperature dynamics preferentially converge to flat minima: regions of the loss landscape where the loss varies slowly with the parameters. Flat minima correlate empirically with better generalization, and the same intuition is developed further in Should small weights be preferred?.

Cosine annealing does not guarantee convergence to a flat minimum; the connection is heuristic. But the schedule’s slow tail at small is structurally compatible with this preference, which is part of why cosine annealing tends to produce competitive validation accuracies even at fixed compute budget.

Scope of the noise interpretation

The stochastic-noise reading above is cleanest for SGD-like optimizers, where mini-batch noise is the dominant source of randomness. For Adam and AdamW, the effective dynamics also depend on the optimizer’s adaptive denominator, and the “learning rate as temperature” identification is approximate at best. Cosine annealing is still widely used with adaptive optimizers, and works well empirically, but the interpretive intuition is less direct.

4. From one cycle to many: the limitation of a single decay

A single cosine cycle performs one exploration-to-exploitation trajectory. It ends in whatever basin the optimizer happened to enter near the end of the cycle, and the choice of that basin is essentially random.

One trajectory, one outcome

The final point of a single cosine cycle may correspond to a mediocre local basin, a plateau, or a region with poor generalization properties, depending on the random seed, the data ordering, and the precise dynamics of the early high-LR phase. A single cycle leaves no opportunity to revisit other parts of the landscape.

This motivates the restart-based extension introduced in the SGDR paper.

5. Cosine annealing with warm restarts

The restart-based variant repeats the cosine cycle multiple times. After each cycle ends at , the learning rate is reset to and a new cycle begins.

Why "warm" restarts

The word warm indicates that the optimizer is not reinitialized. At each restart:

  • the model parameters are kept;
  • the optimizer state (momentum, Adam moments, etc.) is kept;
  • only the learning-rate schedule is reset to .

Training resumes from a partially trained state with renewed exploratory step sizes. The contrast is with a “cold” restart, which would re-initialize the model from scratch.

Inside any given cycle, the schedule is the same half-cosine as Section 2, with the cycle length renamed (the length of the -th cycle) and now measured from the most recent restart:

5.1 Variable-length cycles

A common refinement: make the cycle length grow after each restart, controlled by a multiplier . With initial length and multiplier , the cycle lengths are

For all cycles are equally long; for the lengths double each restart. The intuition: early cycles are short to encourage repeated exploration, later cycles are long to let promising regions be exploited thoroughly.

5.2 Snapshot ensembles: free model selection

A side effect of restart-based training is that each cycle naturally produces a candidate model at its low-LR endpoint.

Snapshot ensembles

A practical strategy enabled by warm restarts:

  1. Save the model at the end of each cycle (the low-LR exploitation point).
  2. Evaluate each checkpoint on a validation set.
  3. Either pick the best checkpoint, or average their predictions to form an ensemble.

Different cycles often converge to different regions of the loss landscape, so the resulting ensemble is more diverse than independent training runs of the same length would produce. This is the snapshot ensembles technique (Huang et al., 2017).

Validation set, not test set

Model selection across cycles must be performed on a validation set, never on the test set. The test set is reserved for the final evaluation of the chosen model.

5.3 The cost of restarts

Compute budget

Restart-based cosine annealing can be expensive. A useful single cycle already takes many epochs to produce a refined solution; running several full cycles multiplies that cost. In practice, the technique requires careful budget allocation, checkpointing, and often early stopping based on validation performance.

6. PyTorch: schedulers and stepping conventions

PyTorch exposes the two variants as two different classes in torch.optim.lr_scheduler:

  • CosineAnnealingLR: single cosine decay, no restarts;
  • CosineAnnealingWarmRestarts: SGDR-style cyclic decay with warm restarts.

In both, is the optimizer’s initial lr, and the closed-form schedule is the one in Section 2 (or Section 5 for the warm-restart variant). Two practical conventions matter and are the source of most implementation bugs.

The two PyTorch conventions

1. Call order. scheduler.step() must be called after optimizer.step(), not before. Calling it before causes the first batch to use from the next step rather than the intended .

2. Stepping granularity. The scheduler does not know whether an “epoch” or an “iteration” is one unit of time. The unit is defined entirely by how often step() is called. If step() is called once per epoch, then T_max, T_0, , are all in epochs. If step() is called once per batch, they are in iterations.

The cosine schedule is mathematically correct in either regime; what matters is that the time scale of step() and the time scale of the scheduler parameters are consistent.

Common implementation mistakes

The most frequent mistakes, all variations on the conventions above:

  • calling scheduler.step() before optimizer.step();
  • stepping the scheduler once per batch while choosing T_max or T_0 as if they counted epochs (resulting in a schedule that finishes within the first epoch);
  • expecting CosineAnnealingLR to perform restarts (it does not; for restarts, use CosineAnnealingWarmRestarts);
  • selecting checkpoints from a restart-based run on the test set rather than on a held-out validation set.

6.1 CosineAnnealingLR: single decay

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR
 
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)   # eta_max = 0.1
 
# step() will be called once per epoch, so T_max counts epochs
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0.01)
 
for epoch in range(100):
    # ... training loop over mini-batches ...
    optimizer.step()      # update parameters
    scheduler.step()      # advance the schedule

If step() is called once per mini-batch instead, T_max must be expressed in iterations. For example, epochs of iterations each gives .

6.2 CosineAnnealingWarmRestarts: cyclic decay

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
 
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)   # eta_max = 0.1
 
scheduler = CosineAnnealingWarmRestarts(
    optimizer,
    T_0=20,          # length of the first cycle (in scheduler steps)
    T_mult=2,        # multiplier for subsequent cycle lengths: 20, 40, 80, ...
    eta_min=0.01,
)
 
for epoch in range(140):
    # ... training loop over mini-batches ...
    optimizer.step()
    scheduler.step()

The two key parameters:

  • T_0 (int): length of the first cycle, in scheduler steps;
  • T_mult (int, ): factor by which the cycle length grows after each restart;
  • eta_min (float): the floor of the cosine.

CosineAnnealingWarmRestarts additionally supports fractional steps via scheduler.step(epoch + i / iters), which lets a smooth cosine be traced at sub-epoch resolution even when step() is conceptually tied to epoch boundaries. This is sometimes the cleanest pattern for training loops that want the cosine to look smooth in iteration space without giving up the epoch-level bookkeeping.

7. A modern wrinkle: linear warmup before cosine decay

Most modern training pipelines, especially for Transformer-scale models, do not start the learning rate at its maximum value. They begin with a linear warmup that ramps the learning rate from (or a very small value) up to over the first few hundred or few thousand iterations, then switch to cosine decay for the remainder of training.

Why warmup matters with Adam-family optimizers

Adam and AdamW compute adaptive denominators from running gradient statistics. Early in training these statistics are noisy and biased, and starting at full can produce unstable updates. The linear warmup gives the optimizer a few hundred steps to accumulate reliable moment estimates before the learning rate reaches its target value. The construction is now standard in the published training recipes for BERT, GPT, ViT, and most other large-scale architectures.

PyTorch supports this via torch.optim.lr_scheduler.SequentialLR, which chains a LinearLR warmup phase with a CosineAnnealingLR decay phase. The choice of where to split (typically to of total iterations devoted to warmup) is a hyperparameter, but the recipe itself is now a default rather than an experimental choice.

8. When to use what

When cosine annealing is a natural choice

  • A smooth monotone decay is preferred over abrupt milestones.
  • Training run is long enough for the slow tail near to matter.
  • Manual milestone design is undesirable (no prior knowledge of the right epoch numbers).
  • Stable late-stage exploitation is important for final accuracy.

When warm restarts are worth considering

  • Compute budget is large enough to support multiple cycles.
  • Snapshot ensembles or repeated exploratory phases are desired.
  • The task is known empirically to benefit from cyclic schedules.

When other schedules may be preferable

  • The training horizon is short (a single cosine cycle wastes the slow tail on too few iterations).
  • The optimal milestone epochs are already known (step decay gives more direct control).
  • The optimizer is AdamW and the recommended recipe for the architecture explicitly uses a different schedule (e.g., one-cycle, inverse-square-root, or constant after warmup).

For a broader treatment of how to pick a learning-rate schedule among the available options, see Choosing an LR scheduler.

9. Summary

Cosine annealing is a smooth half-cosine learning-rate schedule that decays from to over a configurable number of scheduler steps. Its key appeals are the absence of milestone tuning, the slow tail near that supports stable late-stage exploitation, and the natural “exploration to exploitation” interpretation. The restart-based extension (SGDR) repeats the cosine cycle multiple times, enabling snapshot ensembles and broader landscape exploration at the cost of increased compute.

The practical points that matter the most:

  • the closed-form schedule of Section 2 applies to both CosineAnnealingLR and CosineAnnealingWarmRestarts;
  • , , count scheduler steps, not epochs intrinsically; the unit is set by how often scheduler.step() is called;
  • modern Transformer training combines cosine annealing with a linear warmup at the start;
  • snapshot-ensemble selection across restart cycles must be done on a validation set, never on the test set.

Final takeaway

Cosine annealing is widely used because it is simple, smooth, and empirically effective. Its real-world performance depends less on the formula and more on choosing the cycle length, the warmup duration, and the stepping granularity consistently with the actual training pipeline.