Cosine annealing

Intro

Definition

Cosine Annealing is a learning rate scheduling policy that adjusts the learning rate $η$ during training according to a cosine-shaped curve. Instead of decreasing it in discrete steps (as in Step Decay), it reduces the rate smoothly and continuously.

Info

This LR policy draws inspiration from the metallurgical process of annealing, in which a material is slowly cooled to reach a more stable, minimum-energy state. Similarly, in optimization, training begins with a high learning rate to encourage exploration, which is then gradually “cooled” to allow fine-grained convergence toward a favorable minimum (the exploitation phase).

Mechanism: Cosine Decay

The learning rate $η_{t}$ at each epoch $t$ follows a cosine-shaped curve (a half-period cosine) that decays from a maximum value $η_{ma x}$ down to a minimum value $η_{min}$ over a span of $T_{ma x}$ epochs.

This produces a nonlinear decay profile: it decreases slowly at the beginning, accelerates in the middle phase, and then slows down again as it approaches the minimum value.

The formula is:

η_{t} = η_{min} + \frac{1}{2} (η_{ma x} - η_{min}) (1 + cos (\frac{T _{c u r}}{T _{ma x}} π))

The key parameters are:

$η_{ma x}$ : the initial learning rate (highest value).
$η_{min}$ : the minimum learning rate (final target value).
$T_{ma x}$ : the number of epochs required to complete half a cosine cycle, i.e., the number of epochs to decay from $η_{ma x}$ to $η_{min}$ .
$T_{c u r}$ : the current epoch, ranging from $0$ to $T_{ma x}$ .

Thus, the learning rate is parameterized by a cosine function that smoothly decreases from $η_{ma x}$ to $η_{min}$ .

Advantages

More Effective Exploration

The principal benefit of Cosine Annealing lies in its smoothness. By avoiding abrupt drops in the learning rate, the optimizer can continue making meaningful progress even when approaching a minimum, reducing the risk of settling prematurely on a plateau or “overshooting” a good minimum due to steps that are still too large. This often results in a more stable and reliable convergence.

Policy	Typical characteristic	Practical implication
Exponential Decay	The learning rate decays rapidly, forcing an almost immediate transition to the exploitation phase.	Risk of premature convergence into a sub-optimal local minimum, since the search is confined to the first region explored.
Cosine Annealing	Extended initial exploration followed by a smooth transition into the exploitation phase.	Broader coverage of the parameter space, increasing the likelihood of discovering more promising valleys in the loss landscape.

If the brief exploration phase typical of “decay” policies is insufficient, it becomes necessary to extend it (by tuning step_size, $γ$ , or by directly adopting Cosine Annealing).

Implicit Regularization Effect

Beyond its role as a learning rate decay mechanism, Cosine Annealing can also be interpreted through the lens of implicit regularization.
In stochastic optimization methods such as SGD, the learning rate directly influences the magnitude of the stochastic noise induced by mini-batch sampling. Consequently, a time-varying learning rate produces a corresponding modulation of the effective noise level during training.

During the early phase of the cosine schedule, the relatively large learning rate amplifies stochastic fluctuations, encouraging broader exploration of the parameter space. As the learning rate decreases smoothly, the effective noise level is progressively reduced, promoting stabilization and fine-grained convergence.

This behavior is consistent with the flat minima hypothesis, which suggests that solutions located in wide, low-curvature basins tend to generalize better.

Insight

Cosine Annealing can therefore be viewed not only as a decay policy, but as a mechanism for controlling the stochastic dynamics of SGD, implicitly biasing the optimization process toward flatter regions of the loss landscape.

Evolution: Cosine Annealing with Restarts

Despite its advantages, Cosine Annealing also has an intrinsic limitation.

Limitation of a Single Cycle

A single exploration → exploitation trajectory may end in a sub-optimal local minimum or a saddle point, where the gradient vanishes without reaching a true minimum.

To overcome this limitation, a powerful extension of the classical Cosine Annealing has been introduced.

Solution: Cosine Annealing with Restarts

This LR policy extends Cosine Annealing by introducing cyclic restarts.

At the end of each cycle (i.e., after each $T_{ma x}$ epochs), the learning rate is abruptly reset to its maximum value $η_{ma x}$ .

Each restart opens a new exploration phase, reducing the likelihood of getting stuck in saddle points or sub-optimal local minima.

The subsequent periods $T_{0}, T_{1}, \dots$ may remain constant or progressively increase, thereby allocating longer exploitation phases as training progresses. This adjustment reflects the intuition that early cycles should emphasize frequent exploration, while later cycles should allow more time for fine-grained convergence within promising regions of the loss landscape. Each cycle, however, still follows a cosine decay trajectory.

Practical Benefits

Combines the benefits of the cosine-shaped curve (smooth decay → effective fine-tuning) with multiple exploration cycles.

After each exploitation phase, if the model has converged to an “unsatisfactory” minimum, the LR “boost” introduced by the restart enables it to escape and explore a new region of the loss landscape.

This is one of the most widely adopted LR policies in advanced Deep Learning experiments, thanks to its robust balance between exploration and exploitation across multiple cycles.

The use of restarts enables a highly effective model selection strategy.

Best-model selection with Cosine Annealing + Restarts

Since each exploration → exploitation cycle brings the network to settle in a local minimum of the loss landscape, it becomes possible to:

Save the model weights at the end of each exploitation phase and evaluate the model on a separate validation (or test) set, obtaining a performance metric (e.g., accuracy, F1 score, …).

Repeat this process across multiple cycles, thereby accumulating several candidate models (one for each local minimum reached).

Perform a best-model selection by choosing the model with the best metric among all candidates (“the best of the best”).

Broader Exploration of the Loss Landscape

Cosine Annealing with Restarts is an empirical method designed to address one of the key limitations of modern Deep Learning: the difficulty of thoroughly exploring a highly complex, high-dimensional loss landscape.

Step Decay, Exponential Decay, or classical Cosine Annealing explore the parameter space only once, producing a single candidate solution.

Cyclic restarts enable multiple explorations, thereby increasing the probability of discovering globally better weight configurations.

Practical Limitation of Cosine Annealing + Restarts

Cosine Annealing with Restarts can be highly demanding: for some models, completing even a single exploration → exploitation phase may require around 100 epochs. With multiple restarts, the total number of epochs (and thus the computational cost) increases rapidly, becoming prohibitive without adequate resources or an effective early-stopping strategy.

PyTorch Implementation

In PyTorch, these schedulers can be implemented using torch.optim.lr_scheduler, which offers two distinct classes for the two versions of the scheduler.

Implementation Detail

In PyTorch, scheduler.step() is typically called at the end of each epoch. If called after every batch, the schedule will adapt per iteration instead of per epoch, causing the learning rate to decay much faster than intended.

Cosine Annealing

This scheduler implements the standard Cosine Annealing without restarts. It performs a single, smooth decay from the initial learning rate down to a minimum value over a specified number of epochs ( $T_{ma x}$ ).

As shown in the official documentation, the learning rate is updated using a recursive formula that approximates the theoretical closed-form schedule from the original Stochastic Gradient Descent with Warm Restarts (SGDR) paper (introduced above).

Recursive Formula (PyTorch Implementation):

η_{t + 1} = η_{min} + (η_{t} - η_{min}) \cdot \frac{1 + cos ( \frac{( T _{c u r} + 1 ) π}{T _{ma x}} )}{1 + cos ( \frac{T _{c u r} π}{T _{ma x}} )}

where:

$η_{t}$ is the learning rate at step $t$
$T_{c u r}$ is the number of epochs since the last restart
$T_{ma x}$ is the maximum number of epochs in a cycle

Note

Although SGDR includes periodic restarts, this implementation performs cosine annealing without restarts, so $T_{c u r} = t$ and increases monotonically with each call to step(). So the key detail is that for CosineAnnealingLR, the internal step counter $T_{c u r}$ simply increases with each epoch and is not reset.

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR
import matplotlib.pyplot as plt
 
# Dummy model and optimizer
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1) # eta_max = 0.1
 
# LR Scheduler: Decay from 0.1 down to 0.01 over 100 epochs
# T_max = 100, eta_min = 0.01
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0.01)
 
# Simulate training and record LR
lrs = []
for epoch in range(100):
    lrs.append(optimizer.param_groups[0]['lr'])
    scheduler.step()
 
# Plot the result to visualize the single decay cycle
plt.figure(figsize=(10, 6))
plt.plot(lrs, marker='o')
plt.title('CosineAnnealingLR: Single Decay Cycle')
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.grid(True)
plt.show()

Cosine annealing with restarts

This is the implementation of Stochastic Gradient Descent with Warm Restarts (SGDR). It applies the same cosine decay but in cycles, “restarting” the learning rate at the beginning of each new cycle.

It cyclically decays the learning rate following a cosine curve, which is reset at the end of each cycle. The formula used within each cycle is:

η_{t} = η_{min} + \frac{1}{2} (η_{ma x} - η_{min}) (1 + cos (\frac{T _{c u r}}{T _{i}} π))

where $T_{i}$ is the length of the current cycle and $T_{c u r}$ is the number of epochs since the last restart.

Key Parameters

As defined in the PyTorch documentation, the key parameters are:

$T_{0}$ (int): The number of epochs for the first restart cycle.

$T_{m u lt}$ (int, optional): A factor by which the cycle duration ( $T_{i}$ ) increases after each restart. Default: 1.

eta_min (float, optional): The minimum learning rate. Default: 0.

For example, with T_0=20 and T_mult=2, the first cycle lasts 20 epochs, the second 40, the third 80, and so on.

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
import matplotlib.pyplot as plt
 
# Dummy model and optimizer
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1) # Initial LR (eta_max)
 
# LR Scheduler: First cycle lasts 20 epochs, second 40, third 80...
# T_0 = 20, T_mult = 2, eta_min = 0.01
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=20, T_mult=2, eta_min=0.01)
 
# Simulate training for 140 epochs (20 + 40 + 80)
lrs = []
for epoch in range(140):
    lrs.append(optimizer.param_groups[0]['lr'])
    scheduler.step()
 
# Plot the result to visualize the restarts
plt.figure(figsize=(10, 6))
plt.plot(lrs, marker='o')
plt.title('CosineAnnealingWarmRestarts: Cyclic Decay')
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.grid(True)
plt.show()

Deep Learning

Explorer

Cosine annealing

Intro

Mechanism: Cosine Decay

Advantages

More Effective Exploration

Implicit Regularization Effect

Evolution: Cosine Annealing with Restarts

PyTorch Implementation

Cosine Annealing

Cosine annealing with restarts

Graph View

Table of Contents

Backlinks