Intro

So far, the analysis has mainly focused on mechanisms that multiply or scale the learning rate, such as the adaptive estimates introduced by algorithms like Adam.
These methods adjust the learning rate per parameter based on gradient statistics, but they still rely on a base learning rate set globally for the whole network.

LR scheduling

Learning rate scheduling instead refers to modifying the value of this base learning rate over the course of training, following a predefined policy.
The goal is to balance exploration in the early stages with fine convergence in the later stages, adapting the optimizer’s step size to the different phases of learning.


Challenges

Avoid local minima and saddle points

Saddle points in the loss landscape

A critical issue is the presence of saddle points in the loss landscape. These are points where the gradient vanishes, but they are not minima, since the curvature is negative along some directions.

In high-dimensional loss landscapes, saddle points are far more common than true local minima. The main difficulty arises not from descending into bad local minima, but from getting stuck in wide, flat regions around saddles where some eigenvalues of the Hessian are close to zero.

In such regions, the learning process slows dramatically: gradients provide little information, and the optimizer can drift aimlessly on a plateau.
The learning rate plays a key role in these regions:

  • If it is too small, the optimizer may remain trapped near the saddle for a long time.
  • If it is appropriately chosen, it provides enough momentum to escape the plateau and continue making progress toward regions of lower loss.

Accelerate convergence

Another central motivation for scheduling the learning rate is to speed up convergence.

  • If the learning rate decays too slowly, it stays large for too long.
    The optimizer keeps bouncing around the valley of the loss landscape, taking longer to settle near a minimum.

  • If the learning rate decays too quickly, it shrinks to very small values too early.
    The optimizer loses its ability to move significantly in parameter space, and may stall prematurely, missing better minima located further away.

The challenge is to strike the right balance:

  • Decay slow enough to allow the optimizer to explore valleys of the loss landscape thoroughly.
  • Decay fast enough to ensure that convergence does not take excessively many iterations.

Practical takeaway

A well-designed LR schedule avoids both extremes, neither wandering endlessly in wide regions due to an overly large LR, nor freezing too early due to an overly small LR, thus ensuring a faster and more reliable convergence.

Exploration-Exploitation trade-off

The desideratum is to start with a high learning rate and then let it progressively decay.
This allows managing the exploration–exploitation trade-off during training.

  • 🔍 Exploration: in the early stages it is useful to take large steps in order to explore the topology of the loss function.
    A high initial learning rate enables the optimizer to cover different regions of the parameter space.

  • 🧠 Exploitation: once a promising subspace of the loss landscape has been identified, the goal is to take finer steps to reach the minimum. At this stage, an excessively high learning rate can cause overshooting: the optimizer may bounce past the valley floor without ever settling. In the worst cases, it may even escape the valley, leading to worse performance.

Important

Learning rate scheduling is therefore a key tool for managing the different phases of optimization, ensuring both effective exploration in the early stages and accurate convergence in the later ones.


Summary


Common Scheduling Policies

Having established why learning rate scheduling is a critical component of modern model training, the next logical step is to explore how this is practically achieved. Over the years, several effective policies have been developed, each offering a different strategy for reducing the learning rate over time.

The following notes will delve into the mechanics and use cases of the most foundational and widely used schedulers:

  • Step Decay: A policy that reduces the learning rate in discrete, “staircase-like” intervals.
  • Exponential Decay: A strategy that applies a smoother, continuous reduction at every epoch.
  • Cosine Annealing: An approach that follows a sinusoidal curve to cyclically adjust the learning rate, promoting both exploration and fine-grained convergence.