1. Intro

Step Decay and Exponential Decay are the two foundational learning-rate schedules in Deep Learning. Both implement the same broad idea introduced in Learning rate scheduling:

  • begin with a learning rate large enough to support meaningful exploration;
  • reduce it over time;
  • so that late training proceeds with smaller, more controlled updates.

The essential difference is how the reduction is distributed over time:

  • Step Decay changes the learning rate at discrete milestones, producing a piecewise-constant staircase profile;
  • Exponential Decay reduces it at every step, producing a smooth monotone curve.

Same idea, different cadence

These are not competing mathematical curiosities. They are two answers to a single practical question: should the learning rate decrease abruptly at specific times, or continuously at every step? Both are multiplicative schedules; they differ only in the frequency at which the multiplicative reduction is applied. The smoother alternative that interpolates between the two regimes is Cosine Annealing, treated in its own note.

Scheduler steps are not intrinsically epochs

Throughout this note, all hyperparameters (step_size, , scheduler indices) are expressed in scheduler steps, i.e.\ in units of how often scheduler.step() is called. If step() runs once per epoch, the hyperparameters count epochs; if once per iteration, they count iterations. The general discussion of this convention is in Learning rate scheduling §3; the practical consequence for these two schedules is that stepping per batch while choosing step_size as if it counted epochs is the most common implementation bug.

2. Step Decay

2.1 Definition

Step Decay multiplies the learning rate by a fixed factor every scheduler steps, where is the interval between two consecutive drops. The result is a staircase profile: the learning rate stays constant on each plateau, then drops abruptly to the next plateau.

2.2 Closed-form schedule

Let be the initial learning rate, the multiplicative decay factor, the step interval, and the scheduler step index. The closed form is

The floor expression counts how many drops have already occurred by step :

  • for no drop has happened, the exponent is , and ;
  • after the first crossing of a threshold , the exponent becomes and ;
  • after thresholds, the exponent is and .

2.3 What Step Decay is and is not

Step Decay strengths

Step Decay is often the right choice when:

  • a small number of explicit, controlled regime changes is desired;
  • prior experiments already suggest good milestone locations (the epoch numbers where the loss curve flattens);
  • manual interpretability of the schedule matters;
  • a simple baseline is needed before moving to smoother families.

Main weakness: abruptness and milestone sensitivity

At each milestone the learning rate changes discontinuously, and the optimization dynamics can change suddenly. Training therefore becomes sensitive to the exact placement of the milestones: if step_size is too small or is too low, the learning rate collapses too early and learning slows down prematurely; if step_size is too large, the late-stage exploitation never begins. The right milestones are essentially a hyperparameter that has to be tuned per task.

Step Decay vs ReduceLROnPlateau

Step Decay is time-driven: drops happen at predetermined scheduler-step counts. PyTorch’s ReduceLROnPlateau is metric-driven: drops happen when a monitored quantity (typically validation loss) stops improving for a configured number of epochs. The two are different mechanisms and address different needs; this note covers only the time-driven family.

3. Exponential Decay

3.1 Definition

Exponential Decay multiplies the learning rate by the same factor at every scheduler step. Unlike Step Decay, the reduction happens continuously rather than in discrete plateaus.

3.2 Closed-form schedule

The per-step recursion has the immediate closed form

The same schedule can equivalently be written as a continuous exponential

The two forms are mathematically identical; the exponential parameterization is occasionally convenient because it makes the time constant explicit (the number of scheduler steps over which the learning rate decays by a factor of ).

3.3 What Exponential Decay is and is not

Exponential Decay strengths

Exponential Decay is often the right choice when:

  • a smooth schedule is preferred over discontinuous drops;
  • the learning rate should decrease steadily throughout training without explicit milestones;
  • a single-parameter decay mechanism (only to tune) is desired.

Main weakness: less explicit control over regime changes

The reduction is always active, so the schedule may become too conservative if is chosen too small: the optimizer enters the low-learning-rate exploitation regime gradually but starts giving up step magnitude immediately, with no extended high-LR exploration plateau. Step Decay can express the rule “stay at the high LR for steps, then drop”; Exponential Decay cannot.

4. The two schedules in one picture

Step Decay and Exponential Decay are closely related. Both are multiplicative; they differ in frequency:

  • Step Decay applies the factor every steps: the same geometric decay is concentrated into explicit jumps;
  • Exponential Decay applies the factor every step: the same geometric decay is spread continuously over time.

Step Decay as a coarsened Exponential Decay

A useful way to think about the difference: a Step Decay with interval and factor has the same per-drop multiplier as an Exponential Decay with factor , but the contraction is buffered at the drop boundary instead of being smeared across steps. Step Decay changes the learning rate through regime shifts; Exponential Decay changes it through small continuous contractions.

5. Tuning from a target final learning rate

In practice, the most natural way to choose is not to pick it directly but to specify the desired endpoint of training and solve for .

If the final learning rate is known in advance, what value of achieves it?

5.1 Exponential Decay

Training runs for scheduler steps and the goal is to go from to . Inverting :

5.2 Step Decay

If decay events occur during training, , so

Here counts drops, not scheduler steps. If training lasts scheduler steps with drops every steps, .

Target-final-LR thinking

This viewpoint is often more interpretable than tuning directly by trial and error. It converts scheduler design into four explicit choices:

  1. the initial learning rate ;
  2. the desired final learning rate (or );
  3. the training horizon (or the number of drops );
  4. then solve for .

6. PyTorch

Both schedules live in torch.optim.lr_scheduler and follow the same two PyTorch conventions that apply to every scheduler:

  • scheduler.step() must be called after optimizer.step();
  • all scheduler hyperparameters are in scheduler-step units (see §1).

6.1 StepLR

StepLR multiplies the learning rate of each parameter group by gamma every step_size scheduler steps.

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
 
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)
 
# Halve the learning rate every 10 scheduler steps
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)
 
for epoch in range(1, 31):
    # ... training loop over mini-batches ...
    optimizer.step()
    scheduler.step()

The resulting schedule:

epoch rangelearning rate
1 to 9
10 to 19
20 to 29

6.2 ExponentialLR

ExponentialLR multiplies the learning rate by gamma at every scheduler step.

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ExponentialLR
 
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)
 
# Reduce LR by 10% each scheduler step
scheduler = ExponentialLR(optimizer, gamma=0.9)
 
for epoch in range(1, 6):
    # ... training loop over mini-batches ...
    optimizer.step()
    scheduler.step()

The resulting schedule:

epochlearning rate
1
2
3
4
5

Both code blocks are epoch-based only because step() is called once per epoch. Calling it once per batch turns the same schedulers into iteration-based variants, with the hyperparameter values then needing to be expressed in iterations.

7. Comparison and decision

AspectStep DecayExponential Decay
Decay profilepiecewise-constant staircasesmooth monotone curve
Update frequencyevery step_size stepsevery step
Optimization dynamicsstable plateaus, abrupt regime changesgradual continuous contraction
Main hyperparametersstep_size,
Main strengthexplicit milestone-based exploitation controlsmooth reduction without discontinuities
Main weaknesssensitivity to milestone placementno extended high-LR plateau; can be too aggressive if small

When neither is enough

If both schedules feel too rigid or too sensitive to their hyperparameters, the natural next step is Cosine Annealing, which provides a smooth nonlinear decay with a long high-LR plateau at the start and a slow tail near . For the full decision framework (when to add warm-up, when to consider restarts, when to keep things simple), see Choosing an LR scheduler.

8. Summary

Final takeaway

Step Decay and Exponential Decay embody the same fundamental idea of multiplicative learning-rate reduction. The real choice between them is between discrete milestone-based exploitation (Step Decay) and continuous smooth decay (Exponential Decay). The better schedule is the one whose transition pattern best matches the optimization dynamics of the training problem at hand; both are simple baselines against which any more elaborate scheduler should be compared.