1. Intro
Step Decay and Exponential Decay are the two foundational learning-rate schedules in Deep Learning. Both implement the same broad idea introduced in Learning rate scheduling:
- begin with a learning rate large enough to support meaningful exploration;
- reduce it over time;
- so that late training proceeds with smaller, more controlled updates.
The essential difference is how the reduction is distributed over time:
- Step Decay changes the learning rate at discrete milestones, producing a piecewise-constant staircase profile;
- Exponential Decay reduces it at every step, producing a smooth monotone curve.
Same idea, different cadence
These are not competing mathematical curiosities. They are two answers to a single practical question: should the learning rate decrease abruptly at specific times, or continuously at every step? Both are multiplicative schedules; they differ only in the frequency at which the multiplicative reduction is applied. The smoother alternative that interpolates between the two regimes is Cosine Annealing, treated in its own note.
Scheduler steps are not intrinsically epochs
Throughout this note, all hyperparameters (
step_size, , scheduler indices) are expressed in scheduler steps, i.e.\ in units of how oftenscheduler.step()is called. Ifstep()runs once per epoch, the hyperparameters count epochs; if once per iteration, they count iterations. The general discussion of this convention is in Learning rate scheduling §3; the practical consequence for these two schedules is that stepping per batch while choosingstep_sizeas if it counted epochs is the most common implementation bug.
2. Step Decay
2.1 Definition
Step Decay multiplies the learning rate by a fixed factor every scheduler steps, where is the interval between two consecutive drops. The result is a staircase profile: the learning rate stays constant on each plateau, then drops abruptly to the next plateau.
2.2 Closed-form schedule
Let be the initial learning rate, the multiplicative decay factor, the step interval, and the scheduler step index. The closed form is
The floor expression counts how many drops have already occurred by step :
- for no drop has happened, the exponent is , and ;
- after the first crossing of a threshold , the exponent becomes and ;
- after thresholds, the exponent is and .
Equivalent step-index reformulation
Writing for the number of drops so far, the schedule simplifies to the pure geometric sequence
which is immediate from , , …, . The two formulas describe the same schedule at two different levels: the time-indexed form tells the learning rate at every scheduler step , while the drop-indexed form tells the learning rate after drops have happened. The floor expression in the first form simply determines when the index should advance.
2.3 What Step Decay is and is not
Step Decay strengths
Step Decay is often the right choice when:
- a small number of explicit, controlled regime changes is desired;
- prior experiments already suggest good milestone locations (the epoch numbers where the loss curve flattens);
- manual interpretability of the schedule matters;
- a simple baseline is needed before moving to smoother families.
Main weakness: abruptness and milestone sensitivity
At each milestone the learning rate changes discontinuously, and the optimization dynamics can change suddenly. Training therefore becomes sensitive to the exact placement of the milestones: if
step_sizeis too small or is too low, the learning rate collapses too early and learning slows down prematurely; ifstep_sizeis too large, the late-stage exploitation never begins. The right milestones are essentially a hyperparameter that has to be tuned per task.
Step Decay vs
ReduceLROnPlateauStep Decay is time-driven: drops happen at predetermined scheduler-step counts. PyTorch’s
ReduceLROnPlateauis metric-driven: drops happen when a monitored quantity (typically validation loss) stops improving for a configured number of epochs. The two are different mechanisms and address different needs; this note covers only the time-driven family.
An indexing-convention subtlety
Different texts and implementations write the Step Decay schedule with slightly different floor expressions, for example
The difference is not conceptual: it concerns when the first drop is considered to happen. The left form keeps for and drops at ; the right form keeps only at and drops at . Neither is “more correct”; they encode different choices of when the schedule’s first reduction lands.
The practical point: the convention used in the mathematics must match the convention used in the implementation. Two implementations using the two formulas above with the same produce schedules that are off by exactly one drop event, and the resulting bug looks like a near-miss on the training curve rather than an obvious crash. PyTorch’s
StepLRuses the convention , the first form above.
3. Exponential Decay
3.1 Definition
Exponential Decay multiplies the learning rate by the same factor at every scheduler step. Unlike Step Decay, the reduction happens continuously rather than in discrete plateaus.
3.2 Closed-form schedule
The per-step recursion has the immediate closed form
The same schedule can equivalently be written as a continuous exponential
The two forms are mathematically identical; the exponential parameterization is occasionally convenient because it makes the time constant explicit (the number of scheduler steps over which the learning rate decays by a factor of ).
3.3 What Exponential Decay is and is not
Exponential Decay strengths
Exponential Decay is often the right choice when:
- a smooth schedule is preferred over discontinuous drops;
- the learning rate should decrease steadily throughout training without explicit milestones;
- a single-parameter decay mechanism (only to tune) is desired.
Main weakness: less explicit control over regime changes
The reduction is always active, so the schedule may become too conservative if is chosen too small: the optimizer enters the low-learning-rate exploitation regime gradually but starts giving up step magnitude immediately, with no extended high-LR exploration plateau. Step Decay can express the rule “stay at the high LR for steps, then drop”; Exponential Decay cannot.
4. The two schedules in one picture
Step Decay and Exponential Decay are closely related. Both are multiplicative; they differ in frequency:
- Step Decay applies the factor every steps: the same geometric decay is concentrated into explicit jumps;
- Exponential Decay applies the factor every step: the same geometric decay is spread continuously over time.
Step Decay as a coarsened Exponential Decay
A useful way to think about the difference: a Step Decay with interval and factor has the same per-drop multiplier as an Exponential Decay with factor , but the contraction is buffered at the drop boundary instead of being smeared across steps. Step Decay changes the learning rate through regime shifts; Exponential Decay changes it through small continuous contractions.
5. Tuning from a target final learning rate
In practice, the most natural way to choose is not to pick it directly but to specify the desired endpoint of training and solve for .
If the final learning rate is known in advance, what value of achieves it?
5.1 Exponential Decay
Training runs for scheduler steps and the goal is to go from to . Inverting :
5.2 Step Decay
If decay events occur during training, , so
Here counts drops, not scheduler steps. If training lasts scheduler steps with drops every steps, .
Target-final-LR thinking
This viewpoint is often more interpretable than tuning directly by trial and error. It converts scheduler design into four explicit choices:
- the initial learning rate ;
- the desired final learning rate (or );
- the training horizon (or the number of drops );
- then solve for .
6. PyTorch
Both schedules live in torch.optim.lr_scheduler and follow the same two PyTorch conventions that apply to every scheduler:
scheduler.step()must be called afteroptimizer.step();- all scheduler hyperparameters are in scheduler-step units (see §1).
6.1 StepLR
StepLR multiplies the learning rate of each parameter group by gamma every step_size scheduler steps.
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)
# Halve the learning rate every 10 scheduler steps
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)
for epoch in range(1, 31):
# ... training loop over mini-batches ...
optimizer.step()
scheduler.step()The resulting schedule:
| epoch range | learning rate |
|---|---|
| 1 to 9 | |
| 10 to 19 | |
| 20 to 29 | |
6.2 ExponentialLR
ExponentialLR multiplies the learning rate by gamma at every scheduler step.
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ExponentialLR
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)
# Reduce LR by 10% each scheduler step
scheduler = ExponentialLR(optimizer, gamma=0.9)
for epoch in range(1, 6):
# ... training loop over mini-batches ...
optimizer.step()
scheduler.step()The resulting schedule:
| epoch | learning rate |
|---|---|
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 |
Both code blocks are epoch-based only because step() is called once per epoch. Calling it once per batch turns the same schedulers into iteration-based variants, with the hyperparameter values then needing to be expressed in iterations.
7. Comparison and decision
| Aspect | Step Decay | Exponential Decay |
|---|---|---|
| Decay profile | piecewise-constant staircase | smooth monotone curve |
| Update frequency | every step_size steps | every step |
| Optimization dynamics | stable plateaus, abrupt regime changes | gradual continuous contraction |
| Main hyperparameters | step_size, | |
| Main strength | explicit milestone-based exploitation control | smooth reduction without discontinuities |
| Main weakness | sensitivity to milestone placement | no extended high-LR plateau; can be too aggressive if small |
When neither is enough
If both schedules feel too rigid or too sensitive to their hyperparameters, the natural next step is Cosine Annealing, which provides a smooth nonlinear decay with a long high-LR plateau at the start and a slow tail near . For the full decision framework (when to add warm-up, when to consider restarts, when to keep things simple), see Choosing an LR scheduler.
8. Summary
Final takeaway
Step Decay and Exponential Decay embody the same fundamental idea of multiplicative learning-rate reduction. The real choice between them is between discrete milestone-based exploitation (Step Decay) and continuous smooth decay (Exponential Decay). The better schedule is the one whose transition pattern best matches the optimization dynamics of the training problem at hand; both are simple baselines against which any more elaborate scheduler should be compared.