1. Intro
Step Decay and Exponential Decay are two of the foundational learning-rate scheduling paradigms in Deep Learning.
Both implement the same broad idea:
- begin with a learning rate large enough to support meaningful exploration,
- then reduce it over time,
- so that later training proceeds with smaller, more controlled updates.
Their essential difference lies in how the reduction is distributed over time:
- Step Decay changes the learning rate at discrete milestones, producing a piecewise-constant profile;
- Exponential Decay reduces it continuously in scheduler time, producing a smooth monotone curve.
Common objective
The purpose of both schedules is the same:
- maintain sufficiently large updates during the early phase of training,
- then progressively reduce the step size,
- so that late-stage optimization emphasizes exploitation rather than coarse exploration.
Note
These two schedules are not competing mathematical curiosities. They represent two distinct ways of answering the same practical question: should the learning rate decrease abruptly at specific times, or continuously at every step?
2. A useful conceptual distinction
It is helpful to separate three ideas:
- the learning-rate value itself,
- the time unit used by the scheduler,
- the shape of the decay.
For both Step Decay and Exponential Decay:
- the time unit may be epochs or iterations,
- depending on how often
scheduler.step()is called; - the schedule is correct only if the scheduler parameters are interpreted in that same time unit.
Time-unit consistency
If a scheduler is stepped once per epoch, its hyperparameters count epochs. If it is stepped once per iteration, its hyperparameters count iterations. Many apparent implementation mistakes are actually time-scale mismatches.
3. Step Decay
3.1 Definition
Step Decay is a learning-rate schedule in which the learning rate is multiplied by a factor every fixed number of scheduler steps.
Underlying principle
Let denote the interval between two consecutive drops. Then Step Decay reduces the learning rate by a multiplicative factor every scheduler steps.
This produces a staircase-like profile:
- the learning rate remains constant on each plateau,
- then drops abruptly,
- then remains constant again until the next decay event.
From the graph above:
- the learning rate starts from a base value ;
- after each interval of length
step_size, it is multiplied by ; - the result is a sequence of plateaus separated by sharp drops.
3.2 Closed-form description
Let:
- be the initial learning rate,
- be the multiplicative decay factor,
- be the step interval,
- be the scheduler step index.
Then a compact description is:
This formula says:
- no decay occurs until reaches the first threshold,
- after one threshold crossing, the exponent becomes ,
- after two threshold crossings, it becomes ,
- and so on.
3.3 Step-index formulation
An equivalent viewpoint is to count not scheduler time directly, but the number of decays already performed.
If
then
This representation is often more intuitive: it tracks how many drops have occurred, not every scheduler step individually.
| Formulation | Equation | Meaning |
|---|---|---|
| Compact time-index form | Expresses the learning rate directly as a function of scheduler time | |
| Step-index form | Expresses the learning rate as a function of the number of drops already performed |
Step-by-step derivation
The step-index form follows immediately from repeated multiplication:
The floor term simply determines when the index should increase.
3.4 Interpretation
Step Decay imposes explicit transitions between optimization regimes.
Before a drop:
- the optimizer runs with a fixed learning rate,
- preserving a stable exploration or descent regime.
After a drop:
- all subsequent updates become uniformly smaller,
- the optimization switches to a more conservative phase.
What Step Decay makes explicit
Step Decay is useful because it is simple and interpretable:
- long plateaus allow a stable optimization regime,
- each drop introduces a deliberate exploitation phase,
- the schedule is easy to visualize and tune.
3.5 Practical strengths
Step Decay strengths
Step Decay is often a strong choice when:
- a small number of explicit regime changes is desired,
- prior experiments already suggest good milestone locations,
- one wants a schedule that is easy to tune manually,
- abrupt but controlled reductions are acceptable.
3.6 Main weakness
Its main weakness is the abruptness of the transition.
At each milestone:
- the learning rate changes discontinuously,
- the optimization dynamics can change suddenly,
- training may become sensitive to the exact placement of the milestones.
Hyperparameter tuning challenge
The choice of
step_sizeand is critical. Ifstep_sizeis too small, or if is too low, the learning rate may collapse too early and learning can slow down prematurely.
3.7 Formula subtlety and library conventions
Different texts and implementations may write slightly different formulas, such as
The difference is not conceptual. It concerns when the first drop is considered to happen.
Convention subtlety
The two formulas differ only by an indexing convention. The important point is not which floor formula is “philosophically correct”, but that the convention used in the mathematics matches the convention used in code.
Related but distinct idea
Step Decay should not be confused with
ReduceLROnPlateau. The latter is metric-driven, not purely time-driven: it reduces the learning rate when validation performance stops improving, rather than at predetermined milestones.
4. Exponential Decay
4.1 Definition
Exponential Decay is a learning-rate schedule in which the learning rate is multiplied by the same factor at every scheduler step.
Underlying principle
At each scheduler step, the learning rate is updated as
Unlike Step Decay, this produces a smooth monotone decrease rather than a staircase profile.
From the graph above:
- the learning rate starts from a base value ;
- at every scheduler step, it is multiplied by ;
- the resulting curve decreases smoothly and monotonically.
4.2 Closed-form expression
If is the initial learning rate and is the scheduler step index, then
This is the direct closed form of repeated multiplicative decay.
4.3 Continuous exponential form
The same schedule can be written as
where
This representation is useful because it reveals the connection with continuous exponential decay processes.
| Formulation | Equation | Meaning |
|---|---|---|
| Discrete multiplicative form | Natural form for scheduler implementation | |
| Continuous exponential form | Highlights the relation with exponential decay and decay-rate parameterization |
4.4 Interpretation
Exponential Decay changes the learning rate at every scheduler step. This means:
- no explicit plateaus,
- no abrupt milestones,
- a continuously shrinking update scale.
Why Exponential Decay is often useful
Exponential Decay is often preferred when a progressive and gradual reduction is desired. It avoids the discontinuities of Step Decay while remaining mathematically simple.
4.5 Practical strengths
Exponential Decay strenghts
Exponential Decay is often useful when:
- a smooth schedule is preferred,
- abrupt drops are undesirable,
- the learning rate should decrease steadily throughout training,
- one wants a one-parameter decay mechanism controlled mainly by .
4.6 Main weakness
Its main weakness is that it offers less explicit control over regime changes.
With Step Decay, one can say:
- “keep the learning rate fixed until this milestone, then drop it.”
With Exponential Decay:
- the reduction is always active,
- so the schedule may become too conservative if is chosen too small.
Hyperparameter tuning challenge
The decay factor is crucial. If it is too small, the learning rate may decay too quickly and exploration may end too early.
5. Relationship between the two
Step Decay and Exponential Decay are closely related.
Both are multiplicative schedules; they differ in the frequency with which the multiplicative reduction is applied.
- In Step Decay, the factor is applied only every steps.
- In Exponential Decay, the factor is applied at every step.
This means that Step Decay can be understood as a coarsened multiplicative schedule: the same idea of geometric decay is present, but the decay is concentrated into explicit jumps rather than spread continuously over time.
Insight
A useful way to think about the difference is:
- Step Decay changes the learning rate in regime shifts,
- Exponential Decay changes it in small continuous contractions.
6. Hyperparameter tuning
One of the most useful practical questions is the following:
How can be chosen if the desired final learning rate is known in advance?
6.1 Exponential Decay
Suppose training runs for scheduler steps and the goal is to move from to . Since
the appropriate decay factor is
This gives a direct way to calibrate Exponential Decay from the desired endpoint of training.
6.2 Step Decay
Suppose instead that decay events will occur during training. Since after drops
the corresponding factor is
The difference is that now counts drop events, not every training step. In practice, this means that is determined by:
- the total training horizon,
- the chosen
step_size, - and the time unit used for scheduler stepping.
For example, if training lasts scheduler steps and Step Decay drops the learning rate every steps, then the number of decay events is approximately
If scheduler.step() is called once per epoch, then and are epoch counts.
If it is called once per iteration, then both are measured in iterations.
Tip
This target-final-LR viewpoint is often more interpretable than tuning directly by trial and error. It converts scheduler design into a more interpretable problem:
- choose the initial LR,
- choose the desired final LR,
- choose the training horizon,
- then solve for .
7. PyTorch implementation
Both schedules are implemented in torch.optim.lr_scheduler.
Call order
In standard PyTorch usage,
scheduler.step()should be called afteroptimizer.step().
Time-unit interpretation
In PyTorch,
step_sizeand the effective scheduler index are measured in the unit in whichscheduler.step()is called. If stepping is done once per epoch, they count epochs. If stepping is done once per iteration, they count iterations.
7.1 Step Decay (StepLR)
StepLR multiplies the learning rate of each parameter group by gamma every step_size scheduler steps.
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
# Dummy model and optimizer
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)
# Halve the learning rate every 10 scheduler steps
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)
print(f"Initial LR: {optimizer.param_groups[0]['lr']:.4f}")
for epoch in range(1, 31):
# train(...)
optimizer.step()
scheduler.step()
if epoch % 5 == 0:
print(f"Epoch {epoch}: Current LR = {optimizer.param_groups[0]['lr']:.4f}")
# Expected pattern:
# Epoch 1-9: 0.1000
# Epoch 10-19: 0.0500
# Epoch 20-29: 0.0250
# Epoch 30+: 0.01257.2 Exponential Decay (ExponentialLR)
ExponentialLR multiplies the learning rate of each parameter group by gamma at every scheduler step.
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ExponentialLR
# Dummy model and optimizer
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1)
# Reduce LR by 10% each scheduler step
scheduler = ExponentialLR(optimizer, gamma=0.9)
print(f"Initial LR: {optimizer.param_groups[0]['lr']:.4f}")
for epoch in range(1, 6):
# train(...)
optimizer.step()
scheduler.step()
print(f"Epoch {epoch}: Current LR = {optimizer.param_groups[0]['lr']:.4f}")
# Expected pattern:
# Epoch 1: 0.0900
# Epoch 2: 0.0810
# Epoch 3: 0.0729
# Epoch 4: 0.0656
# Epoch 5: 0.0590Note
The code above is epoch-based only because
scheduler.step()is called once per epoch. If it were called once per batch, the same schedulers would become iteration-based.
8. Step vs Exponential Decay
| Aspect | Step Decay | Exponential Decay |
|---|---|---|
| Decay profile | Piecewise-constant, staircase-like | Smooth, monotonically decreasing |
| Update frequency | Every step_size scheduler steps | Every scheduler step |
| Optimization dynamics | Stable plateaus separated by abrupt regime changes | Gradual continuous contraction |
| Main hyperparameters | step_size, | |
| Main strength | Explicit control over milestone-based exploitation | Smooth reduction without discontinuities |
| Main weakness | Sensitivity to milestone placement and abrupt drops | Continuous decay may become too aggressive if poorly calibrated |
9. Practical guidance
When Step Decay is a good choice
Step Decay is often preferable when:
- good milestone locations are already known,
- explicit regime changes are desired,
- interpretability and manual control matter,
- a strong simple baseline is needed.
When Exponential Decay is a good choice
Exponential Decay is often preferable when:
- a smoother decay is desired,
- abrupt learning-rate drops are undesirable,
- one wants a continuously shrinking schedule,
- the training dynamics benefit from gradual rather than discrete changes.
When neither is enough
If both schedules feel too rigid or too sensitive, this often suggests moving to a more flexible family such as Cosine Annealing.
10. Conclusion
Final takeaway
Step Decay and Exponential Decay embody the same fundamental idea of multiplicative learning-rate reduction. The real choice between them is a choice between:
- discrete milestone-based exploitation, and
- continuous smooth decay.
The better schedule is the one whose transition pattern best matches the optimization dynamics of the training problem at hand.