Intro
Step Decay and Exponential Decay form two of the foundational paradigms upon which many modern learning rate scheduling policies are built.
Both rely on the principle of a systematic, multiplicative reduction of the learning rate. Their essential difference, however, lies in how this reduction unfolds over time.
- Step Decay enforces the decrease at fixed, discrete intervals, producing a piecewise-constant, “staircase-like” profile.
- Exponential Decay, by contrast, applies the reduction at every epoch, yielding a smooth, monotonically decreasing curve.
These contrasting dynamics embody alternative pathways to the same overarching objective:
Goal
Begin with a high learning rate to foster broad exploration of the parameter space, then gradually temper it to allow fine-grained exploitation and convergence without overshooting, following the principle of the exploration–exploitation trade-off.
Step Decay
It is a learning rate scheduling policy based on the training epochs.
Underlying principle
Specifically, it is a strategy that reduces the learning rate by a multiplicative factor every
step_sizeepochs.

From the graph above, it can be observed that:
- The learning rate starts with a base value, for example .
- After each fixed interval (
step_size = 4 epochs, in the figure), the learning rate is multiplied by (e.g., ). - This produces a staircase-like profile, where each sharp drop is followed by a plateau of stability.
| Formulation | Equation | Variables & Notes |
|---|---|---|
| General (compact form) | epoch (or iteration) index initial learning rate decay factor () (e.g. halves the LR) step size (number of epochs between updates) floor function (applies decay only at multiples of ) 👉 Directly links LR to epoch index , capturing both plateaus and drops. Example: if , , and , then . | |
| Equivalent step-wise | = index of the performed decay step (not epoch number) multiplicative decay factor 👉 More intuitive: tracks how many decay steps have occurred, not every epoch. Example: if and epoch , then → . |
Step-by-step derivation
The step-wise formulation can be derived by unfolding the recursive update:
This shows how the recursive multiplication by at each step naturally leads to the closed-form expression
Info
This strategy is often combined with monitoring validation metrics, as in PyTorch’s
ReduceLROnPlateaufunction, which reduces the learning rate only if performance does not improve.
A Note on Formulation Subtlety
It is worth noting that while the formula is valid, some popular Deep Learning libraries (such as PyTorch’s
StepLR) use a slightly different convention, often equivalent to . The primary difference is the timing of the first decay: the formula presented here applies the first drop at epoch , whereas thefloor(n/r)convention applies it at epoch . This distinction is important to keep in mind when translating theoretical models into practical code.
Hyperparameter Tuning Challenge
The choice of
step_size() andgamma() is a critical hyperparameter tuning task. Astep_sizethat is too small or agammavalue that is too low can cause the learning rate to decay prematurely, potentially halting the learning process before the model converges to an optimal solution.
Exponential Decay
It is a learning rate scheduling policy that decreases the learning rate at every epoch, yielding a smoother and more gradual decay compared to the staircase profile of Step Decay.
Underlying principle
The learning rate is multiplied by a factor () at each epoch.

From the graph above, it can be observed that:
- The learning rate starts from a base value, e.g., .
- At each epoch, the learning rate is multiplied by (i.e., ).
- This produces a smooth, monotonically decreasing curve, unlike the staircase-like profile of Step Decay.
| Formulation | Equation | Variables & Notes |
|---|---|---|
| General (per-epoch form) | epoch index initial learning rate decay factor per epoch () (controls the speed of the decay) Most common in ML libraries. 👉 At each epoch, the LR is multiplied by . Example: If , , then at : . | |
| Equivalent continuous form | decay rate (a positive constant) Euler’s number () Equivalent to the discrete one if (or ). 👉 Highlights the connection to natural decay processes. Example: If , then , and at : . |
Note
Exponential Decay is particularly useful when a progressive and gradual reduction is preferred, avoiding the abrupt drops typical of Step Decay.
Hyperparameter Tuning Challenge
Tuning the decay factor
gamma() is crucial. An overly aggressive decay (a lowgammavalue) can diminish the learning rate too quickly, preventing the model from exploring the parameter space effectively in the later stages of training.
Implementation with PyTorch
Here is how to implement these schedulers using PyTorch’s torch.optim.lr_scheduler.
Step Decay (StepLR)
This scheduler decays the learning rate of each parameter group by gamma every step_size epochs.
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
# Dummy model and optimizer
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1) # Initial LR = 0.1
# Scheduler: Halve the learning rate every 10 epochs
# gamma = 0.5, step_size = 10
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)
print(f"Initial LR: {optimizer.param_groups[0]['lr']:.4f}")
# Simulate training loop for 30 epochs
for epoch in range(1, 31):
# Training steps would go here
# optimizer.step()
# Update the learning rate
scheduler.step()
if epoch % 5 == 0:
print(f"Epoch {epoch}: Current LR = {optimizer.param_groups[0]['lr']:.4f}")
# Expected output:
# Initial LR: 0.1000
# Epoch 5: Current LR = 0.1000
# Epoch 10: Current LR = 0.0500 <- First drop
# Epoch 15: Current LR = 0.0500
# Epoch 20: Current LR = 0.0250 <- Second drop
# Epoch 25: Current LR = 0.0250
# Epoch 30: Current LR = 0.0125 <- Third dropExponential Decay (ExponentialLR)
This scheduler decays the learning rate of each parameter group by gamma at every single epoch.
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ExponentialLR
# Dummy model and optimizer
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1) # Initial LR = 0.1
# Scheduler: Reduce LR by 10% each epoch (gamma = 0.9)
scheduler = ExponentialLR(optimizer, gamma=0.9)
print(f"Initial LR: {optimizer.param_groups[0]['lr']:.4f}")
# Simulate training loop for 5 epochs
for epoch in range(1, 6):
# Training steps...
# optimizer.step()
# Update the learning rate
scheduler.step()
print(f"Epoch {epoch}: Current LR = {optimizer.param_groups[0]['lr']:.4f}")
# Expected output:
# Initial LR: 0.1000
# Epoch 1: Current LR = 0.0900 (0.1 * 0.9)
# Epoch 2: Current LR = 0.0810 (0.09 * 0.9)
# Epoch 3: Current LR = 0.0729 (0.081 * 0.9)
# Epoch 4: Current LR = 0.0656 (0.0729 * 0.9)
# Epoch 5: Current LR = 0.0590 (0.0656 * 0.9)Step vs Exponential Decay
| Aspect | Step Decay | Exponential Decay |
|---|---|---|
| Decay Profile | Piecewise-constant, staircase-like decay | Smooth, monotonically decreasing curve |
| Update Frequency | Applied every step_size epochs | Applied at every epoch |
| Decay Dynamics | More aggressive at the drop, yet stable between steps | More gentle, ideal for a progressive reduction without abrupt jumps |
| Key Parameters | step_size, |