Step and Exponential decay

Intro

Step Decay and Exponential Decay form two of the foundational paradigms upon which many modern learning rate scheduling policies are built.

Both rely on the principle of a systematic, multiplicative reduction of the learning rate. Their essential difference, however, lies in how this reduction unfolds over time.

Step Decay enforces the decrease at fixed, discrete intervals, producing a piecewise-constant, “staircase-like” profile.
Exponential Decay, by contrast, applies the reduction at every epoch, yielding a smooth, monotonically decreasing curve.

These contrasting dynamics embody alternative pathways to the same overarching objective:

Goal

Begin with a high learning rate to foster broad exploration of the parameter space, then gradually temper it to allow fine-grained exploitation and convergence without overshooting, following the principle of the exploration–exploitation trade-off.

Step Decay

It is a learning rate scheduling policy based on the training epochs.

Underlying principle

Specifically, it is a strategy that reduces the learning rate $η$ by a multiplicative factor $γ < 1$ every step_size epochs.

From the graph above, it can be observed that:

The learning rate starts with a base value, for example $η_{0} = 0.001$ .
After each fixed interval (step_size = 4 epochs, in the figure), the learning rate is multiplied by $γ$ (e.g., $γ = 0.5$ ).
This produces a staircase-like profile, where each sharp drop is followed by a plateau of stability.

Formulation	Equation	Variables & Notes
General (compact form)	$η_{n} = η_{0} \cdot γ^{⌊ \frac{1 + n}{r} ⌋}$	$n \to$ epoch (or iteration) index $η_{0} \to$ initial learning rate $γ \to$ decay factor ( $0 < γ < 1$ ) (e.g. $γ = 0.5$ halves the LR) $r \to$ step size (number of epochs between updates) $⌊ \cdot ⌋ \to$ floor function (applies decay only at multiples of $r$ ) 👉 Directly links LR to epoch index $n$ , capturing both plateaus and drops. Example: if $r = 4$ , $γ = 0.5$ , and $n = 12$ , then $η_{12} = η_{0} \cdot 0. 5^{⌊(1 + 12) /4 ⌋} = η_{0} \cdot 0. 5^{3}$ .
Equivalent step-wise	$η_{i} = η_{0} \cdot γ^{i}, i = 0, 1, 2, \dots$	$i$ = index of the performed decay step (not epoch number) $γ \to$ multiplicative decay factor 👉 More intuitive: tracks how many decay steps have occurred, not every epoch. Example: if $r = 4$ and epoch $n = 12$ , then $i = ⌊ \frac{1 + 12}{4} ⌋ = 3$ → $η_{3} = η_{0} \cdot γ^{3}$ .

Step-by-step derivation

The step-wise formulation can be derived by unfolding the recursive update:
$η_{1} η_{2} ⋮ η_{i} = γ \cdot η_{0} = γ \cdot η_{1} = ⋮ = γ \cdot η_{i - 1} = γ^{2} \cdot η_{0} = ⋮ = γ^{i} \cdot η_{0}$
This shows how the recursive multiplication by $γ$ at each step naturally leads to the closed-form expression
$η_{i} = η_{0} \cdot γ^{i} .$

Info

This strategy is often combined with monitoring validation metrics, as in PyTorch’s ReduceLROnPlateau function, which reduces the learning rate only if performance does not improve.

A Note on Formulation Subtlety

It is worth noting that while the formula $η_{n} = η_{0} \cdot γ^{⌊(1 + n) / r ⌋}$ is valid, some popular Deep Learning libraries (such as PyTorch’s StepLR) use a slightly different convention, often equivalent to $η_{n} = η_{0} \cdot γ^{⌊ n / r ⌋}$ . The primary difference is the timing of the first decay: the formula presented here applies the first drop at epoch $n = r - 1$ , whereas the floor(n/r) convention applies it at epoch $n = r$ . This distinction is important to keep in mind when translating theoretical models into practical code.

Hyperparameter Tuning Challenge

The choice of step_size ( $r$ ) and gamma ( $γ$ ) is a critical hyperparameter tuning task. A step_size that is too small or a gamma value that is too low can cause the learning rate to decay prematurely, potentially halting the learning process before the model converges to an optimal solution.

Exponential Decay

It is a learning rate scheduling policy that decreases the learning rate $η$ at every epoch, yielding a smoother and more gradual decay compared to the staircase profile of Step Decay.

Underlying principle

The learning rate is multiplied by a factor $γ$ ( $0 < γ < 1$ ) at each epoch.

From the graph above, it can be observed that:

The learning rate starts from a base value, e.g., $η_{0} = 0.001$ .
At each epoch, the learning rate is multiplied by $γ$ (i.e., $L R_{t} = L R_{t - 1} \cdot γ$ ).
This produces a smooth, monotonically decreasing curve, unlike the staircase-like profile of Step Decay.

Formulation	Equation	Variables & Notes
General (per-epoch form)	$η_{n} = η_{0} \cdot γ^{n}$	$n \to$ epoch index $(0, 1, 2, \dots)$ $η_{0} \to$ initial learning rate $γ \to$ decay factor per epoch ( $0 < γ < 1$ ) (controls the speed of the decay) Most common in ML libraries. 👉 At each epoch, the LR is multiplied by $γ$ . Example: If $η_{0} = 0.1$ , $γ = 0.95$ , then at $n = 10$ : $η_{10} = 0.1 \cdot 0.9 5^{10} \approx 0.0599$ .
Equivalent continuous form	$η_{n} = η_{0} \cdot e^{- kn}$	$k \to$ decay rate (a positive constant) $e \to$ Euler’s number ( $\approx 2.718$ ) Equivalent to the discrete one if $γ = e^{- k}$ (or $k = - ln (γ)$ ). 👉 Highlights the connection to natural decay processes. Example: If $γ = 0.95$ , then $k = - ln (0.95) \approx 0.0513$ , and at $n = 10$ : $η_{10} = 0.1 \cdot e^{- 0.0513 \cdot 10} \approx 0.0599$ .

Note

Exponential Decay is particularly useful when a progressive and gradual reduction is preferred, avoiding the abrupt drops typical of Step Decay.

Hyperparameter Tuning Challenge

Tuning the decay factor gamma ( $γ$ ) is crucial. An overly aggressive decay (a low gamma value) can diminish the learning rate too quickly, preventing the model from exploring the parameter space effectively in the later stages of training.

Implementation with PyTorch

Here is how to implement these schedulers using PyTorch’s torch.optim.lr_scheduler.

Step Decay (`StepLR`)

This scheduler decays the learning rate of each parameter group by gamma every step_size epochs.

import torch 
import torch.optim as optim 
from torch.optim.lr_scheduler import StepLR 
 
# Dummy model and optimizer 
model = torch.nn.Linear(10, 2) 
optimizer = optim.SGD(model.parameters(), lr=0.1) # Initial LR = 0.1 
 
# Scheduler: Halve the learning rate every 10 epochs 
# gamma = 0.5, step_size = 10 
scheduler = StepLR(optimizer, step_size=10, gamma=0.5) 
 
print(f"Initial LR: {optimizer.param_groups[0]['lr']:.4f}") 
 
# Simulate training loop for 30 epochs 
for epoch in range(1, 31): 
    # Training steps would go here 
    # optimizer.step() 
    
    # Update the learning rate 
    scheduler.step() 
    
    if epoch % 5 == 0: 
        print(f"Epoch {epoch}: Current LR = {optimizer.param_groups[0]['lr']:.4f}") 
        
 
# Expected output: 
# Initial LR: 0.1000 
# Epoch 5: Current LR = 0.1000 
# Epoch 10: Current LR = 0.0500 <- First drop 
# Epoch 15: Current LR = 0.0500 
# Epoch 20: Current LR = 0.0250 <- Second drop 
# Epoch 25: Current LR = 0.0250 
# Epoch 30: Current LR = 0.0125 <- Third drop

Exponential Decay (`ExponentialLR`)

This scheduler decays the learning rate of each parameter group by gamma at every single epoch.

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ExponentialLR
 
# Dummy model and optimizer
model = torch.nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.1) # Initial LR = 0.1
 
# Scheduler: Reduce LR by 10% each epoch (gamma = 0.9)
scheduler = ExponentialLR(optimizer, gamma=0.9)
 
print(f"Initial LR: {optimizer.param_groups[0]['lr']:.4f}")
 
# Simulate training loop for 5 epochs
for epoch in range(1, 6):
    # Training steps...
    # optimizer.step()
 
    # Update the learning rate
    scheduler.step()
    
    print(f"Epoch {epoch}: Current LR = {optimizer.param_groups[0]['lr']:.4f}")
 
# Expected output:
# Initial LR: 0.1000
# Epoch 1: Current LR = 0.0900  (0.1 * 0.9)
# Epoch 2: Current LR = 0.0810  (0.09 * 0.9)
# Epoch 3: Current LR = 0.0729  (0.081 * 0.9)
# Epoch 4: Current LR = 0.0656  (0.0729 * 0.9)
# Epoch 5: Current LR = 0.0590  (0.0656 * 0.9)

Step vs Exponential Decay

Aspect	Step Decay	Exponential Decay
Decay Profile	Piecewise-constant, staircase-like decay	Smooth, monotonically decreasing curve
Update Frequency	Applied every `step_size` epochs	Applied at every epoch
Decay Dynamics	More aggressive at the drop, yet stable between steps	More gentle, ideal for a progressive reduction without abrupt jumps
Key Parameters	`step_size`, $γ$	$γ$

Deep Learning

Explorer

Step and Exponential decay

Intro

Step Decay

Exponential Decay

Implementation with PyTorch

Step Decay (`StepLR`)

Exponential Decay (`ExponentialLR`)

Step vs Exponential Decay

Graph View

Table of Contents

Backlinks

Deep Learning

Explorer

Step and Exponential decay

Intro

Step Decay

Exponential Decay

Implementation with PyTorch

Step Decay (StepLR)

Exponential Decay (ExponentialLR)

Step vs Exponential Decay

Graph View

Table of Contents

Backlinks

Step Decay (`StepLR`)

Exponential Decay (`ExponentialLR`)