Diffusion Denoising Probabilistic Models (DDPM)

Introduction to Diffusion Models

Diffusion Models are generative models that model a target distribution by learning a denoising process at varying noise levels. This concept is inspired by nonequilibrium thermodynamics, in which a physical system starts from a structured, low-entropy state that is gradually “diffused” or driven toward a more disordered, high-entropy equilibrium state over time. In principle, the system can be steered back toward a more ordered configuration, although this typically requires precise control and information about the underlying dynamics. In diffusion-based generative models, we begin with real data and then apply a stochastic “diffusion” of noise step-by-step. Each step slightly corrupts the data by adding Gaussian noise to arrive at a highly noisy, nearly featureless distribution that is mathematically close to a pure Gaussian distribution $N (0, I)$ .

Our goal is to understand the mathematics behind this “denoising” magic, starting with a core concept in generative modeling: the ELBO.

1.1. The Evidence Lower Bound (ELBO) - A Quick Refresh starting from VAEs

In many generative models, like VAEs, we want to model the true data distribution $p (x)$ .

The likelihood function tells you: How likely it is to observe your data, given some parameter values.

Directly maximizing the likelihood $p (x)$ can be tricky. VAEs introduce a latent variable $z$ and aim to maximize $lo g p (x)$ . This is often intractable, so we maximize a lower bound instead, called the Evidence Lower Bound (ELBO):

Let’s start by rewriting the log-likelihood of the data $x$ :

lo g p (x) = lo g \int p (x, z) d z = lo g \int p (x ∣ z) p (z) d z

We introduce an approximate posterior $q (z ∣ x)$ (the “encoder”) and $p (x ∣ z)$ becomes our “decoder”. The ELBO is derived as:

lo g p (x) = lo g \int p (x, z) \frac{q ( z ∣ x )}{q ( z ∣ x )} d z = lo g E_{q (z ∣ x)} [\frac{p ( x , z )}{q ( z ∣ x )}]

Using Jensen’s inequality ( $lo g E [Y] \geq E [lo g Y]$ ):

lo g p (x) \geq E_{q (z ∣ x)} [lo g \frac{p ( x , z )}{q ( z ∣ x )}] = E_{q (z ∣ x)} [lo g p (x ∣ z) + lo g p (z) - lo g q (z ∣ x)]

This can be rewritten into a more familiar form:

lo g p (x) \geq E_{q (z ∣ x)} [lo g p (x ∣ z)] - D_{K L} (q (z ∣ x) ∣∣ p (z))

Here:

$E_{q (z ∣ x)} [lo g p (x ∣ z)]$ is the reconstruction likelihood: how well can we reconstruct $x$ from $z$ sampled from our encoder $q (z ∣ x)$ ?
$D_{K L} (q (z ∣ x) ∣∣ p (z))$ is a Kullback-Leibler (KL) divergence that acts as a regularizer, pushing the distribution of latent codes $q (z ∣ x)$ to be similar to a prior distribution $p (z)$ (often a standard Gaussian).

What is $D_{K L} (q (z ∣ x) ∥ p (z))$ ?

It’s defined as:
$D_{K L} (q (z ∣ x) ∥ p (z)) = E_{q (z ∣ x)} [lo g \frac{q ( z ∣ x )}{p ( z )}]$
It measures how much the approximate posterior $q (z ∣ x)$ differs from the prior $p (z)$ . It is always non-negative, and equals zero if the two distributions match exactly.

Why choose $p (z)$ as a standard Gaussian?

Choosing $p (z)$ as a standard Gaussian $N (0, I)$ simplifies the KL divergence term and ensures that the latent space is well-regularized. This choice also facilitates the reparameterization trick, which allows us to backpropagate through the sampling process. By reparameterizing $z$ as $z = μ + σ ⊙ ϵ$ , where $ϵ \sim N (0, I)$ , we can compute gradients with respect to $μ$ and $σ$ directly, enabling efficient optimization.

1.2. ELBO for Sequential Latent Variables (Precursor to Diffusion)

Diffusion models can be seen as a type of latent variable model, but with a sequence of latent variables $x_{1}, x_{2}, ..., x_{T}$ .

$x_{0}$ is our original clean data.
$x_{1}, ..., x_{T - 1}$ are increasingly noisy versions of $x_{0}$ .
$x_{T}$ is ideally pure noise (e.g., a sample from $N (0, I)$ ).

The goal is still to maximize $lo g p (x_{0})$ . We can write a similar ELBO for this sequence: Let $x_{1 : T}$ denote the sequence $x_{1}, ..., x_{T}$ . We want to maximize $p (x_{0})$ . We can express this using the chain of latent states: $p (x_{0}) = \int p (x_{0 : T}) d x_{1 : T}$ , where $x_{0 : T} = (x_{0}, x_{1}, ..., x_{T})$ . The ELBO becomes:

$lo g p (x_{0}) \geq E_{q (x_{1 : T} ∣ x_{0})} [lo g \frac{p ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}]$

Here, $q (x_{1 : T} ∣ x_{0})$ is the (fixed) forward noising process, and $p (x_{0 : T})$ is the model we want to learn (the reverse denoising process).

1.3. The Two Markov Chains behind Diffusion Models

Diffusion models are characterized by two key processes:

a. Forward Process $q (x_{t} ∣ x_{t - 1})$ (Diffusion Process) ➡️

This process gradually adds Gaussian noise to an image over $T$ timesteps. It’s a fixed Markov chain, meaning it’s not learned. It’s defined as:

q (x_{1 : T} ∣ x_{0}) = t = 0 \prod T q (x_{t} ∣ x_{t - 1});

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I);

Here, $β_{t}$ are small positive constants representing the noise schedule (variance). Let $α_{t} = 1 - β_{t}$ . Then:

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; α_{t} x_{t - 1}, (1 - α_{t}) I)

A wonderful property of this process is that we can sample $x_{t}$ directly from $x_{0}$ at any timestep $t$ , without iterating through all intermediate steps. Let $\overset{α}{ˉ}_{t} = \prod_{i = 1}^{t} α_{i}$ :

q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)

This means we can get $x_{t}$ by:

x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ, where ϵ \sim N (0, I)

This is super useful for training! We can pick any $x_{0}$ from our dataset, pick a random $t$ , and generate a noisy $x_{t}$ in one shot.

b. Reverse Process $p_{θ} (x_{t - 1} ∣ x_{t})$ (Denoising Process) ⬅️

This process learns to reverse the noising steps. It’s also a Markov chain, aiming to predict the slightly less noisy $x_{t - 1}$ given the noisier $x_{t}$ . This is where our neural network (parameterized by $θ$ ) comes in:

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

The goal of training is to make $p_{θ} (x_{t - 1} ∣ x_{t})$ a good approximation of the true (but intractable) reverse conditional $q (x_{t - 1} ∣ x_{t})$ .

1.4. Decomposing the ELBO for Diffusion Models

Let’s expand the ELBO. The joint probability $p (x_{0 : T})$ is given by the reverse process: $p (x_{0 : T}) = p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} ∣ x_{t})$ where $p (x_{T})$ is a prior, usually $N (0, I)$ .

The forward process $q (x_{1 : T} ∣ x_{0})$ is given by: $q (x_{1 : T} ∣ x_{0}) = \prod_{t = 1}^{T} q (x_{t} ∣ x_{t - 1})$ (where $x_{0}$ is given for $q (x_{1} ∣ x_{0})$ ).

The ELBO can be rewritten and decomposed into several terms (after some algebra!):

L_{V L B} = E_{q} [lo g \frac{p ( x _{0 : T} )}{q ( x _{1 : T} ∣ x _{0} )}]

$= E_{q} [lo g \frac{p ( x _{T} ) \prod _{t = 1}^{T} p _{θ} ( x _{t - 1} ∣ x _{t} )}{q ( x _{1} ∣ x _{0} ) \prod _{t = 2}^{T} q ( x _{t} ∣ x _{t - 1} )}]$

A more convenient decomposition for minimization (we minimize $- L_{V L B}$ ) looks like this:

- L_{V L B} = E_{q} [- lo g p_{θ} (x_{0} ∣ x_{1}) + t = 2 \sum T D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t})) + D_{K L} (q (x_{T} ∣ x_{0}) ∣∣ p (x_{T}))]

Let’s break down these terms that we want to minimize:

$L_{0} = - E_{q} [lo g p_{θ} (x_{0} ∣ x_{1})]$ : This is a reconstruction term. It measures how well the model can reconstruct the original data $x_{0}$ from the first noisy version $x_{1}$ .
$L_{t - 1} = E_{q} [D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t}))]$ for $t = 2, ..., T$ : These are KL divergence terms. They measure the difference between the model’s reverse step $p_{θ} (x_{t - 1} ∣ x_{t})$ and the true posterior $q (x_{t - 1} ∣ x_{t}, x_{0})$ of the forward process. This true posterior tells us what $x_{t - 1}$ should look like given $x_{t}$ and the original clean image $x_{0}$ .
$L_{T} = D_{K L} (q (x_{T} ∣ x_{0}) ∣∣ p (x_{T}))$ : This term compares the distribution of the final noised sample $x_{T}$ (derived from $x_{0}$ ) with a prior $p (x_{T})$ (e.g., $N (0, I)$ ). Since the forward process $q$ is designed such that $q (x_{T} ∣ x_{0})$ is approximately $N (0, I)$ for large $T$ , and $p (x_{T})$ is chosen as $N (0, I)$ , this term is often small and doesn’t depend on $θ$ , so it’s usually ignored during training.

The core of the learning happens in the $L_{t - 1}$ terms (and $L_{0}$ , which can be seen as a special case).

Why is $p (x_{T})$ a normal Gaussian as $T \to \infty$ ?

As $T$ approaches infinity, the forward noising process $q (x_{T} ∣ x_{0})$ adds so much noise that the original data $x_{0}$ is completely obliterated. The resulting distribution $q (x_{T} ∣ x_{0})$ converges to a standard Gaussian $N (0, I)$ due to the central limit theorem and the design of the noise schedule. This ensures that $p (x_{T})$ , the prior for the reverse process, can also be modeled as $N (0, I)$ , simplifying the training and sampling processes.

Step-by-Step Explanation:

Step 1: Analyze $\overset{α}{ˉ}_{t}$ as $t \to \infty$ :

Typically, the noise schedule ${β_{t}}$ is defined such that:

Each $β_{t}$ is small and satisfies $0 < β_{t} < 1$ .
As $t \to \infty$ , the cumulative product $\overset{α}{ˉ}_{t}$ approaches zero:

t \to \infty lim \overset{α}{ˉ}_{t} = t \to \infty lim i = 1 \prod t (1 - β_{i}) = 0.

Step 2: Evaluate mean and covariance of $q (x_{t} ∣ x_{0})$ in this limit:

Given the distribution:

q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I),

Mean: As $\overset{α}{ˉ}_{t} \to 0$ ,

t \to \infty lim \overset{α}{ˉ}_{t} x_{0} = 0.

Covariance: As $\overset{α}{ˉ}_{t} \to 0$ ,

t \to \infty lim (1 - \overset{α}{ˉ}_{t}) = 1.

Thus,

q (x_{t} ∣ x_{0}) t \to \infty N (0, I) .

Final Result:

Therefore, the final limiting behavior is elegantly expressed as:

q (x_{T} ∣ x_{0}) T \to \infty N (0, I)

1.5. Analyzing the KL Divergence Terms ( $L_{t - 1}$ )

To minimize $L_{t - 1} = D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t}))$ , we need to characterize $q (x_{t - 1} ∣ x_{t}, x_{0})$ . Using Bayes’ theorem:

q (x_{t - 1} ∣ x_{t}, x_{0}) = \frac{q ( x _{t} ∣ x _{t - 1} , x _{0} ) q ( x _{t - 1} ∣ x _{0} )}{q ( x _{t} ∣ x _{0} )}

Since the forward process is Markovian, $q (x_{t} ∣ x_{t - 1}, x_{0}) = q (x_{t} ∣ x_{t - 1})$ . We know $q (x_{t} ∣ x_{t - 1})$ , $q (x_{t - 1} ∣ x_{0})$ , and $q (x_{t} ∣ x_{0})$ are all Gaussians. After some lovely math involving products of Gaussian PDFs and completing the square, we find that $q (x_{t - 1} ∣ x_{t}, x_{0})$ is also a Gaussian:

q (x_{t - 1} ∣ x_{t}, x_{0}) = N (x_{t - 1}; \tilde{μ}_{t} (x_{t}, x_{0}), \tilde{β}_{t} I)

Where the mean $\tilde{μ}_{t} (x_{t}, x_{0})$ and variance $\tilde{β}_{t} I$ are:

\tilde{μ}_{t} (x_{t}, x_{0}) = \frac{α ˉ _{t - 1} β _{t}}{1 - α ˉ _{t}} x_{0} + \frac{α _{t} ( 1 - α ˉ _{t - 1} )}{1 - α ˉ _{t}} x_{t}

\tilde{β}_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t}

Our model’s reverse step is $p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{t}^{2} I)$ . The KL divergence between two Gaussians $N (μ_{1}, σ_{1}^{2} I)$ and $N (μ_{2}, σ_{2}^{2} I)$ simplifies nicely. If we fix the variance of our model $σ_{t}^{2}$ (e.g., set $σ_{t}^{2} = \tilde{β}_{t}$ or $σ_{t}^{2} = β_{t}$ , as is common), the KL divergence term $L_{t - 1}$ becomes proportional to the squared difference between the means:

L_{t - 1} \propto E_{x_{0}, ϵ (to get x_{t})} [\frac{1}{2 σ _{t}^{2}} ∣∣ \tilde{μ}_{t} (x_{t}, x_{0}) - μ_{θ} (x_{t}, t) ∣ ∣^{2}]

So, we need our neural network $μ_{θ} (x_{t}, t)$ to predict $\tilde{μ}_{t} (x_{t}, x_{0})$ .

1.6. Parameterizing the Mean $μ_{θ} (x_{t}, t)$ with Noise Prediction

This is where the magic simplification happens! The mean $\tilde{μ}_{t} (x_{t}, x_{0})$ depends on $x_{0}$ , which is not available during the reverse (generation) process. However, during training, we do have $x_{0}$ . Recall the forward sampling equation: $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ$ . We can re-arrange this to express $x_{0}$ in terms of $x_{t}$ and the noise $ϵ$ :

x_{0} = \frac{x _{t} - 1 - α ˉ _{t} ϵ}{α ˉ _{t}}

Now, substitute this expression for $x_{0}$ back into the equation for $\tilde{μ}_{t} (x_{t}, x_{0})$ . After some algebraic simplification, $\tilde{μ}_{t}$ can be re-written as:

\tilde{μ}_{t} (x_{t}, ϵ) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ) = \frac{1}{α _{t}} (x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t}} ϵ)

Instead of making our neural network $μ_{θ} (x_{t}, t)$ directly predict this complex mean, we parameterize it to predict the noise $ϵ$ that was added at step $t$ . Let $ϵ_{θ} (x_{t}, t)$ be the noise predicted by our neural network (usually a U-Net architecture). We define our model’s mean as:

μ_{θ} (x_{t}, t) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t))

This is a key implementation point! Our network learns to predict noise.

Now, the squared difference term in $L_{t - 1}$ becomes:

∥ \tilde{μ}_{t} (x_{t}, ϵ) - μ_{θ} (x_{t}, t) ∥^{2} = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ) - \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t))^{2} = \frac{1}{α _{t}} \frac{β _{t}}{1 - α ˉ _{t}} (ϵ_{θ} (x_{t}, t) - ϵ)^{2} = \frac{β _{t}^{2}}{α _{t} ( 1 - α ˉ _{t} )} ∥ ϵ - ϵ_{θ} (x_{t}, t) ∥^{2}

So, the loss term $L_{t - 1}$ (for $t \geq 2$ ) becomes:

L_{t - 1} = E_{x_{0}, ϵ} [\frac{1}{2 σ _{t}^{2}} \frac{β _{t}^{2}}{α _{t} ( 1 - α ˉ _{t} )} ∣∣ ϵ - ϵ_{θ} (\overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ, t) ∣ ∣^{2}]

The $L_{0}$ term (reconstruction) can also be formulated in a similar way or handled by this noise prediction framework at $t = 1$ .

1.7. The Simplified Loss Function (The Big Reveal!)

The full ELBO contains these weighted noise prediction terms. However, Ho et al. (2020) in their paper “Denoising Diffusion Probabilistic Models” (DDPM) found that a much simpler, unweighted version of this loss works remarkably well in practice. They propose to train the model by minimizing the following simple mean squared error between the true noise and the predicted noise:

L_{s im pl e} (θ) = E_{t \sim U (1, T), x_{0} \sim p_{d a t a}, ϵ \sim N (0, I)} [∣∣ ϵ - ϵ_{θ} (\overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ, t) ∣ ∣^{2}]

This is it! This is the objective function that most diffusion models are trained on. We simply:

Pick a random training image $x_{0}$ .
Pick a random timestep $t$ .
Sample a random noise vector $ϵ$ .
Create the noised image $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ$ .
Feed $x_{t}$ and $t$ to our neural network $ϵ_{θ} (x_{t}, t)$ .
Ask the network to predict the original noise $ϵ$ that was added.
The loss is just the Mean Squared Error between the true $ϵ$ and the predicted $ϵ_{θ}$ .

It’s beautifully simple and incredibly effective.

1.8. Why this Simplification Works and Implementation Details

Why simplify? The weighting factors $\frac{β _{t}^{2}}{2 σ _{t}^{2} α _{t} ( 1 - α ˉ _{t} )}$ can be complex to tune, and empirically, the unweighted version performs very well and is more stable. It effectively re-weights the importance of different timesteps.
Choice of $σ_{t}^{2}$ : The variance of the reverse process $p_{θ} (x_{t - 1} ∣ x_{t})$ is often set to $σ_{t}^{2} = \tilde{β}_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t}$ or $σ_{t}^{2} = β_{t}$ . The DDPM paper found that both choices work well.

1.9. Deriving the Reverse Sampling Formula

Let’s derive the formula for sampling during the reverse process step-by-step:

First, recall that the true posterior for the reverse process is:
$q (x_{t - 1} ∣ x_{t}, x_{0}) = N (x_{t - 1}; \tilde{μ}_{t} (x_{t}, x_{0}), \tilde{β}_{t} I)$
Where the mean $\tilde{μ}_{t}$ can be derived using Bayes’ rule:
$\tilde{μ}_{t} (x_{t}, x_{0}) = \frac{α _{t} ( 1 - α ˉ _{t - 1} )}{1 - α ˉ _{t}} x_{t} + \frac{α ˉ _{t - 1} β _{t}}{1 - α ˉ _{t}} x_{0}$
We can’t use this directly in practice, since we don’t know the true $x_{0}$ during sampling. However, given our noise prediction network, we can estimate $x_{0}$ from $x_{t}$ :
$x_{0} \approx \hat{x}_{0} = \frac{1}{α ˉ _{t}} (x_{t} - 1 - \overset{α}{ˉ}_{t} ϵ_{θ} (x_{t}, t))$
Substituting this estimate into the formula for $\tilde{μ}_{t}$ , and after algebraic simplification, we get the DDPM sampling equation:
$x_{t - 1} = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t)) + σ_{t} z, z \sim N (0, I)$
Where $σ_{t}^{2} = \tilde{β}_{t} = \frac{1 - α ˉ _{t - 1}}{1 - α ˉ _{t}} β_{t}$ is the variance term from the true posterior.

This formula is the heart of the DDPM sampling process. Notice how:

The first term denoises the noisy sample using our predicted noise
The second term adds a controlled amount of new noise to maintain the stochastic nature of the process

When we apply this formula step by step from $t = T$ down to $t = 1$ , we gradually transform random noise into a coherent data sample.

Beta Schedules: Designing the Noise Trajectory

The choice of noise schedule ${β_{1}, β_{2}, ..., β_{T}}$ significantly impacts sample quality and training dynamics:

Linear schedule: A simple linear increase from 0.0001 to 0.02 (original DDPM paper)
- $β_{t}$ increases linearly from $β_{1}$ to $β_{T}$
- Easy to implement but not optimal for all data types
Cosine schedule: A smoother, more natural decay proposed by Nichol & Dhariwal
- Uses cosine function to create a schedule that adds noise more gradually at first
- Better preserves data structure in early timesteps
- Improves sample quality
Learned schedule: The $β_{t}$ values are optimized jointly with model weights
- More complex but can adapt to specific data characteristics
- Requires additional training complexity

The optimal schedule balances between preserving low-frequency information early in the diffusion process and adding sufficient noise to cover the data distribution.

1.10. Implementation Details

Pseudo-code for Training

# Training Algorithm
for each training step:
    x_0 = sample_from_data()  # Original image
    t = random_timestep(1, T)  # Random timestep
    epsilon = sample_noise()  # Noise ~ N(0, I)
 
    # Generate noised image
    x_t = sqrt(alpha_bar[t]) * x_0 + sqrt(1 - alpha_bar[t]) * epsilon
 
    # Predict noise using the model
    epsilon_pred = model(x_t, t)
 
    # Compute loss
    loss = mse_loss(epsilon, epsilon_pred)
 
    # Update model parameters
    optimizer.step(loss)

Pseudo-code for Sampling

# Sampling Algorithm
x_T = sample_noise()  # Start with pure noise
 
for t in range(T, 0, -1):
    epsilon_pred = model(x_t, t)  # Predict noise
    mu_t = (1 / sqrt(alpha[t])) * (x_t - beta[t] / sqrt(1 - alpha_bar[t]) * epsilon_pred)
 
    if t > 1:
        z = sample_noise()  # Add noise for intermediate steps
    else:
        z = 0  # No noise at the last step
 
    x_t_minus_1 = mu_t + sigma[t] * z
 
x_0 = x_t_minus_1  # Final generated image

References

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239.
Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv preprint arXiv:1503.03585.

Deep Learning

Explorer

Diffusion Denoising Probabilistic Models (DDPM)

Introduction to Diffusion Models

1.1. The Evidence Lower Bound (ELBO) - A Quick Refresh starting from VAEs

Why choose $p (z)$ as a standard Gaussian?

1.2. ELBO for Sequential Latent Variables (Precursor to Diffusion)

1.3. The Two Markov Chains behind Diffusion Models

a. Forward Process $q (x_{t} ∣ x_{t - 1})$ (Diffusion Process) ➡️

b. Reverse Process $p_{θ} (x_{t - 1} ∣ x_{t})$ (Denoising Process) ⬅️

1.4. Decomposing the ELBO for Diffusion Models

Why is $p (x_{T})$ a normal Gaussian as $T \to \infty$ ?

Step-by-Step Explanation:

Final Result:

1.5. Analyzing the KL Divergence Terms ( $L_{t - 1}$ )

1.6. Parameterizing the Mean $μ_{θ} (x_{t}, t)$ with Noise Prediction

1.7. The Simplified Loss Function (The Big Reveal!)

1.8. Why this Simplification Works and Implementation Details

1.9. Deriving the Reverse Sampling Formula

Beta Schedules: Designing the Noise Trajectory

1.10. Implementation Details

Pseudo-code for Training

Pseudo-code for Sampling

References

Graph View

Table of Contents

Deep Learning

Explorer

Diffusion Denoising Probabilistic Models (DDPM)

Introduction to Diffusion Models

1.1. The Evidence Lower Bound (ELBO) - A Quick Refresh starting from VAEs

Why choose p(z) as a standard Gaussian?

1.2. ELBO for Sequential Latent Variables (Precursor to Diffusion)

1.3. The Two Markov Chains behind Diffusion Models

a. Forward Process q(xt​∣xt−1​) (Diffusion Process) ➡️

b. Reverse Process pθ​(xt−1​∣xt​) (Denoising Process) ⬅️

1.4. Decomposing the ELBO for Diffusion Models

Why is p(xT​) a normal Gaussian as T→∞?

Step-by-Step Explanation:

Final Result:

1.5. Analyzing the KL Divergence Terms (Lt−1​)

1.6. Parameterizing the Mean μθ​(xt​,t) with Noise Prediction

1.7. The Simplified Loss Function (The Big Reveal!)

1.8. Why this Simplification Works and Implementation Details

1.9. Deriving the Reverse Sampling Formula

Beta Schedules: Designing the Noise Trajectory

1.10. Implementation Details

Pseudo-code for Training

Pseudo-code for Sampling

References

Graph View

Table of Contents

Why choose $p (z)$ as a standard Gaussian?

a. Forward Process $q (x_{t} ∣ x_{t - 1})$ (Diffusion Process) ➡️

b. Reverse Process $p_{θ} (x_{t - 1} ∣ x_{t})$ (Denoising Process) ⬅️

Why is $p (x_{T})$ a normal Gaussian as $T \to \infty$ ?

1.5. Analyzing the KL Divergence Terms ( $L_{t - 1}$ )

1.6. Parameterizing the Mean $μ_{θ} (x_{t}, t)$ with Noise Prediction