Guiding Diffusion Models: Classifier and Classifier-Free Guidance

In the previous sections, we explored how diffusion models can generate diverse samples by learning to reverse a noising process. However, often we want to control what the model generates, for example, generating an image of a specific object or style. This is known as conditional generation. Guidance techniques are methods to steer this generation process towards desired attributes.

The core idea behind many guidance techniques is to modify the sampling process, specifically how the model predicts the less noisy sample $x_{t - 1}$ from $x_{t}$ . Recall from DDPMs (Section 1.6) that the model $ϵ_{θ} (x_{t}, t)$ predicts the noise added to $x_{0}$ to get $x_{t}$ . The reverse process then uses this predicted noise to estimate the mean of $p_{θ} (x_{t - 1} ∣ x_{t})$ .

μ_{θ} (x_{t}, t) = \frac{1}{α _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t))

Guidance methods typically alter the effective noise prediction $ϵ_{θ} (x_{t}, t)$ or, equivalently, the score $\nabla_{x_{t}} lo g p (x_{t})$ , to incorporate the desired condition $y$ .

3.1. The Role of Score Matching and Noise Prediction

To fully appreciate why predicting the noise $ϵ$ is so effective and how it relates to guiding the generation process, it’s helpful to briefly touch upon the concept of score-based generative models (also known as score matching).

The “score” of a data distribution $p (x)$ at a point $x$ is defined as the gradient of the log-probability with respect to the data: $\nabla_{x} lo g p (x)$ . Score-based models aim to learn this score function for the data distribution at different noise levels.

The key intuition connecting this to noise prediction in DDPMs is that the score of the noised data distribution $q (x_{t} ∣ x_{0})$ can be shown to be proportional to the negative of the noise $ϵ$ that was added to $x_{0}$ to obtain $x_{t}$ (when conditioned on $x_{0}$ ). Specifically, for $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ$ , we have $\nabla_{x_{t}} lo g q (x_{t} ∣ x_{0}) = - \frac{ϵ}{1 - α ˉ _{t}}$ .

So, a model $ϵ_{θ} (x_{t}, t)$ that is trained to predict the noise $ϵ$ is implicitly learning a scaled version of the score $\nabla_{x_{t}} lo g q (x_{t} ∣ x_{0})$ . This is why the terms “noise prediction model” and “score-based model” are often used interchangeably in the context of diffusion models, as they are learning essentially the same underlying quantity. The equations for classifier guidance, which explicitly use the score $\nabla_{x_{t}} lo g p (x_{t} ∣ y)$ , highlight this connection.

While a deep dive into score-based models is beyond the scope of this note, understanding this connection provides a richer perspective on why diffusion models work and how guidance mechanisms are formulated. For those interested in a comprehensive exploration that covers VAEs, DDPMs, DDIMs, and Score-Based Models in detail, the following guide is an excellent resource:

Tutorial on Diffusion Models for Imaging and Vision (Chan, 2025)

3.2. Classifier Guidance

Classifier guidance, introduced by Dhariwal and Nichol (2021), uses a separate, pre-trained classifier $p_{ϕ} (y ∣ x_{t})$ to guide the diffusion sampling process. The classifier is trained to predict the class $y$ of a noisy image $x_{t}$ .

The Core Idea: Modifying the Score

The goal is to sample from the conditional distribution $p (x_{t} ∣ y)$ . Using Bayes’ theorem:

p (x_{t} ∣ y) = \frac{p ( y ∣ x _{t} ) p ( x _{t} )}{p ( y )}

Taking the logarithm and then the gradient with respect to $x_{t}$ :

\nabla_{x_{t}} lo g p (x_{t} ∣ y) = \nabla_{x_{t}} lo g p (x_{t}) + \nabla_{x_{t}} lo g p (y ∣ x_{t})

Let’s break this down:

$\nabla_{x_{t}} lo g p (x_{t})$ : This is the score of the unconditional distribution that our diffusion model has learned to approximate. If our model $ϵ_{θ} (x_{t}, t)$ predicts the noise, this score is related by $s_{θ} (x_{t}, t) \approx - ϵ_{θ} (x_{t}, t) / 1 - \overset{α}{ˉ}_{t}$ .
$\nabla_{x_{t}} lo g p (y ∣ x_{t})$ : This is the gradient of the log-likelihood of the condition $y$ given $x_{t}$ . This term is provided by the external classifier $p_{ϕ} (y ∣ x_{t})$ . It “points” $x_{t}$ in a direction that makes it more recognizable as class $y$ by the classifier.

The guided score $s_{gu i d e d} (x_{t}, t, y)$ is then a combination, often with a guidance scale $λ$ (also denoted $s$ or guidance_scale):

s_{gu i d e d} (x_{t}, t, y) = s_{θ} (x_{t}, t) + λ \nabla_{x_{t}} lo g p_{ϕ} (y ∣ x_{t})

This modified score then leads to an adjusted noise prediction $ϵ_{θ}^{'} (x_{t}, t, y)$ that is used in the DDPM sampling step:

ϵ_{θ}^{'} (x_{t}, t, y) = ϵ_{θ} (x_{t}, t) - λ 1 - \overset{α}{ˉ}_{t} \nabla_{x_{t}} lo g p_{ϕ} (y ∣ x_{t})

The term $λ 1 - \overset{α}{ˉ}_{t}$ is often absorbed into a single guidance scale hyperparameter. The sign depends on whether the gradient is added to the score or subtracted from the noise.

How it Works in Practice

During each step $t$ of the reverse diffusion process:

The diffusion model predicts the unconditional noise $ϵ_{θ} (x_{t}, t)$ .
The current noisy image $x_{t}$ and the target class $y$ are fed to the classifier $p_{ϕ} (y ∣ x_{t})$ .
The gradient of the log probability of class $y$ with respect to the input $x_{t}$ , i.e., $\nabla_{x_{t}} lo g p_{ϕ} (y ∣ x_{t})$ , is computed. This requires $x_{t}$ to have gradients enabled.
The unconditional noise prediction is adjusted using this gradient and the guidance scale.
The sampling step proceeds using this adjusted noise.

Pseudo-code for Classifier Guidance

# classifier_model: Pre-trained image classification model p_phi(y|x_t)
# model: Unconditional diffusion model epsilon_theta(x_t, t)
# scheduler: DDPM scheduler for timesteps and updates
# y: Target class label
# guidance_scale: Strength of the classifier guidance
 
input = get_noise_from_standard_normal_distribution(...)  # Initial noisy image x_T
 
for t in tqdm(scheduler.timesteps):  # Iterate from T down to 0
    # Ensure input requires gradients for the classifier
    input_with_grad = input.detach().requires_grad_(True)
 
    # 1. Predict unconditional noise
    with torch.no_grad():
        noise_pred_uncond = model(input, t).sample  # epsilon_theta(x_t, t)
 
    # 2. Get classifier gradient
    log_probs_y = classifier_model(input_with_grad, t).log_softmax(dim=-1)[:, y]  
    # log p_phi(y|x_t)
    
    # 3. Compute gradient for guidance
    class_gradient = classifier_model.get_class_guidance(input_with_grad, y)  
    # nabla_{x_t} log p_phi(y|x_t)
 
    # 4. Adjust noise prediction
    noise_pred_guided = noise_pred_uncond + class_gradient * guidance_scale
 
    # 5. Perform sampling step
    input = scheduler.step(noise_pred_guided, t, input).prev_sample

Note on Pseudo-code:

The exact implementation details, especially the sign and scaling of class_gradient, can vary. The key is that the classifier’s gradient steers the sampling. The classifier_model.get_class_guidance encapsulates the gradient computation $\nabla_{x_{t}} lo g p_{ϕ} (y ∣ x_{t})$ .

Pros and Cons of Classifier Guidance

Pros:

Can guide any pre-trained unconditional diffusion model without retraining it.

Allows leveraging powerful, off-the-shelf classifiers.

Cons:

Requires a separate classifier model, which must be robust to noisy inputs $x_{t}$ (often requires training the classifier on noisy data).

The guidance is limited to the classes the classifier was trained on.

Can be computationally more expensive due to classifier forward/backward passes at each step.

Guidance can sometimes lead to adversarial examples for the classifier, resulting in artifacts if the guidance scale is too high.

3.3. Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG), proposed by Ho and Salimans (2022), offers a way to guide diffusion models without needing an external classifier. It has become a very popular and effective technique.

The Core Idea: Jointly Trained Conditional Model

The key idea is to train a single diffusion model $ϵ_{θ} (x_{t}, t, y)$ that is conditioned on $y$ (e.g., class label, text embedding). During training, this model is occasionally fed a null condition $\emptyset$ (e.g., a zero vector for class embeddings, or an empty string embedding for text). This means the model learns both conditional generation $p (x_{t} ∣ y)$ and unconditional generation $p (x_{t})$ (when $y = \emptyset$ ).

At sampling time, the model makes two predictions:

$ϵ_{θ} (x_{t}, t, y)$ : The noise prediction conditioned on the desired $y$ .
$ϵ_{θ} (x_{t}, t, \emptyset)$ : The noise prediction for unconditional generation.

The final noise prediction $ϵ_{θ}^{'}$ used for sampling is an extrapolation from the unconditional prediction in the direction of the conditional one:

ϵ_{θ}^{'} (x_{t}, t, y, λ) = ϵ_{θ} (x_{t}, t, \emptyset) + λ (ϵ_{θ} (x_{t}, t, y) - ϵ_{θ} (x_{t}, t, \emptyset))

Here, $λ$ is the guidance scale (often denoted $w$ or guidance_scale).

If $λ = 0$ , we get unconditional generation: $ϵ_{θ}^{'} = ϵ_{θ} (x_{t}, t, \emptyset)$ .
If $λ = 1$ , we get standard conditional generation: $ϵ_{θ}^{'} = ϵ_{θ} (x_{t}, t, y)$ .
If $λ > 1$ , the generation is pushed further in the direction of $y$ , often improving sample quality and adherence to the condition, at the cost of diversity.

Equivalently,

ϵ_{θ}^{'} (x_{t}, t, y, λ) = (1 - λ) ϵ_{θ} (x_{t}, t, \emptyset) + λ ϵ_{θ} (x_{t}, t, y) .

How it Works in Practice

Training – Train a conditional diffusion model $ϵ_{θ} (x_{t}, t, y)$ . With some probability (e.g., 10-20 %), replace the true condition $y$ with a null/empty condition $\emptyset$ .
Sampling – At each step $t$ :
- Compute $ϵ_{co n d} = ϵ_{θ} (x_{t}, t, y)$ and $ϵ_{u n co n d} = ϵ_{θ} (x_{t}, t, \emptyset)$ .
- Combine them: $ϵ_{gu i d e d} = ϵ_{u n co n d} + λ (ϵ_{co n d} - ϵ_{u n co n d})$ .
- Use $ϵ_{gu i d e d}$ in the DDPM sampling step.

Pseudo-code for Classifier-Free Guidance (Text-to-Image Example)

# model: conditional diffusion model epsilon_theta(x_t, t, y_embedding)
# scheduler: DDPM scheduler
# text_condition: e.g. "a photo of a cat"
# guidance_scale: CFG strength
# text_encoder: encodes text to embeddings
 
cond_emb = text_encoder.encode(text_condition)
uncond_emb = text_encoder.encode("")  # empty string
 
x = torch.randn_like(image_shape)  # x_T ~ N(0,I)
 
for t in scheduler.timesteps:
    with torch.no_grad():
        eps_uncond = model(x, t, encoder_hidden_states=uncond_emb).sample
        eps_cond   = model(x, t, encoder_hidden_states=cond_emb).sample
    eps_guided = eps_uncond + guidance_scale * (eps_cond - eps_uncond)
    x = scheduler.step(eps_guided, t, x).prev_sample

Deep Learning

Explorer

Guiding Diffusion Models: Classifier and Classifier-Free Guidance

3.1. The Role of Score Matching and Noise Prediction

3.2. Classifier Guidance

The Core Idea: Modifying the Score

How it Works in Practice

Pseudo-code for Classifier Guidance

Pros and Cons of Classifier Guidance

3.3. Classifier-Free Guidance (CFG)

The Core Idea: Jointly Trained Conditional Model

How it Works in Practice

Pseudo-code for Classifier-Free Guidance (Text-to-Image Example)

Graph View

Table of Contents