The Generative Ambition

Generative models are a cornerstone of modern machine learning, enabling systems that can create entirely new data such as images, audio, and text by learning from real-world distributions.

Text-to-image generation is one of the most striking manifestations of this broader ambition. It aims to create realistic and coherent visual representations from linguistic descriptions by relying on generative models.

Given the intrinsic complexity of both image generation and text understanding, text-to-image generation, as the combination of the two, introduces an additional challenge: the transfer from one representational domain to another, so as to convert explicit textual relations between entities into images that are consistent with the meaning of the text.

Moreover, a good model should be able to combine concepts and styles it has never seen before in order to generate novel images. For instance, there are no portraits of Kim Jong-Un in a bedroom holding teddy bears; yet, by relying on the technologies underlying the training of neural networks, such an image can be created.



Prompt: “Pope Francis playing dj console, wearing dj headphones”



Prompt: “Kim Jong-Un wearing hello kitty pattern pajama in bedroom holding a big hello kitty peluche”


Prompt: “Trump escorted by police officers”


Prompt: “Trump in prison doing bodybuilding”

Furthermore, it would be desirable for the model to infer accurately the way in which the objects in the generated image relate to one another, based on the semantics of the textual message and on how words acquire meaning through context. For example, the image of “a person with a life buoy floating in the sea” should appear very different from that of “a person with a life buoy in a sea of people”.

This note clarifies what is meant by generative modeling, contrasts it with discriminative modeling, situates the principal model families within a coherent taxonomy, and explains why diffusion models have emerged as one of the most important paradigms in modern generative AI.

What It Means to Model a Data Distribution

Generative modeling definition

Generative modeling is a branch of machine learning that involves training a model to generate new, previously unseen data that is related to the original training dataset.

At the mathematical level, the central object is the unknown data distribution. If the true data distribution is denoted by , then the goal is to learn a model distribution that approximates it closely enough to support both sampling and probabilistic reasoning.

Conceptually, the training problem may be written as

where denotes a discrepancy between probability distributions. In many likelihood-based models, this discrepancy is closely related to maximum likelihood estimation.

Consider, for example, a dataset containing images of guitars. A generative model can be trained on such a dataset in order to infer the rules governing the complex relationships among the pixels in guitar images. Once training has been completed, the model can be used to create new and previously unseen images of guitars that were not present in the original dataset.

Probabilistic essence of generative modeling

It is important to note that a generative model is intrinsically probabilistic. Indeed, the distinctive ability of a generative model to produce previously unseen data presupposes learning the unknown distribution of the training data, so that by sampling from such a distribution it becomes possible to generate new data instances. This necessarily involves a degree of variability and uncertainty.

If the model were devoid of any element of randomness, for example if it simply consisted of a fixed computation such as taking the mean value of each pixel in the dataset, then, precisely because of its deterministic nature, it would not be generative.

In this sense, generation is inseparable from distributional modeling: the model is not merely memorizing examples, but learning a probabilistic structure from which new examples may be sampled.

Why this is difficult

A valid probability density must satisfy two conditions: it must be non-negative everywhere, and it must integrate to one over the whole space.

This apparently simple requirement is one of the central technical difficulties of generative modeling. In high-dimensional settings, directly constructing a flexible neural density that is also exactly normalized is often computationally intractable.

This difficulty helps explain why different families of generative models exist in the first place. Some make exact likelihood tractable, some optimize approximations or surrogates, some model an energy or a score field instead of a normalized density, and others bypass explicit density evaluation altogether.

Generative and Discriminative Perspectives

For a more accurate understanding of generative modeling, it is useful to compare it with its counterpart: discriminative modeling.

The following figure shows a discriminative model designed to determine whether a painting can or cannot be attributed to Van Gogh. Since this is a typical binary classification problem, the training dataset is divided into two groups: paintings by Van Gogh are labeled with , whereas paintings by other artists are labeled with . The model is then trained to discriminate between the two groups and returns the probability that an image not present in the training dataset has label , that is, that it is indeed a painting by Van Gogh.

Note

It should be noted that, in the context of discriminative modeling, each observation in the training dataset is accompanied by a label.

By contrast, labeling the training dataset is not an essential requirement in generative modeling, which concerns the creation of novel images rather than the correct assignment of a label to a given image.

Important

The foregoing may be formalized mathematically as follows:

  • discriminative models estimate the probability that is the label corresponding to a given observation ;
  • generative models estimate the unknown distribution of the data in the training dataset; the model then generates new data from the learned distribution.

Thus, whereas the goal of discriminative modeling is to classify data on the basis of their features, generative modeling concerns the creation of new data that display affinity with the training dataset.

Conditional Generative Modeling

Note

A generative model may also estimate the conditional probability of observing data given label .

For example, if the training dataset contained images of different kinds of trees, a generative model could be used to generate an image of a cypress tree. In that case, one speaks of conditional generative models.

This conditional perspective is precisely what makes systems such as text-to-image models possible. More generally, the conditioning variable need not be a class label. It may be a sentence, a segmentation mask, an audio signal, a molecular constraint, or another modality altogether.

Text-to-image generation is therefore best understood not as a separate problem from generative modeling, but as a particularly important instance of conditional generative modeling: the model learns to generate an image conditioned on a textual description , that is, according to a distribution of the form .

Principal Families of Generative Models

Several major paradigms have shaped the modern history of generative modeling. The list below is not exhaustive, but it includes the families most relevant for understanding the path that leads to diffusion models.

Energy-Based Models (EBMs)

Energy-Based Models define probability distributions through an energy function that assigns lower energy to more plausible samples and higher energy to less plausible ones. Formally, the model takes the form

where is the partition function [3].

Strengths

  • conceptually elegant and highly general;
  • closely connected to score-based learning, since gradients of log-density depend on energy differences rather than on absolute normalization;
  • foundational for understanding the score-based perspective on diffusion models.

Limitations

  • the partition function is generally intractable in high dimensions;
  • exact likelihood training is therefore difficult;
  • practical learning often requires alternative objectives or approximations.

Generative Adversarial Networks (GANs)

Introduced by Goodfellow et al. [1], GANs formulate generation as a game between two neural networks: a generator that produces synthetic samples, and a discriminator that attempts to distinguish generated samples from real data.

Strengths

  • capable of producing very sharp and realistic images;
  • historically crucial in demonstrating the power of neural generative modeling at scale.

Limitations

  • training is often unstable;
  • mode collapse may occur;
  • standard GANs do not provide direct likelihood estimation.

Variational Autoencoders (VAEs)

Proposed by Kingma and Welling [2], VAEs combine latent-variable modeling with deep neural networks. Instead of learning only how to synthesize data, they also learn a structured latent space together with an approximate posterior distribution over latent variables.

Strengths

  • principled probabilistic framework;
  • tractable training through the Evidence Lower Bound (ELBO);
  • meaningful latent representations useful for interpolation and representation learning.

Limitations

  • generated samples are often blurrier than those of the strongest GANs or diffusion models;
  • the variational approximation may limit sample fidelity.

Autoregressive Models

Autoregressive models generate data by factorizing the joint distribution into a product of conditional probabilities. For example, an image model may write

and then generate one component at a time according to that factorization. Representative examples include PixelRNN and PixelCNN-style models [4].

Strengths

  • exact likelihood evaluation;
  • conceptually clean probabilistic interpretation.

Limitations

  • sampling is inherently sequential and often slow;
  • scalability can become difficult for very high-dimensional outputs.

Normalizing Flows

Normalizing flows model data through an invertible transformation of a simple base distribution. If and with invertible , then the data density can be computed exactly through the change-of-variables formula. Representative examples include RealNVP and related flow-based models [5].

Strengths

  • exact likelihood;
  • exact latent-variable inference through invertibility;
  • elegant probabilistic structure.

Limitations

  • invertibility imposes architectural constraints;
  • achieving both flexibility and efficiency can be difficult in very large-scale settings.

Diffusion Models

Diffusion models, originally proposed in Sohl-Dickstein et al. [7] and later refined in Denoising Diffusion Probabilistic Models (DDPM) by Ho et al. [8], generate data by learning to reverse a gradual noising process. Instead of synthesizing data in one step, they transform noise into structure through an iterative denoising trajectory.

Strengths

  • highly stable training;
  • strong sample quality;
  • natural compatibility with conditioning and guidance mechanisms.

Limitations

  • sampling is typically iterative and can therefore be computationally expensive;
  • the probabilistic formulation is richer, but often more technically demanding than simpler paradigms.

A Taxonomy of Density Modeling

It is useful to situate these model families within a broader taxonomy of how generative models represent, approximate, or avoid explicit density modeling.

Note

Strictly speaking, not all generative models rely on maximum likelihood. GANs, for instance, do not make use of the maximum likelihood method. Nevertheless, “by ignoring those models that do not use maximum likelihood, and by focusing on the maximum-likelihood version of models that do not usually employ this method (such as GANs), some of the most distracting differences among different models can be removed” (Goodfellow et al., 2016 [6]), thereby allowing for a more compact overall view of generative models.

Generative models that rely on maximum likelihood differ in the way they represent, or approximate, the probability distribution .

In the literature, three broad approaches to modeling may be identified:

  • explicitly modeling the probability distribution itself;
  • explicitly modeling a tractable approximation of the probability distribution;
  • implicitly modeling the data-generation mechanism without directly estimating the density.

At a higher level, this landscape can be summarized by the following compact map:

Family typeDensity accessTypical objectiveRepresentative families
Explicit, tractableExact or directly computable likelihoodMaximum likelihoodAutoregressive Models, Normalizing Flows
Explicit, approximateLikelihood bound or likelihood-based surrogateELBO, denoising or variational objectivesVAEs, Diffusion Models
Explicit, unnormalizedDensity specified up to a partition functionScore matching or approximate likelihood methodsEnergy-Based Models
ImplicitSampling procedure without direct density evaluationAdversarial or sample-based objectivesGANs

The following figure provides a compact taxonomy:

This taxonomy is especially useful because it allows the main paradigms introduced above to be placed into a single conceptual map:

  • autoregressive models and normalizing flows are explicit density models with tractable likelihood;
  • VAEs and diffusion models are explicit density models trained through approximation, typically via variational objectives or closely related likelihood-based surrogates;
  • EBMs are explicit but unnormalized models, in which the difficulty lies in the intractability of the partition function;
  • GANs are implicit density models, since they focus on constructing a stochastic data-generation process rather than on direct density evaluation.

Implicit density models do not aim to estimate the probability density , but focus exclusively on the production of a stochastic process that directly generates the data.

Explicit density models may be further divided into:

  • those that directly optimize the probability distribution (tractable models);
  • those that optimize an approximation of it.

Important

A thread running through all types of generative models is deep learning. Almost all of the most sophisticated generative models rely on neural networks.

The Emergence of Diffusion Models

Diffusion models have become central because they offer a compelling synthesis of properties that were historically difficult to obtain at the same time.

Compared with GANs, they avoid adversarial training and are therefore much more stable to optimize. Compared with VAEs, they generally achieve significantly stronger perceptual sample quality. Compared with autoregressive models, they do not require a one-component-at-a-time factorization of generation. Compared with flow-based models, they avoid the strong architectural constraints imposed by exact invertibility.

Diffusion models main advantages

Their modern success can be summarized through four major advantages:

  • Stability: no adversarial min-max game is required during training;
  • Likelihood-based training: the objective is grounded in a principled probabilistic formulation, typically through an ELBO or a closely related denoising objective;
  • Sample quality: the strongest diffusion models achieve visual quality comparable to or better than previous paradigms in many domains;
  • Scalability and conditioning: diffusion models adapt naturally to guidance, control signals, and multimodal conditioning.

Originally proposed in Sohl-Dickstein et al. [7], and later improved in DDPM [8], diffusion models learn to denoise a sample iteratively from pure noise, thereby approximating a reverse stochastic process that maps noise back to data.

Diffusion models are especially important because they do not arise in isolation. They can be understood as inheriting and recombining several earlier lines of thought:

  • from VAEs, they inherit a variational and likelihood-based perspective;
  • from EBMs and score-based modeling, they inherit the idea that gradients of log-density can guide generation;
  • from flow-based models, they inherit the intuition that generation may be viewed as a structured transformation from a simple prior to the data distribution.

For this reason, diffusion models are not merely another entry in a list of generative architectures. They mark an important rebalancing of the field: away from the instability of adversarial training and toward a family of models that combines probabilistic rigor, practical robustness, and state-of-the-art empirical performance.

Applications and Scientific Impact

The practical impact of modern generative models, and especially of diffusion-based systems, now extends far beyond academic benchmarks.

DomainUse Case
Image generationStable Diffusion, Imagen, DALLE-2 [9][10][11]
Inpainting/editingPhotoshop AI fill, restoration tools
Molecular designProtein folding, molecule generation [12]
Medical imagingMRI super-resolution, anomaly detection [13][14]

These models are reshaping not only AI research, but also scientific practice, creative industries, design workflows, biomedical discovery, and the broader public understanding of what machine learning systems can produce.

📚 References

  1. Goodfellow et al., “Generative Adversarial Networks,” NeurIPS 2014.
  2. Kingma & Welling, “Auto-Encoding Variational Bayes,” ICLR 2014.
  3. LeCun, Chopra, Hadsell, Ranzato, and Huang, “A Tutorial on Energy-Based Learning,” 2006.
  4. van den Oord, Kalchbrenner, and Kavukcuoglu, “Pixel Recurrent Neural Networks,” ICML 2016.
  5. Dinh, Sohl-Dickstein, and Bengio, “Density Estimation Using Real NVP,” ICLR 2017.
  6. Goodfellow, Bengio, and Courville, Deep Learning, MIT Press, 2016.
  7. Sohl-Dickstein et al., “Deep Unsupervised Learning using Nonequilibrium Thermodynamics,” ICML 2015.
  8. Ho et al., “Denoising Diffusion Probabilistic Models,” NeurIPS 2020.
  9. Rombach et al., “High-Resolution Image Synthesis with Latent Diffusion Models,” CVPR 2022.
  10. Saharia et al., “Imagen: Photorealistic Text-to-Image Generation,” ICML 2022.
  11. Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents,” arXiv 2022.
  12. Hoogeboom et al., “Equivariant Diffusion for Molecule Generation in 3D,” ICML 2022.
  13. Wolleb et al., “Diffusion Models for Medical Anomaly Detection,” MICCAI 2022.
  14. Pinaya et al., “Brain Imaging Generation with Latent Diffusion Models,” NeuroImage 2022.