Generative AI
Generative AI studies models that learn to synthesize data: images, text, audio, speech, music, video, 3D structures, molecules, and other complex signals.
This page is the map. The detailed paper links live in focused database notes.
Boundary Note
Language-generation and LLM papers are kept in Transformers and State Space Models to avoid duplicating the NLP database. This section focuses on generative model families and media-generation systems.
Focused Databases
| Database | Scope |
|---|---|
| VAEs and Flows | Latent-variable models, variational inference, vector quantization, and exact-likelihood flows. |
| GANs | Adversarial objectives, stability, high-resolution image synthesis, and style-based generation. |
| Autoregressive and Tokenized Generation | Pixel models, visual tokenizers, VQGAN, DALL-E, MaskGIT, Parti, and masked/token-based image generation. |
| Diffusion Models | DDPM, score-based models, guidance, latent diffusion, diffusion transformers, and controllable diffusion. |
| Flow Matching and Fast Sampling | Rectified flow, flow matching, consistency models, distillation, and fast diffusion samplers. |
| Text-to-Image and Video Systems | GLIDE, DALL-E 2/3, Imagen, SDXL, Make-A-Video, Imagen Video, Stable Video Diffusion, Sora, and Sora 2. |
| Audio and Speech Generation | WaveNet, Jukebox, AudioLM, AudioGen, AudioLDM, MusicLM, Voicebox, and modern text-to-audio systems. |
Milestone Map
| Stage | Key Papers | Primary Database |
|---|---|---|
| Likelihood and latent variables | VAE, IWAE, VQ-VAE, NICE, Real NVP, Glow, FFJORD | VAEs and Flows |
| Adversarial image generation | GAN, DCGAN, WGAN, Progressive GAN, BigGAN, StyleGAN | GANs |
| Tokenized visual generation | PixelRNN, Image Transformer, VQGAN, DALL-E, MaskGIT, Parti | Autoregressive and Tokenized Generation |
| Diffusion era | DDPM, DDIM, Score SDE, Guided Diffusion, Classifier-Free Guidance, Latent Diffusion, DiT | Diffusion Models |
| Fast generative dynamics | DPM-Solver, Consistency Models, Rectified Flow, Flow Matching, Stable Diffusion 3 | Flow Matching and Fast Sampling |
| Multimodal systems | DALL-E 2/3, Imagen, SDXL, Make-A-Video, Imagen Video, Sora | Text-to-Image and Video Systems |
| Audio generation | WaveNet, Jukebox, AudioLM, AudioGen, MusicLM, AudioLDM, Voicebox | Audio and Speech Generation |
Suggested Paths
| Path | Read |
|---|---|
| Classical generative modeling | VAEs and Flows → GANs |
| Modern image generation | Autoregressive and Tokenized Generation → Diffusion Models → Flow Matching and Fast Sampling |
| Text-to-media systems | Text-to-Image and Video Systems → Audio and Speech Generation |
| Fast sampling | DDIM → DPM-Solver → Consistency Models → Rectified Flow → Flow Matching. |
| Foundation-model view | CLIP-style conditioning → latent diffusion → diffusion transformers → video world simulators. |
Reading Principle
For each family, separate four questions:
| Question | What to Track |
|---|---|
| Representation | Pixels, latents, discrete tokens, audio codes, video patches, or continuous states. |
| Objective | Likelihood, variational bound, adversarial game, denoising, score matching, flow matching, or next-token prediction. |
| Conditioning | Class labels, text, image prompts, masks, pose, depth, audio prompts, or video context. |
| Sampling | Autoregressive decoding, ancestral diffusion, ODE/SDE solvers, distillation, one-step generation, or iterative refinement. |