Generative AI

Generative AI studies models that learn to synthesize data: images, text, audio, speech, music, video, 3D structures, molecules, and other complex signals.

This page is the map. The detailed paper links live in focused database notes.

Boundary Note

Language-generation and LLM papers are kept in Transformers and State Space Models to avoid duplicating the NLP database. This section focuses on generative model families and media-generation systems.

Focused Databases

DatabaseScope
VAEs and FlowsLatent-variable models, variational inference, vector quantization, and exact-likelihood flows.
GANsAdversarial objectives, stability, high-resolution image synthesis, and style-based generation.
Autoregressive and Tokenized GenerationPixel models, visual tokenizers, VQGAN, DALL-E, MaskGIT, Parti, and masked/token-based image generation.
Diffusion ModelsDDPM, score-based models, guidance, latent diffusion, diffusion transformers, and controllable diffusion.
Flow Matching and Fast SamplingRectified flow, flow matching, consistency models, distillation, and fast diffusion samplers.
Text-to-Image and Video SystemsGLIDE, DALL-E 2/3, Imagen, SDXL, Make-A-Video, Imagen Video, Stable Video Diffusion, Sora, and Sora 2.
Audio and Speech GenerationWaveNet, Jukebox, AudioLM, AudioGen, AudioLDM, MusicLM, Voicebox, and modern text-to-audio systems.

Milestone Map

StageKey PapersPrimary Database
Likelihood and latent variablesVAE, IWAE, VQ-VAE, NICE, Real NVP, Glow, FFJORDVAEs and Flows
Adversarial image generationGAN, DCGAN, WGAN, Progressive GAN, BigGAN, StyleGANGANs
Tokenized visual generationPixelRNN, Image Transformer, VQGAN, DALL-E, MaskGIT, PartiAutoregressive and Tokenized Generation
Diffusion eraDDPM, DDIM, Score SDE, Guided Diffusion, Classifier-Free Guidance, Latent Diffusion, DiTDiffusion Models
Fast generative dynamicsDPM-Solver, Consistency Models, Rectified Flow, Flow Matching, Stable Diffusion 3Flow Matching and Fast Sampling
Multimodal systemsDALL-E 2/3, Imagen, SDXL, Make-A-Video, Imagen Video, SoraText-to-Image and Video Systems
Audio generationWaveNet, Jukebox, AudioLM, AudioGen, MusicLM, AudioLDM, VoiceboxAudio and Speech Generation

Suggested Paths

PathRead
Classical generative modelingVAEs and Flows GANs
Modern image generationAutoregressive and Tokenized Generation Diffusion Models Flow Matching and Fast Sampling
Text-to-media systemsText-to-Image and Video Systems Audio and Speech Generation
Fast samplingDDIM DPM-Solver Consistency Models Rectified Flow Flow Matching.
Foundation-model viewCLIP-style conditioning latent diffusion diffusion transformers video world simulators.

Reading Principle

For each family, separate four questions:

QuestionWhat to Track
RepresentationPixels, latents, discrete tokens, audio codes, video patches, or continuous states.
ObjectiveLikelihood, variational bound, adversarial game, denoising, score matching, flow matching, or next-token prediction.
ConditioningClass labels, text, image prompts, masks, pose, depth, audio prompts, or video context.
SamplingAutoregressive decoding, ancestral diffusion, ODE/SDE solvers, distillation, one-step generation, or iterative refinement.