Generative AI

Generative AI studies models that learn to synthesize data: images, text, audio, speech, music, video, 3D structures, molecules, and other complex signals.

This page is the map. The detailed paper links live in focused database notes.

Boundary Note

Language-generation and LLM papers are kept in Transformers and State Space Models to avoid duplicating the NLP database. This section focuses on generative model families and media-generation systems.

Focused Databases

Database	Scope
VAEs and Flows	Latent-variable models, variational inference, vector quantization, and exact-likelihood flows.
GANs	Adversarial objectives, stability, high-resolution image synthesis, and style-based generation.
Autoregressive and Tokenized Generation	Pixel models, visual tokenizers, VQGAN, DALL-E, MaskGIT, Parti, and masked/token-based image generation.
Diffusion Models	DDPM, score-based models, guidance, latent diffusion, diffusion transformers, and controllable diffusion.
Flow Matching and Fast Sampling	Rectified flow, flow matching, consistency models, distillation, and fast diffusion samplers.
Text-to-Image and Video Systems	GLIDE, DALL-E 2/3, Imagen, SDXL, Make-A-Video, Imagen Video, Stable Video Diffusion, Sora, and Sora 2.
Audio and Speech Generation	WaveNet, Jukebox, AudioLM, AudioGen, AudioLDM, MusicLM, Voicebox, and modern text-to-audio systems.

Milestone Map

Stage	Key Papers	Primary Database
Likelihood and latent variables	VAE, IWAE, VQ-VAE, NICE, Real NVP, Glow, FFJORD	VAEs and Flows
Adversarial image generation	GAN, DCGAN, WGAN, Progressive GAN, BigGAN, StyleGAN	GANs
Tokenized visual generation	PixelRNN, Image Transformer, VQGAN, DALL-E, MaskGIT, Parti	Autoregressive and Tokenized Generation
Diffusion era	DDPM, DDIM, Score SDE, Guided Diffusion, Classifier-Free Guidance, Latent Diffusion, DiT	Diffusion Models
Fast generative dynamics	DPM-Solver, Consistency Models, Rectified Flow, Flow Matching, Stable Diffusion 3	Flow Matching and Fast Sampling
Multimodal systems	DALL-E 2/3, Imagen, SDXL, Make-A-Video, Imagen Video, Sora	Text-to-Image and Video Systems
Audio generation	WaveNet, Jukebox, AudioLM, AudioGen, MusicLM, AudioLDM, Voicebox	Audio and Speech Generation

Suggested Paths

Path	Read
Classical generative modeling	VAEs and Flows → GANs
Modern image generation	Autoregressive and Tokenized Generation → Diffusion Models → Flow Matching and Fast Sampling
Text-to-media systems	Text-to-Image and Video Systems → Audio and Speech Generation
Fast sampling	DDIM → DPM-Solver → Consistency Models → Rectified Flow → Flow Matching.
Foundation-model view	CLIP-style conditioning → latent diffusion → diffusion transformers → video world simulators.

Reading Principle

For each family, separate four questions:

Question	What to Track
Representation	Pixels, latents, discrete tokens, audio codes, video patches, or continuous states.
Objective	Likelihood, variational bound, adversarial game, denoising, score matching, flow matching, or next-token prediction.
Conditioning	Class labels, text, image prompts, masks, pose, depth, audio prompts, or video context.
Sampling	Autoregressive decoding, ancestral diffusion, ODE/SDE solvers, distillation, one-step generation, or iterative refinement.

7 items under this folder.

Apr 30, 2026
VAEs and Flows
Apr 30, 2026
GANs
Apr 30, 2026
Autoregressive and Tokenized Generation
Apr 30, 2026
Diffusion Models
Apr 30, 2026
Flow Matching and Fast Sampling
Apr 30, 2026
Text-to-Image and Video Systems
Apr 30, 2026
Audio and Speech Generation