Text-to-Image and Video Systems

This database focuses on large generative systems: how models combine text encoders, image/video latents, diffusion or autoregressive decoding, recaptioning, scaling, and multimodal conditioning.

Text-to-Image Systems

Year	Paper	Topic	Note
2021	GLIDE (PMLR)	Text-guided diffusion	Text-conditional diffusion with classifier-free guidance and editing.
2022	Hierarchical Text-Conditional Image Generation with CLIP Latents (OpenAI)	DALL-E 2 / unCLIP	Prior over CLIP image latents plus diffusion decoders.
2022	Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding	Imagen	Frozen large language model encoder plus diffusion cascade.
2023	Improving Image Generation with Better Captions	DALL-E 3	Recaptioning improves prompt following in text-to-image systems.
2023	SDXL	Stable Diffusion XL	Larger latent diffusion model with richer conditioning and refinement.

Text-to-Video and Video Foundation Systems

Year	Paper	Topic	Note
2022	Video Diffusion Models	Video diffusion	Extends image diffusion architectures to video generation.
2022	Make-A-Video	Text-to-video	Learns appearance from text-image data and motion from video data.
2022	Imagen Video	Video diffusion cascade	High-definition text-to-video through cascaded video diffusion models.
2023	Stable Video Diffusion	Latent video diffusion	Scales video latent diffusion through curated data and staged training.
2024	Video Generation Models as World Simulators	Sora technical report	Spacetime latent patches and diffusion transformers for long video generation.
2025	Sora 2 is here	Sora 2 research release	Video and audio generation system with improved control and physical consistency.

Cross-Database Pointers

Theme	Go To	Note
Core diffusion math	Diffusion Models	DDPM, score SDEs, guidance, latent diffusion, DiT, and ControlNet live there.
Rectified-flow systems	Flow Matching and Fast Sampling	Stable Diffusion 3 / MM-DiT is kept with rectified flow and flow matching.
Autoregressive text-to-image	Autoregressive and Tokenized Generation	DALL-E, Parti, MaskGIT, Muse, and VQGAN-style token models live there.

Reading Path

Step	Read
1	GLIDE, DALL-E 2, and Imagen for text-to-image diffusion systems.
2	DALL-E 3 and SDXL for prompt fidelity, recaptioning, and high-resolution latent diffusion.
3	Video Diffusion Models, Make-A-Video, and Imagen Video for early text-to-video systems.
4	Stable Video Diffusion and Sora for scalable video generation.
5	Sora 2 for the video-plus-audio system direction.