Text-to-Image and Video Systems

This database focuses on large generative systems: how models combine text encoders, image/video latents, diffusion or autoregressive decoding, recaptioning, scaling, and multimodal conditioning.

Text-to-Image Systems

YearPaperTopicNote
2021GLIDE (PMLR)Text-guided diffusionText-conditional diffusion with classifier-free guidance and editing.
2022Hierarchical Text-Conditional Image Generation with CLIP Latents (OpenAI)DALL-E 2 / unCLIPPrior over CLIP image latents plus diffusion decoders.
2022Photorealistic Text-to-Image Diffusion Models with Deep Language UnderstandingImagenFrozen large language model encoder plus diffusion cascade.
2023Improving Image Generation with Better CaptionsDALL-E 3Recaptioning improves prompt following in text-to-image systems.
2023SDXLStable Diffusion XLLarger latent diffusion model with richer conditioning and refinement.

Text-to-Video and Video Foundation Systems

YearPaperTopicNote
2022Video Diffusion ModelsVideo diffusionExtends image diffusion architectures to video generation.
2022Make-A-VideoText-to-videoLearns appearance from text-image data and motion from video data.
2022Imagen VideoVideo diffusion cascadeHigh-definition text-to-video through cascaded video diffusion models.
2023Stable Video DiffusionLatent video diffusionScales video latent diffusion through curated data and staged training.
2024Video Generation Models as World SimulatorsSora technical reportSpacetime latent patches and diffusion transformers for long video generation.
2025Sora 2 is hereSora 2 research releaseVideo and audio generation system with improved control and physical consistency.

Cross-Database Pointers

ThemeGo ToNote
Core diffusion mathDiffusion ModelsDDPM, score SDEs, guidance, latent diffusion, DiT, and ControlNet live there.
Rectified-flow systemsFlow Matching and Fast SamplingStable Diffusion 3 / MM-DiT is kept with rectified flow and flow matching.
Autoregressive text-to-imageAutoregressive and Tokenized GenerationDALL-E, Parti, MaskGIT, Muse, and VQGAN-style token models live there.

Reading Path

StepRead
1GLIDE, DALL-E 2, and Imagen for text-to-image diffusion systems.
2DALL-E 3 and SDXL for prompt fidelity, recaptioning, and high-resolution latent diffusion.
3Video Diffusion Models, Make-A-Video, and Imagen Video for early text-to-video systems.
4Stable Video Diffusion and Sora for scalable video generation.
5Sora 2 for the video-plus-audio system direction.