Text-to-Image and Video Systems
This database focuses on large generative systems: how models combine text encoders, image/video latents, diffusion or autoregressive decoding, recaptioning, scaling, and multimodal conditioning.
Text-to-Image Systems
Text-to-Video and Video Foundation Systems
Year Paper Topic Note 2022 Video Diffusion Models Video diffusion Extends image diffusion architectures to video generation. 2022 Make-A-Video Text-to-video Learns appearance from text-image data and motion from video data. 2022 Imagen Video Video diffusion cascade High-definition text-to-video through cascaded video diffusion models. 2023 Stable Video Diffusion Latent video diffusion Scales video latent diffusion through curated data and staged training. 2024 Video Generation Models as World Simulators Sora technical report Spacetime latent patches and diffusion transformers for long video generation. 2025 Sora 2 is here Sora 2 research release Video and audio generation system with improved control and physical consistency.
Cross-Database Pointers
Reading Path
Step Read 1 GLIDE, DALL-E 2, and Imagen for text-to-image diffusion systems. 2 DALL-E 3 and SDXL for prompt fidelity, recaptioning, and high-resolution latent diffusion. 3 Video Diffusion Models, Make-A-Video, and Imagen Video for early text-to-video systems. 4 Stable Video Diffusion and Sora for scalable video generation. 5 Sora 2 for the video-plus-audio system direction.