Autoregressive and Tokenized Generation
Before diffusion became dominant, many high-quality generative systems treated images or audio as sequences. This lineage matters because modern multimodal generation still often relies on tokenization, discrete codes, masked decoding, and Transformer-style sequence modeling.
Pixel and Token Autoregressive Models
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2016 | Pixel Recurrent Neural Networks | PixelRNN / PixelCNN | Autoregressive density model over image pixels. |
| 2017 | PixelCNN++ | Pixel likelihood | Improved PixelCNN with discretized logistic mixture likelihoods. |
| 2018 | Image Transformer | Image generation | Self-attention for autoregressive image modeling. |
| 2020 | Generative Pretraining from Pixels | iGPT | GPT-style autoregressive pretraining on image pixels. |
Discrete Visual Tokens and Text-to-Image Transformers
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2020 | Taming Transformers for High-Resolution Image Synthesis | VQGAN + Transformer | Learns visual codebook then models tokens with Transformers. |
| 2021 | Zero-Shot Text-to-Image Generation | DALL-E | Autoregressive Transformer over text and image tokens. |
| 2022 | MaskGIT: Masked Generative Image Transformer | Masked token generation | Parallel iterative token refinement instead of raster autoregression. |
| 2022 | Scaling Autoregressive Models for Content-Rich Text-to-Image Generation | Parti | Treats text-to-image as sequence-to-sequence token generation. |
| 2023 | Muse: Text-To-Image Generation via Masked Generative Transformers | Masked visual tokens | Fast masked-token text-to-image generation. |
Reading Path
| Step | Read |
|---|---|
| 1 | PixelRNN/PixelCNN and PixelCNN++ for exact autoregressive image likelihoods. |
| 2 | Image Transformer and iGPT for Transformer-based image generation. |
| 3 | VQGAN and DALL-E for discrete tokenized text-to-image generation. |
| 4 | MaskGIT, Parti, and Muse for later token-based alternatives to pure diffusion. |