Autoregressive and Tokenized Generation

Before diffusion became dominant, many high-quality generative systems treated images or audio as sequences. This lineage matters because modern multimodal generation still often relies on tokenization, discrete codes, masked decoding, and Transformer-style sequence modeling.

Pixel and Token Autoregressive Models

YearPaperTopicNote
2016Pixel Recurrent Neural NetworksPixelRNN / PixelCNNAutoregressive density model over image pixels.
2017PixelCNN++Pixel likelihoodImproved PixelCNN with discretized logistic mixture likelihoods.
2018Image TransformerImage generationSelf-attention for autoregressive image modeling.
2020Generative Pretraining from PixelsiGPTGPT-style autoregressive pretraining on image pixels.

Discrete Visual Tokens and Text-to-Image Transformers

YearPaperTopicNote
2020Taming Transformers for High-Resolution Image SynthesisVQGAN + TransformerLearns visual codebook then models tokens with Transformers.
2021Zero-Shot Text-to-Image GenerationDALL-EAutoregressive Transformer over text and image tokens.
2022MaskGIT: Masked Generative Image TransformerMasked token generationParallel iterative token refinement instead of raster autoregression.
2022Scaling Autoregressive Models for Content-Rich Text-to-Image GenerationPartiTreats text-to-image as sequence-to-sequence token generation.
2023Muse: Text-To-Image Generation via Masked Generative TransformersMasked visual tokensFast masked-token text-to-image generation.

Reading Path

StepRead
1PixelRNN/PixelCNN and PixelCNN++ for exact autoregressive image likelihoods.
2Image Transformer and iGPT for Transformer-based image generation.
3VQGAN and DALL-E for discrete tokenized text-to-image generation.
4MaskGIT, Parti, and Muse for later token-based alternatives to pure diffusion.