Transformers

This is the focused Transformer database for Literature. It is intentionally table-based: each row is a link to the original paper plus a small note. Longer explanations should live in separate conceptual notes.

Core Architecture

YearPaperTopicNote
2017Attention Is All You NeedTransformerOriginal self-attention architecture.
2019Transformer-XLRecurrence / contextExtends context beyond fixed windows.
2019Fast Transformer Decoding: One Write-Head is All You NeedMulti-query attentionReduces decoding memory bandwidth.
2020GLU Variants Improve TransformerFeed-forward layersGated FFN variants used in many modern LLMs.

Encoder and Encoder-Decoder Pre-Training

YearPaperTopicNote
2018Improving Language Understanding by Generative Pre-TrainingGPT-1Autoregressive pre-training plus task adaptation.
2018BERTBidirectional encoderMasked-language-model pre-training.
2019XLNetPermutation LMAutoregressive pre-training with bidirectional context.
2019RoBERTaBERT optimizationShows training recipe matters as much as architecture.
2019ALBERTParameter sharingLighter BERT-style pre-training.
2019T5Text-to-textUnified text-to-text transfer learning.
2019BARTDenoising seq2seqCombines bidirectional encoder and autoregressive decoder.
2020ELECTRAReplaced-token detectionMore sample-efficient encoder pre-training.
2020DeBERTaDisentangled attentionSeparates content and position attention.

Decoder-Only LLMs and Scaling

YearPaperTopicNote
2019Language Models are Unsupervised Multitask Learners (code)GPT-2Zero-shot behavior from next-token prediction.
2020Scaling Laws for Neural Language ModelsScaling lawsLoss vs model size, data, and compute.
2020Language Models are Few-Shot LearnersGPT-3In-context learning at large scale.
2022Training Compute-Optimal Large Language ModelsChinchillaCompute-optimal balance between parameters and tokens.
2022PaLMLarge-scale LLM540B-parameter Transformer with Pathways.
2023LLaMAOpen foundation modelsEfficient public-data LLM family.

Efficient Attention and Long Context

YearPaperTopicNote
2020ReformerEfficient attentionLSH attention and reversible residual layers.
2020LongformerLong documentsSliding-window plus global attention.
2020LinformerLinear attentionLow-rank approximation of self-attention.
2020Big BirdSparse attentionLocal, random, and global sparse attention.
2022FlashAttentionEfficient exact attentionIO-aware attention for speed and memory.
2023FlashAttention-2GPU attention kernelBetter parallelism and work partitioning.

Sparse and Mixture-of-Experts Transformers

YearPaperTopicNote
2020GShardMoE scalingConditional computation and automatic sharding.
2021Switch TransformersSparse MoESimple routing for trillion-parameter sparse models.

Adaptation, Prompting, and Alignment

YearPaperTopicNote
2021Prefix-TuningParameter-efficient tuningLearns continuous prefixes while freezing the LM.
2021LoRALow-rank adaptationInjects trainable low-rank matrices into Transformer layers.
2021Finetuned Language Models Are Zero-Shot LearnersInstruction tuningFLAN-style instruction tuning for zero-shot behavior.
2022Training Language Models to Follow Instructions with Human FeedbackRLHF / InstructGPTHuman feedback for instruction-following behavior.

Cross-Domain Pointers

DirectionGo ToNote
Vision TransformersVision Transformers and Foundation ModelsViT, DeiT, Swin, CLIP, ALIGN, OWL-ViT, Grounding DINO, and related visual foundation models live in the Computer Vision database.
State-space alternativesState Space ModelsS4, Mamba, Mamba-2, and Mamba-3 live in the sequence-modeling branch.

Reading Path

StepRead
1Attention Is All You Need.
2BERT, GPT-1, GPT-2, T5, and BART.
3Transformer-XL, Longformer, Big Bird, and FlashAttention.
4Scaling Laws, GPT-3, Chinchilla, PaLM, and LLaMA.
5Prefix-Tuning, LoRA, FLAN, and InstructGPT.
6Vision Transformers and Foundation Models for the cross-domain vision branch.