Transformers
This is the focused Transformer database for Literature. It is intentionally table-based: each row is a link to the original paper plus a small note. Longer explanations should live in separate conceptual notes.
Core Architecture
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2017 | Attention Is All You Need | Transformer | Original self-attention architecture. |
| 2019 | Transformer-XL | Recurrence / context | Extends context beyond fixed windows. |
| 2019 | Fast Transformer Decoding: One Write-Head is All You Need | Multi-query attention | Reduces decoding memory bandwidth. |
| 2020 | GLU Variants Improve Transformer | Feed-forward layers | Gated FFN variants used in many modern LLMs. |
Encoder and Encoder-Decoder Pre-Training
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2018 | Improving Language Understanding by Generative Pre-Training | GPT-1 | Autoregressive pre-training plus task adaptation. |
| 2018 | BERT | Bidirectional encoder | Masked-language-model pre-training. |
| 2019 | XLNet | Permutation LM | Autoregressive pre-training with bidirectional context. |
| 2019 | RoBERTa | BERT optimization | Shows training recipe matters as much as architecture. |
| 2019 | ALBERT | Parameter sharing | Lighter BERT-style pre-training. |
| 2019 | T5 | Text-to-text | Unified text-to-text transfer learning. |
| 2019 | BART | Denoising seq2seq | Combines bidirectional encoder and autoregressive decoder. |
| 2020 | ELECTRA | Replaced-token detection | More sample-efficient encoder pre-training. |
| 2020 | DeBERTa | Disentangled attention | Separates content and position attention. |
Decoder-Only LLMs and Scaling
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2019 | Language Models are Unsupervised Multitask Learners (code) | GPT-2 | Zero-shot behavior from next-token prediction. |
| 2020 | Scaling Laws for Neural Language Models | Scaling laws | Loss vs model size, data, and compute. |
| 2020 | Language Models are Few-Shot Learners | GPT-3 | In-context learning at large scale. |
| 2022 | Training Compute-Optimal Large Language Models | Chinchilla | Compute-optimal balance between parameters and tokens. |
| 2022 | PaLM | Large-scale LLM | 540B-parameter Transformer with Pathways. |
| 2023 | LLaMA | Open foundation models | Efficient public-data LLM family. |
Efficient Attention and Long Context
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2020 | Reformer | Efficient attention | LSH attention and reversible residual layers. |
| 2020 | Longformer | Long documents | Sliding-window plus global attention. |
| 2020 | Linformer | Linear attention | Low-rank approximation of self-attention. |
| 2020 | Big Bird | Sparse attention | Local, random, and global sparse attention. |
| 2022 | FlashAttention | Efficient exact attention | IO-aware attention for speed and memory. |
| 2023 | FlashAttention-2 | GPU attention kernel | Better parallelism and work partitioning. |
Sparse and Mixture-of-Experts Transformers
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2020 | GShard | MoE scaling | Conditional computation and automatic sharding. |
| 2021 | Switch Transformers | Sparse MoE | Simple routing for trillion-parameter sparse models. |
Adaptation, Prompting, and Alignment
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2021 | Prefix-Tuning | Parameter-efficient tuning | Learns continuous prefixes while freezing the LM. |
| 2021 | LoRA | Low-rank adaptation | Injects trainable low-rank matrices into Transformer layers. |
| 2021 | Finetuned Language Models Are Zero-Shot Learners | Instruction tuning | FLAN-style instruction tuning for zero-shot behavior. |
| 2022 | Training Language Models to Follow Instructions with Human Feedback | RLHF / InstructGPT | Human feedback for instruction-following behavior. |
Cross-Domain Pointers
| Direction | Go To | Note |
|---|---|---|
| Vision Transformers | Vision Transformers and Foundation Models | ViT, DeiT, Swin, CLIP, ALIGN, OWL-ViT, Grounding DINO, and related visual foundation models live in the Computer Vision database. |
| State-space alternatives | State Space Models | S4, Mamba, Mamba-2, and Mamba-3 live in the sequence-modeling branch. |
Reading Path
| Step | Read |
|---|---|
| 1 | Attention Is All You Need. |
| 2 | BERT, GPT-1, GPT-2, T5, and BART. |
| 3 | Transformer-XL, Longformer, Big Bird, and FlashAttention. |
| 4 | Scaling Laws, GPT-3, Chinchilla, PaLM, and LLaMA. |
| 5 | Prefix-Tuning, LoRA, FLAN, and InstructGPT. |
| 6 | Vision Transformers and Foundation Models for the cross-domain vision branch. |