Transformers

This is the focused Transformer database for Literature. It is intentionally table-based: each row is a link to the original paper plus a small note. Longer explanations should live in separate conceptual notes.

Core Architecture

Year	Paper	Topic	Note
2017	Attention Is All You Need	Transformer	Original self-attention architecture.
2019	Transformer-XL	Recurrence / context	Extends context beyond fixed windows.
2019	Fast Transformer Decoding: One Write-Head is All You Need	Multi-query attention	Reduces decoding memory bandwidth.
2020	GLU Variants Improve Transformer	Feed-forward layers	Gated FFN variants used in many modern LLMs.

Encoder and Encoder-Decoder Pre-Training

Year	Paper	Topic	Note
2018	Improving Language Understanding by Generative Pre-Training	GPT-1	Autoregressive pre-training plus task adaptation.
2018	BERT	Bidirectional encoder	Masked-language-model pre-training.
2019	XLNet	Permutation LM	Autoregressive pre-training with bidirectional context.
2019	RoBERTa	BERT optimization	Shows training recipe matters as much as architecture.
2019	ALBERT	Parameter sharing	Lighter BERT-style pre-training.
2019	T5	Text-to-text	Unified text-to-text transfer learning.
2019	BART	Denoising seq2seq	Combines bidirectional encoder and autoregressive decoder.
2020	ELECTRA	Replaced-token detection	More sample-efficient encoder pre-training.
2020	DeBERTa	Disentangled attention	Separates content and position attention.

Decoder-Only LLMs and Scaling

Year	Paper	Topic	Note
2019	Language Models are Unsupervised Multitask Learners (code)	GPT-2	Zero-shot behavior from next-token prediction.
2020	Scaling Laws for Neural Language Models	Scaling laws	Loss vs model size, data, and compute.
2020	Language Models are Few-Shot Learners	GPT-3	In-context learning at large scale.
2022	Training Compute-Optimal Large Language Models	Chinchilla	Compute-optimal balance between parameters and tokens.
2022	PaLM	Large-scale LLM	540B-parameter Transformer with Pathways.
2023	LLaMA	Open foundation models	Efficient public-data LLM family.

Efficient Attention and Long Context

Year	Paper	Topic	Note
2020	Reformer	Efficient attention	LSH attention and reversible residual layers.
2020	Longformer	Long documents	Sliding-window plus global attention.
2020	Linformer	Linear attention	Low-rank approximation of self-attention.
2020	Big Bird	Sparse attention	Local, random, and global sparse attention.
2022	FlashAttention	Efficient exact attention	IO-aware attention for speed and memory.
2023	FlashAttention-2	GPU attention kernel	Better parallelism and work partitioning.

Sparse and Mixture-of-Experts Transformers

Year	Paper	Topic	Note
2020	GShard	MoE scaling	Conditional computation and automatic sharding.
2021	Switch Transformers	Sparse MoE	Simple routing for trillion-parameter sparse models.

Adaptation, Prompting, and Alignment

Year	Paper	Topic	Note
2021	Prefix-Tuning	Parameter-efficient tuning	Learns continuous prefixes while freezing the LM.
2021	LoRA	Low-rank adaptation	Injects trainable low-rank matrices into Transformer layers.
2021	Finetuned Language Models Are Zero-Shot Learners	Instruction tuning	FLAN-style instruction tuning for zero-shot behavior.
2022	Training Language Models to Follow Instructions with Human Feedback	RLHF / InstructGPT	Human feedback for instruction-following behavior.

Cross-Domain Pointers

Direction	Go To	Note
Vision Transformers	Vision Transformers and Foundation Models	ViT, DeiT, Swin, CLIP, ALIGN, OWL-ViT, Grounding DINO, and related visual foundation models live in the Computer Vision database.
State-space alternatives	State Space Models	S4, Mamba, Mamba-2, and Mamba-3 live in the sequence-modeling branch.

Reading Path

Step	Read
1	Attention Is All You Need.
2	BERT, GPT-1, GPT-2, T5, and BART.
3	Transformer-XL, Longformer, Big Bird, and FlashAttention.
4	Scaling Laws, GPT-3, Chinchilla, PaLM, and LLaMA.
5	Prefix-Tuning, LoRA, FLAN, and InstructGPT.
6	Vision Transformers and Foundation Models for the cross-domain vision branch.