Info
The discussion below focuses on historical motivations and tradeoffs, following the path from attention in neural machine translation to Transformer-based language models and later efficiency-oriented alternatives. Paper-level bibliography is kept in the Literature section:
- Representations and Sequence Models for word representations, RNNs, Seq2Seq, and early attention.
- Transformers for Transformer architecture, BERT/GPT/T5 lineages, scaling, efficient attention, and adaptation.
- State Space Models for S4, S5, Mamba, Mamba-2, and Mamba-3.
- Generalization and Scaling for scaling laws and compute-optimal training.
From Attention to Modern Sequence Models
This history is organized around the binding constraint at each stage. In early encoder-decoder translation, the constraint was compression into a fixed vector. In recurrent sequence modeling, it was sequential computation. In scaled language models, it became optimization, data/compute allocation, interface design, and eventually the memory pressure of long-context inference.
The opening timeline follows these shifts:
- Encoder-decoder RNNs compressed the whole source sequence into one vector.
- Attention let the decoder retrieve different parts of the source sequence when needed.
- The Transformer removed recurrence from the main computation, making sequence modeling much more parallelizable.
- Scaling laws, normalization choices, and data/compute allocation made Transformer-based language modeling a more predictable scaling regime.
- Instruction tuning and RLHF adapted large language models to assistant-style interaction.
- Long-context inference and KV-cache growth created a new efficiency problem, motivating systems work such as FlashAttention and architectural alternatives such as State Space Models.

Methodological Frame
Four levels are kept separate throughout the note:
| Level | Question | Examples |
|---|---|---|
| Architecture | What computation does the model perform? | Attention, self-attention, encoder-decoder structure, decoder-only Transformers, SSM layers |
| Optimization and scaling | Can the model be trained stably and predictably? | residual connections, LayerNorm placement, scaling laws, compute/data allocation |
| Systems and inference | Can the model run efficiently on real hardware? | KV cache, FlashAttention, memory bandwidth, long-context serving |
| Interface and deployment | How is the model exposed and used? | instruction tuning, RLHF, chat interfaces, assistant-style interaction |
These levels are analytic aids rather than strict historical phases. The 2017 Transformer paper changed the architecture of sequence modeling; ChatGPT changed the interface and deployment path for instruction-following language models; FlashAttention addressed a systems bottleneck; Mamba-style models respond to architectural and serving constraints around long-context computation.
Compressed Timeline
| Period | Main Development | Problem Being Addressed |
|---|---|---|
| 2014-2015 | Attention in neural machine translation | Fixed-vector compression in encoder-decoder RNNs |
| 2017 | Transformer architecture | Sequential computation in recurrent sequence models |
| 2018-2020 | Encoder-only, decoder-only, and encoder-decoder lineages | Different tasks require different training objectives and inference patterns |
| 2020-2022 | Scaling laws, GPT-3, Chinchilla, instruction tuning | How to allocate parameters, data, compute, and human feedback |
| 2022 | ChatGPT | How non-specialists interact with instruction-tuned language models |
| 2022-2026 | FlashAttention, long-context systems, Mamba/SSMs | Attention cost, memory movement, and KV-cache growth |
1. Before Attention: The Fixed-Context Bottleneck
Early neural machine translation systems based on encoder-decoder RNNs followed a simple pattern:
- The encoder read the whole source sentence.
- The encoder compressed that sentence into a fixed-length vector.
- The decoder generated the target sentence from that vector.
The design was compact, but it placed a strong burden on the final encoder state. A short sentence and a long sentence had to pass through the same-size representation before decoding. The limitation was architectural: all source-side information had to be compressed into a single channel before generation began.
The Key Bottleneck
The decoder needed access to different parts of the source sentence at different output steps. A single fixed vector made that difficult, especially for long or information-dense inputs.
Bahdanau, Cho, and Bengio addressed this in 2014 by allowing the decoder to compute a context vector that depends on the current decoding step. Instead of forcing the decoder to rely on a single summary of the source sentence, the model scores the encoder states against the current decoder state, turns those scores into attention weights, and builds a step-specific summary of the source.
Architecturally, the source sentence was no longer represented only by one vector; it became a set of hidden states that the decoder could read from selectively.
Conceptual Interpretation
Attention turns the encoded sequence into a differentiable, content-addressable memory. The decoder can query the source representation as generation proceeds, instead of relying entirely on its recurrent state.
Luong, Pham, and Manning later systematized global and local attention variants. Attention was becoming a reusable design pattern for sequence modeling, beyond its first role in a specific translation architecture.
2. Attention Before the Transformer
Before the Transformer, attention was usually attached to recurrent encoder-decoder models. The recurrent network still processed tokens sequentially; attention provided a read mechanism over hidden states.
The architectural role of attention was different in the two settings:
| Attention in RNN Seq2Seq | Transformer Self-Attention |
|---|---|
| Attention is added to a recurrent encoder-decoder model. | Attention becomes the main sequence-processing operation. |
| Encoder and decoder still process sequences step by step. | Training can process positions in parallel within a layer. |
| Attention mainly solves the source-context bottleneck. | Self-attention also changes how information flows between all token positions. |
The transition to the Transformer concerns the role assigned to attention. In earlier sequence-to-sequence systems, attention was an interface between recurrent states. In the Transformer, self-attention became the main operation for exchanging information across positions, while recurrence and convolution were removed from the core sequence-transduction pipeline.
3. The Transformer: Removing Sequential Recurrence
Vaswani et al. introduced the Transformer as an encoder-decoder architecture for machine translation. Its central proposal was to process sequences with attention mechanisms instead of recurrence or convolution.
The target bottleneck differed from the one addressed by Bahdanau-style attention.
| Problem in RNNs | Transformer Response |
|---|---|
| Tokens must be processed in sequence. | Token representations can be updated in parallel during training. |
| Information from distant positions passes through many recurrent steps. | Self-attention gives shorter paths between positions. |
| Training is hard to parallelize across sequence length. | Attention is more compatible with GPU/TPU matrix operations. |
| Order is implicit in recurrence. | Order must be injected explicitly through positional information. |
A common shorthand describes the Transformer as “self-attention plus MLPs.” The original architecture is broader: multi-head attention, positional encodings, residual connections, layer normalization, embeddings, output projections, and a carefully chosen training recipe all matter.
Interpretive Caution
The original result has a precise interpretation: high-quality sequence transduction could be achieved without recurrence or convolution as the core sequence-processing mechanism.
Self-Attention and Path Length
In a recurrent model, token influences token through a chain of recurrent updates. In self-attention, token representations can interact more directly within the same layer.
The effective path between distant positions becomes shorter. Shorter paths do not by themselves guarantee long-range reasoning, but they give the model a more direct computational route for modeling long-distance dependencies.
Positional Encoding
Self-attention by itself is permutation-invariant: without positional information, it has no built-in reason to know that token comes after token .
The original Transformer added sinusoidal positional encodings to token embeddings. Later models explored learned positional embeddings, relative position schemes, rotary position embeddings, ALiBi-style biases, and other mechanisms.
Methodological Note
Positional encoding records a design tradeoff. Once recurrence is removed, order is no longer supplied by the computation itself and must be represented separately.
4. After 2017: One Architecture, Several Lineages
The original Transformer was an encoder-decoder model. Later work separated and specialized its parts.
| Lineage | Typical Objective | Representative Models | Typical Use |
|---|---|---|---|
| Encoder-only | masked or denoising-style representation learning | BERT, RoBERTa | classification, retrieval, reranking, understanding tasks |
| Decoder-only | autoregressive next-token prediction | GPT family | generation, completion, chat-style systems, in-context learning |
| Encoder-decoder | conditional generation from input to output | T5, BART | translation, summarization, structured transformation |
The split reflects different assumptions about information flow:
- encoder-only models see context bidirectionally and are strong representation learners;
- decoder-only models generate left-to-right and are natural fits for continuation and interactive generation;
- encoder-decoder models separate input encoding from output generation and remain useful for conditional generation tasks.
Terminological Note
“Transformer” names an architectural family, not a single model type. Treating BERT, GPT, and T5 as interchangeable hides the role of objective, masking pattern, and inference procedure.
5. Scaling Changed the Meaning of the Architecture
The 2017 architecture was only one part of the later LLM trajectory. Several additional ingredients had to become reliable:
- training stability at depth;
- predictable scaling behavior;
- large and diverse pretraining corpora;
- efficient distributed training;
- task adaptation through prompting, instruction tuning, and human feedback.
Scaling Laws
Kaplan et al. studied empirical scaling laws for language models and found that loss follows approximate power-law trends with model size, dataset size, and compute. Training large models still required judgment and experimentation, but scaling became a more measurable engineering problem.
Hoffmann et al. later argued that many large language models were undertrained relative to their parameter count. The Chinchilla result shifted attention from “more parameters” toward better compute allocation between parameters and training tokens.
Scaling and Allocation
Scaling laws relate performance to a coupled allocation of compute, parameters, and data. A model can be large and still be inefficiently trained.
Normalization and Stability
The original Transformer used Post-Norm blocks, where layer normalization is applied after the residual addition. Many later large language models use Pre-Norm-style blocks, often with RMSNorm.
The reason is practical: in very deep networks, clean residual paths help gradient flow. Pre-Norm is not universally superior in every setting, but it became a common choice for stable large-scale training.
In-Context Learning
GPT-3 made in-context learning highly visible. A decoder-only language model could often adapt its behavior from examples placed in the prompt, without parameter updates.
After GPT-3, prompting increasingly functioned as a temporary task specification rather than only an input string.
Caution
In-context learning differs from ordinary gradient-based learning. The model weights remain fixed during the prompt; adaptation occurs through the forward pass and the model’s learned ability to condition on context.
6. ChatGPT: Interface, Alignment, and Deployment
ChatGPT enters this history as a deployment and interaction event. By 2022, Transformer language models and instruction-following work were already established.
The shift lies in the combination of model capability, alignment methods, and interface design.
| Layer | What Changed |
|---|---|
| Model family | Large Transformer language models were already established. |
| Behavioral tuning | Instruction tuning and RLHF made outputs more aligned with user requests. |
| Interface | The chat format supported multi-turn clarification, correction, and iteration. |
| Adoption | Non-specialists could use the system without learning model-specific prompting conventions first. |
Technically, this follows the InstructGPT/RLHF line: supervised fine-tuning on demonstrations, reward modeling from human preferences, and further optimization using human feedback. ChatGPT packaged this assistant behavior in a public conversational product.
Architecture and Deployment
The years 2014-2015 and 2017 mark architectural transitions.
The year 2022 marks a transition in alignment practice, interface design, and public deployment.
Adoption figures for ChatGPT are often cited because they indicate how quickly the interface spread. Those figures document the public uptake of a conversational interface, not the arrival of a new model architecture.
7. The Cost of Attention: Context Length and the KV Cache
Transformer deployment at scale exposed a new bottleneck.
Self-attention gives each token direct access to other tokens, but full attention over a sequence has a quadratic attention matrix:
where is the sequence length. As context length grows, the number of token-to-token comparisons grows quadratically. Doubling the context roughly quadruples the attention matrix.
During autoregressive generation, the model avoids recomputing all previous key and value projections by storing them in the KV cache. This is essential for efficient decoding, but it creates a memory problem.
For each new token:
- compute the new query, key, and value;
- append the new key and value to the cache;
- attend from the new query over cached keys and values;
- produce the next-token distribution.
The per-token attention work still grows with context length, and the cache grows with the number of layers, heads, head dimension, batch size, and sequence length.
With a KV cache, decoding avoids the naive strategy of recomputing the whole previous sequence for every generated token. The attention work for a new token is still linear in the current context length:
The resulting distinction is:
| Phase | Attention Cost | Memory Profile |
|---|---|---|
| Training / prefill | over the full sequence | attention activations and intermediate states |
| Autoregressive decoding with KV cache | per generated token | KV cache grows with context length |
The New Bottleneck
Long-context inference is not limited only by arithmetic. It is also limited by memory capacity and memory bandwidth. The KV cache turns context length into a serving problem.
FlashAttention
FlashAttention keeps the attention operation exact while reducing memory movement between GPU high-bandwidth memory and on-chip SRAM.
Its historical role is systems-level: once the Transformer became dominant, part of the bottleneck moved from architecture to hardware-aware implementation.
8. State Space Models: Recurrence Revisited
State Space Models enter the account by reopening a question the Transformer seemed to have settled:
Question
Can recurrence provide memory-efficient sequence modeling without reintroducing the weaknesses of classical RNNs?
Classical recurrence processes sequences through a hidden state. This is memory efficient at inference time, but older RNNs were hard to train at scale and did not match accelerator hardware as well as Transformer matrix operations.
Modern SSMs recover part of the appeal of recurrence while using more structured parameterizations and more parallelizable algorithms.
| Line | Main Idea | Historical Role |
|---|---|---|
| S4 / S5 | structured state space layers for long-range sequence modeling | revived SSMs as practical deep learning layers |
| Mamba (2023) | selective, input-dependent state spaces | made recurrent state more content-aware |
| Mamba-2 (2024) | State Space Duality | clarified connections between SSMs and attention-like computation |
| Mamba-3 (2026) | more expressive state updates and inference-oriented refinements | targets state tracking, retrieval, and hardware utilization |
Mamba-style models are better read as responses to bottlenecks created by successful Transformer deployment: long contexts, KV-cache memory, and the need for efficient decoding.
Interpretive Caution
The historical pattern is not a sequence of clean replacements. Each architecture makes a different tradeoff between direct access to context, parallel training, state compression, memory use, and hardware efficiency.
The Causal Thread
The evolution can be summarized as a sequence of motivated responses:
| Stage | What Was Inadequate Before | What Changed | New Cost Introduced |
|---|---|---|---|
| Attention | fixed-vector compression | dynamic retrieval over encoder states | attention computation over source positions |
| Transformer | sequential recurrence | parallel token-to-token interaction | quadratic attention in sequence length |
| Scaling laws and GPT-3 | ad hoc scaling | empirical planning of model/data/compute | larger training and evaluation complexity |
| Instruction tuning and ChatGPT | capable models were hard to use directly | assistant-style interaction | alignment, reliability, and deployment risks |
| FlashAttention | attention was memory-traffic inefficient | exact attention with better IO behavior | systems complexity |
| Mamba/SSMs | KV-cache and attention costs grow with context | compact recurrent state | possible tradeoffs in retrieval and state tracking |
Main Historical Claims
-
Attention was originally a solution to the fixed-context problem in encoder-decoder RNNs.
-
The Transformer changed the role of attention. Attention moved from an auxiliary read mechanism in recurrent models to the main sequence-processing operation.
-
Modern LLMs depend on more than architecture. Their behavior depends on objective choice, scale, data, normalization, optimization, instruction tuning, and deployment interface.
-
ChatGPT marks an interface, alignment, and deployment event.
-
Efficiency research follows from Transformer deployment at scale. Once models are deployed at long context and large scale, memory traffic, KV-cache growth, and inference cost become central research problems.
Tradeoff Pattern
Across these episodes, the recurrent pattern is a tradeoff:
- compression versus retrieval,
- recurrence versus parallelism,
- direct context access versus memory cost,
- benchmark capability versus usable interaction,
- architectural elegance versus hardware reality.
The Transformer became central by offering an effective compromise among these pressures. Research continued because that compromise created new costs at training and inference scale.