State Space Models
State Space Models (SSMs) are sequence models based on a hidden state that is updated over time. In deep learning, they are studied as an alternative to attention-heavy Transformers, especially when long context, streaming inference, or memory efficiency matter.
Problem Map
| Problem | Why It Matters | What SSMs Try to Do |
|---|---|---|
| Transformer attention scales poorly with length | Full attention has O(n^2) attention cost and a growing KV cache. | Use recurrent/state updates with linear scaling and compact state. |
| Long-context inference is expensive | Agentic workflows, codebases, books, audio, and genomics can require very long sequences. | Maintain a compressed history instead of attending over every token. |
| Older linear models lost quality | Many efficient models struggled with content-based reasoning and discrete language. | Make state updates input-dependent so the model can selectively remember or forget. |
| Linear inference can underuse hardware | Theoretical O(n) does not automatically mean high GPU utilization. | Redesign the recurrence and kernels around practical inference throughput. |
| State tracking is hard | Some linear models fail tasks requiring precise updates to latent variables or symbolic state. | Use richer state dynamics, complex-valued updates, and MIMO formulations. |
Background SSMs
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2021 | Efficiently Modeling Long Sequences with Structured State Spaces | S4 | Structured SSM layer for long-range sequence modeling. |
| 2022 | Diagonal State Spaces are as Effective as Structured State Spaces | DSS | Simpler diagonal SSMs can match S4 on long-range tasks. |
| 2022 | Simplified State Space Layers for Sequence Modeling | S5 | Multi-input, multi-output SSM using efficient parallel scans. |
Mamba Family
| Year | Paper | Topic | Problem Addressed |
|---|---|---|---|
| 2023 | Mamba: Linear-Time Sequence Modeling with Selective State Spaces (code) | Mamba / selective SSM | Makes SSM parameters input-dependent to recover content-aware reasoning while keeping linear sequence scaling. |
| 2024 | Transformers are SSMs (PMLR) | Mamba-2 / state space duality | Connects SSMs and attention through structured semiseparable matrices; makes Mamba-style layers faster and more Transformer-like. |
| 2026 | Mamba-3: Improved Sequence Modeling using State Space Principles (project) | Mamba-3 / inference-first SSM | Targets the quality-efficiency gap: better state tracking, retrieval, and decode hardware utilization with expressive recurrence, complex state updates, and MIMO SSMs. |
Key Ideas
| Idea | Meaning |
|---|---|
| State | A compact memory vector updated as the sequence is processed. |
| Selectivity | The model decides what to propagate or forget based on the current input. |
| Linear scaling | Sequence processing avoids the O(n^2) attention matrix. |
| Constant-memory decoding | In autoregressive inference, the model can update a fixed-size state instead of expanding a KV cache. |
| State Space Duality | Mamba-2 shows structural links between SSMs and attention-like computations. |
| MIMO SSM | Mamba-3 processes vector-valued inputs/outputs in the recurrence to improve expressivity and hardware utilization. |
Reading Path
| Step | Read |
|---|---|
| 1 | S4 for the structured state-space foundation. |
| 2 | S5 for the MIMO simplification perspective. |
| 3 | Mamba for selective state spaces and language modeling. |
| 4 | Mamba-2 for the SSM-attention bridge and faster SSD layer. |
| 5 | Mamba-3 for inference-first design, richer state tracking, and MIMO recurrence. |