Vision Transformers and Foundation Models
This database tracks the shift from CNN-first computer vision to patch-based Transformers, large-scale image-text pretraining, open-vocabulary recognition, visual grounding, and promptable segmentation.
Vision Transformers
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2020 | An Image is Worth 16x16 Words | ViT | Applies a standard Transformer to image patches. |
| 2020 | Training Data-Efficient Image Transformers and Distillation through Attention | DeiT | Makes ViT training more data-efficient. |
| 2021 | Swin Transformer | Hierarchical ViT | Shifted-window attention for dense vision tasks. |
| 2021 | MLP-Mixer | Token mixing | Shows competitive patch models without attention or convolutions. |
| 2021 | BEiT | Masked image modeling | BERT-style pretraining for image patches. |
Image-Text and Multimodal Pretraining
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2021 | Learning Transferable Visual Models From Natural Language Supervision (OpenAI) | CLIP | Contrastive image-text pretraining for zero-shot recognition. |
| 2021 | Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | ALIGN | Large-scale noisy image-text representation learning. |
| 2021 | Florence: A New Foundation Model for Computer Vision | Vision foundation model | Large-scale visual representation system for many tasks. |
| 2022 | CoCa: Contrastive Captioners are Image-Text Foundation Models | Contrastive captioning | Combines contrastive learning and captioning. |
| 2022 | Flamingo | Vision-language few-shot learning | Visual language model using frozen language models and visual inputs. |
| 2023 | Sigmoid Loss for Language Image Pre-Training | SigLIP | Replaces softmax contrastive loss with sigmoid loss for image-text training. |
Open-Vocabulary and Grounding
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2022 | Simple Open-Vocabulary Object Detection with Vision Transformers (Google) | OWL-ViT | Transfers image-text pretraining to open-vocabulary detection. |
| 2023 | Grounding DINO | Grounded detection | Open-set object detection from category names or referring expressions. |
Cross-Database Pointers
| Theme | Go To | Note |
|---|---|---|
| Masked image modeling | Self-Supervised Vision | MAE and iBOT are kept with the self-supervised vision lineage. |
| Foundation visual features | Self-Supervised Vision | DINOv2 is kept with large-scale self-supervised representation learning. |
| Modern ConvNet comparison | Recognition and Backbones | ConvNeXt belongs primarily to backbone design. |
| Promptable segmentation | Detection and Segmentation | SAM and SAM 2 are kept with segmentation and dense prediction. |
Reading Path
| Step | Read |
|---|---|
| 1 | ViT, DeiT, Swin, and BEiT for the Transformer backbone lineage. |
| 2 | CLIP and ALIGN for image-text pretraining. |
| 3 | CoCa, Flamingo, and SigLIP for multimodal scaling and objectives. |
| 4 | OWL-ViT and Grounding DINO for open-vocabulary detection and grounding. |
| 5 | DINOv2 through Self-Supervised Vision, plus SAM and SAM 2 through Detection and Segmentation, for modern visual foundation models. |