Vision Transformers and Foundation Models

This database tracks the shift from CNN-first computer vision to patch-based Transformers, large-scale image-text pretraining, open-vocabulary recognition, visual grounding, and promptable segmentation.

Vision Transformers

Year	Paper	Topic	Note
2020	An Image is Worth 16x16 Words	ViT	Applies a standard Transformer to image patches.
2020	Training Data-Efficient Image Transformers and Distillation through Attention	DeiT	Makes ViT training more data-efficient.
2021	Swin Transformer	Hierarchical ViT	Shifted-window attention for dense vision tasks.
2021	MLP-Mixer	Token mixing	Shows competitive patch models without attention or convolutions.
2021	BEiT	Masked image modeling	BERT-style pretraining for image patches.

Image-Text and Multimodal Pretraining

Year	Paper	Topic	Note
2021	Learning Transferable Visual Models From Natural Language Supervision (OpenAI)	CLIP	Contrastive image-text pretraining for zero-shot recognition.
2021	Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	ALIGN	Large-scale noisy image-text representation learning.
2021	Florence: A New Foundation Model for Computer Vision	Vision foundation model	Large-scale visual representation system for many tasks.
2022	CoCa: Contrastive Captioners are Image-Text Foundation Models	Contrastive captioning	Combines contrastive learning and captioning.
2022	Flamingo	Vision-language few-shot learning	Visual language model using frozen language models and visual inputs.
2023	Sigmoid Loss for Language Image Pre-Training	SigLIP	Replaces softmax contrastive loss with sigmoid loss for image-text training.

Open-Vocabulary and Grounding

Year	Paper	Topic	Note
2022	Simple Open-Vocabulary Object Detection with Vision Transformers (Google)	OWL-ViT	Transfers image-text pretraining to open-vocabulary detection.
2023	Grounding DINO	Grounded detection	Open-set object detection from category names or referring expressions.

Cross-Database Pointers

Theme	Go To	Note
Masked image modeling	Self-Supervised Vision	MAE and iBOT are kept with the self-supervised vision lineage.
Foundation visual features	Self-Supervised Vision	DINOv2 is kept with large-scale self-supervised representation learning.
Modern ConvNet comparison	Recognition and Backbones	ConvNeXt belongs primarily to backbone design.
Promptable segmentation	Detection and Segmentation	SAM and SAM 2 are kept with segmentation and dense prediction.

Reading Path

Step	Read
1	ViT, DeiT, Swin, and BEiT for the Transformer backbone lineage.
2	CLIP and ALIGN for image-text pretraining.
3	CoCa, Flamingo, and SigLIP for multimodal scaling and objectives.
4	OWL-ViT and Grounding DINO for open-vocabulary detection and grounding.
5	DINOv2 through Self-Supervised Vision, plus SAM and SAM 2 through Detection and Segmentation, for modern visual foundation models.