Vision Transformers and Foundation Models

This database tracks the shift from CNN-first computer vision to patch-based Transformers, large-scale image-text pretraining, open-vocabulary recognition, visual grounding, and promptable segmentation.

Vision Transformers

YearPaperTopicNote
2020An Image is Worth 16x16 WordsViTApplies a standard Transformer to image patches.
2020Training Data-Efficient Image Transformers and Distillation through AttentionDeiTMakes ViT training more data-efficient.
2021Swin TransformerHierarchical ViTShifted-window attention for dense vision tasks.
2021MLP-MixerToken mixingShows competitive patch models without attention or convolutions.
2021BEiTMasked image modelingBERT-style pretraining for image patches.

Image-Text and Multimodal Pretraining

YearPaperTopicNote
2021Learning Transferable Visual Models From Natural Language Supervision (OpenAI)CLIPContrastive image-text pretraining for zero-shot recognition.
2021Scaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionALIGNLarge-scale noisy image-text representation learning.
2021Florence: A New Foundation Model for Computer VisionVision foundation modelLarge-scale visual representation system for many tasks.
2022CoCa: Contrastive Captioners are Image-Text Foundation ModelsContrastive captioningCombines contrastive learning and captioning.
2022FlamingoVision-language few-shot learningVisual language model using frozen language models and visual inputs.
2023Sigmoid Loss for Language Image Pre-TrainingSigLIPReplaces softmax contrastive loss with sigmoid loss for image-text training.

Open-Vocabulary and Grounding

YearPaperTopicNote
2022Simple Open-Vocabulary Object Detection with Vision Transformers (Google)OWL-ViTTransfers image-text pretraining to open-vocabulary detection.
2023Grounding DINOGrounded detectionOpen-set object detection from category names or referring expressions.

Cross-Database Pointers

ThemeGo ToNote
Masked image modelingSelf-Supervised VisionMAE and iBOT are kept with the self-supervised vision lineage.
Foundation visual featuresSelf-Supervised VisionDINOv2 is kept with large-scale self-supervised representation learning.
Modern ConvNet comparisonRecognition and BackbonesConvNeXt belongs primarily to backbone design.
Promptable segmentationDetection and SegmentationSAM and SAM 2 are kept with segmentation and dense prediction.

Reading Path

StepRead
1ViT, DeiT, Swin, and BEiT for the Transformer backbone lineage.
2CLIP and ALIGN for image-text pretraining.
3CoCa, Flamingo, and SigLIP for multimodal scaling and objectives.
4OWL-ViT and Grounding DINO for open-vocabulary detection and grounding.
5DINOv2 through Self-Supervised Vision, plus SAM and SAM 2 through Detection and Segmentation, for modern visual foundation models.