Computer Vision

Computer vision is where many deep learning ideas became visibly dominant: convolutional representations, large-scale supervised learning, residual connections, dense prediction, self-supervised representation learning, and later visual foundation models.

This page is the map. The detailed paper lists live in focused database notes.

Focused Databases

Database	Scope
Recognition and Backbones	LeNet, AlexNet, VGG, Inception, ResNet, DenseNet, EfficientNet, ConvNeXt, and efficient CNN families.
Detection and Segmentation	R-CNN family, YOLO, SSD, RetinaNet, DETR, semantic segmentation, instance segmentation, SAM, and SAM 2.
Self-Supervised Vision	Pretext tasks, contrastive learning, clustering, BYOL, DINO, MAE, iBOT, and DINOv2.
Vision Transformers and Foundation Models	ViT, DeiT, Swin, BEiT, CLIP, ALIGN, open-vocabulary detection, visual grounding, and promptable segmentation.

Milestone Map

Stage	Key Papers	Primary Database
Recognition backbones	LeNet, AlexNet, VGG, Inception, ResNet, DenseNet, EfficientNet, ConvNeXt	Recognition and Backbones
Dense prediction	FCN, U-Net, DeepLab, Mask R-CNN, Mask2Former, SAM, SAM 2	Detection and Segmentation
Detection	R-CNN, Fast R-CNN, Faster R-CNN, YOLO, SSD, RetinaNet, DETR, Deformable DETR, Grounding DINO	Detection and Segmentation
Self-supervised vision	Context prediction, MoCo, SimCLR, BYOL, DINO, MAE, iBOT, DINOv2	Self-Supervised Vision
Vision foundation models	ViT, DeiT, Swin, CLIP, ALIGN, OWL-ViT, Grounding DINO, SAM	Vision Transformers and Foundation Models

Suggested Paths

Path	Read
CNN foundations	LeNet → AlexNet → VGG → Inception → ResNet → DenseNet → EfficientNet → ConvNeXt.
Dense prediction	FCN → U-Net → DeepLab → Mask R-CNN → Mask2Former → SAM → SAM 2.
Object detection	R-CNN → Fast R-CNN → Faster R-CNN → YOLO → SSD → RetinaNet → DETR → Deformable DETR → Grounding DINO.
Self-supervised vision	Context prediction → InstDisc → DeepCluster → MoCo → SimCLR → BYOL → DINO → MAE → DINOv2.
Visual foundation models	ViT → DeiT → Swin → CLIP → ALIGN → OWL-ViT → Grounding DINO → SAM.

4 items under this folder.

Apr 30, 2026
Recognition and Backbones
Apr 30, 2026
Detection and Segmentation
Apr 30, 2026
Self-Supervised Vision
Apr 30, 2026
Vision Transformers and Foundation Models