Computer Vision

Computer vision is where many deep learning ideas became visibly dominant: convolutional representations, large-scale supervised learning, residual connections, dense prediction, self-supervised representation learning, and later visual foundation models.

This page is the map. The detailed paper lists live in focused database notes.

Focused Databases

DatabaseScope
Recognition and BackbonesLeNet, AlexNet, VGG, Inception, ResNet, DenseNet, EfficientNet, ConvNeXt, and efficient CNN families.
Detection and SegmentationR-CNN family, YOLO, SSD, RetinaNet, DETR, semantic segmentation, instance segmentation, SAM, and SAM 2.
Self-Supervised VisionPretext tasks, contrastive learning, clustering, BYOL, DINO, MAE, iBOT, and DINOv2.
Vision Transformers and Foundation ModelsViT, DeiT, Swin, BEiT, CLIP, ALIGN, open-vocabulary detection, visual grounding, and promptable segmentation.

Milestone Map

StageKey PapersPrimary Database
Recognition backbonesLeNet, AlexNet, VGG, Inception, ResNet, DenseNet, EfficientNet, ConvNeXtRecognition and Backbones
Dense predictionFCN, U-Net, DeepLab, Mask R-CNN, Mask2Former, SAM, SAM 2Detection and Segmentation
DetectionR-CNN, Fast R-CNN, Faster R-CNN, YOLO, SSD, RetinaNet, DETR, Deformable DETR, Grounding DINODetection and Segmentation
Self-supervised visionContext prediction, MoCo, SimCLR, BYOL, DINO, MAE, iBOT, DINOv2Self-Supervised Vision
Vision foundation modelsViT, DeiT, Swin, CLIP, ALIGN, OWL-ViT, Grounding DINO, SAMVision Transformers and Foundation Models

Suggested Paths

PathRead
CNN foundationsLeNet AlexNet VGG Inception ResNet DenseNet EfficientNet ConvNeXt.
Dense predictionFCN U-Net DeepLab Mask R-CNN Mask2Former SAM SAM 2.
Object detectionR-CNN Fast R-CNN Faster R-CNN YOLO SSD RetinaNet DETR Deformable DETR Grounding DINO.
Self-supervised visionContext prediction InstDisc DeepCluster MoCo SimCLR BYOL DINO MAE DINOv2.
Visual foundation modelsViT DeiT Swin CLIP ALIGN OWL-ViT Grounding DINO SAM.