Computer Vision
Computer vision is where many deep learning ideas became visibly dominant: convolutional representations, large-scale supervised learning, residual connections, dense prediction, self-supervised representation learning, and later visual foundation models.
This page is the map. The detailed paper lists live in focused database notes.
Focused Databases
| Database | Scope |
|---|---|
| Recognition and Backbones | LeNet, AlexNet, VGG, Inception, ResNet, DenseNet, EfficientNet, ConvNeXt, and efficient CNN families. |
| Detection and Segmentation | R-CNN family, YOLO, SSD, RetinaNet, DETR, semantic segmentation, instance segmentation, SAM, and SAM 2. |
| Self-Supervised Vision | Pretext tasks, contrastive learning, clustering, BYOL, DINO, MAE, iBOT, and DINOv2. |
| Vision Transformers and Foundation Models | ViT, DeiT, Swin, BEiT, CLIP, ALIGN, open-vocabulary detection, visual grounding, and promptable segmentation. |
Milestone Map
| Stage | Key Papers | Primary Database |
|---|---|---|
| Recognition backbones | LeNet, AlexNet, VGG, Inception, ResNet, DenseNet, EfficientNet, ConvNeXt | Recognition and Backbones |
| Dense prediction | FCN, U-Net, DeepLab, Mask R-CNN, Mask2Former, SAM, SAM 2 | Detection and Segmentation |
| Detection | R-CNN, Fast R-CNN, Faster R-CNN, YOLO, SSD, RetinaNet, DETR, Deformable DETR, Grounding DINO | Detection and Segmentation |
| Self-supervised vision | Context prediction, MoCo, SimCLR, BYOL, DINO, MAE, iBOT, DINOv2 | Self-Supervised Vision |
| Vision foundation models | ViT, DeiT, Swin, CLIP, ALIGN, OWL-ViT, Grounding DINO, SAM | Vision Transformers and Foundation Models |
Suggested Paths
| Path | Read |
|---|---|
| CNN foundations | LeNet → AlexNet → VGG → Inception → ResNet → DenseNet → EfficientNet → ConvNeXt. |
| Dense prediction | FCN → U-Net → DeepLab → Mask R-CNN → Mask2Former → SAM → SAM 2. |
| Object detection | R-CNN → Fast R-CNN → Faster R-CNN → YOLO → SSD → RetinaNet → DETR → Deformable DETR → Grounding DINO. |
| Self-supervised vision | Context prediction → InstDisc → DeepCluster → MoCo → SimCLR → BYOL → DINO → MAE → DINOv2. |
| Visual foundation models | ViT → DeiT → Swin → CLIP → ALIGN → OWL-ViT → Grounding DINO → SAM. |