Detection and Segmentation
This database collects the major papers that turned image classification backbones into systems that localize, segment, and interactively select objects.
Object Detection
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2013 | Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation | R-CNN | CNN features plus region proposals. |
| 2014 | Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition | SPPnet | Handles variable-size regions with spatial pooling. |
| 2015 | Fast R-CNN | Detection pipeline | Faster region-based detector with shared computation. |
| 2015 | Faster R-CNN | Region proposals | Region Proposal Network inside the detector. |
| 2015 | You Only Look Once | YOLO | Single-pass real-time detection. |
| 2015 | SSD: Single Shot MultiBox Detector | SSD | Dense multi-scale one-stage detection. |
| 2016 | Feature Pyramid Networks for Object Detection | FPN | Multi-scale feature pyramids for detection. |
| 2017 | Focal Loss for Dense Object Detection | RetinaNet | Addresses foreground-background imbalance in one-stage detection. |
| 2020 | End-to-End Object Detection with Transformers | DETR | Detection as set prediction with Transformers. |
| 2020 | Deformable DETR | Efficient DETR | Sparse deformable attention for faster convergence and multi-scale features. |
| 2022 | DINO: DETR with Improved DeNoising Anchor Boxes | DETR training | Strong DETR variant with denoising and anchor refinement. |
Segmentation
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2014 | Fully Convolutional Networks for Semantic Segmentation | FCN | Converts classification CNNs into dense predictors. |
| 2015 | U-Net | Biomedical segmentation | Encoder-decoder with skip connections for precise localization. |
| 2016 | DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs | DeepLab | Atrous convolution and CRF refinement. |
| 2017 | Mask R-CNN | Instance segmentation | Adds mask prediction to Faster R-CNN. |
| 2018 | Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation | DeepLabv3+ | Encoder-decoder refinement with atrous separable convolution. |
| 2021 | Masked-attention Mask Transformer for Universal Image Segmentation | Mask2Former | Unified semantic, instance, and panoptic segmentation. |
| 2023 | Segment Anything | SAM | Promptable segmentation foundation model. |
| 2024 | SAM 2: Segment Anything in Images and Videos | SAM 2 | Extends promptable segmentation to video with streaming memory. |
Reading Path
| Step | Read |
|---|---|
| 1 | R-CNN, Fast R-CNN, and Faster R-CNN for the region-based lineage. |
| 2 | YOLO, SSD, and RetinaNet for one-stage detection. |
| 3 | FCN, U-Net, DeepLab, and Mask R-CNN for dense prediction. |
| 4 | DETR, Deformable DETR, and DINO for Transformer-based detection. |
| 5 | Mask2Former, SAM, and SAM 2 for modern unified and promptable segmentation. |