Detection and Segmentation

This database collects the major papers that turned image classification backbones into systems that localize, segment, and interactively select objects.

Object Detection

YearPaperTopicNote
2013Rich Feature Hierarchies for Accurate Object Detection and Semantic SegmentationR-CNNCNN features plus region proposals.
2014Spatial Pyramid Pooling in Deep Convolutional Networks for Visual RecognitionSPPnetHandles variable-size regions with spatial pooling.
2015Fast R-CNNDetection pipelineFaster region-based detector with shared computation.
2015Faster R-CNNRegion proposalsRegion Proposal Network inside the detector.
2015You Only Look OnceYOLOSingle-pass real-time detection.
2015SSD: Single Shot MultiBox DetectorSSDDense multi-scale one-stage detection.
2016Feature Pyramid Networks for Object DetectionFPNMulti-scale feature pyramids for detection.
2017Focal Loss for Dense Object DetectionRetinaNetAddresses foreground-background imbalance in one-stage detection.
2020End-to-End Object Detection with TransformersDETRDetection as set prediction with Transformers.
2020Deformable DETREfficient DETRSparse deformable attention for faster convergence and multi-scale features.
2022DINO: DETR with Improved DeNoising Anchor BoxesDETR trainingStrong DETR variant with denoising and anchor refinement.

Segmentation

YearPaperTopicNote
2014Fully Convolutional Networks for Semantic SegmentationFCNConverts classification CNNs into dense predictors.
2015U-NetBiomedical segmentationEncoder-decoder with skip connections for precise localization.
2016DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFsDeepLabAtrous convolution and CRF refinement.
2017Mask R-CNNInstance segmentationAdds mask prediction to Faster R-CNN.
2018Encoder-Decoder with Atrous Separable Convolution for Semantic Image SegmentationDeepLabv3+Encoder-decoder refinement with atrous separable convolution.
2021Masked-attention Mask Transformer for Universal Image SegmentationMask2FormerUnified semantic, instance, and panoptic segmentation.
2023Segment AnythingSAMPromptable segmentation foundation model.
2024SAM 2: Segment Anything in Images and VideosSAM 2Extends promptable segmentation to video with streaming memory.

Reading Path

StepRead
1R-CNN, Fast R-CNN, and Faster R-CNN for the region-based lineage.
2YOLO, SSD, and RetinaNet for one-stage detection.
3FCN, U-Net, DeepLab, and Mask R-CNN for dense prediction.
4DETR, Deformable DETR, and DINO for Transformer-based detection.
5Mask2Former, SAM, and SAM 2 for modern unified and promptable segmentation.