Self-Supervised Vision

Self-supervised vision asks how to learn useful visual representations without relying on dense human labels. The historical path moves from handcrafted pretext tasks to contrastive objectives, clustering, bootstrap learning, masked image modeling, and large-scale foundation features.

Pretext Tasks and Early Unsupervised Representation Learning

YearPaperTopicNote
2014Discriminative Unsupervised Feature Learning with Convolutional Neural NetworksExemplar CNNLearns invariances by classifying transformed image patches.
2015Unsupervised Visual Representation Learning by Context PredictionContext predictionPredicts relative patch positions.
2016Colorful Image ColorizationColorizationUses color prediction as a visual pretext task.
2016Context EncodersInpaintingLearns representations by filling missing image regions.

Contrastive, Clustering, and Bootstrap Learning

YearPaperTopicNote
2018Unsupervised Feature Learning via Non-Parametric Instance DiscriminationInstance discriminationTreats each image instance as its own class.
2018Deep Clustering for Unsupervised Learning of Visual FeaturesDeepClusterAlternates clustering and representation learning.
2018Representation Learning with Contrastive Predictive CodingCPCPredictive contrastive learning.
2019Momentum Contrast for Unsupervised Visual Representation LearningMoCoMemory queue and momentum encoder for contrastive learning.
2020A Simple Framework for Contrastive Learning of Visual RepresentationsSimCLRStrong augmentation and large-batch contrastive learning.
2020Bootstrap Your Own LatentBYOLSelf-supervised learning without negative pairs.
2020Unsupervised Learning of Visual Features by Contrasting Cluster AssignmentsSwAVOnline clustering with swapped assignments.
2020Exploring Simple Siamese Representation LearningSimSiamSiamese self-supervision without negatives or momentum encoder.

Masked Image Modeling and Foundation Features

YearPaperTopicNote
2021Emerging Properties in Self-Supervised Vision TransformersDINOSelf-distillation produces strong ViT features and object localization behavior.
2021Masked Autoencoders Are Scalable Vision LearnersMAEReconstructs masked patches with asymmetric encoder-decoder training.
2021iBOT: Image BERT Pre-Training with Online TokenizeriBOTCombines masked image modeling with self-distillation.
2023DINOv2: Learning Robust Visual Features without SupervisionDINOv2Large-scale self-supervised features for broad visual tasks.

Reading Path

StepRead
1Context prediction, colorization, and context encoders for pretext-task intuition.
2Instance Discrimination, DeepCluster, CPC, MoCo, and SimCLR for contrastive learning.
3BYOL, SwAV, and SimSiam for alternatives to standard negative-pair contrast.
4DINO, MAE, iBOT, and DINOv2 for the modern ViT-era self-supervised lineage.