Self-Supervised Vision

Self-supervised vision asks how to learn useful visual representations without relying on dense human labels. The historical path moves from handcrafted pretext tasks to contrastive objectives, clustering, bootstrap learning, masked image modeling, and large-scale foundation features.

Pretext Tasks and Early Unsupervised Representation Learning

Year	Paper	Topic	Note
2014	Discriminative Unsupervised Feature Learning with Convolutional Neural Networks	Exemplar CNN	Learns invariances by classifying transformed image patches.
2015	Unsupervised Visual Representation Learning by Context Prediction	Context prediction	Predicts relative patch positions.
2016	Colorful Image Colorization	Colorization	Uses color prediction as a visual pretext task.
2016	Context Encoders	Inpainting	Learns representations by filling missing image regions.

Contrastive, Clustering, and Bootstrap Learning

Year	Paper	Topic	Note
2018	Unsupervised Feature Learning via Non-Parametric Instance Discrimination	Instance discrimination	Treats each image instance as its own class.
2018	Deep Clustering for Unsupervised Learning of Visual Features	DeepCluster	Alternates clustering and representation learning.
2018	Representation Learning with Contrastive Predictive Coding	CPC	Predictive contrastive learning.
2019	Momentum Contrast for Unsupervised Visual Representation Learning	MoCo	Memory queue and momentum encoder for contrastive learning.
2020	A Simple Framework for Contrastive Learning of Visual Representations	SimCLR	Strong augmentation and large-batch contrastive learning.
2020	Bootstrap Your Own Latent	BYOL	Self-supervised learning without negative pairs.
2020	Unsupervised Learning of Visual Features by Contrasting Cluster Assignments	SwAV	Online clustering with swapped assignments.
2020	Exploring Simple Siamese Representation Learning	SimSiam	Siamese self-supervision without negatives or momentum encoder.

Masked Image Modeling and Foundation Features

Year	Paper	Topic	Note
2021	Emerging Properties in Self-Supervised Vision Transformers	DINO	Self-distillation produces strong ViT features and object localization behavior.
2021	Masked Autoencoders Are Scalable Vision Learners	MAE	Reconstructs masked patches with asymmetric encoder-decoder training.
2021	iBOT: Image BERT Pre-Training with Online Tokenizer	iBOT	Combines masked image modeling with self-distillation.
2023	DINOv2: Learning Robust Visual Features without Supervision	DINOv2	Large-scale self-supervised features for broad visual tasks.

Reading Path

Step	Read
1	Context prediction, colorization, and context encoders for pretext-task intuition.
2	Instance Discrimination, DeepCluster, CPC, MoCo, and SimCLR for contrastive learning.
3	BYOL, SwAV, and SimSiam for alternatives to standard negative-pair contrast.
4	DINO, MAE, iBOT, and DINOv2 for the modern ViT-era self-supervised lineage.