Self-Supervised Vision
Self-supervised vision asks how to learn useful visual representations without relying on dense human labels. The historical path moves from handcrafted pretext tasks to contrastive objectives, clustering, bootstrap learning, masked image modeling, and large-scale foundation features.
Pretext Tasks and Early Unsupervised Representation Learning
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2014 | Discriminative Unsupervised Feature Learning with Convolutional Neural Networks | Exemplar CNN | Learns invariances by classifying transformed image patches. |
| 2015 | Unsupervised Visual Representation Learning by Context Prediction | Context prediction | Predicts relative patch positions. |
| 2016 | Colorful Image Colorization | Colorization | Uses color prediction as a visual pretext task. |
| 2016 | Context Encoders | Inpainting | Learns representations by filling missing image regions. |
Contrastive, Clustering, and Bootstrap Learning
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2018 | Unsupervised Feature Learning via Non-Parametric Instance Discrimination | Instance discrimination | Treats each image instance as its own class. |
| 2018 | Deep Clustering for Unsupervised Learning of Visual Features | DeepCluster | Alternates clustering and representation learning. |
| 2018 | Representation Learning with Contrastive Predictive Coding | CPC | Predictive contrastive learning. |
| 2019 | Momentum Contrast for Unsupervised Visual Representation Learning | MoCo | Memory queue and momentum encoder for contrastive learning. |
| 2020 | A Simple Framework for Contrastive Learning of Visual Representations | SimCLR | Strong augmentation and large-batch contrastive learning. |
| 2020 | Bootstrap Your Own Latent | BYOL | Self-supervised learning without negative pairs. |
| 2020 | Unsupervised Learning of Visual Features by Contrasting Cluster Assignments | SwAV | Online clustering with swapped assignments. |
| 2020 | Exploring Simple Siamese Representation Learning | SimSiam | Siamese self-supervision without negatives or momentum encoder. |
Masked Image Modeling and Foundation Features
| Year | Paper | Topic | Note |
|---|---|---|---|
| 2021 | Emerging Properties in Self-Supervised Vision Transformers | DINO | Self-distillation produces strong ViT features and object localization behavior. |
| 2021 | Masked Autoencoders Are Scalable Vision Learners | MAE | Reconstructs masked patches with asymmetric encoder-decoder training. |
| 2021 | iBOT: Image BERT Pre-Training with Online Tokenizer | iBOT | Combines masked image modeling with self-distillation. |
| 2023 | DINOv2: Learning Robust Visual Features without Supervision | DINOv2 | Large-scale self-supervised features for broad visual tasks. |
Reading Path
| Step | Read |
|---|---|
| 1 | Context prediction, colorization, and context encoders for pretext-task intuition. |
| 2 | Instance Discrimination, DeepCluster, CPC, MoCo, and SimCLR for contrastive learning. |
| 3 | BYOL, SwAV, and SimSiam for alternatives to standard negative-pair contrast. |
| 4 | DINO, MAE, iBOT, and DINOv2 for the modern ViT-era self-supervised lineage. |