From LeNet to ResNet: Major Milestones in the Rise of Modern Deep Learning
The modern history of deep learning in computer vision is not best understood as a smooth, continuous progression. It is more accurately described as a sequence of bottleneck-breaking advances. Some architectures made end-to-end visual learning practical, others made it scalable, and still others made extreme depth trainable. Four landmarks capture this trajectory particularly well: LeNet-5, AlexNet, VGGNet and ResNet.

Section Goal
This section revisits four decisive milestones in the rise of modern deep learning, emphasizing the historical problem, technical innovation, scientific impact, and lasting legacy of each contribution.
Year Researchers Architecture / Breakthrough Reference Domain 1998 Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner LeNet-5 Handwritten digit and document recognition 2012 Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton AlexNet Large-scale image classification 2014 Simonyan & Zisserman VGG & The 3x3 Revolution Architectural standardization 2015 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun ResNet Optimization of very-deep CNNs
Note
This is a selective history of the CNN lineage of deep learning. Earlier milestones such as the Perceptron and Backpropagation, and later architectural shifts such as Attention and Transformers, are discussed elsewhere.

1998 - LeNet-5
| Aspect | Details |
|---|---|
| Historical Problem | In the 1990s, document processing systems needed to recognize handwritten digits reliably for applications such as bank check reading and postal code recognition. The main challenge was to design a model that could learn visual features directly from raw pixels rather than depending on manually engineered preprocessing pipelines. |
| Architecture | LeNet-5 is a compact convolutional network organized as C1-S2-C3-S4-C5-F6-Output. Its core ideas are local receptive fields, weight sharing, and subsampling, which together allow the network to exploit the spatial structure of images while keeping the number of parameters manageable. The model used 5x5 convolutions, trainable subsampling layers, and tanh-like nonlinearities. |
| Training | The network was trained end-to-end with backpropagation and gradient-based optimization. In modern terms, its importance lies not only in its architecture but also in the fact that it demonstrated practical end-to-end representation learning for visual recognition. |
| Results | On the MNIST handwritten digit benchmark, LeNet-5 achieved roughly 0.95% test error, while related variants in the same line of work pushed the error even lower under additional training tricks and distortions. More importantly, the architecture was tied to real document-recognition systems deployed in practice. |
| Scientific Impact | LeNet-5 showed that hierarchical visual features could be learned directly from data rather than manually specified. It provided one of the first convincing demonstrations that convolutional neural networks were not only theoretically appealing, but also practically useful. |
| Historical Importance | LeNet-5 was not the first convolution-inspired model in history, but it was one of the first decisive, trainable, and practically deployed CNN architectures. It established the blueprint for modern convolutional vision systems. |
What LeNet-5 Established
- Weight sharing drastically reduces parameter count and improves statistical efficiency.
- Local connectivity makes it possible to capture spatial primitives such as strokes, edges, and local shapes.
- Hierarchical feature learning allows increasingly abstract visual concepts to emerge across layers.
- The early promise of CNNs was already visible in 1998, even if widespread adoption had to wait for larger datasets and stronger hardware.
Historical Precision
LeNet-5 should not be described as the absolute “first CNN.” Earlier precursors, most notably Fukushima’s Neocognitron (1980), already introduced convolution-like ideas. The historical importance of LeNet-5 is that it provided a successful gradient-trained CNN architecture with real practical impact.
2012 - AlexNet
| Aspect | Details |
|---|---|
| Historical Problem | Before 2012, CNNs were known to work on relatively small datasets, but it was still unclear whether they could dominate large-scale visual recognition. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) provided the first truly industrial-scale benchmark: about 1.2 million training images across 1,000 classes. |
| Architecture | AlexNet contains five convolutional layers followed by three fully connected layers, for a total of eight learned layers and about 60 million parameters. It used several techniques that became historically important: ReLU activations, dropout, data augmentation, overlapping max pooling, and Local Response Normalization (LRN). |
| Compute Innovation | A major practical breakthrough was the use of two NVIDIA GTX 580 GPUs to train a network that was too large for a single GPU. This was one of the clearest demonstrations that modern deep learning progress would depend not only on architectural ideas, but also on hardware-aware implementation. |
| Results | AlexNet won ILSVRC-2012 with a 15.3% top-5 test error, compared with 26.2% for the second-best entry. This margin was so large that it changed how the entire field interpreted the feasibility of deep neural networks for vision. |
| Scientific Impact | AlexNet proved that CNNs could scale effectively when three ingredients were combined: large labeled datasets, high-throughput GPU computation, and regularization strong enough to control overfitting. |
| Historical Importance | AlexNet is widely regarded as the model that triggered the modern deep learning boom in computer vision. It shifted CNNs from a promising niche approach to the dominant paradigm for visual recognition. |
Why AlexNet Was a Watershed
AlexNet did not introduce convolution from scratch. Its importance was that it showed, decisively and publicly, that deep CNNs could beat traditional computer vision pipelines on a benchmark large enough to matter. It validated a new research formula:
more data + more compute + deeper end-to-end models
Technical Detail Worth Remembering
The AlexNet paper reported that, in one of its comparisons, a CNN with ReLU reached a target training error on CIFAR-10 about six times faster than an equivalent network with tanh units. This was one of the most influential early demonstrations of why non-saturating activations mattered in practice.
2014 - The Depth Bottleneck: VGGNet, GoogLeNet, and the Vanishing Gradient
Before reaching the extreme depth of ResNet, the field had to standardize how layers were built and confront the mathematical limits of backpropagation. The year 2014 provided the necessary stepping stones.
| Aspect | Details |
|---|---|
| Historical Problem | Following AlexNet, the prevailing intuition was simply that “deeper is better.” However, researchers quickly discovered that naively stacking dozens of layers led to networks that either refused to converge or were impossibly expensive to compute. |
| The 3x3 Revolution (VGGNet) | VGGNet demonstrated that large convolutional filters (like AlexNet’s 11x11 and 5x5) were inefficient. By stacking multiple 3x3 convolutions, VGGNet achieved the same effective receptive field with fewer parameters and more non-linearities (ReLUs). This standardized CNN design into clean, uniform blocks. |
| Dimensionality Reduction (GoogLeNet) | At the same time, GoogLeNet (Inception) introduced the heavy use of 1x1 convolutions to compress the depth of feature maps. This “bottleneck” design drastically reduced computational cost, a concept that ResNet would later inherit for its deepest models. |
| The Optimization Wall | As researchers pushed beyond 20 layers using VGG-like designs, a severe mathematical roadblock appeared: the vanishing gradient problem. During backpropagation, gradients multiplied through many layers tend to shrink exponentially toward zero, leaving the earliest layers untrained. |
| The Catalyst for ResNet | While the introduction of Batch Normalization (2015) helped stabilize the variance of activations (addressing Internal Covariate Shift), it did not solve the “degradation problem,” where deeper networks inexplicably exhibited higher training error than shallower ones. The failure of plain deep networks made a fundamental architectural bypass inevitable. |
2015 - ResNet
| Aspect | Details |
|---|---|
| Historical Problem | By the mid-2010s, deeper CNNs were generally expected to perform better, but in practice very deep plain networks became difficult to optimize. Even when overfitting was not the issue, adding more layers could make training accuracy worse. This phenomenon was called degradation. |
| Core Idea | ResNet introduced residual learning through skip connections, formalized as: . Instead of forcing stacked layers to learn a full transformation from scratch, the network learns a rresidual correction relative to the input . |
| Why It Works | Identity shortcuts make it easier for information and gradients to propagate through deep networks. If a block is not needed, it becomes easier for it to approximate an identity mapping rather than harm optimization. In this sense, residual learning does not remove all training difficulties, but it changes the optimization geometry in a highly favorable way. |
| Results | ResNet models achieved state-of-the-art results on ImageNet, and an ensemble of residual networks reached 3.57% top-5 error on the ILSVRC 2015 classification task. The paper also reported experiments with extremely deep networks on CIFAR-10, including a 1202-layer model. |
| Scientific Impact | ResNet established that depth itself could remain a productive source of performance gain if optimization was properly reparameterized. This was one of the most important structural insights in modern deep learning. |
| Historical Importance | Residual connections became a general architectural pattern, extending far beyond CNNs. They now appear in U-Nets, Transformers, diffusion backbones, AlphaFold-style systems, and many other deep architectures. |
Why Skip Connections Changed Deep Learning
- They improve gradient flow by creating short identity pathways through very deep networks.
- They make it easier to learn incremental refinements rather than entirely new transformations at every depth.
- They turn “depth” from a liability into a usable design dimension for modern architectures.
Historical Precision
The famous 1202-layer result was reported on CIFAR-10, not on ImageNet. On ImageNet, the headline residual architecture was ResNet-152. This distinction is often blurred in simplified retellings and is worth keeping accurate.
Timeline
timeline title Key Milestones in the CNN Lineage of Deep Learning 1998 : LeNet-5 -> practical end-to-end CNNs for document recognition 2012 : AlexNet -> deep CNNs + GPUs + ImageNet scale 2014 : VGG & GoogLeNet -> 3x3 standardization, 1x1 bottlenecks, and the vanishing gradient wall 2015 : ResNet -> residual learning and trainable very-deep networks
Connection to the Present
| Inherited Concept | Modern Examples |
|---|---|
| Convolution + weight sharing | EfficientNet, ConvNeXt |
| Non-saturating activations | ReLU, GELU, SiLU |
| Residual pathways | Transformers, Diffusion U-Nets, AlphaFold |
| Regularization and data augmentation | Dropout, MixUp, CutMix, RandAugment |
| Hardware-aware scaling | GPU clusters, TPUs, distributed foundation-model training |
Continuity Across Generations
Modern architectures may look very different from early CNNs, but many of their most important design principles are inherited rather than invented from scratch. Among the most durable are parameter sharing, hierarchical feature extraction, stable optimization through architectural design, and alignment with high-throughput hardware.
Final Take-away
LeNet-5, AlexNet, and ResNet represent three successive solutions to three different bottlenecks in deep learning:
- LeNet-5 made learned visual feature extraction practical.
- AlexNet made deep convolutional learning scalable.
- ResNet made extreme depth trainable.
Together, they explain a large part of how deep learning moved from promising laboratory systems to the dominant paradigm in modern computer vision. The architectures themselves differ, but the broader pattern remains constant: each breakthrough resolved a concrete limitation of the previous generation and introduced ideas that remained structurally important long after the original model.