From LeNet to ResNet: Major Milestones in the Rise of Modern Deep Learning

The modern history of deep learning in computer vision is not best understood as a smooth, continuous progression. It is more accurately described as a sequence of bottleneck-breaking advances. Some architectures made end-to-end visual learning practical, others made it scalable, and still others made extreme depth trainable. Four landmarks capture this trajectory particularly well: LeNet-5, AlexNet, VGGNet and ResNet.

Section Goal

This section revisits four decisive milestones in the rise of modern deep learning, emphasizing the historical problem, technical innovation, scientific impact, and lasting legacy of each contribution.

YearResearchersArchitecture / BreakthroughReference Domain
1998Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick HaffnerLeNet-5Handwritten digit and document recognition
2012Alex Krizhevsky, Ilya Sutskever, Geoffrey HintonAlexNetLarge-scale image classification
2014Simonyan & ZissermanVGG & The 3x3 RevolutionArchitectural standardization
2015Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian SunResNetOptimization of very-deep CNNs

Note

This is a selective history of the CNN lineage of deep learning. Earlier milestones such as the Perceptron and Backpropagation, and later architectural shifts such as Attention and Transformers, are discussed elsewhere.


1998 - LeNet-5

AspectDetails
Historical ProblemIn the 1990s, document processing systems needed to recognize handwritten digits reliably for applications such as bank check reading and postal code recognition. The main challenge was to design a model that could learn visual features directly from raw pixels rather than depending on manually engineered preprocessing pipelines.
ArchitectureLeNet-5 is a compact convolutional network organized as C1-S2-C3-S4-C5-F6-Output. Its core ideas are local receptive fields, weight sharing, and subsampling, which together allow the network to exploit the spatial structure of images while keeping the number of parameters manageable. The model used 5x5 convolutions, trainable subsampling layers, and tanh-like nonlinearities.
TrainingThe network was trained end-to-end with backpropagation and gradient-based optimization. In modern terms, its importance lies not only in its architecture but also in the fact that it demonstrated practical end-to-end representation learning for visual recognition.
ResultsOn the MNIST handwritten digit benchmark, LeNet-5 achieved roughly 0.95% test error, while related variants in the same line of work pushed the error even lower under additional training tricks and distortions. More importantly, the architecture was tied to real document-recognition systems deployed in practice.
Scientific ImpactLeNet-5 showed that hierarchical visual features could be learned directly from data rather than manually specified. It provided one of the first convincing demonstrations that convolutional neural networks were not only theoretically appealing, but also practically useful.
Historical ImportanceLeNet-5 was not the first convolution-inspired model in history, but it was one of the first decisive, trainable, and practically deployed CNN architectures. It established the blueprint for modern convolutional vision systems.

What LeNet-5 Established

  1. Weight sharing drastically reduces parameter count and improves statistical efficiency.
  2. Local connectivity makes it possible to capture spatial primitives such as strokes, edges, and local shapes.
  3. Hierarchical feature learning allows increasingly abstract visual concepts to emerge across layers.
  4. The early promise of CNNs was already visible in 1998, even if widespread adoption had to wait for larger datasets and stronger hardware.

Historical Precision

LeNet-5 should not be described as the absolute “first CNN.” Earlier precursors, most notably Fukushima’s Neocognitron (1980), already introduced convolution-like ideas. The historical importance of LeNet-5 is that it provided a successful gradient-trained CNN architecture with real practical impact.


2012 - AlexNet

AspectDetails
Historical ProblemBefore 2012, CNNs were known to work on relatively small datasets, but it was still unclear whether they could dominate large-scale visual recognition. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) provided the first truly industrial-scale benchmark: about 1.2 million training images across 1,000 classes.
ArchitectureAlexNet contains five convolutional layers followed by three fully connected layers, for a total of eight learned layers and about 60 million parameters. It used several techniques that became historically important: ReLU activations, dropout, data augmentation, overlapping max pooling, and Local Response Normalization (LRN).
Compute InnovationA major practical breakthrough was the use of two NVIDIA GTX 580 GPUs to train a network that was too large for a single GPU. This was one of the clearest demonstrations that modern deep learning progress would depend not only on architectural ideas, but also on hardware-aware implementation.
ResultsAlexNet won ILSVRC-2012 with a 15.3% top-5 test error, compared with 26.2% for the second-best entry. This margin was so large that it changed how the entire field interpreted the feasibility of deep neural networks for vision.
Scientific ImpactAlexNet proved that CNNs could scale effectively when three ingredients were combined: large labeled datasets, high-throughput GPU computation, and regularization strong enough to control overfitting.
Historical ImportanceAlexNet is widely regarded as the model that triggered the modern deep learning boom in computer vision. It shifted CNNs from a promising niche approach to the dominant paradigm for visual recognition.

Why AlexNet Was a Watershed

AlexNet did not introduce convolution from scratch. Its importance was that it showed, decisively and publicly, that deep CNNs could beat traditional computer vision pipelines on a benchmark large enough to matter. It validated a new research formula:

more data + more compute + deeper end-to-end models

Technical Detail Worth Remembering

The AlexNet paper reported that, in one of its comparisons, a CNN with ReLU reached a target training error on CIFAR-10 about six times faster than an equivalent network with tanh units. This was one of the most influential early demonstrations of why non-saturating activations mattered in practice.


2014 - The Depth Bottleneck: VGGNet, GoogLeNet, and the Vanishing Gradient

Before reaching the extreme depth of ResNet, the field had to standardize how layers were built and confront the mathematical limits of backpropagation. The year 2014 provided the necessary stepping stones.

AspectDetails
Historical ProblemFollowing AlexNet, the prevailing intuition was simply that “deeper is better.” However, researchers quickly discovered that naively stacking dozens of layers led to networks that either refused to converge or were impossibly expensive to compute.
The 3x3 Revolution (VGGNet)VGGNet demonstrated that large convolutional filters (like AlexNet’s 11x11 and 5x5) were inefficient. By stacking multiple 3x3 convolutions, VGGNet achieved the same effective receptive field with fewer parameters and more non-linearities (ReLUs). This standardized CNN design into clean, uniform blocks.
Dimensionality Reduction (GoogLeNet)At the same time, GoogLeNet (Inception) introduced the heavy use of 1x1 convolutions to compress the depth of feature maps. This “bottleneck” design drastically reduced computational cost, a concept that ResNet would later inherit for its deepest models.
The Optimization WallAs researchers pushed beyond 20 layers using VGG-like designs, a severe mathematical roadblock appeared: the vanishing gradient problem. During backpropagation, gradients multiplied through many layers tend to shrink exponentially toward zero, leaving the earliest layers untrained.
The Catalyst for ResNetWhile the introduction of Batch Normalization (2015) helped stabilize the variance of activations (addressing Internal Covariate Shift), it did not solve the “degradation problem,” where deeper networks inexplicably exhibited higher training error than shallower ones. The failure of plain deep networks made a fundamental architectural bypass inevitable.

2015 - ResNet

AspectDetails
Historical ProblemBy the mid-2010s, deeper CNNs were generally expected to perform better, but in practice very deep plain networks became difficult to optimize. Even when overfitting was not the issue, adding more layers could make training accuracy worse. This phenomenon was called degradation.
Core IdeaResNet introduced residual learning through skip connections, formalized as: . Instead of forcing stacked layers to learn a full transformation from scratch, the network learns a rresidual correction relative to the input .
Why It WorksIdentity shortcuts make it easier for information and gradients to propagate through deep networks. If a block is not needed, it becomes easier for it to approximate an identity mapping rather than harm optimization. In this sense, residual learning does not remove all training difficulties, but it changes the optimization geometry in a highly favorable way.
ResultsResNet models achieved state-of-the-art results on ImageNet, and an ensemble of residual networks reached 3.57% top-5 error on the ILSVRC 2015 classification task. The paper also reported experiments with extremely deep networks on CIFAR-10, including a 1202-layer model.
Scientific ImpactResNet established that depth itself could remain a productive source of performance gain if optimization was properly reparameterized. This was one of the most important structural insights in modern deep learning.
Historical ImportanceResidual connections became a general architectural pattern, extending far beyond CNNs. They now appear in U-Nets, Transformers, diffusion backbones, AlphaFold-style systems, and many other deep architectures.

Why Skip Connections Changed Deep Learning

  1. They improve gradient flow by creating short identity pathways through very deep networks.
  2. They make it easier to learn incremental refinements rather than entirely new transformations at every depth.
  3. They turn “depth” from a liability into a usable design dimension for modern architectures.

Historical Precision

The famous 1202-layer result was reported on CIFAR-10, not on ImageNet. On ImageNet, the headline residual architecture was ResNet-152. This distinction is often blurred in simplified retellings and is worth keeping accurate.


Timeline

timeline
    title Key Milestones in the CNN Lineage of Deep Learning
    1998 : LeNet-5 -> practical end-to-end CNNs for document recognition
    2012 : AlexNet -> deep CNNs + GPUs + ImageNet scale
    2014 : VGG & GoogLeNet -> 3x3 standardization, 1x1 bottlenecks, and the vanishing gradient wall
    2015 : ResNet -> residual learning and trainable very-deep networks

Connection to the Present

Inherited ConceptModern Examples
Convolution + weight sharingEfficientNet, ConvNeXt
Non-saturating activationsReLU, GELU, SiLU
Residual pathwaysTransformers, Diffusion U-Nets, AlphaFold
Regularization and data augmentationDropout, MixUp, CutMix, RandAugment
Hardware-aware scalingGPU clusters, TPUs, distributed foundation-model training

Continuity Across Generations

Modern architectures may look very different from early CNNs, but many of their most important design principles are inherited rather than invented from scratch. Among the most durable are parameter sharing, hierarchical feature extraction, stable optimization through architectural design, and alignment with high-throughput hardware.

Final Take-away

LeNet-5, AlexNet, and ResNet represent three successive solutions to three different bottlenecks in deep learning:

  • LeNet-5 made learned visual feature extraction practical.
  • AlexNet made deep convolutional learning scalable.
  • ResNet made extreme depth trainable.

Together, they explain a large part of how deep learning moved from promising laboratory systems to the dominant paradigm in modern computer vision. The architectures themselves differ, but the broader pattern remains constant: each breakthrough resolved a concrete limitation of the previous generation and introduced ideas that remained structurally important long after the original model.

Selected Primary Sources