Knowledge Distillation

Abstract

A large network can be accurate and still be unusable, too slow or too memory-hungry for a phone or a real-time system. Knowledge distillation, proposed by Hinton, Vinyals, and Dean (2015), attacks this from the training side rather than the architecture side. A large, well-performing teacher is trained first; a small student is then trained not only on the ground-truth labels but to reproduce the teacher’s full output distribution, inheriting most of its behaviour at a fraction of its cost. This note develops the original recipe and its central distinction, between the hard label and the soft distribution a trained model produces, and then follows the idea to its inversion in Noisy Student, where a larger student trained on noisy, pseudo-labelled data uses the same scaffold not to compress a model but to push its accuracy past the teacher’s.

A slow teacher, a fast student

Modern networks are accurate because they are large, and large networks are slow and expensive to run, exactly where deployment is most constrained. Distillation rests on an observation about that tension: the capacity a model needs to discover structure in data is not the capacity it needs to represent the solution once found. A cumbersome teacher may be necessary to extract knowledge from millions of examples, but the extracted knowledge often fits comfortably inside a far smaller model.

The recipe has two stages and at least two networks:

a teacher, large and accurate, trained first, ideally on a very large dataset;
a student, much smaller, trained afterwards, on a possibly smaller dataset, with a combined loss that mixes the ordinary supervised signal with a second term that pulls it toward the teacher.

Training the student against the teacher is the distillation: the teacher’s knowledge is concentrated into a model small enough to be fast, which then runs at a fraction of the teacher’s latency and memory while keeping most of its accuracy. It is the training-time counterpart of the architectural efficiency tricks collected in the rest of this section, reaching a small fast model by a different route.

Hard targets and soft targets

The whole idea rests on what a network outputs.

A hard target is the ground-truth label written as a one-hot distribution: probability $1$ on the correct class and $0$ on every other. It is “hard” in the sense of a categorical, all-or-nothing verdict. It asserts that an image is a “2” and says nothing more.

A soft target is the full probability vector a trained model assigns across all classes, with small but non-zero mass on classes other than the predicted one. It is “soft” because the probability is spread rather than concentrated, and that spread is not noise: the relative sizes of the small probabilities encode what the model has learned about how the classes resemble each other. A good classifier shown a handwritten “2” places most mass on “2”, a little on “3” and “7”, and almost none on “dog”.

Dark knowledge: the information a hard label throws away

The similarity structure hiding in the wrong-class probabilities is what Hinton called the dark knowledge of a trained model. A one-hot label states only the correct answer; the teacher’s soft output additionally reveals that a “2” is more confusable with a “3” than with a “dog”. This is a far richer training signal, because every example now constrains the student on all classes at once instead of one. Transferring it is the entire point of distillation, and it is why a student trained on soft targets often generalises better than the same student trained on hard labels alone. In Hinton’s experiments the effect was strong enough that a student could recognise a class for which it had seen no labelled examples at all, learning it purely from that class’s faint imprint on the soft targets of the others.

For this knowledge to be visible, the teacher’s distribution usually has to be softened on purpose. A confident teacher produces a softmax that is already close to one-hot, burying the small probabilities. The softmax is therefore evaluated with a temperature $T$ ,

q_{i} (T) = \frac{exp ( z _{i} / T )}{\sum _{j} exp ( z _{j} / T )},

where $z_{i}$ are the logits. At $T = 1$ this is the ordinary softmax. Raising $T$ above $1$ flattens the distribution and lifts the small inter-class probabilities into a range the student can actually learn from. The student is trained to match the teacher at this same raised temperature, then deployed back at $T = 1$ .

The combined loss

The student is fit to two targets at once: a hard term, the ordinary cross-entropy against the ground-truth one-hot label at $T = 1$ , and a soft term, the distillation loss, which pulls the student’s softened output toward the teacher’s softened output. Writing $y$ for the one-hot label and $p_{T}, p_{S}$ for the teacher and student distributions,

L = (1 - α) hard, cross-entropy H (y, p_{S}^{(1)}) + α T^{2} soft, distillation D_{K L} (p_{T}^{(T)} ∥ p_{S}^{(T)})

with $α \in [0, 1]$ balancing the two. The superscripts mark the temperature: the hard term uses $T = 1$ , while the distillation term softens both teacher and student at the same $T > 1$ . The factor $T^{2}$ corrects a scaling artefact, since the gradients of the softened cross-entropy shrink like $1/ T^{2}$ ; multiplying by $T^{2}$ keeps the two terms comparable in magnitude as $T$ changes.

Which distance, and why it carries two names

The two terms appear to use different yardsticks, cross-entropy for the hard target and KL divergence for the soft one, but they are the same quantity seen from two sides. The cross-entropy between a target $p$ and a prediction $q$ splits exactly as

H (p, q) = H (p) + D_{K L} (p ∥ q),

the target’s own entropy plus the KL divergence from prediction to target. Because $H (p)$ does not depend on the student, minimising cross-entropy and minimising KL are the same optimisation, differing only by a constant. The two loss terms are then this one identity specialised:

For the hard term the target is one-hot, whose entropy is $0$ , so cross-entropy and KL coincide exactly. It is conventionally written as cross-entropy.
For the soft term the target is the teacher’s spread-out distribution, whose entropy is non-zero. Cross-entropy and KL now differ by that constant, and the convention is to write the KL divergence between the two soft distributions.

KL is a divergence, not a distance

The distillation term is the Kullback-Leibler divergence $D_{K L} (p_{T} ∥ p_{S}) = \sum_{i} p_{T} (i) lo g \frac{p _{T} ( i )}{p _{S} ( i )} \geq 0,$ the expected log-ratio of the teacher’s probabilities to the student’s, which is non-negative and vanishes only when the two distributions agree, exactly the property that makes it worth minimising. The name is precise, though: KL is a divergence, not a metric. It is asymmetric, $D_{K L} (p_{T} ∥ p_{S}) \neq = D_{K L} (p_{S} ∥ p_{T})$ in general, and it violates the triangle inequality, so it defines no geometry of distances. What it measures is directional, the information lost when the student’s distribution stands in for the teacher’s, and distillation drives that loss toward zero. (Note that $- \sum_{i} p_{T} (i) lo g \frac{p _{T} ( i )}{p _{S} ( i )}$ , with the leading minus sign, would be the negative of the divergence, hence non-positive; the divergence itself carries no such sign.)

The figure shows a single round of the classic recipe: the teacher is trained on its own, and the student is then fit to two signals at once, the ground truth as the hard cross-entropy target and the teacher’s transmitted knowledge as the soft distillation target. At deployment, only the small student remains.

From compression to generalisation: Noisy Student

The teacher-student scaffold outlived its original purpose. A few years later the same idea was turned around in Noisy Student (Xie et al., 2020), where the student is no longer smaller than the teacher but deliberately equal or larger, and the goal is no longer to compress a model but to make it generalise better. The method’s slogan is exactly this inversion: distillation in the service of generalisation rather than of size.

It is a form of self-training, and it shifts the bottleneck away from labelled data, which is scarce and expensive, toward unlabelled data, which is abundant. One round runs as follows:

train a teacher on the available labelled data, for instance ImageNet;
use the teacher to infer pseudo-labels on a large pool of unlabelled images, hundreds of millions of them gathered from the web;
train a larger student on the union of labelled and pseudo-labelled data, with strong noise injected (data augmentation, dropout, stochastic depth);
promote the student to teacher, and repeat.

Two reasons make the student larger rather than smaller:

it now sees far more data, the labelled set plus hundreds of millions of pseudo-labelled images, and more data supports more capacity; growing the student across iterations is what lets the frontier keep advancing.
The student is trained under heavy regularisation: input noise through aggressive data augmentation, and model noise through dropout, which randomly zeroes activations, and stochastic depth, which randomly drops whole layers or blocks during training. This noise deliberately handicaps the network while it learns, lowering its effective capacity, so additional raw capacity is needed to compensate, which is to say a larger model.

The two reasons reinforce each other: the extra data justifies the size, and the heavy regularisation requires it.

The same noise is also the engine of generalisation, and the asymmetry is the point: the teacher that produced the pseudo-labels read clean images, while the student must reproduce those labels from heavily corrupted inputs. It cannot copy the teacher by rote; it is forced to learn features robust enough to survive the noise, and so it generalises past the teacher that taught it. This is why a student equal to or larger than its teacher can surpass it, something classic distillation, aiming only to approximate a fixed teacher, never sets out to do.

Applied to EfficientNet, Noisy Student lifted ImageNet top-1 accuracy at every scale, for example from $85.0%$ to $86.4%$ at EfficientNet-B7, and carried the whole EfficientNet-B0 to B7 family above the previous Pareto frontier of architectures such as SENet, AmoebaNet, NASNet, and ResNeXt: higher accuracy at lower parameter count and compute. It stands as the last purely convolutional approach to hold the ImageNet state of the art before Vision Transformers.

Conclusion

Both methods are built on the same scaffold, a teacher network that trains a student from its own outputs, and both exploit the same fact, that a trained model’s soft outputs carry more knowledge than any hard label. What differs is the destination. Classic distillation runs the scaffold downward, compressing the teacher’s knowledge into a smaller, faster student for deployment. Noisy Student runs it outward, using the teacher to label a flood of unlabelled data and training a larger, noised student that generalises beyond the teacher. One trades size for speed at fixed accuracy; the other trades unlabelled data for accuracy at growing size.

	Classic distillation	Noisy Student
Goal	compress a model	improve accuracy and generalisation
Student vs teacher	smaller	equal or larger
Data	a labelled transfer set	labelled plus a large unlabelled pool, pseudo-labelled
Noise on the student	little or none	heavy: augmentation, dropout, stochastic depth
Iteration	a single teacher-to-student round	repeated, the student becomes the next teacher
Outcome	a fast student that nearly matches the teacher	a student that surpasses the teacher

The common thread is worth keeping in view: in both cases the value transferred is the soft structure of a network’s predictions, the dark knowledge introduced at the start. Whether the goal is a model small enough to run on a phone or one accurate enough to top a benchmark, the lever is the same.

Deep Learning: Zero to Hero

Explorer

Knowledge Distillation

A slow teacher, a fast student

Hard targets and soft targets

The combined loss

Which distance, and why it carries two names

From compression to generalisation: Noisy Student

Conclusion

Graph View

Table of Contents

Backlinks