Building of a new loss
Consider again the example of the single neuron. However, the task is no longer to map its unit input to zero.
π― Goal
The goal is to drop the derivative terms from the loss function derivatives (which drive gradient descent), in order to avoid the vanishing-gradient problem in the saturation region of the sigmoid.
is a critical multiplicative factor: when , the gradient vanishes, even if the error is large.
Problem setup
From the perspective of functional analysis, the problem can be posed as follows:
identify a loss function whose derivatives with respect to and do not depend on the derivative of the activation function.
For simplicity, letβs focus on the partial derivative .
By applying the chain rule
knowing that
and imposing the functional analysis constraint, namely, to find a loss function such that
we get:
Finally, integrating w.r.t. gives the new loss function:
Proof of the Integration
The calculation of from its partial derivative with respect to requires solving the following integral:
The integral is solved using the method of partial fraction decomposition. The expression is set up as follows:
The coefficients and are found by multiplying both sides by the common denominator :
The values of and are then determined by substituting strategic values for :
- For :
- For :
Substituting the values of and back into the decomposed expression gives:
The integration can now be performed term by term:
Solving the two simple integrals yields:
- (found via u-substitution where )
Substituting these results back into the equation gives the final expression:
Success
This new loss function has the advantage of eliminating the term from the derivatives, thus making the gradient more stable and informative, even in the presence of saturated activations.
MSE vs Cross Entropy
| π΄ MSE, unlucky configuration | π’New Loss, same configuration |
|---|---|
| Input: β Desired output: | Input: β Desired output: |
| Initial : | Initial : |
| Initial : | Initial : |
| Initial output: | Initial output: |
| Final output: | Final output: |
| Learning rate: | Learning rate: |
| Loss function: quadratic (MSE) | Loss function: Cross-Entropy |
| β³ Slow learning: plateau lasting epochs | β Fast learning |
| β οΈ Near-zero initial gradient | β Substantial initial gradient |
| π€ Initially flat curve, followed by a delayed descent. | π The loss function decreases immediately |
| π and remain nearly constant for a considerable number of iterations. | π and are updated right from the start |
β The neuron learned quickly!
By using the cross-entropy loss, the neuron was able to learn quickly, even when starting from a disadvantageous parameters configuration, exactly as hoped.
A closer look at the above plots shows that the initial slope of the loss curve is much steeper compared to the initial flat plateau observed with the MSE loss.It is precisely this steeper slope that cross-entropy provides: it allows the neuron to avoid early stagnation, right at the stage when it should be learning the most, namely, when starting from a configuration very far from the correct target.
About the Learning Rateβ¦
In the case of the MSE loss shown on the left side of the table above, the learning rate used was:
Question
Should the same learning rate be used when switching to the new loss function (cross-entropy) ?
Strictly speaking, it is not meaningful to talk about using the same learning rate when the loss function changes: it would be like comparing apples to oranges. In both cases, the learning rate was empirically determined (i.e., by experimenting with different values) so as to make the learning dynamics clearly observable. For the cross-entropy loss, the chosen value was .
The key point is not absolute speed
One might object that, since the learning rate was changed, the previous plots lose their significance.
After all, why should the speed at which the neuron learns matter if the learning rate was chosen arbitrarily?βBut such an objection misses the central point.
The key issue is not the absolute speed of learning, but rather how the learning speed changes.
- With the quadratic loss function, learning is slower when the neuron is making large errors, and only accelerates as it approaches the desired output.
- With cross-entropy, by contrast, learning is faster precisely when the neuron is far from the correct response.
These observations hold regardless of the specific learning rate chosen.
Cross Entropy loss: formal derivation
Prerequisite: cross-entropy in information theory
Cross-entropy is a general quantity from information theory, not something specific to neural networks. Its definition, the decomposition , and the proof that minimising it is exactly maximum likelihood are developed in the Cross Entropy note of the information-theory section. This paragraph adapts that quantity to a neural network.
A more rigorous way to arrive at the same loss function derived above, rather than reasoning heuristically on a single neuron, is to appeal to Claude Shannonβs Information Theory, and in particular to the concept of cross-entropy .
The cross-entropy between two probability distributions and over the same outcomes is
It measures, in bits, the cost of describing data that truly follow while assuming the distribution : it is minimised when , and grows as the two drift apart. Reading the target distribution as (the true labels) and the networkβs output as (its estimate), becomes an information distance between what the data say and what the network predicts. Adopted as a training objective, it is the cross-entropy loss:
To intuitively motivate the superiority of the Cross-Entropy loss over Mean Squared Error (MSE), itβs useful to consider their fundamental differences. MSE is essentially a geometric () distance, which assumes to operate in a space where measuring an distance is meaningful. Its formula is:
In contrast, measures the βdistanceβ in terms of the bits of information between the desired outputs () and the actual outputs (). This perspective is more consistent with the domain of neural networks, which are trained on data, that is, on information itself.
Why cross-entropy became the default loss
Three properties explain its dominance:
- it is robust to neuron saturation and the learning slowdown of MSE, because its gradient drops the factor (the derivation above);
- it has been the most widely used loss for training neural networks since around 2000;
- many other losses are specialisations of it: binary cross-entropy, the multilabel logistic loss and the negative log-likelihood loss are all instantiated to a particular label structure.
For a modern theoretical treatment, see Mao et al., Cross-entropy loss functions: theoretical analysis and applications (ICML 2023).
The Loss Function That Shaped Deep Learning
Empirically, it is observed that using helps neural networks converge much more effectively. This loss function was one of the key improvements that made Deep Learning practical, enabling networks to be trained faster by using a measure based on Shannonβs Information Theory in place of the distance.
is a measure that approaches zero as the actual outputs become more similar to the desired outputs.
Cross-entropy in PyTorch: what actually runs
The theoretic formula is almost never evaluated as written. PyTorchβs nn.CrossEntropyLoss fuses the softmax, the logarithm, and the negative log-likelihood into a single operation that consumes raw logits , never the post-softmax probabilities . The fusion is not a convenience: it is what keeps the computation finite.
Why the theoretic formula is never computed directly
Computing and then chains two operations that each break at the extremes: overflows to for a logit as small as in float32, and when a probability saturates to . The fused loss sidesteps both by staying in logits,
the log-sum-exp trick: subtracting the largest logit before exponentiating forces every , so the sum is exact and nothing overflows. The softmax probabilities are produced implicitly and never stored. This is the reason a classifierβs output layer must emit raw logits, with no softmax of its own.
Two silent bugs the fused design invites
Both produce wrong training with no error raised:
- Double softmax. Putting
softmaxorlog_softmaxinside the model and passing the result toCrossEntropyLossnormalises twice; the gradients flatten and learning quietly stalls. The model must output unnormalised logits.- Probabilities instead of logits. Feeding already-normalised probabilities makes the loss take the log-softmax of a distribution, a different and incorrect objective.
The stable form keeps intact the very property that made cross-entropy worth adopting: its gradient with respect to the logits is the clean residual
so a confidently wrong logit (large on the wrong class) still produces a gradient of order one, exactly where a naive would have produced a NaN.
Three options of CrossEntropyLoss solve problems that recur constantly and that introductions rarely mention.
label_smoothing: a regulariser hidden inside the lossSetting
label_smoothing=Ξ΅replaces the hard one-hot target with a mixture: on the correct class and spread over the others. Cross-entropy against a strict one-hot target drives the correct logit toward , which breeds overconfident, badly calibrated predictions; the smoothed target caps that drive and leaves a finite optimal logit gap. The standard value is used in most modern vision and language models, and it routinely improves calibration, and often accuracy, for the cost of a single argument.
ignore_indexandweight
ignore_indexdrops chosen target positions from the loss entirely, contributing no gradient. This is what makes cross-entropy usable on padded sequences: the padding token is ignored, so short and long sequences in a batch are scored fairly. It is the standard partner of BPTT on variable-length data.weightrescales each classβs contribution, the simplest remedy for class imbalance: up-weighting a rare class makes its errors count more, countering a model that would otherwise collapse onto the majority class.
Finally, the target of CrossEntropyLoss may be either class indices (hard labels) or a full probability vector per row (soft labels). The same loss therefore covers ordinary classification, knowledge distillation (the targets are a teacherβs softened outputs), and mixup-style augmentation (the targets are convex combinations of one-hot vectors), with no change of interface.
When should cross-entropy be used instead of MSE?
Cross-entropy is almost always the best choice, provided that the output neurons are sigmoid neurons.
To understand why, keep in mind that during a networkβs initialization phase, the weights and biases are typically assigned randomly. Itβs possible for this initial choice to cause the model to be confidently wrong for some training inputs: for example, an output neuron might saturate near 1 when it should actually return (or vice-versa).
If the quadratic cost function (MSE) is being used in this scenario, learning will slow down significantly. Learning wonβt stop entirely (because the network can still learn from other inputs), but the slowdown is nonetheless undesirable.