Building of a new loss
Consider again the example of the single neuron. However, the task is no longer to make its unit input zeros.
π― Goal
The goal is to drop the derivative terms from the loss function derivatives (which drive gradient descent), in order to avoid the vanishing-gradient problem in the saturation region of the sigmoid.
is a critical multiplicative factor: when , the gradient vanishes, even if the error is large.
Problem setup
From the perspective of functional analysis, the problem can be posed as follows:
identify a loss function whose derivatives with respect to and do not depend on the derivative of the activation function.
For simplicity, letβs focus on the partial derivative .
By applying the chain rule
knowing that
and imposing the functional analysis constraint, namely, to find a loss function such that
we get:
Finally, integrating w.r.t. gives the new loss function:
Proof of the Integration
The calculation of from its partial derivative with respect to requires solving the following integral:
The integral is solved using the method of partial fraction decomposition. The expression is set up as follows:
The coefficients and are found by multiplying both sides by the common denominator :
The values of and are then determined by substituting strategic values for :
- For :
- For :
Substituting the values of and back into the decomposed expression gives:
The integration can now be performed term by term:
Solving the two simple integrals yields:
- (found via u-substitution where )
Substituting these results back into the equation gives the final expression:
This new loss function has the advantage of eliminating the term from the derivatives, thus making the gradient more stable and informative, even in the presence of saturated activations.
MSE vs Cross Entropy
| π΄ MSE, unlucky configuration | π’New Loss, same configuration |
|---|---|
| Input: β Desired output: | Input: β Desired output: |
| Initial : | Initial : |
| Initial : | Initial : |
| Initial output: | Initial output: |
| Final output: | Final output: |
| Learning rate: | Learning rate: |
| Loss function: quadratic (MSE) | Loss function: Cross-Entropy |
| β³ Slow learning: plateau lasting epochs | β Fast learning |
| β οΈ Near-zero initial gradient | β Substantial initial gradient |
| π€ Initially flat curve, followed by a delayed descent. | π The loss function decreases immediately |
| π and remain nearly constant for a considerable number of iterations. | π and are updated right from the start |
β The neuron learned quickly!
By using the cross-entropy loss, the neuron was able to learn quickly, even when starting from a disadvantageous parameters configuration, exactly as hoped.
A closer look at the above plots shows that the initial slope of the loss curve is much steeper compared to the initial flat plateau observed with the MSE loss.It is precisely this steeper slope that cross-entropy provides: it allows the neuron to avoid early stagnation, right at the stage when it should be learning the most, namely, when starting from a configuration very far from the correct target.
About the Learning Rateβ¦
In the case of the MSE loss shown on the left side of the table above, the learning rate used was:
Should the same learning rate be used when switching to the new loss function (cross-entropy) ?
Strictly speaking, it is not meaningful to talk about using the same learning rate when the loss function changes: it would be like comparing apples to oranges. In both cases, the learning rate was empirically determined (i.e., by experimenting with different values) so as to make the learning dynamics clearly observable. For the cross-entropy loss, the chosen value was .
π― The key point is not absolute speed
One might object that, since the learning rate was changed, the previous plots lose their significance.
After all, why should the speed at which the neuron learns matter if the learning rate was chosen arbitrarily?βBut such an objection misses the central point.
The key issue is not the absolute speed of learning, but rather how the learning speed changes.
- With the quadratic loss function, learning is slower when the neuron is making large errors, and only accelerates as it approaches the desired output.
- With cross-entropy, by contrast, learning is faster precisely when the neuron is far from the correct response.
These observations hold regardless of the specific learning rate chosen.
Cross Entropy loss: formal derivation
A more rigorous way to arrive at the same loss function derived above, rather than reasoning heuristically on a single neuron, is to appeal to Claude Shannonβs Information Theory, and in particular to the concept of cross-entropy .
By adapting the definition of cross-entropy from Information Theory to the neural network domain and interpreting it as a measure of error, the Cross-Entropy Loss is obtained:
To intuitively motivate the superiority of the Cross-Entropy loss over Mean Squared Error (MSE), itβs useful to consider their fundamental differences. MSE is essentially a geometric () distance, which assumes to operate in a space where measuring an distance is meaningful. Its formula is:
In contrast, measures the βdistanceβ in terms of the bits of information between the desired outputs () and the actual outputs (). This perspective is more consistent with the domain of neural networks, which are trained on dataβthat is, on information itself.

The Loss Function That Shaped Deep Learning
Empirically, it is observed that using helps neural networks converge much more effectively. This loss function was one of the key improvements that made Deep Learning practical, enabling networks to be trained faster by using a measure based on Shannonβs Information Theory in place of the distance.
is a measure that approaches zero as the actual outputs become more similar to the desired outputs.
Cross-Entropy in Modern DL Frameworks
In Deep Learning frameworks like PyTorch, the general theoretical formula for Cross-Entropy () is not typically implemented directly. Instead, specialized versions are provided. This is because the formula itself is a direct adaptation of Shannonβs cross-entropy, from Information Theory, to the context of neural networks. The implementations found within these frameworks are the practical, numerically stable specializations of that core concept.
When should cross-entropy be used instead of MSE?
Cross-entropy is almost always the best choice, provided that the output neurons are sigmoid neurons.
To understand why, keep in mind that during a networkβs initialization phase, the weights and biases are typically assigned randomly. Itβs possible for this initial choice to cause the model to be confidently wrong for some training inputs: for example, an output neuron might saturate near 1 when it should actually return (or vice-versa).
If the quadratic cost function (MSE) is being used in this scenario, learning will slow down significantly. Learning wonβt stop entirely (because the network can still learn from other inputs), but the slowdown is nonetheless undesirable.