Building of a new loss

Consider again the example of the single neuron. However, the task is no longer to make its unit input zeros.

🎯 Goal

The goal is to drop the derivative terms from the loss function derivatives (which drive gradient descent), in order to avoid the vanishing-gradient problem in the saturation region of the sigmoid.

is a critical multiplicative factor: when , the gradient vanishes, even if the error is large.

Problem setup

From the perspective of functional analysis, the problem can be posed as follows:
identify a loss function whose derivatives with respect to and do not depend on the derivative of the activation function.

For simplicity, let’s focus on the partial derivative .

By applying the chain rule

knowing that

and imposing the functional analysis constraint, namely, to find a loss function such that

we get:

Finally, integrating w.r.t. gives the new loss function:

This new loss function has the advantage of eliminating the term from the derivatives, thus making the gradient more stable and informative, even in the presence of saturated activations.


MSE vs Cross Entropy

πŸ”΄ MSE, unlucky configuration 🟒New Loss, same configuration
Input: β†’ Desired output: Input: β†’ Desired output:
Initial : Initial :
Initial : Initial :
Initial output: Initial output:
Final output: Final output:
Learning rate: Learning rate:
Loss function: quadratic (MSE)Loss function: Cross-Entropy
⏳ Slow learning: plateau lasting epochsβœ… Fast learning
⚠️ Near-zero initial gradientβœ… Substantial initial gradient
πŸ’€ Initially flat curve, followed by a delayed descent.πŸ“‰ The loss function decreases immediately
πŸ“Œ and remain nearly constant for a considerable number of iterations.πŸ“Œ and are updated right from the start

βœ… The neuron learned quickly!

By using the cross-entropy loss, the neuron was able to learn quickly, even when starting from a disadvantageous parameters configuration, exactly as hoped.
A closer look at the above plots shows that the initial slope of the loss curve is much steeper compared to the initial flat plateau observed with the MSE loss.

It is precisely this steeper slope that cross-entropy provides: it allows the neuron to avoid early stagnation, right at the stage when it should be learning the most, namely, when starting from a configuration very far from the correct target.


About the Learning Rate…

In the case of the MSE loss shown on the left side of the table above, the learning rate used was:

Should the same learning rate be used when switching to the new loss function (cross-entropy) ?

Strictly speaking, it is not meaningful to talk about using the same learning rate when the loss function changes: it would be like comparing apples to oranges. In both cases, the learning rate was empirically determined (i.e., by experimenting with different values) so as to make the learning dynamics clearly observable. For the cross-entropy loss, the chosen value was .

🎯 The key point is not absolute speed

One might object that, since the learning rate was changed, the previous plots lose their significance.
After all, why should the speed at which the neuron learns matter if the learning rate was chosen arbitrarily?

❗But such an objection misses the central point.

The key issue is not the absolute speed of learning, but rather how the learning speed changes.

  • With the quadratic loss function, learning is slower when the neuron is making large errors, and only accelerates as it approaches the desired output.
  • With cross-entropy, by contrast, learning is faster precisely when the neuron is far from the correct response.

These observations hold regardless of the specific learning rate chosen.


Cross Entropy loss: formal derivation

A more rigorous way to arrive at the same loss function derived above, rather than reasoning heuristically on a single neuron, is to appeal to Claude Shannon’s Information Theory, and in particular to the concept of cross-entropy .

By adapting the definition of cross-entropy from Information Theory to the neural network domain and interpreting it as a measure of error, the Cross-Entropy Loss is obtained:

To intuitively motivate the superiority of the Cross-Entropy loss over Mean Squared Error (MSE), it’s useful to consider their fundamental differences. MSE is essentially a geometric () distance, which assumes to operate in a space where measuring an distance is meaningful. Its formula is:

In contrast, measures the β€œdistance” in terms of the bits of information between the desired outputs () and the actual outputs (). This perspective is more consistent with the domain of neural networks, which are trained on dataβ€”that is, on information itself.

The Loss Function That Shaped Deep Learning

Empirically, it is observed that using helps neural networks converge much more effectively. This loss function was one of the key improvements that made Deep Learning practical, enabling networks to be trained faster by using a measure based on Shannon’s Information Theory in place of the distance.

is a measure that approaches zero as the actual outputs become more similar to the desired outputs.

Cross-Entropy in Modern DL Frameworks

In Deep Learning frameworks like PyTorch, the general theoretical formula for Cross-Entropy () is not typically implemented directly. Instead, specialized versions are provided. This is because the formula itself is a direct adaptation of Shannon’s cross-entropy, from Information Theory, to the context of neural networks. The implementations found within these frameworks are the practical, numerically stable specializations of that core concept.

When should cross-entropy be used instead of MSE?

Cross-entropy is almost always the best choice, provided that the output neurons are sigmoid neurons.

To understand why, keep in mind that during a network’s initialization phase, the weights and biases are typically assigned randomly. It’s possible for this initial choice to cause the model to be confidently wrong for some training inputs: for example, an output neuron might saturate near 1 when it should actually return (or vice-versa).

If the quadratic cost function (MSE) is being used in this scenario, learning will slow down significantly. Learning won’t stop entirely (because the network can still learn from other inputs), but the slowdown is nonetheless undesirable.