Formal Derivation

A more rigorous way to arrive at the same loss function derived in the previous note, rather than reasoning heuristically on a single neuron, is to appeal to Claude Shannon’s Information Theory, and in particular to the concept of cross-entropy .

By adapting the definition of cross-entropy from Information Theory to the neural network domain and interpreting it as a measure of error, the Cross-Entropy Loss is obtained:

To intuitively motivate the superiority of the Cross-Entropy loss over Mean Squared Error (MSE), it’s useful to consider their fundamental differences. MSE is essentially a geometric () distance, which assumes to operate in a space where measuring an distance is meaningful. Its formula is:

In contrast, measures the “distance” in terms of the bits of information between the desired outputs () and the actual outputs (). This perspective is more consistent with the domain of neural networks, which are trained on data—that is, on information itself.

The Loss Function That Shaped Deep Learning

Empirically, it is observed that using helps neural networks converge much more effectively. This loss function was one of the key improvements that made Deep Learning practical, enabling networks to be trained faster by using a measure based on Shannon’s Information Theory in place of the distance.

is a measure that approaches zero as the actual outputs become more similar to the desired outputs.

Cross-Entropy in Modern DL Frameworks

In Deep Learning frameworks like PyTorch, the general theoretical formula for Cross-Entropy () is not typically implemented directly. Instead, specialized versions are provided. This is because the formula itself is a direct adaptation of Shannon’s cross-entropy, from Information Theory, to the context of neural networks. The implementations found within these frameworks are the practical, numerically stable specializations of that core concept.

When should cross-entropy be used instead of MSE?

Cross-entropy is almost always the best choice, provided that the output neurons are sigmoid neurons.

To understand why, keep in mind that during a network’s initialization phase, the weights and biases are typically assigned randomly. It’s possible for this initial choice to cause the model to be confidently wrong for some training inputs: for example, an output neuron might saturate near 1 when it should actually return (or vice-versa).

If the quadratic cost function (MSE) is being used in this scenario, learning will slow down significantly. Learning won’t stop entirely (because the network can still learn from other inputs), but the slowdown is nonetheless undesirable.