Building of a new loss

Consider again the example of the single neuron. However, the task is no longer to make its unit input zeros.

🎯 Goal

The goal is to drop the derivative terms from the loss function derivatives (which drive gradient descent), in order to avoid the vanishing-gradient problem in the saturation region of the sigmoid.

is a critical multiplicative factor: when , the gradient vanishes, even if the error is large.

Problem setup

From the perspective of functional analysis, the problem can be posed as follows:
identify a loss function whose derivatives with respect to and do not depend on the derivative of the activation function.

For simplicity, let’s focus on the partial derivative .

By applying the chain rule

, knowing that

and imposing the functional analysis constraint, namely, to find a loss function such that

we get:

Finally, integrating w.r.t. gives the new loss function:

This new loss function has the advantage of eliminating the term from the derivatives, thus making the gradient more stable and informative, even in the presence of saturated activations.


MSE vs Cross Entropy

πŸ”΄ MSE, unlucky configuration 🟒New Loss, same configuration
Input: β†’ Desired output: Input: β†’ Desired output:
Initial : Initial :
Initial : Initial :
Initial output: Initial output:
Final output: Final output:
Learning rate: Learning rate:
Loss function: quadratic (MSE)Loss function: Cross-Entropy
⏳ Slow learning: plateau lasting epochsβœ… Fast learning
⚠️ Near-zero initial gradientβœ… Substantial initial gradient
πŸ’€ Initially flat curve, followed by a delayed descent.πŸ“‰ The loss function decreases immediately
πŸ“Œ and remain nearly constant for a considerable number of iterations.πŸ“Œ and are updated right from the start

βœ… The neuron learned quickly!

By using the cross-entropy loss, the neuron was able to learn quickly, even when starting from a disadvantageous parameters configuration, exactly as hoped.
A closer look at the above plots shows that the initial slope of the loss curve is much steeper compared to the initial flat plateau observed with the MSE loss.

It is precisely this steeper slope that cross-entropy provides: it allows the neuron to avoid early stagnation, right at the stage when it should be learning the most, namely, when starting from a configuration very far from the correct target.


About the Learning Rate…

In the case of the MSE loss shown on the left side of the table above, the learning rate used was:

Should the same learning rate be used when switching to the new loss function (cross-entropy) ?

Strictly speaking, it is not meaningful to talk about using the same learning rate when the loss function changes: it would be like comparing apples to oranges. In both cases, the learning rate was empirically determined (i.e., by experimenting with different values) so as to make the learning dynamics clearly observable. For the cross-entropy loss, the chosen value was .

🎯 The key point is not absolute speed

One might object that, since the learning rate was changed, the previous plots lose their significance.
After all, why should the speed at which the neuron learns matter if the learning rate was chosen arbitrarily?

❗But such an objection misses the central point.

The key issue is not the absolute speed of learning, but rather how the learning speed changes.

  • With the quadratic loss function, learning is slower when the neuron is making large errors, and only accelerates as it approaches the desired output.
  • With cross-entropy, by contrast, learning is faster precisely when the neuron is far from the correct response.

These observations hold regardless of the specific learning rate chosen.