Building of a new loss
Consider again the example of the single neuron. However, the task is no longer to make its unit input zeros.
π― Goal
The goal is to drop the derivative terms from the loss function derivatives (which drive gradient descent), in order to avoid the vanishing-gradient problem in the saturation region of the sigmoid.
is a critical multiplicative factor: when , the gradient vanishes, even if the error is large.
Problem setup
From the perspective of functional analysis, the problem can be posed as follows:
identify a loss function whose derivatives with respect to and do not depend on the derivative of the activation function.
For simplicity, letβs focus on the partial derivative .
By applying the chain rule
, knowing that
and imposing the functional analysis constraint, namely, to find a loss function such that
we get:
Finally, integrating w.r.t. gives the new loss function:
Proof of the Integration
The calculation of from its partial derivative with respect to requires solving the following integral:
The integral is solved using the method of partial fraction decomposition. The expression is set up as follows:
The coefficients and are found by multiplying both sides by the common denominator :
The values of and are then determined by substituting strategic values for :
- For :
- For :
Substituting the values of and back into the decomposed expression gives:
The integration can now be performed term by term:
Solving the two simple integrals yields:
- (found via u-substitution where )
Substituting these results back into the equation gives the final expression:
This new loss function has the advantage of eliminating the term from the derivatives, thus making the gradient more stable and informative, even in the presence of saturated activations.
MSE vs Cross Entropy
| π΄ MSE, unlucky configuration | π’New Loss, same configuration |
|---|---|
| Input: β Desired output: | Input: β Desired output: |
| Initial : | Initial : |
| Initial : | Initial : |
| Initial output: | Initial output: |
| Final output: | Final output: |
| Learning rate: | Learning rate: |
| Loss function: quadratic (MSE) | Loss function: Cross-Entropy |
| β³ Slow learning: plateau lasting epochs | β Fast learning |
| β οΈ Near-zero initial gradient | β Substantial initial gradient |
| π€ Initially flat curve, followed by a delayed descent. | π The loss function decreases immediately |
| π and remain nearly constant for a considerable number of iterations. | π and are updated right from the start |
β The neuron learned quickly!
By using the cross-entropy loss, the neuron was able to learn quickly, even when starting from a disadvantageous parameters configuration, exactly as hoped.
A closer look at the above plots shows that the initial slope of the loss curve is much steeper compared to the initial flat plateau observed with the MSE loss.It is precisely this steeper slope that cross-entropy provides: it allows the neuron to avoid early stagnation, right at the stage when it should be learning the most, namely, when starting from a configuration very far from the correct target.
About the Learning Rateβ¦
In the case of the MSE loss shown on the left side of the table above, the learning rate used was:
Should the same learning rate be used when switching to the new loss function (cross-entropy) ?
Strictly speaking, it is not meaningful to talk about using the same learning rate when the loss function changes: it would be like comparing apples to oranges. In both cases, the learning rate was empirically determined (i.e., by experimenting with different values) so as to make the learning dynamics clearly observable. For the cross-entropy loss, the chosen value was .
π― The key point is not absolute speed
One might object that, since the learning rate was changed, the previous plots lose their significance.
After all, why should the speed at which the neuron learns matter if the learning rate was chosen arbitrarily?βBut such an objection misses the central point.
The key issue is not the absolute speed of learning, but rather how the learning speed changes.
- With the quadratic loss function, learning is slower when the neuron is making large errors, and only accelerates as it approaches the desired output.
- With cross-entropy, by contrast, learning is faster precisely when the neuron is far from the correct response.
These observations hold regardless of the specific learning rate chosen.