MSE Problems

You learn by making mistakes

Author's anecdote

It’s mistakes that trigger change. For example, during a Cryptography class exercise, I gave the wrong answer when computing $5 mod 3$ . I was nervous and froze until the professor corrected me. It was very embarrassing. And yet, as frustrating as it was, it’s precisely those clear-cut mistakes that make you learn faster. The next time, I correctly computed $7 mod 5$ . Conversely, failing to learn from one’s mistakes undermines the learning process.

Question

Ideally, one would expect neural networks to learn quickly from their mistakes as well. But what happens in practice?

To answer that, consider the following example.

Toy Example: Sigmoid Neuron with a Single Input

Goal

The goal is to train this neuron to perform a trivial task: mapping the input $1$ to the output $0$ .

Of course, this is such a simple problem that one could easily determine suitable values for the weight $w$ and the bias $b$ without relying on any learning algorithm.
Nevertheless, it is instructive to observe how the neuron learns through gradient descent.

Let’s examine in more detail the learning process.

🟢MSE, lucky configuration $w, b$	🔴 MSE, unlucky configuration $w, b$

Input: $1.0$ → Desired output: $0.0$	Input: $1.0$ → Desired output: $0.0$
Initial $w$ : $0.6$	Initial $w$ : $2.0$
Initial $b$ : $0.9$	Initial $b$ : $2.0$
Initial output: $0.82$	Initial output: $0.98$
Final output: $\sim 0.09$	Final output: $\sim 0.20$
Learning rate: $η = 0.15$	Learning rate: $η = 0.15$
Loss function: quadratic (MSE)	Loss function: quadratic (MSE)
✅ “Fast” learning (though $a = 0.09$ after $300$ epochs is hardly impressive)”	⏳Slow learning: plateau lasting ~150 epochs
✅ Substantial initial gradient	⚠️ Near-zero initial gradient
📉 The loss function decreases immediately	💤 Initially flat curve, followed by a delayed descent.
📌 $w$ and $b$ are updated right from the start	📌 $w$ and $b$ remain nearly constant for a considerable number of iterations.

Why is this behavior unusual?

This behavior is unusual when compared to human learning process.

As noted at the beginning of this section, human beings often learn more quickly when we make big mistakes.

Yet, it has just been showed that the artificial neuron struggles far more to learn when its error is large, much more than when the error is small.

Not an isolated case

This behavior is not limited to this simple example, but also appears in more complex neural networks.

What can be done?

Why is the learning process so slow?

Is it possible to find a way to mitigate this slowdown in learning?

🔍 1. Why is the learning process so slow?

Neuron saturation (due to the sigmoid) + gradient structure with MSE

In the case of a single neuron that must map the input $x = 1$ to the desired output $y = 0$ ,
the slowdown in learning is observed when the initial parameters are $w = 2$ and $b = 2$ .
In this configuration, the net input $z$ is:
$z = w x + b = 2 \cdot 1 + 2 = 4$
A value of $z = 4$ places the neuron in the saturation region (see figure below):
the sigmoid function $σ (z)$ is nearly flat, and its derivative $σ^{'} (z)$ is close to zero.

🧠 The crucial point is that when using the MSE loss function,
neuron saturation directly implies the vanishing of the gradient,
because the derivative of the sigmoid $σ^{'} (z)$ (which is close to zero in the saturation region)
appears as a multiplicative factor in all components of the gradient.

Specifically, as illustrated below, the partial derivatives of the loss for a single sample, $L_{x} = (y - a)^{2}$ , with respect to the parameters $w$ and $b$ are:
$\frac{\partial L _{x}}{\partial w} = (a - y) \cdot σ^{'} (z) \cdot x \frac{\partial L _{x}}{\partial b} = (a - y) \cdot σ^{'} (z)$
❗In both gradient components, the derivative $σ^{'} (z)$ appears as a multiplicative factor.
Whenever $σ^{'} (z) \approx 0$ , the entire gradient vanishes, even if the error $(a - y)$ is large (i.e., the neuron is making a big mistake).❗

⚠️ Therefore, the slowdown in learning is caused by the combination of two factors:

Neuron saturation: $σ^{'} (z) \approx 0$

Gradient structure induced by MSE, where $σ^{'} (z)$ multiplies every component

🔍 2. Possible solution

Croos-entropy loss

Based on the issues discussed above, one possible solution is to replace the MSE loss function with a different loss function, known as cross-entropy loss.