MSE loss function

The mean squared error (MSE) is the default loss for regression: tasks whose target is a real-valued quantity (a price, a temperature, a coordinate) rather than a discrete label. It measures the squared Euclidean gap between the network’s prediction and the target, and it is the loss against which every other loss in this section is contrasted.

Definition

For a single example with network output $a \in R^{n}$ and target $y \in R^{n}$ , the per-example MSE loss is

L_{MSE} = \frac{1}{n} j = 1 \sum n (y_{j} - a_{j})^{2} = \frac{1}{n} ∥ y - a ∥_{2}^{2}

Over a dataset of $N$ examples the objective is, as always, the empirical risk, the average of the per-example losses,

L_{MSE} (θ) = \frac{1}{N} i = 1 \sum N \frac{1}{n} y^{(i)} - a^{(i)}_{2}^{2} .

The factor in front is a convention

The leading constant ( $\frac{1}{n}$ , or $\frac{1}{2}$ , or $1$ ) does not change where the minimum sits; it only rescales the gradient, which is absorbed into the learning rate. Many derivations use $\frac{1}{2} ∥ y - a ∥^{2}$ for a single example, because the $2$ from differentiating the square then cancels and the gradient is exactly $a - y$ . This note uses whichever constant keeps a given formula cleanest.

The simplest reading of MSE is geometric: it is the squared $L^{2}$ distance between prediction and target. Driving it to zero forces $a$ to coincide with $y$ , and because the penalty grows with the square of the gap, large errors are punished disproportionately more than small ones.

Why squared, and not the absolute error

Squaring has three consequences that make MSE the default:

It is smooth everywhere (unlike $∣ y - a ∣$ , which has a kink at $0$ ), so its gradient is well defined for optimisation.

It penalises large errors heavily, which is desirable when big misses are much worse than small ones.

It has a clean probabilistic meaning, derived next, that fixes the otherwise arbitrary choice of exponent.

When robustness to outliers matters more than smoothness, the absolute error (L1) or the Huber loss are the usual alternatives, precisely because squaring makes MSE sensitive to a few extreme residuals.

Where MSE comes from: Gaussian maximum likelihood

MSE is not an arbitrary geometric choice. It is the negative log-likelihood of the data under the assumption that the target is the network’s prediction corrupted by Gaussian noise.

MSE is maximum likelihood under Gaussian noise

Model the target as $y = a + ε$ with $ε \sim N (0, σ^{2} I)$ , so the prediction $a = f_{θ} (x)$ is the conditional mean and the residual is Gaussian. The likelihood of one observation is
$p (y ∣ x, θ) = \frac{1}{( 2 π σ ^{2} ) ^{n /2}} exp (- \frac{∥ y - a ∥ ^{2}}{2 σ ^{2}}) .$
Taking the negative logarithm and dropping the constants that do not depend on $θ$ leaves
$- lo g p (y ∣ x, θ) = \frac{1}{2 σ ^{2}} ∥ y - a ∥^{2} + const .$
Minimising the squared error is therefore exactly maximising the Gaussian likelihood of the data. This is the regression counterpart of the cross-entropy story for classification: both losses are the negative log-likelihood of a chosen output distribution, Gaussian for regression and categorical for classification.

This single fact explains both when MSE is the right loss and when it is not. It is right when the target is a continuous quantity whose errors are roughly symmetric and bell-shaped. It is the wrong loss for classification, where the target is a discrete label and the output is a probability: there the correct likelihood is Bernoulli or categorical, and its negative log gives cross-entropy, not MSE.

Gradient

With the $\frac{1}{2}$ convention on a single example, the gradient of the loss with respect to the output is simply the signed residual,

\frac{\partial L _{MSE}}{\partial a} = a - y,

which is the cleanest possible learning signal: it points from the prediction straight toward the target, with a magnitude equal to the size of the miss. The trouble, as the rest of this note shows, begins only when this residual is multiplied by the slope of a saturating output activation.

In PyTorch

import torch.nn as nn
 
criterion = nn.MSELoss()              # mean over all elements (the default)
# criterion = nn.MSELoss(reduction='sum')   # sum instead of mean
loss = criterion(prediction, target) # both tensors have the same shape

nn.MSELoss expects a raw real-valued prediction; no output activation is applied. This is the correct default for regression. Pairing MSE with a sigmoid or softmax output, the setup analysed below, is where the learning slowdown appears.

The problem with MSE: slow learning from large mistakes

Author's anecdote

Mistakes are what trigger change. During a cryptography exercise I once gave the wrong answer for $5 mod 3$ ; I froze, the professor corrected me, and the embarrassment was sharp. Yet it is exactly those clear-cut mistakes that make learning fast: the next time, $7 mod 5$ came out right without hesitation. Failing to learn from mistakes is what undermines the whole process.

Question

One would expect a neural network to learn quickly from its mistakes too, the larger the mistake, the faster the correction. Does it actually behave that way? The answer, when MSE is paired with a saturating output, is no.

Toy example: a sigmoid neuron with one input

Goal

Train a single sigmoid neuron on a trivial task: map the input $1$ to the output $0$ .

The values of the weight $w$ and bias $b$ that solve this could be written down by hand without any optimisation. The point is not the solution but the dynamics: how gradient descent approaches it, and how that depends on where it starts.

Across random initialisations the neuron usually reaches the target in under $100$ epochs, but for some unlucky configurations it takes more than $200$ . What decides the difference is entirely where the initial pre-activation $z = w x + b$ lands: in the responsive central region of the sigmoid, or deep in its saturated tail. The two runs below, trained with MSE from two such initialisations, make the gap visible.

🟢 MSE, lucky configuration $w, b$	🔴 MSE, unlucky configuration $w, b$

Input: $1.0$ → Desired output: $0.0$	Input: $1.0$ → Desired output: $0.0$
Initial $w$ : $0.6$	Initial $w$ : $2.0$
Initial $b$ : $0.9$	Initial $b$ : $2.0$
Pre-activation $z = w x + b = 0.6 \cdot 1 + 0.9 = 1.5$	Pre-activation $z = w x + b = 2.0 \cdot 1 + 2.0 = 4.0$
Initial output: $0.82$	Initial output: $0.98$
Final output: $\sim 0.09$	Final output: $\sim 0.20$
Learning rate: $η = 0.15$	Learning rate: $η = 0.15$
Loss: quadratic (MSE)	Loss: quadratic (MSE)
✅ Fast learning from the start	⏳ Plateau lasting $\sim 150$ epochs
✅ Substantial initial gradient	⚠️ Near-zero initial gradient
📉 Loss decreases immediately	💤 Flat curve, then a delayed descent
📌 $w, b$ update right away	📌 $w, b$ stay nearly constant for many iterations

Why is this behaviour backwards?

Compared with human learning it is inverted. A person tends to correct faster after a large, obvious mistake. The neuron does the opposite: the further its output starts from the target (initial output $0.98$ , target $0$ , the biggest possible mistake here), the slower it learns. A large error produces almost no initial gradient, and the parameters sit on a plateau for over a hundred epochs before anything happens.

This is not an artefact of the toy: the same slowdown appears in deep networks whenever a saturating output activation meets the MSE loss.

Why the learning is slow

The unlucky run combines a saturated neuron with the particular shape of the MSE gradient.

The neuron is saturated: its pre-activation is $z = w x + b = 2 \cdot 1 + 2 = 4$ , out in the flat tail of the sigmoid, where $σ (z) \approx 1$ and the slope $σ^{'} (z) = σ (z) (1 - σ (z))$ is near zero. The shaded band on the right of the plot is this region.

The MSE gradient carries that slope as a factor. Differentiating the per-example loss $L_{x} = \frac{1}{2} (y - a)^{2}$ by the chain rule, with $a = σ (z)$ and $z = w x + b$ , and specialising to the toy task ( $x = 1$ , $y = 0$ ):

\frac{\partial L _{x}}{\partial w} = \frac{\partial L _{x}}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial w} = (a - y) σ^{'} (z) x x = 1, y = 0 a σ^{'} (z),

\frac{\partial L _{x}}{\partial b} = \frac{\partial L _{x}}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial b} = (a - y) σ^{'} (z) y = 0 a σ^{'} (z) .

In both, $σ^{'} (z)$ multiplies the residual $(a - y)$ .

Both gradient components vanish when the neuron saturates

With $σ^{'} (z) \approx 0$ , both $\partial L_{x} / \partial w$ and $\partial L_{x} / \partial b$ are near zero even when the error $(a - y)$ is large: the neuron is at its most wrong and corrects the least. This is the $\sim 150$ -epoch plateau of the unlucky run.

The same effect in a full network

In an MLP the output-layer error signal is $δ^{L} = \nabla_{a} L ⊙ σ^{'} (z^{L})$ , the first backpropagation equation: the residual is gated coordinate by coordinate by $σ^{'} (z^{L})$ . The output-layer gradient is sensitive to the saturation of the output neurons exactly as in the toy, so the same MSE pathology appears whenever a saturating output meets the squared-error loss. It is the saturation mechanism of the vanishing gradient, at the output layer.

The fix

Toward cross-entropy

The cure is to keep the sigmoid output but change the loss so that its gradient no longer carries the $σ^{'} (z)$ factor. The loss that achieves exactly this is the cross-entropy loss, whose gradient with respect to the pre-activation is the bare residual $a - y$ , with no saturating slope to kill it. The construction, and the head-to-head comparison with the run above, is the subject of the next note.

Deep Learning: Zero to Hero

Explorer

MSE loss function

Definition

Where MSE comes from: Gaussian maximum likelihood

Gradient

In PyTorch

The problem with MSE: slow learning from large mistakes

Toy example: a sigmoid neuron with one input

Why the learning is slow

The fix

Graph View

Table of Contents

Backlinks