In a binary classification problem the output belongs to one of two categories, for example dog and cat. To handle them numerically the two classes are encoded dichotomously, as and .
One output neuron is enough for binary classification
A single output neuron suffices, because its value can be read as the probability of belonging to one of the two classes. In a two-class problem the probability of the other class is just the complement,
so knowing one probability fixes the other. A sigmoid output is the natural choice: it maps any real pre-activation into a valid probability.
Specialising cross-entropy to two classes
The general cross-entropy loss over a set of output units is
In binary classification there is only one output neuron, producing , yet the explicit loss still contains two terms, one for each class. The reason is the complement.
Why two terms from a single neuron
There is one neuron, with output , but two classes to account for:
- the probability of class is the output itself, ;
- there is no separate neuron for class ; its probability is the complement, ;
- the same holds for the targets, and .
Substituting , , , into collapses the sum over two units into a single expression in and .
The result is the binary cross-entropy (BCE) loss:
Read it one label at a time and it is transparent. The target is either or , so exactly one term survives:
When the true class is , the loss is : zero when the model is fully confident and correct (), and growing without bound as the model becomes confidently wrong (). The case is the mirror image. The loss therefore penalises confident mistakes far more harshly than hesitant ones, which is exactly the behaviour a classifier should be trained on.
Two roads to the same loss
The BCE expression can be reached from two directions, and seeing both is worthwhile.
| Road 1: functional analysis on one neuron | Road 2: Shannon cross-entropy, instantiated | |
|---|---|---|
| Start from a single sigmoid neuron whose output is read as a probability. Demand a loss whose parameter gradients do not contain the saturating factor , and integrate the resulting condition. This is the derivation carried out in full in Cross-entropy loss; it yields the boxed BCE directly. | Start from the general cross-entropy between two distributions. Instantiate it on the two-class case, with the single estimated probability and its complement , and the same boxed BCE falls out. |
Both roads end at the identical formula: the first explains why this loss cures the MSE slowdown, the second explains what it measures (the cross-entropy between the true label distribution and the predicted one).
The gradient is the bare residual
The defining virtue of BCE, the reason it replaces MSE for classification, is what happens to its gradient. With , the per-example gradient with respect to the pre-activation is
with no factor. The saturating slope that crippled MSE cancels exactly against the that BCE contributes, leaving the clean signed residual. A confidently wrong neuron ( deep in saturation, while ) now receives a gradient of magnitude , the largest possible, instead of the vanishing gradient MSE would have given it. The cancellation is derived step by step in Cross-entropy loss.
Over a dataset of examples the objective is the average,
What BCE measures, in information terms
BCE is the cross-entropy between two Bernoulli distributions: the true one, which puts all its mass on the correct label, and the predicted one, . Minimising it is equivalent to minimising the Kullback-Leibler divergence from the predicted distribution to the true one, since the two differ only by the (constant) entropy of the labels. In maximum-likelihood terms, minimising BCE is maximising the Bernoulli likelihood of the observed labels, the classification counterpart of the Gaussian likelihood behind MSE.
In PyTorch
PyTorch exposes two interfaces, and choosing the right one matters for numerical stability.
import torch.nn as nn
# 1) BCELoss: expects probabilities (apply the sigmoid yourself)
criterion = nn.BCELoss()
probs = torch.sigmoid(logits)
loss = criterion(probs, target) # target in {0, 1}
# 2) BCEWithLogitsLoss: expects raw logits, applies the sigmoid internally
criterion = nn.BCEWithLogitsLoss()
loss = criterion(logits, target) # numerically stable, preferredPrefer
BCEWithLogitsLossoversigmoid+BCELossComputing the sigmoid and then the logarithm separately can overflow: and blow up to when saturates to or , producing
NaNgradients.BCEWithLogitsLossfuses the sigmoid and the log into a single expression (the log-sum-exp trick) that is stable across the whole range of logits. Feeding it raw logits, rather than post-sigmoid probabilities, is the recommended pattern.
Where this leads
Binary cross-entropy is the building block for the next two settings. When a single example can carry several independent labels at once, the answer is one BCE per output neuron, summed: the multilabel logistic loss. When the classes are instead mutually exclusive, the sigmoid-per-neuron picture is replaced by a single softmax over all classes, giving the negative log-likelihood loss.