This note specialises the cross-entropy loss to multilabel classification: a multi-class problem in which the classes are not mutually exclusive, so a single example can carry several labels at once.
One item, several labels
On an e-commerce platform a single product can be tagged simultaneously as electronics, laptop, and gaming. The labels do not compete: assigning one does not forbid the others.
The model produces one output neuron per class, and each output estimates, independently, the probability that the example belongs to class :
- each is an independent probability for class ;
- there is no normalisation constraint, so in general;
- an example may belong to zero, one, or several classes at the same time.
Why a sigmoid on each output neuron
The sigmoid maps any real pre-activation into , so each output can be read as a probability. While the sigmoid is rarely used in hidden layers today (replaced by ReLU and its family), it remains the natural choice for an output neuron whose value must be a standalone probability. Multilabel classification is the textbook case: each sigmoid output independently estimates whether class is present.
Independent sigmoids do not form a distribution
A layer of independent sigmoid outputs does not produce a probability distribution, because the activations need not sum to one. A concrete check with two output neurons:
Sigmoid outputs are independent gates, not a distribution
Each sigmoid acts on its own neuron, with no coupling between outputs, so the vector is a set of independent per-class probabilities rather than a single distribution over classes. A genuine distribution over mutually exclusive classes requires a softmax layer, which couples the outputs so they sum to one. The choice between the two is the choice between multilabel (this note) and multiclass-exclusive (NLL) classification.
Cross-entropy for the multilabel case
Multilabel classification = many independent binary problems
A multilabel problem is a collection of independent binary classification problems, one per class, solved in parallel by the same network.
For each class the output neuron estimates , the probability that the example belongs to class ; the complement is the probability that it does not (a one-versus-all split, with all other classes lumped into “not ”). Each neuron therefore carries its own binary cross-entropy, and the total loss is their sum over the output neurons:
This is the multilabel logistic loss, . Each term is a complete BCE on one class: it rewards the model for raising toward on present labels and lowering it toward on absent ones, independently for every class.
| Symbol | Meaning |
|---|---|
| probability that the example belongs to class | |
| probability that it does not belong to class | |
| sum of the per-class binary cross-entropies |
Why "logistic"
The name refers to the activation: each is produced by a sigmoid, also called the logistic function in statistics. For this reason the loss is exposed in frameworks such as PyTorch as
BCEWithLogitsLoss: the optimiser works directly on the raw logits , and the loss applies the sigmoid internally, which is both more numerically stable and more efficient than applying the sigmoid as a separate step.
Numerical instability of log-based losses
Losses built on logarithms, binary and multilabel cross-entropy alike, can become unstable at the extremes of their inputs. When reaches exactly or , a term appears and, propagated through backpropagation, produces
NaNgradients that derail training. Two standard remedies:
- clamp with an : replace by and by , with a small such as ;
- better, use the logits form (
BCEWithLogitsLoss), which fuses the sigmoid and the logarithm into a single stable expression and never materialises the dangerous intermediate .
Multilabel versus multiclass-exclusive
It is worth setting the two sister losses side by side, because the choice of loss encodes a modelling assumption about the labels.
| Property | Multilabel (this note) | Multiclass-exclusive (NLL) |
|---|---|---|
| Output activation | sigmoid per neuron | single softmax |
| Outputs sum to one? | no, independent | yes, coupled |
| Labels per example | zero, one, or many | exactly one |
| Loss | ||
| Target | binary vector (any number of s) | one-hot vector |
Choosing the multilabel loss when the classes are genuinely exclusive wastes the strong prior that exactly one class is correct; choosing softmax+NLL when several labels can co-occur forces a single winner where there should be many. The loss is part of the model.
In PyTorch
import torch.nn as nn
# logits: raw network outputs, shape (batch, n_classes)
# target: multi-hot float tensor, same shape, with a 1 for every present label
criterion = nn.BCEWithLogitsLoss() # sigmoid + BCE, summed over classes, fused
loss = criterion(logits, target)BCEWithLogitsLoss applied to a vector of logits is exactly : an independent binary cross-entropy per class, computed in a numerically stable way. The corresponding loss for the mutually-exclusive case is the negative log-likelihood loss, paired with a softmax output.