This note specialises the cross-entropy loss to multilabel classification: a multi-class problem in which the classes are not mutually exclusive, so a single example can carry several labels at once.

One item, several labels

On an e-commerce platform a single product can be tagged simultaneously as electronics, laptop, and gaming. The labels do not compete: assigning one does not forbid the others.

The model produces one output neuron per class, and each output estimates, independently, the probability that the example belongs to class :

  • each is an independent probability for class ;
  • there is no normalisation constraint, so in general;
  • an example may belong to zero, one, or several classes at the same time.

Why a sigmoid on each output neuron

The sigmoid maps any real pre-activation into , so each output can be read as a probability. While the sigmoid is rarely used in hidden layers today (replaced by ReLU and its family), it remains the natural choice for an output neuron whose value must be a standalone probability. Multilabel classification is the textbook case: each sigmoid output independently estimates whether class is present.

Independent sigmoids do not form a distribution

A layer of independent sigmoid outputs does not produce a probability distribution, because the activations need not sum to one. A concrete check with two output neurons:

Sigmoid outputs are independent gates, not a distribution

Each sigmoid acts on its own neuron, with no coupling between outputs, so the vector is a set of independent per-class probabilities rather than a single distribution over classes. A genuine distribution over mutually exclusive classes requires a softmax layer, which couples the outputs so they sum to one. The choice between the two is the choice between multilabel (this note) and multiclass-exclusive (NLL) classification.

Cross-entropy for the multilabel case

Multilabel classification = many independent binary problems

A multilabel problem is a collection of independent binary classification problems, one per class, solved in parallel by the same network.

For each class the output neuron estimates , the probability that the example belongs to class ; the complement is the probability that it does not (a one-versus-all split, with all other classes lumped into “not ”). Each neuron therefore carries its own binary cross-entropy, and the total loss is their sum over the output neurons:

This is the multilabel logistic loss, . Each term is a complete BCE on one class: it rewards the model for raising toward on present labels and lowering it toward on absent ones, independently for every class.

SymbolMeaning
probability that the example belongs to class
probability that it does not belong to class
sum of the per-class binary cross-entropies

Why "logistic"

The name refers to the activation: each is produced by a sigmoid, also called the logistic function in statistics. For this reason the loss is exposed in frameworks such as PyTorch as BCEWithLogitsLoss: the optimiser works directly on the raw logits , and the loss applies the sigmoid internally, which is both more numerically stable and more efficient than applying the sigmoid as a separate step.

Numerical instability of log-based losses

Losses built on logarithms, binary and multilabel cross-entropy alike, can become unstable at the extremes of their inputs. When reaches exactly or , a term appears and, propagated through backpropagation, produces NaN gradients that derail training. Two standard remedies:

  • clamp with an : replace by and by , with a small such as ;
  • better, use the logits form (BCEWithLogitsLoss), which fuses the sigmoid and the logarithm into a single stable expression and never materialises the dangerous intermediate .

Multilabel versus multiclass-exclusive

It is worth setting the two sister losses side by side, because the choice of loss encodes a modelling assumption about the labels.

PropertyMultilabel (this note)Multiclass-exclusive (NLL)
Output activationsigmoid per neuronsingle softmax
Outputs sum to one?no, independentyes, coupled
Labels per examplezero, one, or manyexactly one
Loss
Targetbinary vector (any number of s)one-hot vector

Choosing the multilabel loss when the classes are genuinely exclusive wastes the strong prior that exactly one class is correct; choosing softmax+NLL when several labels can co-occur forces a single winner where there should be many. The loss is part of the model.

In PyTorch

import torch.nn as nn
 
# logits: raw network outputs, shape (batch, n_classes)
# target: multi-hot float tensor, same shape, with a 1 for every present label
criterion = nn.BCEWithLogitsLoss()       # sigmoid + BCE, summed over classes, fused
loss = criterion(logits, target)

BCEWithLogitsLoss applied to a vector of logits is exactly : an independent binary cross-entropy per class, computed in a numerically stable way. The corresponding loss for the mutually-exclusive case is the negative log-likelihood loss, paired with a softmax output.