Multiclass exclusive classification is the most common classification setting in practice: each example belongs to exactly one of categories. Digit recognition (one digit per image), object classification (one label per crop), and language modelling (one next token) are all of this form.
Binary problems are often an idealisation
Strictly, many “binary” problems are a coarse approximation of a graded reality. In medical imaging, for instance, there is a whole spectrum of nuance between the extremes benign and malignant. Multiclass exclusive classification is the natural generalisation when the categories are several but still mutually exclusive.
The network has one output neuron per category, and the desired output for a given input is a one-hot vector: a in the position of the correct class, everywhere else.
The output must be a distribution: softmax
Because the classes are mutually exclusive, the outputs should form a genuine probability distribution: non-negative and summing to one. Independent sigmoids cannot do this (their outputs need not sum to one, as shown for the multilabel case). The construction that does is the softmax, which couples the outputs,
The softmax makes the classes compete for a single unit of probability: raising one output lowers the others. Its properties are developed in Softmax and Cross-Entropy and Softmax properties.
From cross-entropy to negative log-likelihood
The loss for this setting is the cross-entropy between the one-hot target and the softmax output,
Here the one-hot structure of does something clean: every is zero except for the correct class , where . The whole sum therefore collapses to a single term,
the negative log-likelihood of the correct class. The loss depends only on the probability the model assigned to the right answer: it ignores how the remaining mass is distributed among the wrong classes, and cares only that be pushed toward .
Why this is "negative log-likelihood"
The softmax output is the probability the model assigns to the observed label. Its likelihood over a dataset is ; taking the negative logarithm turns the product into the sum . Minimising the NLL loss is therefore exactly maximising the likelihood of the correct labels under the model’s categorical distribution. This is the multiclass counterpart of the Bernoulli likelihood behind BCE and the Gaussian likelihood behind MSE: all three losses are negative log-likelihoods, differing only in the assumed output distribution.
Reading the single term confirms the behaviour wanted of a classification loss: is when the model is certain and correct () and grows without bound as it becomes confidently wrong ().
Over a dataset the objective is the average,
The gradient: the clean residual again
Pairing softmax with this loss reproduces the cancellation that made BCE work. The gradient of the loss with respect to the logits is the signed residual between the predicted distribution and the one-hot target,
with no softmax Jacobian left to attenuate it: confidently wrong predictions receive the largest gradient. The derivation, in which the softmax derivative and the log derivative cancel, is carried out in Softmax and Cross-Entropy and reused in Backpropagation Through Time.
In PyTorch: a naming subtlety worth knowing
PyTorch splits this loss across two objects, and confusing them is a common source of silent bugs.
import torch.nn as nn
# Option A: log-softmax in the model, NLLLoss as the criterion
model_tail = nn.LogSoftmax(dim=1) # outputs log-probabilities
criterion = nn.NLLLoss() # expects log-probabilities + integer class index
loss = criterion(model_tail(logits), target) # target: class indices, not one-hot
# Option B (preferred): raw logits straight into CrossEntropyLoss
criterion = nn.CrossEntropyLoss() # = LogSoftmax + NLLLoss, fused and stable
loss = criterion(logits, target) # target: class indices
NLLLossexpects log-probabilities, not probabilitiesDespite its name,
nn.NLLLossdoes not apply a logarithm. It expects inputs that are already log-probabilities (the output ofLogSoftmax) and simply selects and negates the entry of the true class. Feeding it raw probabilities, or raw logits, gives a wrong loss with no error raised. The robust pattern isnn.CrossEntropyLossfed raw logits: it fusesLogSoftmaxandNLLLossinto one numerically stable operation (via the log-sum-exp trick), so the softmax is never materialised andlog(0)cannot occur. The target in both cases is a tensor of class indices, not a one-hot matrix.
The family at a glance
The four classification-and-regression losses of this section are one idea, the negative log-likelihood of a chosen output distribution, specialised to four answer shapes.
| Loss | Output activation | Target | Output distribution |
|---|---|---|---|
| MSE | none (real value) | real number | Gaussian |
| BCE | one sigmoid | Bernoulli | |
| Multilabel logistic | sigmoid per class | multi-hot | product of Bernoullis |
| NLL | softmax | one-hot | categorical |
The same principle, minimise the negative log-likelihood of the data, generates all four; the choice among them is a statement about what kind of quantity the network is predicting.