Multiclass exclusive classification is the most common classification setting in practice: each example belongs to exactly one of categories. Digit recognition (one digit per image), object classification (one label per crop), and language modelling (one next token) are all of this form.

Binary problems are often an idealisation

Strictly, many “binary” problems are a coarse approximation of a graded reality. In medical imaging, for instance, there is a whole spectrum of nuance between the extremes benign and malignant. Multiclass exclusive classification is the natural generalisation when the categories are several but still mutually exclusive.

The network has one output neuron per category, and the desired output for a given input is a one-hot vector: a in the position of the correct class, everywhere else.

The output must be a distribution: softmax

Because the classes are mutually exclusive, the outputs should form a genuine probability distribution: non-negative and summing to one. Independent sigmoids cannot do this (their outputs need not sum to one, as shown for the multilabel case). The construction that does is the softmax, which couples the outputs,

The softmax makes the classes compete for a single unit of probability: raising one output lowers the others. Its properties are developed in Softmax and Cross-Entropy and Softmax properties.

From cross-entropy to negative log-likelihood

The loss for this setting is the cross-entropy between the one-hot target and the softmax output,

Here the one-hot structure of does something clean: every is zero except for the correct class , where . The whole sum therefore collapses to a single term,

the negative log-likelihood of the correct class. The loss depends only on the probability the model assigned to the right answer: it ignores how the remaining mass is distributed among the wrong classes, and cares only that be pushed toward .

Why this is "negative log-likelihood"

The softmax output is the probability the model assigns to the observed label. Its likelihood over a dataset is ; taking the negative logarithm turns the product into the sum . Minimising the NLL loss is therefore exactly maximising the likelihood of the correct labels under the model’s categorical distribution. This is the multiclass counterpart of the Bernoulli likelihood behind BCE and the Gaussian likelihood behind MSE: all three losses are negative log-likelihoods, differing only in the assumed output distribution.

Reading the single term confirms the behaviour wanted of a classification loss: is when the model is certain and correct () and grows without bound as it becomes confidently wrong ().

Over a dataset the objective is the average,

The gradient: the clean residual again

Pairing softmax with this loss reproduces the cancellation that made BCE work. The gradient of the loss with respect to the logits is the signed residual between the predicted distribution and the one-hot target,

with no softmax Jacobian left to attenuate it: confidently wrong predictions receive the largest gradient. The derivation, in which the softmax derivative and the log derivative cancel, is carried out in Softmax and Cross-Entropy and reused in Backpropagation Through Time.

In PyTorch: a naming subtlety worth knowing

PyTorch splits this loss across two objects, and confusing them is a common source of silent bugs.

import torch.nn as nn
 
# Option A: log-softmax in the model, NLLLoss as the criterion
model_tail = nn.LogSoftmax(dim=1)        # outputs log-probabilities
criterion  = nn.NLLLoss()                # expects log-probabilities + integer class index
loss = criterion(model_tail(logits), target)   # target: class indices, not one-hot
 
# Option B (preferred): raw logits straight into CrossEntropyLoss
criterion = nn.CrossEntropyLoss()        # = LogSoftmax + NLLLoss, fused and stable
loss = criterion(logits, target)         # target: class indices

NLLLoss expects log-probabilities, not probabilities

Despite its name, nn.NLLLoss does not apply a logarithm. It expects inputs that are already log-probabilities (the output of LogSoftmax) and simply selects and negates the entry of the true class. Feeding it raw probabilities, or raw logits, gives a wrong loss with no error raised. The robust pattern is nn.CrossEntropyLoss fed raw logits: it fuses LogSoftmax and NLLLoss into one numerically stable operation (via the log-sum-exp trick), so the softmax is never materialised and log(0) cannot occur. The target in both cases is a tensor of class indices, not a one-hot matrix.

The family at a glance

The four classification-and-regression losses of this section are one idea, the negative log-likelihood of a chosen output distribution, specialised to four answer shapes.

LossOutput activationTargetOutput distribution
MSEnone (real value)real numberGaussian
BCEone sigmoidBernoulli
Multilabel logisticsigmoid per classmulti-hotproduct of Bernoullis
NLLsoftmaxone-hotcategorical

The same principle, minimise the negative log-likelihood of the data, generates all four; the choice among them is a statement about what kind of quantity the network is predicting.