The cross-entropy sits exactly between the two quantities already developed in this chapter. The entropy is the cost of describing a source under its own distribution; the Kullback-Leibler divergence is the penalty for using the wrong distribution. Cross-entropy is the total cost of that second situation: describing data generated by with a code built for .

Definition

Let be a discrete random variable over the alphabet , with true probability mass function and a second distribution on the same alphabet. The cross-entropy of relative to is

Conventions, and a notational caution

  • The expectation is taken under the true distribution , while the logarithm scores the assumed distribution . This asymmetric pairing of and is the whole content of the quantity.
  • As for entropy, the convention handles outcomes with . If instead while , the term is , so : a code built for cannot describe an outcome that rules out but deems possible. This is the same absolute-continuity requirement that appears in the KL divergence.
  • takes two distributions as its arguments and denotes the cross-entropy. It must not be confused with the joint entropy , which takes two random variables. The kind of argument disambiguates the two notations.

Interpretation: paying for the wrong codebook

The operational meaning follows from Shannon’s source coding. The shortest expected description of a source uses about bits for the outcome , and its expected length is the entropy . Suppose instead the code is designed for a different distribution , assigning length to outcome . If the data really follow , the expected length of this mismatched code is

Cross-entropy is therefore the expected number of bits actually spent when the encoder believes the distribution is but reality is . When the belief is correct, , this collapses to the entropy , the best achievable. Any mismatch makes it strictly larger, and the next section says by exactly how much.

The fundamental identity

The single most useful fact about cross-entropy is that it splits cleanly into the two quantities it stands between:

The total cost of the wrong code is an irreducible part, the entropy that even the optimal code must pay, plus an avoidable surcharge, the KL divergence charged for using in place of .

Consequences

Cross-entropy is minimised by the truth

Since by Gibbs’ inequality, with equality if and only if , the decomposition gives

again with equality if and only if . Over all distributions , the cross-entropy is therefore smallest when matches the true distribution , and its minimum value is the entropy . Pushing toward and minimising are the same task, differing only by the constant .

Two further properties follow directly from the definition.

  • Asymmetry. In general , since the roles of the averaging distribution and the scored distribution are not interchangeable. Like the KL divergence it contains, cross-entropy is a directed quantity.
  • Self-consistency. Setting gives : encoding a source with its own optimal code costs exactly its entropy, and the surcharge .

The continuous case

For a continuous random variable with densities and , the sum is replaced by an integral,

the cross-entropy counterpart of differential entropy. The decomposition holds unchanged.

Why this is the loss function of classification

The minimisation property above is the reason cross-entropy is the standard training objective for classification.

Minimising cross-entropy is maximum likelihood

In supervised classification the true distribution is the (one-hot) label of an example, and is the network’s predicted distribution, depending on the parameters . Training minimises over . By the decomposition,

because does not depend on the model. Minimising the cross-entropy loss is therefore identical to minimising the KL divergence between the data and the model, which in turn is exactly maximum likelihood. The machine-learning side of this story, including the gradient that makes it preferable to squared error, is developed in the cross-entropy loss.