Cross Entropy

The cross-entropy sits exactly between the two quantities already developed in this chapter. The entropy $H (p)$ is the cost of describing a source under its own distribution; the Kullback-Leibler divergence $D_{K L} (p ∥ q)$ is the penalty for using the wrong distribution. Cross-entropy is the total cost of that second situation: describing data generated by $p$ with a code built for $q$ .

Definition

Let $X$ be a discrete random variable over the alphabet $A_{X}$ , with true probability mass function $p (x)$ and a second distribution $q (x)$ on the same alphabet. The cross-entropy of $q$ relative to $p$ is

H (p, q) = E_{p} [lo g_{2} \frac{1}{q ( X )}] = - x \in A_{X} \sum p (x) lo g_{2} q (x) [bits] .

Conventions, and a notational caution

The expectation is taken under the true distribution $p$ , while the logarithm scores the assumed distribution $q$ . This asymmetric pairing of $p$ and $q$ is the whole content of the quantity.

As for entropy, the convention $0 lo g_{2} 0 = 0$ handles outcomes with $p (x) = 0$ . If instead $q (x) = 0$ while $p (x) > 0$ , the term $- p (x) lo g_{2} q (x)$ is $+ \infty$ , so $H (p, q) = + \infty$ : a code built for $q$ cannot describe an outcome that $q$ rules out but $p$ deems possible. This is the same absolute-continuity requirement $p ≪ q$ that appears in the KL divergence.

$H (p, q)$ takes two distributions as its arguments and denotes the cross-entropy. It must not be confused with the joint entropy $H (X, Y)$ , which takes two random variables. The kind of argument disambiguates the two notations.

Interpretation: paying for the wrong codebook

The operational meaning follows from Shannon’s source coding. The shortest expected description of a source $X \sim p$ uses about $lo g_{2} \frac{1}{p ( x )}$ bits for the outcome $x$ , and its expected length is the entropy $H (p)$ . Suppose instead the code is designed for a different distribution $q$ , assigning length $lo g_{2} \frac{1}{q ( x )}$ to outcome $x$ . If the data really follow $p$ , the expected length of this mismatched code is

E_{p} [lo g_{2} \frac{1}{q ( X )}] = H (p, q) .

Cross-entropy is therefore the expected number of bits actually spent when the encoder believes the distribution is $q$ but reality is $p$ . When the belief is correct, $q = p$ , this collapses to the entropy $H (p)$ , the best achievable. Any mismatch makes it strictly larger, and the next section says by exactly how much.

The fundamental identity

The single most useful fact about cross-entropy is that it splits cleanly into the two quantities it stands between:

H (p, q) = H (p) + D_{K L} (p ∥ q) .

The total cost of the wrong code is an irreducible part, the entropy $H (p)$ that even the optimal code must pay, plus an avoidable surcharge, the KL divergence $D_{K L} (p ∥ q)$ charged for using $q$ in place of $p$ .

Proof of the decomposition

Starting from the definition and adding and subtracting $\sum_{x} p (x) lo g_{2} p (x)$ ,
$H (p, q) = - x \in A_{X} \sum p (x) lo g_{2} q (x) = - x \sum p (x) lo g_{2} p (x) + x \sum p (x) lo g_{2} p (x) - x \sum p (x) lo g_{2} q (x) = H (p) - x \sum p (x) lo g_{2} p (x) + D_{K L} (p ∥ q) x \sum p (x) lo g_{2} \frac{p ( x )}{q ( x )} = H (p) + D_{K L} (p ∥ q) .$
The first group is the entropy of $p$ ; the second is, by definition, the KL divergence from $q$ to $p$ .

Consequences

Cross-entropy is minimised by the truth

Since $D_{K L} (p ∥ q) \geq 0$ by Gibbs’ inequality, with equality if and only if $q = p$ , the decomposition gives
$H (p, q) \geq H (p),$
again with equality if and only if $q = p$ . Over all distributions $q$ , the cross-entropy $H (p, q)$ is therefore smallest when $q$ matches the true distribution $p$ , and its minimum value is the entropy $H (p)$ . Pushing $q$ toward $p$ and minimising $H (p, q)$ are the same task, differing only by the constant $H (p)$ .

Two further properties follow directly from the definition.

Asymmetry. In general $H (p, q) \neq = H (q, p)$ , since the roles of the averaging distribution and the scored distribution are not interchangeable. Like the KL divergence it contains, cross-entropy is a directed quantity.
Self-consistency. Setting $q = p$ gives $H (p, p) = H (p)$ : encoding a source with its own optimal code costs exactly its entropy, and the surcharge $D_{K L} (p ∥ p) = 0$ .

The continuous case

For a continuous random variable with densities $p$ and $q$ , the sum is replaced by an integral,
$H (p, q) = - \int p (x) lo g_{2} q (x) d x,$
the cross-entropy counterpart of differential entropy. The decomposition $H (p, q) = H (p) + D_{K L} (p ∥ q)$ holds unchanged.

Why this is the loss function of classification

The minimisation property above is the reason cross-entropy is the standard training objective for classification.

Minimising cross-entropy is maximum likelihood

In supervised classification the true distribution $p$ is the (one-hot) label of an example, and $q = q_{θ}$ is the network’s predicted distribution, depending on the parameters $θ$ . Training minimises $H (p, q_{θ})$ over $θ$ . By the decomposition,
$ar g θ min H (p, q_{θ}) = ar g θ min [constant in θ H (p) + D_{K L} (p ∥ q_{θ})] = ar g θ min D_{K L} (p ∥ q_{θ}),$
because $H (p)$ does not depend on the model. Minimising the cross-entropy loss is therefore identical to minimising the KL divergence between the data and the model, which in turn is exactly maximum likelihood. The machine-learning side of this story, including the gradient that makes it preferable to squared error, is developed in the cross-entropy loss.

Deep Learning: Zero to Hero

Explorer

Definition

Interpretation: paying for the wrong codebook

The fundamental identity

Consequences

Why this is the loss function of classification

Graph View

Table of Contents

Backlinks