Kullback-Leibler Divergence

Let $p$ and $q$ be two probability distributions of a random variable $X$ . Formally, let $X \sim p_{X} (x)$ and $X \sim q_{X} (x)$ , or by common abuse of notation, $X \sim p (x)$ and $X \sim q (x)$ . The Kullback-Leibler (KL) divergence, also known as relative entropy, measures the statistical discrepancy between these two distributions.

Definition

The KL divergence from $q$ to $p$ is defined as:

D_{K L} (p ∥ q) = {E_{p} [lo g_{2} \frac{p ( X )}{q ( X )}] + \infty if p ≪ q otherwise

Absolute Continuity ( $p ≪ q$ )

The condition $p ≪ q$ signifies that the distribution $p$ is absolutely continuous with respect to $q$ , meaning $q$ dominates $p$ . Formally, for every $x \in A_{X}$ : $q (x) = 0 ⟹ p (x) = 0$ This ensures the Radon-Nikodym derivative exists and the ratio $\frac{p ( x )}{q ( x )}$ is well-defined. If $q (x) = 0$ and $p (x) = 0$ , the contribution to the expectation is conventionally taken as $0$ .

The Singularity Case ("Otherwise")

The “otherwise” condition occurs when absolute continuity is violated. This happens if there exists at least one $x \in X$ such that: $\exists x \in A_{X} : q (x) = 0 and p (x) > 0$ In this scenario, the support of $p$ is not contained within the support of $q$ . The “surprise” of observing $x$ under $q$ when it has a positive probability under $p$ is infinite, hence $D_{K L} (p ∥ q) = \infty$ .

Properties

While $D_{K L} (p ∥ q)$ is often used to quantify the “distance” between distributions, it is strictly a divergence and not a metric (distance) in the topological sense.

1. Asymmetry

The KL divergence is non-symmetric. In general:

D_{K L} (p ∥ q) \neq = D_{K L} (q ∥ p)

The information loss incurred when using $q$ to model $p$ is not necessarily equal to the loss incurred when using $p$ to model $q$ . While specific cases exist where equality holds (e.g., if $p = q$ , then both equal $0$ ), it remains a directed measure.

2. Violation of the Triangle Inequality

A true distance must satisfy the triangle inequality. For any three distributions $p, q, r$ , the following does not necessarily hold:

D_{K L} (p ∥ r) \leq D_{K L} (p ∥ q) + D_{K L} (q ∥ r)

Because $D_{K L}$ does not respect this geometric constraint, it cannot define a standard metric space.

3. Non-negativity (Gibbs’ Inequality)

Despite not being a distance, $D_{K L}$ is always non-negative:

D_{K L} (p ∥ q) \geq 0

with $D_{K L} (p ∥ q) = 0$ if and only if $p = q$ almost everywhere. This property stems from Jensen’s Inequality applied to the logarithmic function.

Proof of $D_{K L} (p ∥ q) \geq 0$

Let $X$ be a random variable with support $A_{X}$ . We wish to show that $- D_{K L} (p ∥ q) \leq 0$ .

Starting from the definition:
$- D_{K L} (p ∥ q) = - E_{p} [lo g_{2} \frac{p ( X )}{q ( X )}] = E_{p} [lo g_{2} \frac{q ( X )}{p ( X )}]$
Expanding the expectation for the discrete case (the continuous case follows analogously via integration):
$E_{p} [lo g_{2} \frac{q ( X )}{p ( X )}] = x \in A_{X} \sum p (x) lo g_{2} \frac{q ( x )}{p ( x )}$
Since the logarithm function is strictly concave, we can apply Jensen’s Inequality, which states that for a concave function $f$ :
$E [f (Y)] \leq f (E [Y])$
Substituting $f = lo g_{2}$ and $Y = \frac{q ( X )}{p ( X )}$ :
$x \in A_{X} \sum p (x) lo g_{2} \frac{q ( x )}{p ( x )} \leq lo g_{2} (x \in A_{X} \sum p (x) \frac{q ( x )}{p ( x )})$
The term inside the logarithm simplifies to the total sum of the probability distribution $q$ :
$x \in A_{X} \sum q (x) = 1$
Thus, we obtain:
$- D_{K L} (p ∥ q) \leq lo g_{2} (1) = 0$
Multiplying by $- 1$ reverses the inequality:
$D_{K L} (p ∥ q) \geq 0$
Equality Condition: Since $lo g_{2}$ is strictly concave, equality holds if and only if the ratio $\frac{q ( x )}{p ( x )}$ is constant for all $x$ . Given that both are normalized probability distributions, this constant must be $1$ , implying $p (x) = q (x)$ almost everywhere.

Deep Learning

Explorer

Kullback-Leibler Divergence

Definition

Properties

1. Asymmetry

2. Violation of the Triangle Inequality

3. Non-negativity (Gibbs’ Inequality)

Graph View

Table of Contents