Let and be two probability distributions of a random variable . Formally, let and , or by common abuse of notation, and . The Kullback-Leibler (KL) divergence, also known as relative entropy, measures the statistical discrepancy between these two distributions.
Definition
The KL divergence from to is defined as:
Absolute Continuity ()
The condition signifies that the distribution is absolutely continuous with respect to , meaning dominates . Formally, for every : This ensures the Radon-Nikodym derivative exists and the ratio is well-defined. If and , the contribution to the expectation is conventionally taken as .
The Singularity Case ("Otherwise")
The “otherwise” condition occurs when absolute continuity is violated. This happens if there exists at least one such that: In this scenario, the support of is not contained within the support of . The “surprise” of observing under when it has a positive probability under is infinite, hence .
Properties
While is often used to quantify the “distance” between distributions, it is strictly a divergence and not a metric (distance) in the topological sense.
1. Asymmetry
The KL divergence is non-symmetric. In general:
The information loss incurred when using to model is not necessarily equal to the loss incurred when using to model . While specific cases exist where equality holds (e.g., if , then both equal ), it remains a directed measure.

2. Violation of the Triangle Inequality
A true distance must satisfy the triangle inequality. For any three distributions , the following does not necessarily hold:
Because does not respect this geometric constraint, it cannot define a standard metric space.
3. Non-negativity (Gibbs’ Inequality)
Despite not being a distance, is always non-negative:
with if and only if almost everywhere. This property stems from Jensen’s Inequality applied to the logarithmic function.
Proof of
Let be a random variable with support . We wish to show that .
Starting from the definition:
Expanding the expectation for the discrete case (the continuous case follows analogously via integration):
Since the logarithm function is strictly concave, we can apply Jensen’s Inequality, which states that for a concave function :
Substituting and :
The term inside the logarithm simplifies to the total sum of the probability distribution :
Thus, we obtain:
Multiplying by reverses the inequality:
Equality Condition: Since is strictly concave, equality holds if and only if the ratio is constant for all . Given that both are normalized probability distributions, this constant must be , implying almost everywhere.