Let and be two probability distributions of a random variable . Formally, let and , or by common abuse of notation, and . The Kullback-Leibler (KL) divergence, also known as relative entropy, measures the statistical discrepancy between these two distributions.

Definition

The KL divergence from to is defined as:

Absolute Continuity ()

The condition signifies that the distribution is absolutely continuous with respect to , meaning dominates . Formally, for every : This ensures the Radon-Nikodym derivative exists and the ratio is well-defined. If and , the contribution to the expectation is conventionally taken as .

The Singularity Case ("Otherwise")

The “otherwise” condition occurs when absolute continuity is violated. This happens if there exists at least one such that: In this scenario, the support of is not contained within the support of . The “surprise” of observing under when it has a positive probability under is infinite, hence .


Properties

While is often used to quantify the “distance” between distributions, it is strictly a divergence and not a metric (distance) in the topological sense.

1. Asymmetry

The KL divergence is non-symmetric. In general:

The information loss incurred when using to model is not necessarily equal to the loss incurred when using to model . While specific cases exist where equality holds (e.g., if , then both equal ), it remains a directed measure.

2. Violation of the Triangle Inequality

A true distance must satisfy the triangle inequality. For any three distributions , the following does not necessarily hold:

Because does not respect this geometric constraint, it cannot define a standard metric space.

3. Non-negativity (Gibbs’ Inequality)

Despite not being a distance, is always non-negative:

with if and only if almost everywhere. This property stems from Jensen’s Inequality applied to the logarithmic function.