Deep Learning

❯

Information Theory

❯

❯

The Chain Rule for KL Divergence

The Chain Rule for KL Divergence

Feb 20, 20261 min read

A fundamental property of the joint KL divergence is its recursive decomposition, which is analogous to the chain rule for joint entropy:

D_{K L} (p (x, y) ∥ q (x, y)) = D_{K L} (p (x) ∥ q (x)) + D_{K L} (p (y ∣ x) ∥ q (y ∣ x))

Where $D_{K L} (p (y ∣ x) ∥ q (y ∣ x))$ defines the conditional KL divergence, expressed as:

D_{K L} (p (y ∣ x) ∥ q (y ∣ x)) = x \in X \sum p (x) y \in Y \sum p (y ∣ x) lo g_{2} \frac{p ( y ∣ x )}{q ( y ∣ x )}

Information Theoretic Interpretation

The total divergence between two joint models is the sum of the divergence between their marginal distributions and the expected divergence between their conditional distributions.

‹PreviousJoint KL Divergence

Graph View

Created with Quartz v4.5.2 © 2026

GitHub

Scroll to top ↑
Random Page 🎲