A fundamental property of the joint KL divergence is its recursive decomposition, which is analogous to the chain rule for joint entropy:
Where defines the conditional KL divergence, expressed as:
Information Theoretic Interpretation
The total divergence between two joint models is the sum of the divergence between their marginal distributions and the expected divergence between their conditional distributions.