A fundamental property of the joint KL divergence is its recursive decomposition, which is analogous to the chain rule for joint entropy:

Where defines the conditional KL divergence, expressed as:

Information Theoretic Interpretation

The total divergence between two joint models is the sum of the divergence between their marginal distributions and the expected divergence between their conditional distributions.