Statement
Let , , and be random variables that form a Markov chain, denoted as . The Data Processing Inequality states that:
Proof
By exploiting the chain rule for mutual information, the joint mutual information can be rewritten in two distinct ways, yielding the following system:
By equating the right-hand sides of both expansion:
Since:
it follows that:
Therefore:
Non-negativity of
By definition, conditional mutual information is the Kullback-Leibler divergence between
- the joint conditional probability distribution
- the product of the conditional marginal distributions and .
Since the KL divergence is a measure of distance between distributions and is strictly non-negative, it follows that:
The term quantifies the residual dependence between and when is known. Since forms a Markov chain, and are conditionally independent given .
To mathematically prove this conditional independence, consider the joint probability of and given :
where the following observations have been exploited:
- by using the chain rule of probability,
- due to the Markov assumption , the future state depends only on the present state and is independent of the past state . Therefore, .
- by the definition of conditional probability, .
Since the joint conditional distribution factorizes into the product of the marginal conditional distributions, and are proven to be conditionally independent given . Consequently, the mutual information between them given is zero:
Interpretation
Important
By processing the output , the information on can only be reduced