Data Processing Inequality

Statement

Let $X$ , $Y$ , and $Z$ be random variables that form a Markov chain, denoted as $X \to Y \to Z$ . The Data Processing Inequality states that:

I (X; Y) \geq I (X; Z)

Proof

By exploiting the chain rule for mutual information, the joint mutual information $I (X; Y, Z)$ can be rewritten in two distinct ways, yielding the following system:

I (X; Y, Z) = {I (X; Y ∣ Z) + I (X; Z) I (X; Z ∣ Y) + I (X; Y)

By equating the right-hand sides of both expansion:

I (X; Z ∣ Y) + I (X; Y) = I (X; Y ∣ Z) + I (X; Z)

Since:

$I (X; Y ∣ Z) \geq 0$
$I (X; Z ∣ Y) = 0$

it follows that:

0 + I (X; Y) = I (X; Y ∣ Z) + I (X; Z) \geq I (X; Z)

Therefore:

I (X; Y) \geq I (X; Z)

Non-negativity of $I (X; Y ∣ Z)$

By definition, conditional mutual information is the Kullback-Leibler divergence between

the joint conditional probability distribution $p_{X, Y ∣ Z}$

the product of the conditional marginal distributions $p_{X ∣ Z}$ and $p_{Y ∣ Z}$ .

Since the KL divergence is a measure of distance between distributions and is strictly non-negative, it follows that:
$I (X; Y ∣ Z) \geq 0$

$I (X; Z ∣ Y) = 0$

The term $I (X; Z ∣ Y)$ quantifies the residual dependence between $X$ and $Z$ when $Y$ is known. Since $X \to Y \to Z$ forms a Markov chain, $X$ and $Z$ are conditionally independent given $Y$ .

To mathematically prove this conditional independence, consider the joint probability of $X$ and $Z$ given $Y$ :
$p (X, Z ∣ Y) = \frac{p ( X , Y , Z )}{p ( Y )} = \frac{p ( Z ∣ X , Y ) p ( X , Y )}{p ( Y )} = p (Z ∣ Y) p (X ∣ Y)$
where the following observations have been exploited:

by using the chain rule of probability, $p (X, Y, Z) = p (Z ∣ X, Y) p (X, Y)$

due to the Markov assumption $X \to Y \to Z$ , the future state $Z$ depends only on the present state $Y$ and is independent of the past state $X$ . Therefore, $p (Z ∣ X, Y) = p (Z ∣ Y)$ .

by the definition of conditional probability, $\frac{p ( X , Y )}{p ( Y )} = p (X ∣ Y)$ .

Since the joint conditional distribution factorizes into the product of the marginal conditional distributions, $X$ and $Z$ are proven to be conditionally independent given $Y$ . Consequently, the mutual information between them given $Y$ is zero:
$I (X; Z ∣ Y) = 0$

Interpretation

Important

By processing the output $Y$ , the information on $X$ can only be reduced

Deep Learning: Zero to Hero

Explorer

Data Processing Inequality

Statement

Proof

Interpretation

Graph View

Table of Contents