Jensen's inequality

Statement

Let:

$X$ be a random vector in $R^{n}$ with alphabet $A_{X}$ (i.e. the set of values that $X$ can take)

$C \subseteq R^{n}$ be a convex set such that $A_{X} \subseteq C$

$f : C \to (- \infty, + \infty]$ be a convex function

Assuming that $E [X]$ exists and that $E [f (X)]$ is well defined (possibly equal to $+ \infty$ ), Jensen’s Inequality states that:

$f (E [X]) \leq E [f (X)]$
If, moreover, $f$ is strictly convex, then equality holds if and only if $X$ is almost surely constant, equivalently if and only if $X = E [X]$ almost surely.

In the remainder, the scalar case $n = 1$ is considered, so the random vector $X$ is denoted by the random variable $X$ .

Note

Although the statement holds for both discrete and continuous random variables, the proof below treats only the discrete case.

Proof (Discrete case)

Let $M = card (A_{X})$ denote the number of distinct values that the random variable $X$ can assume. The proof proceeds by mathematical induction on $M$ .

1. Base Case ( $M = 2$ )

Assume that $X$ takes only two values, $x$ and $y$ in $C$ , with probabilities $p$ and $1 - p$ respectively.

By the definition of a convex function, the image of a convex combination is always dominated by the convex combination of the images. Therefore, it trivially follows that:

f (E [X]) = f (p x + (1 - p) y) \leq p f (x) + (1 - p) f (y) = E [f (X)]

The base case is verified.

2. Induction Hypothesis ( $M = k$ )

Assume the inequality holds true for any discrete random variable taking exactly $M = k$ distinct values.

3. Inductive Step ( $M = k + 1$ )

It must be shown that if the proposition holds for $k$ , it necessarily holds for $k + 1$ . Consider a random variable $X$ defined by the following probability distribution:

Point in $C$	Probability
$x_{1}$	$p_{1}$
$x_{2}$	$p_{2}$
$\dots$	$\dots$
$x_{k}$	$p_{k}$
$x_{k + 1}$	$p_{k + 1}$

The expected value $E [f (X)]$ is given by the sum over all $k + 1$ states:
$E [f (X)] = i = 1 \sum k + 1 p_{i} f (x_{i})$
Isolate the $(k + 1)$ -th term from the rest of the summation:
$E [f (X)] = i = 1 \sum k p_{i} f (x_{i}) + p_{k + 1} f (x_{k + 1})$
To manipulate the summation into a form where the induction hypothesis can be applied, multiply and divide the first term by $(1 - p_{k + 1})$ , bringing the denominator inside the summation:
$E [f (X)] = (1 - p_{k + 1}) i = 1 \sum k \frac{p _{i}}{1 - p _{k + 1}} f (x_{i}) + p_{k + 1} f (x_{k + 1})$

Proof of legitimate convex combination coefficients

Before proceeding, it must be verified that the newly formed terms $\frac{p _{i}}{1 - p _{k + 1}}$ constitute a valid probability distribution. They must be strictly positive and sum to exactly $1$ to act as legitimate convex combination coefficients.

Positivity: Since $p_{i} \geq 0$ and $p_{k + 1} < 1$ (assuming a non-degenerate distribution), the ratio $\frac{p _{i}}{1 - p _{k + 1}}$ is intrinsically positive.

Summation to 1: Summing these coefficients yields:

$i = 1 \sum k \frac{p _{i}}{1 - p _{k + 1}} = \frac{1}{1 - p _{k + 1}} i = 1 \sum k p_{i}$
Knowing that the total probability must equal 1 ( $\sum_{i = 1}^{k + 1} p_{i} = 1$ ), it follows that the partial sum is $\sum_{i = 1}^{k} p_{i} = 1 - p_{k + 1}$ . Substituting this back provides:
$\frac{1}{1 - p _{k + 1}} (1 - p_{k + 1}) = 1$
Thus, these coefficients represent a valid probability distribution over $k$ elements.

Applying the Induction and Concluding

Since the terms inside the summation represent a valid convex combination of $k$ elements, the induction hypothesis ( $M = k$ ) can now be applied to the sum:
$i = 1 \sum k \frac{p _{i}}{1 - p _{k + 1}} f (x_{i}) \geq f (i = 1 \sum k \frac{p _{i}}{1 - p _{k + 1}} x_{i})$
Substituting this inequality back into the main equation yields:
$E [f (X)] \geq (1 - p_{k + 1}) f (i = 1 \sum k \frac{p _{i}}{1 - p _{k + 1}} x_{i}) + p_{k + 1} f (x_{k + 1})$
The resulting expression is a convex combination of exactly two images, weighted by $(1 - p_{k + 1})$ and $p_{k + 1}$ . This reduces the problem back to the base case ( $M = 2$ ). Applying the definition of convexity once more:
$E [f (X)] \geq f ((1 - p_{k + 1}) [i = 1 \sum k \frac{p _{i}}{1 - p _{k + 1}} x_{i}] + p_{k + 1} x_{k + 1})$
Simplifying the term $(1 - p_{k + 1})$ inside the function’s argument gives:
$E [f (X)] \geq f (i = 1 \sum k p_{i} x_{i} + p_{k + 1} x_{k + 1})$
Recombining the terms inside the argument restores the full expected value $E [X]$ :
$E [f (X)] \geq f (i = 1 \sum k + 1 p_{i} x_{i}) = f (E [X])$
This completes the proof by induction.

Deep Learning: Zero to Hero

Explorer

Jensen's inequality

Proof (Discrete case)

1. Base Case ( $M = 2$ )

2. Induction Hypothesis ( $M = k$ )

3. Inductive Step ( $M = k + 1$ )

Graph View

Table of Contents

Deep Learning: Zero to Hero

Explorer

Jensen's inequality

Proof (Discrete case)

1. Base Case (M=2)

2. Induction Hypothesis (M=k)

3. Inductive Step (M=k+1)

Graph View

Table of Contents

1. Base Case ( $M = 2$ )

2. Induction Hypothesis ( $M = k$ )

3. Inductive Step ( $M = k + 1$ )