Interpretation

Entropy and uncertainty

Let $X$ be a random variable and $H (X)$ its entropy.

H (X) measures the uncertainty of X

The larger the entropy, the more uncertain the random variable.

Example

Guessing the color of a ball drawn from an urn

Consider an urn containing 8 balls with the following distribution of colors. A single ball is drawn, and its color, modeled as a random variable $X$ , must be guessed with the smallest number of yes/no questions.

Color ( $x$ )	Probability $p (x)$
Red	$4/8 = 0.500$
Green	$2/8 = 0.250$
Blue	$1/8 = 0.125$
Yellow	$1/8 = 0.125$

Technical detail

Strictly speaking, the mapping $X$ used in this example (where outcomes are “Red”, “Green”, etc.) is a categorical variable rather than a formal random variable.

In probability theory, a random variable is defined as a function $X : Ω \to R$ that maps elements of the sample space $Ω$ to the real numbers $R$ . Since “colors” are symbolic labels, treating $X$ as a formal r.v. requires a relabeling, where each color is assigned a real number (e.g., Red = 1, Green = 2, and so on).

The entropy of the random variable $X$ is by definition:

H (X) = x \in A_{X} : p (x) \neq = 0 \sum p (x) lo g_{2} \frac{1}{p ( x )}

Substituting the probabilities from the table:

H (X) = 1.75 bits

Strategy

To minimize the average number of yes/no questions, the prior information regarding the probabilities of drawing a ball of a certain color should be exploited.

Question 1: “Is the ball Red?” (Matches 50% of outcomes).

Question 2: If not Red, “Is it Green?” (Matches 25% of outcomes).

Question 3: If neither Red nor Green, “Is it Blue?” (Resolves the remaining 25%).

By tailoring the sequence of questions to the distribution, outcomes with higher probabilities are prioritized, thereby shortening the average decision path and minimizing the expected number of questions required.

Let $Q$ denote the random variable representing the number of yes/no questions asked.

Why is $Q$ a r.v?

$Q$ is a random variable because it’s a function of the r.v. $X$ that is the color of the ball drawn from the urn (i.e., the specific number of questions is uniquely determined by the outcome of the draw $x$ ). Therefore $Q$ represents a deterministic mapping from the sample space to the set of real numbers and so it’s a random variable.

# of asked questions ( $q$ )	Probability $p (q)$
1 (if red)	$1/2 = 0.500$
2 (if green)	$1/4 = 0.250$
3 (if blue or yellow)	$1/4 = 0.250$

The average number of questions required to identify the color is calculated by applying the definition of expectation of the discrete random variable $Q$ :

E [Q] = i = 1 \sum 3 q_{i} p (q_{i}) = \frac{1}{2} \cdot 1 + \frac{1}{4} \cdot 2 + \frac{1}{4} \cdot 3 = \frac{7}{4} = 1.75

H (X) = E [Q] = 1.75

Important

In general, the entropy of a r.v. is approximately equal to the average number of binary questions (yes/no questions; that is why the entropy is measured in bits) necessary to guess it. Therefore:
$H (X) large H (X) small ⟹ X many questions are needed to guess X ⟹ X is very uncertain ⟹ X few questions are needed to guess X ⟹ X is little uncertain$ $H (X) = uncertainty of X$

Note

It should be noted that while in this specific example $H (X)$ is exactly equal to the average number of questions needed to guess $X$ , in the general case, it can be proven that the entropy is the theoretical lower bound for this value.

Entropy and information

Let $X$ be a random variable and $H (X)$ its entropy.

H (X) measures the information carried by X

The larger the entropy, the more informative the random variable.

Example

Storing of daily weather report

The daily weather report on a mountain must be stored on a device; only sunny, cloudy, rainy, and snowy are of interest. From previous measurements the weather is sunny $50%$ of the times, cloudy $25%$ , rainy $12.5%$ and snowy $12.5%$ . The goal is to use, on the average, the smallest number of bits to store this information.

Let $X$ be the daily weather situation.

Technical detail

Strictly speaking, $X$ is not a random variable since outcomes like “Sunny” or “Cloudy” are not numerical. However, it can be treated as such through a simple relabeling into $R$ .

Daily weather ( $x$ )	Probability $p (x)$
Sunny	$1/2$
Cloudy	$1/4$
Rainy	$1/8$
Snowy	$1/8$

The entropy of the random variable X is:

H (X) = x \in A_{X} : p (x) \neq = 0 \sum p (x) lo g_{2} \frac{1}{p ( x )} = 1.75 bits

It can be proven that the best binary encoding is:

value	codeword
sunny	0
cloudy	10
rainy	110
snowy	111

Why this encoding strategy is optimal?

The strategy employs Huffman coding, which is proven to be optimal as it solves a constrained optimization problem: minimizing the average codeword length, equivalent to the expected number of questions $E [Q]$ , subject to the constraint that the code is prefix. In this framework, the length of each binary codeword corresponds exactly to the number of yes/no questions required to identify the outcome along the decision path.

Let $L$ denote the random variable representing the number of used bits for the encoding of the daily weather situation.

# of used bits ( $l$ )	Probability $p (l)$
1 (if it’s sunny)	$1/2$
2 (if it’s cloudy)	$1/4$
3 (if it’s rainy or snowy)	$1/4$

The average number of bits used to store this information is calculated by applying the definition of expectation of the discrete random variable $L$ :

E [L] = i = 1 \sum 3 l_{i} p (l_{i}) = \frac{1}{2} \cdot 1 + \frac{1}{4} \cdot 2 + \frac{1}{4} \cdot 3 = \frac{7}{4} = 1.75

H (X) = E [L] = 1.75

Important

In general, the entropy of a r.v. is approximately equal to the average number of bits (that is why it is measured in bits) necessary to describe/represent it. Therefore:
$H (X) large H (X) small ⟹ many bits are required to describe X ⟹ X carries much information ⟹ few bits are required to describe X ⟹ X carries few information$ $H (X) = information carried by X$

Note

It should be noted that while in this specific example $H (X)$ is exactly equal to the average number of bits needed to represent $X$ , in the general case, it can be proven that the entropy is the theoretical lower bound for this value.

Entropy-uncertainty-information

Let $X$ be a random variable and $H (X)$ its entropy.

H (X) measures the uncertainty of X and, therefore, the information carried by X

It follows that:

Important

$H (X) large H (X) small ⟹ X very uncertain ⟹ X carries much information ⟹ X little uncertain ⟹ X carries little information$

Deep Learning

Explorer

Interpretation

Entropy and uncertainty

Example

Entropy and information

Example

Entropy-uncertainty-information

Graph View

Table of Contents