Basics

Definition

Let $X$ be a discrete random variable defined over the alphabet $A_{X}$ and distributed according to the probability mass function $p_{X} (x)$ .

The entropy $H (X)$ is defined as:

H (X) = E [lo g_{2} \frac{1}{p ( X )}] = x \in A_{X} : p (x) \neq = 0 \sum p (x) lo g_{2} \frac{1}{p ( x )} [bits]

Mathematical Nuances and Notation

To ensure rigor while maintaining readability, this definition relies on specific conventions:

The Support and Singularity: The condition $p (x) \neq = 0$ implicitly restricts the summation to the support of the distribution, defined as $S_{X} = {x \in A_{X} ∣ p_{X} (x) > 0}$ . This restriction circumvents the mathematical singularity of $lo g (0)$ for impossible events. Alternatively, the summation may cover the entire alphabet $A_{X}$ by adopting the standard information-theoretic convention that $0 lo g 0 = 0$ (justified by continuity, as $lim_{x \to 0} x lo g x = 0$ ).

Dependency on the Distribution: Writing $H (X)$ is technically a convenient abuse of notation. Entropy is not determined by the specific values (labels) of $X$ , but exclusively by its probability mass function (PMF). Strictly speaking, entropy is a functional of the distribution itself, making $H (p_{X})$ or $H (p)$ the more rigorous notation.

Explicit PMF Notation: The term $p (x)$ is used for brevity. Formally, this should be denoted as $p_{X} (x)$ to explicitly attribute the probability to the random variable $X$ . This distinction becomes crucial when analyzing multiple random variables simultaneously (e.g., in joint entropy).

Change of Base and Units

The choice of the logarithmic base $b$ establishes the unit of information (e.g., bits for $b = 2$ , nats for $b = e$ , bans for $b = 10$ ). Conversion between these distinct units is facilitated by the logarithmic change-of-base identity:
$lo g_{b} (u) = lo g_{b} (a) \cdot lo g_{a} (u)$
By substituting this identity into the definition of entropy and invoking the linearity of the expectation operator, the conversion formula is derived as follows:
$H_{b} (X) = E [lo g_{b} \frac{1}{p ( X )}] = E [lo g_{b} (a) \cdot lo g_{a} \frac{1}{p ( X )}] = lo g_{b} (a) \cdot E [lo g_{a} \frac{1}{p ( X )}] = lo g_{b} (a) \cdot H_{a} (X)$
Thus, entropy expressed in one unit is strictly a scalar multiple of entropy expressed in another (e.g., $H_{e} (X) = ln (2) \cdot H_{2} (X)$ ).

Convention: In accordance with standard information-theoretic practice, the binary logarithm ( $b = 2$ ) is adopted as the default for all subsequent analysis. Therefore, unless explicitly stated otherwise, entropy is quantified in bits.

Non-negativity of Entropy

The entropy of a discrete random variable $X$ is always non-negative: $H (X) \geq 0$ This lower bound is established by analyzing the components of the definition:

Bounded Probabilities By definition, the probability mass function satisfies $0 < p_{X} (x) \leq 1$ for all $x$ in the support of $X$ .
Positivity of Self-Information Consequently, the argument of the logarithm, $\frac{1}{p _{X} ( x )}$ , lies within the interval $[1, \infty)$ . Since the logarithm is a monotonically increasing function with $lo g_{2} (1) = 0$ , the variable representing self-information is non-negative:
$lo g_{2} \frac{1}{p _{X} ( x )} \geq 0$
Monotonicity of Expectation Finally, invoking the property that the expectation of a non-negative random variable is itself non-negative (i.e., if $Y \geq 0$ , then $E [Y] \geq 0$ ), the result is derived:
$H (X) = E [lo g_{2} \frac{1}{p ( X )}] \geq 0$

Note

Equality (i.e., $H (X) = 0$ ) holds if and only if the random variable $X$ is deterministic (i.e., there exists an outcome $x$ such that $p_{X} (x) = 1$ ).

Example: Entropy of a Bernoulli r.v

Let $X$ i.e., X is a Benoulli

Deep Learning

Explorer

Basics

Definition

Non-negativity of Entropy

Example: Entropy of a Bernoulli r.v

Graph View

Table of Contents