Information Theory Textbook
Information Theory Textbook
Meryem Benammar
5 novembre 2021
Résumé
In this course, we present the various measures of information, defined by C.E
Shannon, which allow to describe random variables and their possible interactions.
This course is a comprehensive study of these distinct measures as well as the theorems
entailed from what Shannon called a ”mathematical theory of communications”. The
proofs are given by means of guidelines and are not given since left to the discretion
of the learner.
2 Mutual information 10
2.1 Discrete scalar case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Mutual information and bit-rate . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Joint and conditional mutual information . . . . . . . . . . . . . . . . . . . 11
2.4 Continuous case : continuous mutual information . . . . . . . . . . . . . . . 12
2.5 Vector mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Section 1 Entropy
1 Entropy
Let X be a random variable with probability support set X (all values of the support
set are possible). Depending on whether the support set X is discrete or continuous, we can
define two types of entropy : discrete entropy H(X) and differential entropy h(X). These
two entropies share some common characteristics, but differ in many aspects. We start in
the following with the discrete entropy, whilst the differential entropy will be introduced
later in the textbook.
H(X) = 0. (2)
3. Let X be a binary Bernoulli random variable Bern(p) where p = PX (1). The entropy
of X is given by
H2 (p) is commonly known as the binary entropy function. It is maximal when p = 0.5
(equals 1 bit), and is minimal when p = 0 or p = 1, (equals 0 bits). It is symmetric
around p = 0.5.
Exercise 1 : Compute the entropy of the three previous examples. Plot the
binary entropy function H2 (p) as a function of p (you can use Matlab for
instance)
1.2 Entropy and lossless compression 3
It is only natural that, being a measure of uncertainty, the entropy be minimal for those
variables with little uncertainty , constant one in the extreme, and maximal for those very
uncertain variables, the uniform one in the extreme.
Definition 2 (Joint entropy) Let X and Y be two random variables with respective
finite supports X and mathcalY assumed jointly distributed following PX,Y . The joint
entropy of (X, Y ) is defined by
X
H(X, Y ) ≜ − PX,Y (x, y) log2 (PX,Y (x, y)) . (6)
(x,y)∈X ×Y
The joint entropy describes, similarly to the scalar entropy, the amount of randomness
contained in the random pair of variables (X, Y ). It verifies the following properties.
1 1
PX,Y (x, y) = PX (x)PY (y) = pout tout (x, y) ∈ X × Y (8)
|X | |Y|
The fact that the joint entropy is always smaller that the sum of the individual entropy
is due to the fact that, when entangled, two random variables exhibit less randomness than
on their own. This intuition can be encountered as well in the thermodynamic entropy.
In the following, we define another measure of information induced between two ran-
dom variables, namely, conditional entropy.
Definition 3 (Conditional entropy) Let X and Y be two random variables with finite
support sets X and mathcalY joint pmf PX,Y . The conditional entropy of X knowing Y
is defined by :
X
H(X|Y ) ≜ − PX,Y (x, y) log2 PX|Y (x|y) (9)
(x,y)∈X ×Y
X
= PY (y)H(X|Y = y) (10)
y∈Y
where H(X|Y = y) is the entropy of the conditional pmf PX|Y =y , and can be written as
X
H(X|Y = y) ≜ − PX|Y (x|y) log2 PX|Y (x|y) . (11)
x∈X
1.3 Joint and conditionnal entropy 5
1
PX|Y (x|y) = PX (x) pout tout (x, y) ∈ X × Y (13)
|X |
Conditional entropy H(X|Y ) is always smaller than the individual entropy, since, ha-
ving observed a variable Y , possibly correlated to X, the uncertainty about X cannot be
grater than when nothing is observed. We say thus that conditioning decreases entropy,
hence, uncertainty.
Properties 4 The individual entropies H(X) and H(Y ), the joint entropy H(X, Y ) and
the conditional entropies H(X|Y ) et H(Y |X) can be related as follows
Similarly to the joint entropy of a pair of random variables, the vector entropy verifies a
number of properties, listed hereafter.
In practice, we scarcely compute the vector entropy with direct calculations. Rather, we
use the following property.
Exercise 7 : Prove this property. Hint : use the chain rule of probabilities
The two ways of writing the vector entropy, are called causal and anti-causal expansions of
the vector entropy. Since the joint entropy is invariant to permutations, the causal and anti-
causal expressions are just two specific cases of possible joint entropy expansions. Resorting
to other permutations, along with the chain rule, could give many more expansions.
1.5 Continuous case : differential entropy 7
We say that H(X n ) admits a single letter expression, and this property is key in Shannon’s
results, in that it is way much easier to compute a scalar entropy rather H(X) than a vector
H(X n )
entropy H(X n ). The quantity is often called entropy rate.
n
Similarly to the discrete entropy, differential entropy computes the amount of ran-
domness and uncertainty pertaining a random variable. Yet, the properties of both these
measures differ considerably.
Properties 7 Let X be a continuous random variables with finite variance V(X). The
differential entropy of X, h(X) , satisfies the following properties.
8 Section 1 Entropy
Differential joint and conditional entropies, under the assumption that integrals are
finite, are defined in the exact same manner. The relationships between these different
entropies are maintained, as well the definitions in the vector case, and the causal/anti-
causal expansions. The following properties are thus satisfied.
2. Let (X1 , ..., Xn ) be n continuous random variables. The vector differential entropy
satisfies
n
X
h(X n ) = h(X1 , ..., Xn ) = h(Xi |X1 , ..., Xi−1 ) (31)
i=1
Xn
= h(Xi |Xi+1 , ..., Xn ) (32)
i=1
is given by :
1
h(X n ) = log2 ((2πe)n |KX n
|) (33)
2
where |KX n | is the determinant of the covariance matrix K n . This formula is widely known
X
under the name log-det formula.
1.5 Continuous case : differential entropy 9
When the variables Xi are independent, the covariance matrix is diagonal, and hence
n
KX = diag(σ12 , ..., σn2 ) (34)
and we recover,
n n
X 1 X
h(X n ) = log2 (2πe)σi2 = h(Xi ) (35)
2
i=1 i=1
2 Mutual information
In this section, we define a measure of information which is crucial to measure the
quantity of information exchanged between two or more random variables, namely, mutual
information.
Mutual information can be readily seen to correspond to the Kullback-Leibler (KL) diver-
gence between PX,Y and the product of the marginal laws PX PY , i.e.,
I(X; Y ) = DKL (PX,Y ||PX PY ) , (37)
and as such, it exhibits some characteristics.
Properties 9 The mutual information between X and Y satisfies the following :
— Symmetry : I(X; Y ) = I(Y ; X) for all joint pmf PX,Y
— Minimum : I(X; Y ) ≥ 0 with equality iif X and Y are independent, i.e., PX,Y =
PX PY
— Maximum : Mutual information is upper bounded by the individual entropies X et Y ,
c.a.d, I(X; Y ) ≤ min(H(X), H(y)), with equality iif X = f (Y ) and f is a one-to-one
function.
Mutual information can be interpreted as the difference between the initial uncertainty
over X, and the uncertainty on X which remains after we observe Y . As such, it measures
the amount of information inherently shared by X and Y . It is somehow also a measure
of independence, since, if two variables are independent, then it is minimal ( equal to 0),
and if they are fully correlated ( X = f (Y )), it is maximal.
2.2 Mutual information and bit-rate 11
As such, the conditional distribution PY |X of the channel output Y knowing the input
X is given by
where we have used the fact that the random processes N and X are independent.
Having this definition of a channel as a random process, and noticing that the noisier
a channel, the less information could transit through it, Shannon sought to characterize
the maximum bit-rate (measured in bits/sec/Hz) which could be transmitted through a
channel, knowing its probability distribution.
Accordingly, Shannon proved that the maximum bit-rate which could be transmitted
through a channel is equal to the mutual information between its input X and its output
Y , I(X; Y ) and corresponds to the number of bits/sec which can be transmitted per unit
frequency.
This mutual information describes the interaction between X and Y knowing that we have
observed Z which is correlated to both.
Mutual information can be expressed in terms of conditional entropies as follows :
We can also define another type of information measures, which describes the quantity
of information between X and the pair (Y, Z) as follows.
Definition 8 (Joint mutual information) The mutual information between X and the
pair (Y, Z) is defined by :
X PX,Y,Z (x, y, z)
I(X; (Y, Z)) = PX,Y,Z (x, y, z) log2 (48)
PX (x)PY,Z (y, z)
(x,y,z)∈X ×Y×Z
This definition follows the line of the definition of the mutual information except that
it treats the pair of random variables (Y, Z) as one joint variable. A common abuse of
notation consists in writing I(X; Y Z) or I(X; Y, Z).
— Maximum : is not always given by individual entropies since the conditional diffe-
rential entropy can be negative.
— Undefined in degenerate case, i.e., I(X; g(X)) is undefined for all deterministic func-
tions g.
However, all other definitions of mutual informations, joint and conditional, are still valid
and their relationships as well. As such, we will not make a distinction of the two types
of mutual information (discrete and continuous) and note only I(X; Y ).
Y = X + W. (52)
This model describes the so called Additive White Gaussian Noise (AWGN) channel with
inpu signal X, additive noise W , and output signal Y .
The mutual information between X and Y is given by
1 P
I(X; Y ) = log2 1 + 2 . (53)
2 σ
This formula is widely known as Shannon’s formula for AWGN channels, or the
log2 (1 + SN R) formula, and describes the maximum spectral efficienty ( measured in
bits/sec/Hz) which can be transmitted over a channel.
The chain-rule is of crucial importance since it allows to compute the quantity I(X; Y n )
with having to perform a large marginalization on the vector (X, Y n ).
14 Section 2 Mutual information
Properties 12 (Case of iid processes) Let X n = (X1 , ..., Xn ) and Y n = (Y1 , ..., Yn )
be two random vectors such that the pairs (Xi , Yi ) are pairwise independent, i.e.,
n
Y
PX n ,Y n (xn , y n ) = PXi ,Yi (xi , yi ). (55)
i=1
If moreover, the pairs (Xi , Yi ) are i.i.d and all follow the same pmf/pdf PX,Y , then
1
Whether these observations are iif or not, the fraction I(X n ; Y n ) is called information
n
rate.
Hereafter, we write a simple example based on this property.