0% found this document useful (0 votes)
74 views

Information Theory Textbook

This document introduces information theory concepts including entropy, joint entropy, conditional entropy, and mutual information. It provides definitions and properties of these measures as defined by Claude Shannon in his mathematical theory of communication. The document outlines the key concepts that will be covered in further detail in the information theory course.

Uploaded by

Shankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Information Theory Textbook

This document introduces information theory concepts including entropy, joint entropy, conditional entropy, and mutual information. It provides definitions and properties of these measures as defined by Claude Shannon in his mathematical theory of communication. The document outlines the key concepts that will be covered in further detail in the information theory course.

Uploaded by

Shankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction to Information Theory

Meryem Benammar

5 novembre 2021

Résumé
In this course, we present the various measures of information, defined by C.E
Shannon, which allow to describe random variables and their possible interactions.
This course is a comprehensive study of these distinct measures as well as the theorems
entailed from what Shannon called a ”mathematical theory of communications”. The
proofs are given by means of guidelines and are not given since left to the discretion
of the learner.

Table des matières


1 Entropy 2
1.1 Discrete scalar case : discrete entropy . . . . . . . . . . . . . . . . . . . . . 2
1.2 Entropy and lossless compression . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Joint and conditionnal entropy . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Vector case : vector entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Continuous case : differential entropy . . . . . . . . . . . . . . . . . . . . . . 7

2 Mutual information 10
2.1 Discrete scalar case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Mutual information and bit-rate . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Joint and conditional mutual information . . . . . . . . . . . . . . . . . . . 11
2.4 Continuous case : continuous mutual information . . . . . . . . . . . . . . . 12
2.5 Vector mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Section 1 Entropy

1 Entropy
Let X be a random variable with probability support set X (all values of the support
set are possible). Depending on whether the support set X is discrete or continuous, we can
define two types of entropy : discrete entropy H(X) and differential entropy h(X). These
two entropies share some common characteristics, but differ in many aspects. We start in
the following with the discrete entropy, whilst the differential entropy will be introduced
later in the textbook.

1.1 Discrete scalar case : discrete entropy


Let us assume that the random variable X is discrete-valued, and that its support set
X is finite with cardinality |X |. Let the probability mass function (pmf) of X be denoted
˙
by PX ().
Definition 1 (Discrete entropy) The entropy of X is defined by :
X
H(X) ≜ − PX (x) log2 (PX (x)) . (1)
x∈X

where log2 is the logarithm in base 2.


The entropy of a variable X is a measure related only to the values of the pmf PX and
not to the values of X itself. We thus say that the entropy does not carry any semantic
information on X, but measures only the quantity of information contained in X. This
information is due to the randomness, or equivalently, the uncertainty, on the random
variable X. When defined with base-2 logarithms, the entropy is measures in bits, when
defined with natural logarithm (ln), the entropy is measured in nats.

Examples 1 Here are a few special cases of entropy :


1. The entropy of a constant random variables, X = c with probability 1, is given by

H(X) = 0. (2)

2. The entropy of a random variable X uniformly distributed over X , c.à.d, PX (x) =


1
for all x ∈ X , is given by
|X |
H(X) = log2 (|X |). (3)

3. Let X be a binary Bernoulli random variable Bern(p) where p = PX (1). The entropy
of X is given by

H(X) = H2 (p) ≜ −p log2 (p) − (1 − p) log2 (1 − p) = H2 (p). (4)

H2 (p) is commonly known as the binary entropy function. It is maximal when p = 0.5
(equals 1 bit), and is minimal when p = 0 or p = 1, (equals 0 bits). It is symmetric
around p = 0.5.

Exercise 1 : Compute the entropy of the three previous examples. Plot the
binary entropy function H2 (p) as a function of p (you can use Matlab for
instance)
1.2 Entropy and lossless compression 3

In the following, we list some basic properties of discrete entropy.

Properties 1 Discrete entropy H(X) satisfies the following properties :


1. Minimum : H(X) ≥ 0 for all discrete distributions PX ,and the minimum is achieved
for a degenerate distribution, i.e., X is a constant.
2. Maximum : H(X) ≤ log2 (|X |) for all discrete distributions PX , and the maximum
is achieved for the uniform distribution over X .
3. Data Processing Inequality (DPI) : the entropy of a function f of X, is no greater
than the entropy of X, i.e.,
H(f (X)) ≤ H(X) (5)

with equality iif f is a one-to-one function.

Exercise 2 : Prove the second property (maximal value of entropy). Hint :


use a Lagrangian in order to optimize over PX , or use Jensen’s inequality
(concavity of the logarithm).

It is only natural that, being a measure of uncertainty, the entropy be minimal for those
variables with little uncertainty , constant one in the extreme, and maximal for those very
uncertain variables, the uniform one in the extreme.

1.2 Entropy and lossless compression


Let us consider a random source (random vector) which produces symbols xn =
(x1 , . . . , xn ) all drawn iid following a given probability distribution PX . The symbols could
belong to any alphabet X : bits X = {0, 1}, DNA nucleotid X = {A, T, G, C}, text cha-
racters X = {a, b, c, d, ...}, pixels X = [0 : 255]. In order for these symbols to be stored in
a storage device, they need to be mapped each to a binary word.
If no compression is applied, then, in order to store every possible realization of the
alhpabet X , one would need log2 (|X |) bits for each symbol. However, by doing this, we
do not exploit our prior on these symbols : some symbols are more frequent than others,
and hence, it would be clever to use binary words of small length for the more frequency
symbols. Inversely, for symbols with very low probability of occurrence, one could afford
using binary words which are longer. The resulting average binary word length would then
hopefully smaller than the log2 (|X |) bits/symbol. This is the basic idea behind lossless
compression. It is called lossless since it does not destroy information in the source, and
only uses more efficiently the word lengths.
Without having to explicitely derive a lossless compression scheme, Shannon proved
that, one can use as few as H(X) bits/symbol to compress losslessly a source, and that if
one goes below this number, then there will be losses in the information.

1.3 Joint and conditionnal entropy


Now that we have defined the entropy of a random variable, we will introduce the joint
entropy of a pair of random variables.
4 Section 1 Entropy

Definition 2 (Joint entropy) Let X and Y be two random variables with respective
finite supports X and mathcalY assumed jointly distributed following PX,Y . The joint
entropy of (X, Y ) is defined by
X
H(X, Y ) ≜ − PX,Y (x, y) log2 (PX,Y (x, y)) . (6)
(x,y)∈X ×Y

The joint entropy describes, similarly to the scalar entropy, the amount of randomness
contained in the random pair of variables (X, Y ). It verifies the following properties.

Properties 2 The joint entropy H(X, Y ) satisfies the following :


1. Symmetry : H(X, Y ) = H(Y, X) for all joint distribution PX,Y
2. Minimum : H(X, Y ) ≥ 0 for all PX,Y , and the minimum is achieved when (X, Y ) is
a pair of constants.
3. Upper bound :
H(X, Y ) ≤ H(X) + H(Y ) (7)
with equality iif X et Y are independent, i.e., PX,Y = PX PY .
4. Maximum : H(X, Y ) ≤ log2 (|X |) + log2 (|Y|) for all PX,Y , and the maximum is
achieved for a pair of independent uniform random variables (X, Y ), i.e.,

1 1
PX,Y (x, y) = PX (x)PY (y) = pout tout (x, y) ∈ X × Y (8)
|X | |Y|

Exercise 3 : Prove the property 4 (maximum of the joint entropy). Hint :


use property 3 to upper bound the joint entropy.

The fact that the joint entropy is always smaller that the sum of the individual entropy
is due to the fact that, when entangled, two random variables exhibit less randomness than
on their own. This intuition can be encountered as well in the thermodynamic entropy.
In the following, we define another measure of information induced between two ran-
dom variables, namely, conditional entropy.

Definition 3 (Conditional entropy) Let X and Y be two random variables with finite
support sets X and mathcalY joint pmf PX,Y . The conditional entropy of X knowing Y
is defined by :
X 
H(X|Y ) ≜ − PX,Y (x, y) log2 PX|Y (x|y) (9)
(x,y)∈X ×Y
X
= PY (y)H(X|Y = y) (10)
y∈Y

where H(X|Y = y) is the entropy of the conditional pmf PX|Y =y , and can be written as
X 
H(X|Y = y) ≜ − PX|Y (x|y) log2 PX|Y (x|y) . (11)
x∈X
1.3 Joint and conditionnal entropy 5

Conditional entropy H(X|Y ) measures the amount of uncertainty on X remaining


after having observed Y . It consists thus in the first measure of correlation between X
and Y .

Properties 3 Conditional entropy H(X|Y ) satisfies the following properties


1. Asymmetry : H(X|Y ) ̸= H(Y |X)
2. Minimum : H(X|Y ) ≥ 0 for all PX,Y , and the minimum is achieved when PX|Y is
degenrate, i.e., X is a function of Y .
3. Upper bound : conditional entropy is always smaller than the individual entropy

H(X|Y ) ≤ H(X) (12)

with equality iif X and Y are independent : PX,Y = PX PY .


4. Maximum : H(X|Y ) ≤ log2 (|X |) for all PX,Y and the maximum is achieved when
X and Y are independent, and X is uniform, i.e.,

1
PX|Y (x|y) = PX (x) pout tout (x, y) ∈ X × Y (13)
|X |

Exercise 4 : Prove the properties 1, 2, and 4. Prove that if X and Y are


independent, then, H(X|Y ) = H(X).

Conditional entropy H(X|Y ) is always smaller than the individual entropy, since, ha-
ving observed a variable Y , possibly correlated to X, the uncertainty about X cannot be
grater than when nothing is observed. We say thus that conditioning decreases entropy,
hence, uncertainty.

Exercise 5 : Compute the conditional entropy H(Y |X) where Y = X ⊕ W ,


where X follows a Bern(1/2), independent from W which follows a Bern(p)
and ⊕ is the binary XOR operation. (Hint : PX,Y and PY |X were given
previously in example ?? )

Let us now relate the different measures of information introduced previously.

Properties 4 The individual entropies H(X) and H(Y ), the joint entropy H(X, Y ) and
the conditional entropies H(X|Y ) et H(Y |X) can be related as follows

H(X, Y ) = H(X) + H(Y |X) (14)


= H(Y ) + H(X|Y ) (15)

Exercise 6 : Prove the relationships listed herebefore.


6 Section 1 Entropy

1.4 Vector case : vector entropy


In communication systems, and consequently in Shannon’s theory as well, the random
variables we are dealing with consist in random processes (vectors of random variables)
where the dimension of time or frequency is taken into account. To this end, we will need
to define information measures in this vector case. D

Definition 4 (Vector entropy) Let X n = (X1 , ..., Xn ) be a collection of random va-


riables with support set X n = X1 × ... × Xn and joint pmf PX n = PX1 ,...,Xn . The vector
joint entropy is defined as
X
H(X n ) = H(X1 , ..., Xn ) = PX n (xn ) log2 (PX n (xn )) (16)
xn ∈X n
X
= PX1 ,...,Xn (x1 , ..., xn ) log2 (PX1 ,...,Xn (x1 , ..., xn(17)
))
(x1 ,...,xn )∈X n

Similarly to the joint entropy of a pair of random variables, the vector entropy verifies a
number of properties, listed hereafter.

Properties 5 The vector entropy satisfies the following :


— Symmetry : H(X n ) = H(Π(X n )) for all permutation of indices Π() over [1 : n]
— Minimum : H(X n ) ≥ 0 for all joint pmf PX n , and the minimum is achieved when
(X1 , ..., Xn ) is a vector of constants.
— Upper bound : the joint entropy is no greater than the sum of individual entropies
n
X
H(X n ) ≤ H(Xi ) (18)
i=1

with equality iif all Xi are independent.

In practice, we scarcely compute the vector entropy with direct calculations. Rather, we
use the following property.

Properties 6 The vector entropy H(X n ) writes as


n
X
n
H(X ) = H(X1 , ..., Xn ) = H(Xi |X1 , ..., Xi−1 ) (19)
i=1
Xn
= H(Xi |Xi+1 , ..., Xn ). (20)
i=1

Exercise 7 : Prove this property. Hint : use the chain rule of probabilities

The two ways of writing the vector entropy, are called causal and anti-causal expansions of
the vector entropy. Since the joint entropy is invariant to permutations, the causal and anti-
causal expressions are just two specific cases of possible joint entropy expansions. Resorting
to other permutations, along with the chain rule, could give many more expansions.
1.5 Continuous case : differential entropy 7

Example 1 (Vector of iid random variables) Let X n be a vector of n iid random


variables (X1 , ..., Xn ), with same support set X . We have that :

H(X n ) = nH(X). (21)

We say that H(X n ) admits a single letter expression, and this property is key in Shannon’s
results, in that it is way much easier to compute a scalar entropy rather H(X) than a vector
H(X n )
entropy H(X n ). The quantity is often called entropy rate.
n

1.5 Continuous case : differential entropy


Let us assume in the following that the random variable X is a continuous with support
X (often an interval in R or a convex are in C). Let fX ()˙ be the pdf of X. Let us define
the differential entropy of X.

Definition 5 (Differential entropy) The differential entropy of X is given by


Z
h(X) = fX (x) log2 (fX (x)) dx (22)
x∈X

assuming that the integral does exist.

The differential entropy is denoted by h(X), contrary to the discrete entropy


which is denoted H(X). This is to highlight their intrinsic differences.

Similarly to the discrete entropy, differential entropy computes the amount of ran-
domness and uncertainty pertaining a random variable. Yet, the properties of both these
measures differ considerably.

Example 2 Hereafter, we give a few examples of differential entropy, prior to discussing


its properties.
1. The differential entropy of a real-valued Gaussian random variable XG ∼ N (µ, σ 2 )
is given by
1
h(XG ) = log2 (2πeσ 2 ) (23)
2
2. The entropy of a circular complex valued Gaussian variable XCG ∼ CN (µ, σ 2 ) is
given by :
h(XCG ) = log2 (2πeσ 2 ) (24)

Exercise 8 : Prove that the entropy of a real-valued Gaussian variable is


as stated in property 1.

Properties 7 Let X be a continuous random variables with finite variance V(X). The
differential entropy of X, h(X) , satisfies the following properties.
8 Section 1 Entropy

— Maximum : it is maximum for a Gaussian distribution with the same variance as X


, i.e., if X is real-valued,
1
h(X) ≤ log2 (2πeV(X)) (25)
2
and if X is complex valued with covariance matrix KX
1
log2 (2πe)2 |KX |

h(X) ≤ (26)
2
where|KX | is the determinant of KX .
— Differential entropy is not compulsorily positive, for instance, if the variance V(X) ≤
1
, then h(X) ≤ 0
2πe
— Data processing inequality does not always hold. Example, if X Gaussian, with de
variance σ 2 , then 2X has a variance 4σ 2 . Hence, h(2X) ≥ h(X).

Differential joint and conditional entropies, under the assumption that integrals are
finite, are defined in the exact same manner. The relationships between these different
entropies are maintained, as well the definitions in the vector case, and the causal/anti-
causal expansions. The following properties are thus satisfied.

Properties 8 1. LEt (X, Y ) be a pair of continuous random variables. Let h(X, Y ) be


their joint differential entropy, and h(X|Y ) et h(Y |X) their conditional differential
entropies. We have that

h(X, Y ) = h(X) + H(Y |X) = h(Y ) + h(Y |X) (27)


h(Y |X) ≤ h(Y ) (28)
h(X|Y ) ≤ h(X) (29)
h(X, Y ) ≤ h(X) + h(Y ) (30)

2. Let (X1 , ..., Xn ) be n continuous random variables. The vector differential entropy
satisfies
n
X
h(X n ) = h(X1 , ..., Xn ) = h(Xi |X1 , ..., Xi−1 ) (31)
i=1
Xn
= h(Xi |Xi+1 , ..., Xn ) (32)
i=1

Hereafter, we give an example of differential entropy calculations very common in


communication systems.

Example 3 (log-det formula) Let X n = (X1 , ..., Xn ) be n continuous random variables,


with Gaussian distributions, with covariance matrix KX n . The differential entropy of X n

is given by :
1
h(X n ) = log2 ((2πe)n |KX n
|) (33)
2
where |KX n | is the determinant of the covariance matrix K n . This formula is widely known
X
under the name log-det formula.
1.5 Continuous case : differential entropy 9

When the variables Xi are independent, the covariance matrix is diagonal, and hence
n
KX = diag(σ12 , ..., σn2 ) (34)

and we recover,
n n
X 1  X
h(X n ) = log2 (2πe)σi2 = h(Xi ) (35)
2
i=1 i=1

Conclusions 1 At the end of this section, you should be able to :


• Define and list the properties of discrete entropy
• Define joint entropy, and conditional entropy
• Link entropy, joint entropy, and conditional entropy
• Compute entropy for simple examples
• List the difference between discrete and differential entropy
• Apply the chain rule to vector entropies
10 Section 2 Mutual information

2 Mutual information
In this section, we define a measure of information which is crucial to measure the
quantity of information exchanged between two or more random variables, namely, mutual
information.

2.1 Discrete scalar case


Let X and Y two discrete random variable with joint pmf PX,Y .

Definition 6 (Mutual information) Mutual information between X and Y is defined


by  
X PX,Y (x, y)
I(X; Y ) ≜ PX,Y (x, y) log2 (36)
PX (x)PY (y)
(x,y)∈X ×Y

where PX and PY are the marginal pmf associated with PX,Y .

Mutual information can be readily seen to correspond to the Kullback-Leibler (KL) diver-
gence between PX,Y and the product of the marginal laws PX PY , i.e.,
I(X; Y ) = DKL (PX,Y ||PX PY ) , (37)
and as such, it exhibits some characteristics.
Properties 9 The mutual information between X and Y satisfies the following :
— Symmetry : I(X; Y ) = I(Y ; X) for all joint pmf PX,Y
— Minimum : I(X; Y ) ≥ 0 with equality iif X and Y are independent, i.e., PX,Y =
PX PY
— Maximum : Mutual information is upper bounded by the individual entropies X et Y ,
c.a.d, I(X; Y ) ≤ min(H(X), H(y)), with equality iif X = f (Y ) and f is a one-to-one
function.

Exercise 9 : Prove that mutual information is positive. Hint : proof is


similar to the proof of positivity of the KL divergence.

Mutual information can be related in different ways to entropy measures, as follows :


I(X; Y ) = H(X) − H(Y |X) (38)
= H(X) − H(Y |X) (39)
= H(X) + H(Y ) − H(X; Y ). (40)

Exercise 10 : Prove the different identities of mutual information.

Mutual information can be interpreted as the difference between the initial uncertainty
over X, and the uncertainty on X which remains after we observe Y . As such, it measures
the amount of information inherently shared by X and Y . It is somehow also a measure
of independence, since, if two variables are independent, then it is minimal ( equal to 0),
and if they are fully correlated ( X = f (Y )), it is maximal.
2.2 Mutual information and bit-rate 11

2.2 Mutual information and bit-rate


When transmitting an information (image, sound, text, ...) over a physical medium,
this image in affected with possible distortions : non-linearities, delays, Doppler shifts, and
most importantly, noise.
Noise is generally due to an underlying random physical process, which we can be
modeled as a probability distribution PN . We will often assume that the noise is additive,
and independent from the useful signal, say X. Thus, a random signal with distribution
PX transiting through an additive channel with distribution PN , will yield an output from
this channel
Y = X + N. (41)

As such, the conditional distribution PY |X of the channel output Y knowing the input
X is given by

PY |X (y|x) = PY −X|X (y − x|x) = PN |X (y − x|x) = PN (y − x) (42)

where we have used the fact that the random processes N and X are independent.
Having this definition of a channel as a random process, and noticing that the noisier
a channel, the less information could transit through it, Shannon sought to characterize
the maximum bit-rate (measured in bits/sec/Hz) which could be transmitted through a
channel, knowing its probability distribution.
Accordingly, Shannon proved that the maximum bit-rate which could be transmitted
through a channel is equal to the mutual information between its input X and its output
Y , I(X; Y ) and corresponds to the number of bits/sec which can be transmitted per unit
frequency.

2.3 Joint and conditional mutual information


Let X, Y and Z be three random variables with joint pmf PX,Y,Z . We can define
different types of measures of information, and relate them in the following.

Definition 7 (Conditional mutual information) The mutual information between X


and Y conditionally to Z is defined by

PX,Y |Z (x, y|z)


X  
I(X; Y |Z) ≜ PX,Y,Z (x, y, z) log2 (43)
PX|Z (x|z)PY |Z (y|z)
(x,y,z)∈X ×Y×Z
X
= PZ (z) I(X; Y |Z = z). (44)
z∈Z

This mutual information describes the interaction between X and Y knowing that we have
observed Z which is correlated to both.
Mutual information can be expressed in terms of conditional entropies as follows :

I(X; Y |Z) = H(X|Z) − H(X|(Y, Z)) (45)


= H(Y |Z) − H(Y |(X, Z)) (46)
= H(X|Z) + H(Y |Z) − H((X, Y )|Z)) (47)
12 Section 2 Mutual information

Exercise 11 : Prove the equalities listed here above.

We can also define another type of information measures, which describes the quantity
of information between X and the pair (Y, Z) as follows.

Definition 8 (Joint mutual information) The mutual information between X and the
pair (Y, Z) is defined by :
 
X PX,Y,Z (x, y, z)
I(X; (Y, Z)) = PX,Y,Z (x, y, z) log2 (48)
PX (x)PY,Z (y, z)
(x,y,z)∈X ×Y×Z

This definition follows the line of the definition of the mutual information except that
it treats the pair of random variables (Y, Z) as one joint variable. A common abuse of
notation consists in writing I(X; Y Z) or I(X; Y, Z).

Joint conditional and scalar mutual information can be related as follows :

I(X; (Y, Z)) = I(X; Z) + I(X; Y |Z) (49)


= I(X; Y ) + I(X; Z|Y ). (50)

Exercise 12 : Prove the equalities listed here above.

2.4 Continuous case : continuous mutual information


We have previously seen that differential entropy and discrete entropy differ in a given
number of properties, among which positivity, Data Processing Inequality (DPI), and other
inequalities. For mutual information, we will see that there much fewer differences between
the discrete and continuous case, to the extent that we will scarcely make the distinction
later in the course.

Definition 9 (Continuous mutual information) Let X and Y be two continuous ran-


dom variable with joint pdf fX,Y . Continuous mutual information is defined by
Z  
fX,Y (x, y)
I(X; Y ) ≜ fX,Y (x, y) log2 dx dy (51)
x,y fX (x)fY (y)

where fX and fY are the marginal pdf associated with fX,Y .

Continuous mutual information measures the quantity of information shared by X and Y .

Properties 10 Continuous mutual information satisfies the following properties


— Symmetry ; I(X; Y ) = I(Y ; X)
— Minimum : I(X; Y ) ≥ 0 with equality iif X and Y are independent
2.5 Vector mutual information 13

— Maximum : is not always given by individual entropies since the conditional diffe-
rential entropy can be negative.
— Undefined in degenerate case, i.e., I(X; g(X)) is undefined for all deterministic func-
tions g.

However, all other definitions of mutual informations, joint and conditional, are still valid
and their relationships as well. As such, we will not make a distinction of the two types
of mutual information (discrete and continuous) and note only I(X; Y ).

Example 4 (Shannon formula log2 (1 + SN R)) Let X ∼ N (0, P ) and W ∼ N (0, σ 2 )


be two independent Gaussian random variables. Assume that we observe Y given by

Y = X + W. (52)

This model describes the so called Additive White Gaussian Noise (AWGN) channel with
inpu signal X, additive noise W , and output signal Y .
The mutual information between X and Y is given by

 
1 P
I(X; Y ) = log2 1 + 2 . (53)
2 σ

This formula is widely known as Shannon’s formula for AWGN channels, or the
log2 (1 + SN R) formula, and describes the maximum spectral efficienty ( measured in
bits/sec/Hz) which can be transmitted over a channel.

Exercise 13 : Prove the Shannon formula for the AWGN.

2.5 Vector mutual information


Similarly to the vector entropy, we will define in this section of the notion of vector
mutual information. As stated previously, in communication systems, what is of interest
to us is rather the interaction between random processes, and not only scalar realizations
of these processes.

Properties 11 (Chain-rule) Let X be a random variable and let Y n = (Y1 , ..., Yn ) be a


random vector. The mutual information between X and Y n is given by :
n
X n
X
n
I(X; Y ) = I(X; Yi |Y1 , ..., Yi−1 ) = I(X; Yi |Yi+1 , ..., Yn ) (54)
i=1 i=1

where we have used the chain rule of entropy.

The chain-rule is of crucial importance since it allows to compute the quantity I(X; Y n )
with having to perform a large marginalization on the vector (X, Y n ).
14 Section 2 Mutual information

Properties 12 (Case of iid processes) Let X n = (X1 , ..., Xn ) and Y n = (Y1 , ..., Yn )
be two random vectors such that the pairs (Xi , Yi ) are pairwise independent, i.e.,
n
Y
PX n ,Y n (xn , y n ) = PXi ,Yi (xi , yi ). (55)
i=1

In this case the mutual information writes as


n
X
I(X n ; Y n ) = I(Xi ; Yi ). (56)
i=1

If moreover, the pairs (Xi , Yi ) are i.i.d and all follow the same pmf/pdf PX,Y , then

I(X n ; Y n ) = nI(X; Y ). (57)

1
Whether these observations are iif or not, the fraction I(X n ; Y n ) is called information
n
rate.
Hereafter, we write a simple example based on this property.

Example 5 (Memoryless AWGN) Let X n = (X1 , ..., Xn ) n i.i.d Gaussian variables


following N (0, P ), and let W n = (W1 , ..., Wn ) n ri.i.d Gaussian variables following N (0, σ 2 ),
independent of X n . Assume that we observe the random process Y n = X n + W n .
This model describes the so-called memoryless AWGN channel. The mutual informa-
tion between the input of the channel X n and its output Y n is given by
 
n n n P
I(X ; Y ) = nI(X; Y ) = log2 1 + 2 (58)
2 σ

Conclusions 2 At the end of this section, you should be able to :


• Define and list the properties of mutual information
• Define joint and conditional mutual information
• Link entropy, joint entropy, and conditional entropy, to mutual information
• Compute mutual information for simple examples
• Apply the chain rule to vector mutual information

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy