0% found this document useful (0 votes)

74 views

Information Theory Textbook

This document introduces information theory concepts including entropy, joint entropy, conditional entropy, and mutual information. It provides definitions and properties of these measures as defined by Claude Shannon in his mathematical theory of communication. The document outlines the key concepts that will be covered in further detail in the information theory course.

Uploaded by

Shankh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views

Information Theory Textbook

Uploaded by

Shankh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Introduction to Information Theory

Meryem Benammar

5 novembre 2021

Résumé
In this course, we present the various measures of information, defined by C.E
Shannon, which allow to describe random variables and their possible interactions.
This course is a comprehensive study of these distinct measures as well as the theorems
entailed from what Shannon called a ”mathematical theory of communications”. The
proofs are given by means of guidelines and are not given since left to the discretion
of the learner.

Table des matières

1 Entropy 2
1.1 Discrete scalar case : discrete entropy . . . . . . . . . . . . . . . . . . . . . 2
1.2 Entropy and lossless compression . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Joint and conditionnal entropy . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Vector case : vector entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Continuous case : differential entropy . . . . . . . . . . . . . . . . . . . . . . 7

2 Mutual information 10
2.1 Discrete scalar case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Mutual information and bit-rate . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Joint and conditional mutual information . . . . . . . . . . . . . . . . . . . 11
2.4 Continuous case : continuous mutual information . . . . . . . . . . . . . . . 12
2.5 Vector mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Section 1 Entropy

1 Entropy
Let X be a random variable with probability support set X (all values of the support
set are possible). Depending on whether the support set X is discrete or continuous, we can
define two types of entropy : discrete entropy H(X) and differential entropy h(X). These
two entropies share some common characteristics, but differ in many aspects. We start in
the following with the discrete entropy, whilst the differential entropy will be introduced
later in the textbook.

1.1 Discrete scalar case : discrete entropy

Let us assume that the random variable X is discrete-valued, and that its support set
X is finite with cardinality |X |. Let the probability mass function (pmf) of X be denoted
˙
by PX ().
Definition 1 (Discrete entropy) The entropy of X is defined by :
X
H(X) ≜ − PX (x) log2 (PX (x)) . (1)
x∈X

where log2 is the logarithm in base 2.

The entropy of a variable X is a measure related only to the values of the pmf PX and
not to the values of X itself. We thus say that the entropy does not carry any semantic
information on X, but measures only the quantity of information contained in X. This
information is due to the randomness, or equivalently, the uncertainty, on the random
variable X. When defined with base-2 logarithms, the entropy is measures in bits, when
defined with natural logarithm (ln), the entropy is measured in nats.

Examples 1 Here are a few special cases of entropy :

1. The entropy of a constant random variables, X = c with probability 1, is given by

H(X) = 0. (2)

2. The entropy of a random variable X uniformly distributed over X , c.à.d, PX (x) =

1
for all x ∈ X , is given by
|X |
H(X) = log2 (|X |). (3)

3. Let X be a binary Bernoulli random variable Bern(p) where p = PX (1). The entropy
of X is given by

H(X) = H2 (p) ≜ −p log2 (p) − (1 − p) log2 (1 − p) = H2 (p). (4)

H2 (p) is commonly known as the binary entropy function. It is maximal when p = 0.5
(equals 1 bit), and is minimal when p = 0 or p = 1, (equals 0 bits). It is symmetric
around p = 0.5.

Exercise 1 : Compute the entropy of the three previous examples. Plot the
binary entropy function H2 (p) as a function of p (you can use Matlab for
instance)
1.2 Entropy and lossless compression 3

In the following, we list some basic properties of discrete entropy.

Properties 1 Discrete entropy H(X) satisfies the following properties :

1. Minimum : H(X) ≥ 0 for all discrete distributions PX ,and the minimum is achieved
for a degenerate distribution, i.e., X is a constant.
2. Maximum : H(X) ≤ log2 (|X |) for all discrete distributions PX , and the maximum
is achieved for the uniform distribution over X .
3. Data Processing Inequality (DPI) : the entropy of a function f of X, is no greater
than the entropy of X, i.e.,
H(f (X)) ≤ H(X) (5)

with equality iif f is a one-to-one function.

Exercise 2 : Prove the second property (maximal value of entropy). Hint :

use a Lagrangian in order to optimize over PX , or use Jensen’s inequality
(concavity of the logarithm).

It is only natural that, being a measure of uncertainty, the entropy be minimal for those
variables with little uncertainty , constant one in the extreme, and maximal for those very
uncertain variables, the uniform one in the extreme.

1.2 Entropy and lossless compression

Let us consider a random source (random vector) which produces symbols xn =
(x1 , . . . , xn ) all drawn iid following a given probability distribution PX . The symbols could
belong to any alphabet X : bits X = {0, 1}, DNA nucleotid X = {A, T, G, C}, text cha-
racters X = {a, b, c, d, ...}, pixels X = [0 : 255]. In order for these symbols to be stored in
a storage device, they need to be mapped each to a binary word.
If no compression is applied, then, in order to store every possible realization of the
alhpabet X , one would need log2 (|X |) bits for each symbol. However, by doing this, we
do not exploit our prior on these symbols : some symbols are more frequent than others,
and hence, it would be clever to use binary words of small length for the more frequency
symbols. Inversely, for symbols with very low probability of occurrence, one could afford
using binary words which are longer. The resulting average binary word length would then
hopefully smaller than the log2 (|X |) bits/symbol. This is the basic idea behind lossless
compression. It is called lossless since it does not destroy information in the source, and
only uses more efficiently the word lengths.
Without having to explicitely derive a lossless compression scheme, Shannon proved
that, one can use as few as H(X) bits/symbol to compress losslessly a source, and that if
one goes below this number, then there will be losses in the information.

1.3 Joint and conditionnal entropy

Now that we have defined the entropy of a random variable, we will introduce the joint
entropy of a pair of random variables.
4 Section 1 Entropy

Definition 2 (Joint entropy) Let X and Y be two random variables with respective
finite supports X and mathcalY assumed jointly distributed following PX,Y . The joint
entropy of (X, Y ) is defined by
X
H(X, Y ) ≜ − PX,Y (x, y) log2 (PX,Y (x, y)) . (6)
(x,y)∈X ×Y

The joint entropy describes, similarly to the scalar entropy, the amount of randomness
contained in the random pair of variables (X, Y ). It verifies the following properties.

Properties 2 The joint entropy H(X, Y ) satisfies the following :

1. Symmetry : H(X, Y ) = H(Y, X) for all joint distribution PX,Y
2. Minimum : H(X, Y ) ≥ 0 for all PX,Y , and the minimum is achieved when (X, Y ) is
a pair of constants.
3. Upper bound :
H(X, Y ) ≤ H(X) + H(Y ) (7)
with equality iif X et Y are independent, i.e., PX,Y = PX PY .
4. Maximum : H(X, Y ) ≤ log2 (|X |) + log2 (|Y|) for all PX,Y , and the maximum is
achieved for a pair of independent uniform random variables (X, Y ), i.e.,

1 1
PX,Y (x, y) = PX (x)PY (y) = pout tout (x, y) ∈ X × Y (8)
|X | |Y|

Exercise 3 : Prove the property 4 (maximum of the joint entropy). Hint :

use property 3 to upper bound the joint entropy.

The fact that the joint entropy is always smaller that the sum of the individual entropy
is due to the fact that, when entangled, two random variables exhibit less randomness than
on their own. This intuition can be encountered as well in the thermodynamic entropy.
In the following, we define another measure of information induced between two ran-
dom variables, namely, conditional entropy.

Definition 3 (Conditional entropy) Let X and Y be two random variables with finite
support sets X and mathcalY joint pmf PX,Y . The conditional entropy of X knowing Y
is defined by :
X
H(X|Y ) ≜ − PX,Y (x, y) log2 PX|Y (x|y) (9)
(x,y)∈X ×Y
X
= PY (y)H(X|Y = y) (10)
y∈Y

Conditional entropy H(X|Y ) measures the amount of uncertainty on X remaining

after having observed Y . It consists thus in the first measure of correlation between X
and Y .

Properties 3 Conditional entropy H(X|Y ) satisfies the following properties

1. Asymmetry : H(X|Y ) ̸= H(Y |X)
2. Minimum : H(X|Y ) ≥ 0 for all PX,Y , and the minimum is achieved when PX|Y is
degenrate, i.e., X is a function of Y .
3. Upper bound : conditional entropy is always smaller than the individual entropy

H(X|Y ) ≤ H(X) (12)

with equality iif X and Y are independent : PX,Y = PX PY .

4. Maximum : H(X|Y ) ≤ log2 (|X |) for all PX,Y and the maximum is achieved when
X and Y are independent, and X is uniform, i.e.,

1
PX|Y (x|y) = PX (x) pout tout (x, y) ∈ X × Y (13)
|X |

Exercise 4 : Prove the properties 1, 2, and 4. Prove that if X and Y are

independent, then, H(X|Y ) = H(X).

Conditional entropy H(X|Y ) is always smaller than the individual entropy, since, ha-
ving observed a variable Y , possibly correlated to X, the uncertainty about X cannot be
grater than when nothing is observed. We say thus that conditioning decreases entropy,
hence, uncertainty.

Exercise 5 : Compute the conditional entropy H(Y |X) where Y = X ⊕ W ,

where X follows a Bern(1/2), independent from W which follows a Bern(p)
and ⊕ is the binary XOR operation. (Hint : PX,Y and PY |X were given
previously in example ?? )

Let us now relate the different measures of information introduced previously.

Properties 4 The individual entropies H(X) and H(Y ), the joint entropy H(X, Y ) and
the conditional entropies H(X|Y ) et H(Y |X) can be related as follows

H(X, Y ) = H(X) + H(Y |X) (14)

= H(Y ) + H(X|Y ) (15)

Exercise 6 : Prove the relationships listed herebefore.

6 Section 1 Entropy

1.4 Vector case : vector entropy

In communication systems, and consequently in Shannon’s theory as well, the random
variables we are dealing with consist in random processes (vectors of random variables)
where the dimension of time or frequency is taken into account. To this end, we will need
to define information measures in this vector case. D

Definition 4 (Vector entropy) Let X n = (X1 , ..., Xn ) be a collection of random va-

riables with support set X n = X1 × ... × Xn and joint pmf PX n = PX1 ,...,Xn . The vector
joint entropy is defined as
X
H(X n ) = H(X1 , ..., Xn ) = PX n (xn ) log2 (PX n (xn )) (16)
xn ∈X n
X
= PX1 ,...,Xn (x1 , ..., xn ) log2 (PX1 ,...,Xn (x1 , ..., xn(17)
))
(x1 ,...,xn )∈X n

Similarly to the joint entropy of a pair of random variables, the vector entropy verifies a
number of properties, listed hereafter.

Properties 5 The vector entropy satisfies the following :

— Symmetry : H(X n ) = H(Π(X n )) for all permutation of indices Π() over [1 : n]
— Minimum : H(X n ) ≥ 0 for all joint pmf PX n , and the minimum is achieved when
(X1 , ..., Xn ) is a vector of constants.
— Upper bound : the joint entropy is no greater than the sum of individual entropies
n
X
H(X n ) ≤ H(Xi ) (18)
i=1

with equality iif all Xi are independent.

In practice, we scarcely compute the vector entropy with direct calculations. Rather, we
use the following property.

Properties 6 The vector entropy H(X n ) writes as

n
X
n
H(X ) = H(X1 , ..., Xn ) = H(Xi |X1 , ..., Xi−1 ) (19)
i=1
Xn
= H(Xi |Xi+1 , ..., Xn ). (20)
i=1

Exercise 7 : Prove this property. Hint : use the chain rule of probabilities

The two ways of writing the vector entropy, are called causal and anti-causal expansions of
the vector entropy. Since the joint entropy is invariant to permutations, the causal and anti-
causal expressions are just two specific cases of possible joint entropy expansions. Resorting
to other permutations, along with the chain rule, could give many more expansions.
1.5 Continuous case : differential entropy 7

Example 1 (Vector of iid random variables) Let X n be a vector of n iid random

variables (X1 , ..., Xn ), with same support set X . We have that :

H(X n ) = nH(X). (21)

We say that H(X n ) admits a single letter expression, and this property is key in Shannon’s
results, in that it is way much easier to compute a scalar entropy rather H(X) than a vector
H(X n )
entropy H(X n ). The quantity is often called entropy rate.
n

1.5 Continuous case : differential entropy

Let us assume in the following that the random variable X is a continuous with support
X (often an interval in R or a convex are in C). Let fX ()˙ be the pdf of X. Let us define
the differential entropy of X.

Definition 5 (Differential entropy) The differential entropy of X is given by

Z
h(X) = fX (x) log2 (fX (x)) dx (22)
x∈X

assuming that the integral does exist.

The differential entropy is denoted by h(X), contrary to the discrete entropy

which is denoted H(X). This is to highlight their intrinsic differences.

Similarly to the discrete entropy, differential entropy computes the amount of ran-
domness and uncertainty pertaining a random variable. Yet, the properties of both these
measures differ considerably.

Example 2 Hereafter, we give a few examples of differential entropy, prior to discussing

its properties.
1. The differential entropy of a real-valued Gaussian random variable XG ∼ N (µ, σ 2 )
is given by
1
h(XG ) = log2 (2πeσ 2 ) (23)
2
2. The entropy of a circular complex valued Gaussian variable XCG ∼ CN (µ, σ 2 ) is
given by :
h(XCG ) = log2 (2πeσ 2 ) (24)

Exercise 8 : Prove that the entropy of a real-valued Gaussian variable is

as stated in property 1.

Properties 7 Let X be a continuous random variables with finite variance V(X). The
differential entropy of X, h(X) , satisfies the following properties.
8 Section 1 Entropy

— Maximum : it is maximum for a Gaussian distribution with the same variance as X

, i.e., if X is real-valued,
1
h(X) ≤ log2 (2πeV(X)) (25)
2
and if X is complex valued with covariance matrix KX
1
log2 (2πe)2 |KX |

h(X) ≤ (26)
2
where|KX | is the determinant of KX .
— Differential entropy is not compulsorily positive, for instance, if the variance V(X) ≤
1
, then h(X) ≤ 0
2πe
— Data processing inequality does not always hold. Example, if X Gaussian, with de
variance σ 2 , then 2X has a variance 4σ 2 . Hence, h(2X) ≥ h(X).

Differential joint and conditional entropies, under the assumption that integrals are
finite, are defined in the exact same manner. The relationships between these different
entropies are maintained, as well the definitions in the vector case, and the causal/anti-
causal expansions. The following properties are thus satisfied.

Properties 8 1. LEt (X, Y ) be a pair of continuous random variables. Let h(X, Y ) be

their joint differential entropy, and h(X|Y ) et h(Y |X) their conditional differential
entropies. We have that

h(X, Y ) = h(X) + H(Y |X) = h(Y ) + h(Y |X) (27)

h(Y |X) ≤ h(Y ) (28)
h(X|Y ) ≤ h(X) (29)
h(X, Y ) ≤ h(X) + h(Y ) (30)

2. Let (X1 , ..., Xn ) be n continuous random variables. The vector differential entropy
satisfies
n
X
h(X n ) = h(X1 , ..., Xn ) = h(Xi |X1 , ..., Xi−1 ) (31)
i=1
Xn
= h(Xi |Xi+1 , ..., Xn ) (32)
i=1

Hereafter, we give an example of differential entropy calculations very common in

communication systems.

Example 3 (log-det formula) Let X n = (X1 , ..., Xn ) be n continuous random variables,

with Gaussian distributions, with covariance matrix KX n . The differential entropy of X n

is given by :
1
h(X n ) = log2 ((2πe)n |KX n
|) (33)
2
where |KX n | is the determinant of the covariance matrix K n . This formula is widely known
X
under the name log-det formula.
1.5 Continuous case : differential entropy 9

When the variables Xi are independent, the covariance matrix is diagonal, and hence
n
KX = diag(σ12 , ..., σn2 ) (34)

and we recover,
n n
X 1 X
h(X n ) = log2 (2πe)σi2 = h(Xi ) (35)
2
i=1 i=1

Conclusions 1 At the end of this section, you should be able to :

• Define and list the properties of discrete entropy
• Define joint entropy, and conditional entropy
• Link entropy, joint entropy, and conditional entropy
• Compute entropy for simple examples
• List the difference between discrete and differential entropy
• Apply the chain rule to vector entropies
10 Section 2 Mutual information

2 Mutual information
In this section, we define a measure of information which is crucial to measure the
quantity of information exchanged between two or more random variables, namely, mutual
information.

2.1 Discrete scalar case

Let X and Y two discrete random variable with joint pmf PX,Y .

Definition 6 (Mutual information) Mutual information between X and Y is defined

by
X PX,Y (x, y)
I(X; Y ) ≜ PX,Y (x, y) log2 (36)
PX (x)PY (y)
(x,y)∈X ×Y

where PX and PY are the marginal pmf associated with PX,Y .

Mutual information can be readily seen to correspond to the Kullback-Leibler (KL) diver-
gence between PX,Y and the product of the marginal laws PX PY , i.e.,
I(X; Y ) = DKL (PX,Y ||PX PY ) , (37)
and as such, it exhibits some characteristics.
Properties 9 The mutual information between X and Y satisfies the following :
— Symmetry : I(X; Y ) = I(Y ; X) for all joint pmf PX,Y
— Minimum : I(X; Y ) ≥ 0 with equality iif X and Y are independent, i.e., PX,Y =
PX PY
— Maximum : Mutual information is upper bounded by the individual entropies X et Y ,
c.a.d, I(X; Y ) ≤ min(H(X), H(y)), with equality iif X = f (Y ) and f is a one-to-one
function.

Exercise 9 : Prove that mutual information is positive. Hint : proof is

similar to the proof of positivity of the KL divergence.

Mutual information can be related in different ways to entropy measures, as follows :

I(X; Y ) = H(X) − H(Y |X) (38)
= H(X) − H(Y |X) (39)
= H(X) + H(Y ) − H(X; Y ). (40)

Exercise 10 : Prove the different identities of mutual information.

Mutual information can be interpreted as the difference between the initial uncertainty
over X, and the uncertainty on X which remains after we observe Y . As such, it measures
the amount of information inherently shared by X and Y . It is somehow also a measure
of independence, since, if two variables are independent, then it is minimal ( equal to 0),
and if they are fully correlated ( X = f (Y )), it is maximal.
2.2 Mutual information and bit-rate 11

2.2 Mutual information and bit-rate

When transmitting an information (image, sound, text, ...) over a physical medium,
this image in affected with possible distortions : non-linearities, delays, Doppler shifts, and
most importantly, noise.
Noise is generally due to an underlying random physical process, which we can be
modeled as a probability distribution PN . We will often assume that the noise is additive,
and independent from the useful signal, say X. Thus, a random signal with distribution
PX transiting through an additive channel with distribution PN , will yield an output from
this channel
Y = X + N. (41)

As such, the conditional distribution PY |X of the channel output Y knowing the input
X is given by

PY |X (y|x) = PY −X|X (y − x|x) = PN |X (y − x|x) = PN (y − x) (42)

where we have used the fact that the random processes N and X are independent.
Having this definition of a channel as a random process, and noticing that the noisier
a channel, the less information could transit through it, Shannon sought to characterize
the maximum bit-rate (measured in bits/sec/Hz) which could be transmitted through a
channel, knowing its probability distribution.
Accordingly, Shannon proved that the maximum bit-rate which could be transmitted
through a channel is equal to the mutual information between its input X and its output
Y , I(X; Y ) and corresponds to the number of bits/sec which can be transmitted per unit
frequency.

2.3 Joint and conditional mutual information

Let X, Y and Z be three random variables with joint pmf PX,Y,Z . We can define
different types of measures of information, and relate them in the following.

Definition 7 (Conditional mutual information) The mutual information between X

and Y conditionally to Z is defined by

PX,Y |Z (x, y|z)

X
I(X; Y |Z) ≜ PX,Y,Z (x, y, z) log2 (43)
PX|Z (x|z)PY |Z (y|z)
(x,y,z)∈X ×Y×Z
X
= PZ (z) I(X; Y |Z = z). (44)
z∈Z

This mutual information describes the interaction between X and Y knowing that we have
observed Z which is correlated to both.
Mutual information can be expressed in terms of conditional entropies as follows :

I(X; Y |Z) = H(X|Z) − H(X|(Y, Z)) (45)

= H(Y |Z) − H(Y |(X, Z)) (46)
= H(X|Z) + H(Y |Z) − H((X, Y )|Z)) (47)
12 Section 2 Mutual information

Exercise 11 : Prove the equalities listed here above.

We can also define another type of information measures, which describes the quantity
of information between X and the pair (Y, Z) as follows.

Definition 8 (Joint mutual information) The mutual information between X and the
pair (Y, Z) is defined by :

X PX,Y,Z (x, y, z)
I(X; (Y, Z)) = PX,Y,Z (x, y, z) log2 (48)
PX (x)PY,Z (y, z)
(x,y,z)∈X ×Y×Z

This definition follows the line of the definition of the mutual information except that
it treats the pair of random variables (Y, Z) as one joint variable. A common abuse of
notation consists in writing I(X; Y Z) or I(X; Y, Z).

Joint conditional and scalar mutual information can be related as follows :

I(X; (Y, Z)) = I(X; Z) + I(X; Y |Z) (49)

= I(X; Y ) + I(X; Z|Y ). (50)

Exercise 12 : Prove the equalities listed here above.

2.4 Continuous case : continuous mutual information

We have previously seen that differential entropy and discrete entropy differ in a given
number of properties, among which positivity, Data Processing Inequality (DPI), and other
inequalities. For mutual information, we will see that there much fewer differences between
the discrete and continuous case, to the extent that we will scarcely make the distinction
later in the course.

Definition 9 (Continuous mutual information) Let X and Y be two continuous ran-

dom variable with joint pdf fX,Y . Continuous mutual information is defined by
Z
fX,Y (x, y)
I(X; Y ) ≜ fX,Y (x, y) log2 dx dy (51)
x,y fX (x)fY (y)

where fX and fY are the marginal pdf associated with fX,Y .

Continuous mutual information measures the quantity of information shared by X and Y .

Properties 10 Continuous mutual information satisfies the following properties

— Symmetry ; I(X; Y ) = I(Y ; X)
— Minimum : I(X; Y ) ≥ 0 with equality iif X and Y are independent
2.5 Vector mutual information 13

— Maximum : is not always given by individual entropies since the conditional diffe-
rential entropy can be negative.
— Undefined in degenerate case, i.e., I(X; g(X)) is undefined for all deterministic func-
tions g.

However, all other definitions of mutual informations, joint and conditional, are still valid
and their relationships as well. As such, we will not make a distinction of the two types
of mutual information (discrete and continuous) and note only I(X; Y ).

Example 4 (Shannon formula log2 (1 + SN R)) Let X ∼ N (0, P ) and W ∼ N (0, σ 2 )

be two independent Gaussian random variables. Assume that we observe Y given by

Y = X + W. (52)

This model describes the so called Additive White Gaussian Noise (AWGN) channel with
inpu signal X, additive noise W , and output signal Y .
The mutual information between X and Y is given by

1 P
I(X; Y ) = log2 1 + 2 . (53)
2 σ

This formula is widely known as Shannon’s formula for AWGN channels, or the
log2 (1 + SN R) formula, and describes the maximum spectral efficienty ( measured in
bits/sec/Hz) which can be transmitted over a channel.

Exercise 13 : Prove the Shannon formula for the AWGN.

2.5 Vector mutual information

Similarly to the vector entropy, we will define in this section of the notion of vector
mutual information. As stated previously, in communication systems, what is of interest
to us is rather the interaction between random processes, and not only scalar realizations
of these processes.

Properties 11 (Chain-rule) Let X be a random variable and let Y n = (Y1 , ..., Yn ) be a

random vector. The mutual information between X and Y n is given by :
n
X n
X
n
I(X; Y ) = I(X; Yi |Y1 , ..., Yi−1 ) = I(X; Yi |Yi+1 , ..., Yn ) (54)
i=1 i=1

where we have used the chain rule of entropy.

The chain-rule is of crucial importance since it allows to compute the quantity I(X; Y n )
with having to perform a large marginalization on the vector (X, Y n ).
14 Section 2 Mutual information

Properties 12 (Case of iid processes) Let X n = (X1 , ..., Xn ) and Y n = (Y1 , ..., Yn )
be two random vectors such that the pairs (Xi , Yi ) are pairwise independent, i.e.,
n
Y
PX n ,Y n (xn , y n ) = PXi ,Yi (xi , yi ). (55)
i=1

In this case the mutual information writes as

n
X
I(X n ; Y n ) = I(Xi ; Yi ). (56)
i=1

If moreover, the pairs (Xi , Yi ) are i.i.d and all follow the same pmf/pdf PX,Y , then

I(X n ; Y n ) = nI(X; Y ). (57)

1
Whether these observations are iif or not, the fraction I(X n ; Y n ) is called information
n
rate.
Hereafter, we write a simple example based on this property.

Example 5 (Memoryless AWGN) Let X n = (X1 , ..., Xn ) n i.i.d Gaussian variables

following N (0, P ), and let W n = (W1 , ..., Wn ) n ri.i.d Gaussian variables following N (0, σ 2 ),
independent of X n . Assume that we observe the random process Y n = X n + W n .
This model describes the so-called memoryless AWGN channel. The mutual informa-
tion between the input of the channel X n and its output Y n is given by

n n n P
I(X ; Y ) = nI(X; Y ) = log2 1 + 2 (58)
2 σ

Conclusions 2 At the end of this section, you should be able to :

• Define and list the properties of mutual information
• Define joint and conditional mutual information
• Link entropy, joint entropy, and conditional entropy, to mutual information
• Compute mutual information for simple examples
• Apply the chain rule to vector mutual information

(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
85% (54)
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
397 pages
Cortex XDR: Safeguard Your Entire Organization With The Industry's First Extended Detection and Response Platform
No ratings yet
Cortex XDR: Safeguard Your Entire Organization With The Industry's First Extended Detection and Response Platform
8 pages
Power System Analysis
100% (8)
Power System Analysis
105 pages
1.1 Shannon's Information Measures: Lecture 1 - January 26
No ratings yet
1.1 Shannon's Information Measures: Lecture 1 - January 26
5 pages
Entropie Eng PDF
No ratings yet
Entropie Eng PDF
6 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
IT-CO-1-EN
No ratings yet
IT-CO-1-EN
26 pages
Slide 04
No ratings yet
Slide 04
16 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Entropy Handbook Definitions, Theorems, M-Files
No ratings yet
Entropy Handbook Definitions, Theorems, M-Files
22 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
No ratings yet
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
36 pages
Info Theory Exercise Solutions
No ratings yet
Info Theory Exercise Solutions
16 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
L01
No ratings yet
L01
5 pages
Lecture 1
No ratings yet
Lecture 1
211 pages
Probabilistic Methods in Information Theory
No ratings yet
Probabilistic Methods in Information Theory
48 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
Exercise Problems: Information Theory and Coding
No ratings yet
Exercise Problems: Information Theory and Coding
6 pages
Mutual Information
No ratings yet
Mutual Information
48 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
ITC Module2 1
No ratings yet
ITC Module2 1
34 pages
Lecture 1: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 1: Entropy and Mutual Information: 2.1 Example
8 pages
Tema 1 Awp
No ratings yet
Tema 1 Awp
32 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
Entropy Post
No ratings yet
Entropy Post
27 pages
Lecture 5
No ratings yet
Lecture 5
42 pages
Presentation Math7952
No ratings yet
Presentation Math7952
29 pages
Entropy
No ratings yet
Entropy
21 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Lecture 2: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 2: Entropy and Mutual Information: 2.1 Example
8 pages
Lec38 - 210108071 - AKSHAY KUMAR JHA
No ratings yet
Lec38 - 210108071 - AKSHAY KUMAR JHA
12 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
Elements of Information Theory-Chapter1-2
No ratings yet
Elements of Information Theory-Chapter1-2
63 pages
BEC503-DC-M3-Information Theory
No ratings yet
BEC503-DC-M3-Information Theory
100 pages
Information Theory
No ratings yet
Information Theory
26 pages
Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples
No ratings yet
Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples
6 pages
15359-2009-lecture25
No ratings yet
15359-2009-lecture25
11 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
22 pages
session2
No ratings yet
session2
60 pages
Notes It
No ratings yet
Notes It
46 pages
Problem Set 1
No ratings yet
Problem Set 1
3 pages
MIT18 440S14 Lecture34
No ratings yet
MIT18 440S14 Lecture34
19 pages
ITC Module - I
No ratings yet
ITC Module - I
98 pages
Entr5
No ratings yet
Entr5
2 pages
Three Tutorial Lectures
No ratings yet
Three Tutorial Lectures
36 pages
Lec 2,3
No ratings yet
Lec 2,3
23 pages
Entropy: Scott Sheffield
No ratings yet
Entropy: Scott Sheffield
57 pages
Entropy: Low Entropy High Entropy
No ratings yet
Entropy: Low Entropy High Entropy
11 pages
lời giải
No ratings yet
lời giải
52 pages
MIT18 600F19 Lec33
No ratings yet
MIT18 600F19 Lec33
58 pages
The Gamma Function
From Everand
The Gamma Function
Emil Artin
No ratings yet
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Introduction to Minimax
From Everand
Introduction to Minimax
V. F. Dem’yanov
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Integration, Measure and Probability
From Everand
Integration, Measure and Probability
H. R. Pitt
No ratings yet
Unit 1A Test Review
No ratings yet
Unit 1A Test Review
5 pages
A.vamshi Mini Project 2
No ratings yet
A.vamshi Mini Project 2
34 pages
Barangay Final Inventory and Turnover
100% (3)
Barangay Final Inventory and Turnover
1 page
Unix Commands
No ratings yet
Unix Commands
6 pages
Chapter 10: Scratch Assignment Solutions
100% (3)
Chapter 10: Scratch Assignment Solutions
3 pages
写作作品集
100% (1)
写作作品集
6 pages
SMA Format For Microproject
No ratings yet
SMA Format For Microproject
5 pages
NA DS-3811-0819-NA-Cloud-Vol-ONTAP - DS
No ratings yet
NA DS-3811-0819-NA-Cloud-Vol-ONTAP - DS
4 pages
The Fun and Future of CTF: Andy Davis, Tim Leek, Michael Zhivich, Kyle Gwinnup, and William Leonard
No ratings yet
The Fun and Future of CTF: Andy Davis, Tim Leek, Michael Zhivich, Kyle Gwinnup, and William Leonard
9 pages
Introduction To Logistics Management
No ratings yet
Introduction To Logistics Management
29 pages
MEC5888 - Assignment 1 Design Stand Alone Power Supply Systems For A House in Off-Grid Area
No ratings yet
MEC5888 - Assignment 1 Design Stand Alone Power Supply Systems For A House in Off-Grid Area
2 pages
Ai - Unit-3
No ratings yet
Ai - Unit-3
16 pages
Stress for GRP Lines
No ratings yet
Stress for GRP Lines
9 pages
MICE FEASIBILITY-STUDY - Squidpay App Exhibition 2021
No ratings yet
MICE FEASIBILITY-STUDY - Squidpay App Exhibition 2021
7 pages
BSSPAR1: Chapter 4 Radio Resource Management: 1 © Nokia Siemens Networks
No ratings yet
BSSPAR1: Chapter 4 Radio Resource Management: 1 © Nokia Siemens Networks
30 pages
3D Printing - The Next Industrial Revolution
No ratings yet
3D Printing - The Next Industrial Revolution
198 pages
12 MiCOM General
No ratings yet
12 MiCOM General
10 pages
Low Power VLSI Circuits & Systems Complete Notes
100% (1)
Low Power VLSI Circuits & Systems Complete Notes
66 pages
Abrasive Jet Machining (Ajm) : Dept. of ME, ACE
No ratings yet
Abrasive Jet Machining (Ajm) : Dept. of ME, ACE
8 pages
Technical Manual for Air-Cooled Modular Chiller - (FCH01-2022,23) (2)
No ratings yet
Technical Manual for Air-Cooled Modular Chiller - (FCH01-2022,23) (2)
39 pages
Chapter Ii - Ais
No ratings yet
Chapter Ii - Ais
29 pages
Adam Implementation Guide v1.0
No ratings yet
Adam Implementation Guide v1.0
81 pages
Risk Perception of The E-Payment Systems: A Young Adult Perspective
No ratings yet
Risk Perception of The E-Payment Systems: A Young Adult Perspective
1 page
Cell Sites
No ratings yet
Cell Sites
10 pages
SQL Decorrelation and Window Functions - in Data Engineering
No ratings yet
SQL Decorrelation and Window Functions - in Data Engineering
75 pages
1650445861890-JPO For Prevention of Open Door of Waon Hitting With Railway Structure
No ratings yet
1650445861890-JPO For Prevention of Open Door of Waon Hitting With Railway Structure
9 pages
NFSen Navigation Guide
100% (1)
NFSen Navigation Guide
7 pages
Jmu 3001 - e 1
No ratings yet
Jmu 3001 - e 1
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Information Theory Textbook

Uploaded by

Information Theory Textbook

Uploaded by

Introduction to Information Theory

Table des matières

1.1 Discrete scalar case : discrete entropy

where log2 is the logarithm in base 2.

Examples 1 Here are a few special cases of entropy :

2. The entropy of a random variable X uniformly distributed over X , c.à.d, PX (x) =

H(X) = H2 (p) ≜ −p log2 (p) − (1 − p) log2 (1 − p) = H2 (p). (4)

In the following, we list some basic properties of discrete entropy.

Properties 1 Discrete entropy H(X) satisfies the following properties :

with equality iif f is a one-to-one function.

Exercise 2 : Prove the second property (maximal value of entropy). Hint :

1.2 Entropy and lossless compression

1.3 Joint and conditionnal entropy

Properties 2 The joint entropy H(X, Y ) satisfies the following :

Exercise 3 : Prove the property 4 (maximum of the joint entropy). Hint :

Conditional entropy H(X|Y ) measures the amount of uncertainty on X remaining

Properties 3 Conditional entropy H(X|Y ) satisfies the following properties

H(X|Y ) ≤ H(X) (12)

with equality iif X and Y are independent : PX,Y = PX PY .

Exercise 4 : Prove the properties 1, 2, and 4. Prove that if X and Y are

Exercise 5 : Compute the conditional entropy H(Y |X) where Y = X ⊕ W ,

Let us now relate the different measures of information introduced previously.

H(X, Y ) = H(X) + H(Y |X) (14)

Exercise 6 : Prove the relationships listed herebefore.

1.4 Vector case : vector entropy

Definition 4 (Vector entropy) Let X n = (X1 , ..., Xn ) be a collection of random va-

Properties 5 The vector entropy satisfies the following :

with equality iif all Xi are independent.

Properties 6 The vector entropy H(X n ) writes as

Example 1 (Vector of iid random variables) Let X n be a vector of n iid random

H(X n ) = nH(X). (21)

1.5 Continuous case : differential entropy

Definition 5 (Differential entropy) The differential entropy of X is given by

assuming that the integral does exist.

The differential entropy is denoted by h(X), contrary to the discrete entropy

Example 2 Hereafter, we give a few examples of differential entropy, prior to discussing

Exercise 8 : Prove that the entropy of a real-valued Gaussian variable is

— Maximum : it is maximum for a Gaussian distribution with the same variance as X

Properties 8 1. LEt (X, Y ) be a pair of continuous random variables. Let h(X, Y ) be

h(X, Y ) = h(X) + H(Y |X) = h(Y ) + h(Y |X) (27)

Hereafter, we give an example of differential entropy calculations very common in

Example 3 (log-det formula) Let X n = (X1 , ..., Xn ) be n continuous random variables,

Conclusions 1 At the end of this section, you should be able to :

2.1 Discrete scalar case

Definition 6 (Mutual information) Mutual information between X and Y is defined

where PX and PY are the marginal pmf associated with PX,Y .

Exercise 9 : Prove that mutual information is positive. Hint : proof is

Mutual information can be related in different ways to entropy measures, as follows :

Exercise 10 : Prove the different identities of mutual information.

2.2 Mutual information and bit-rate

PY |X (y|x) = PY −X|X (y − x|x) = PN |X (y − x|x) = PN (y − x) (42)

2.3 Joint and conditional mutual information

Definition 7 (Conditional mutual information) The mutual information between X

PX,Y |Z (x, y|z)

I(X; Y |Z) = H(X|Z) − H(X|(Y, Z)) (45)

Exercise 11 : Prove the equalities listed here above.

Joint conditional and scalar mutual information can be related as follows :

I(X; (Y, Z)) = I(X; Z) + I(X; Y |Z) (49)

Exercise 12 : Prove the equalities listed here above.

2.4 Continuous case : continuous mutual information

Definition 9 (Continuous mutual information) Let X and Y be two continuous ran-

where fX and fY are the marginal pdf associated with fX,Y .

Continuous mutual information measures the quantity of information shared by X and Y .

Properties 10 Continuous mutual information satisfies the following properties

Example 4 (Shannon formula log2 (1 + SN R)) Let X ∼ N (0, P ) and W ∼ N (0, σ 2 )

Exercise 13 : Prove the Shannon formula for the AWGN.

2.5 Vector mutual information

Properties 11 (Chain-rule) Let X be a random variable and let Y n = (Y1 , ..., Yn ) be a

where we have used the chain rule of entropy.

In this case the mutual information writes as

I(X n ; Y n ) = nI(X; Y ). (57)

Example 5 (Memoryless AWGN) Let X n = (X1 , ..., Xn ) n i.i.d Gaussian variables

Conclusions 2 At the end of this section, you should be able to :

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.