0% found this document useful (0 votes)
3 views

session3

The document discusses various properties of information measures in information theory, including the chain rule for entropy, mutual information, and relative entropy. It provides definitions, proofs, and examples to illustrate these concepts, such as Markov chains and the data processing inequality. The content is structured as a lecture by Richard Combes from CentraleSupelec, focusing on the mathematical foundations of information theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

session3

The document discusses various properties of information measures in information theory, including the chain rule for entropy, mutual information, and relative entropy. It provides definitions, proofs, and examples to illustrate these concepts, such as Markov chains and the data processing inequality. The content is structured as a lecture by Richard Combes from CentraleSupelec, focusing on the mathematical foundations of information theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Lecture 3: Properties of Information Measures

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for entropy

Property
For any X1 , . . . , Xn we have:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1

Proof: By definition of conditional entropy:

H(X1 , ..., Xn ) = H(Xn |X1 , ..., Xn−1 ) + H(X1 , ..., Xn−1 )

The result follows by induction over n.


Note The chain rule is useful when X1 , . . . , Xn are defined recursively.

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for entropy

Property
For any X1 , . . . , Xn we have:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1

Proof: By definition of conditional entropy:

H(X1 , ..., Xn ) = H(Xn |X1 , ..., Xn−1 ) + H(X1 , ..., Xn−1 )

The result follows by induction over n.


Note The chain rule is useful when X1 , . . . , Xn are defined recursively.

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for entropy

Property
For any X1 , . . . , Xn we have:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1

Proof: By definition of conditional entropy:

H(X1 , ..., Xn ) = H(Xn |X1 , ..., Xn−1 ) + H(X1 , ..., Xn−1 )

The result follows by induction over n.


Note The chain rule is useful when X1 , . . . , Xn are defined recursively.

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for entropy: example

▶ Consider U1 , ..., Un i.i.d Bernoulli(a) and

Xi = f (Xi−1 , Ui ).

with f (x, 0) ̸= f (x, 1) for all x.


▶ Given Xi−1 , ..., X1 , Xi ∈ {f (Xi−1 , 0), f (Xi−1 , 1)} w.p. (a, 1 − a).
▶ The conditional entropy reads

H(Xi |Xi−1 , ..., X1 ) = h2 (a)

▶ Applying the chain rule:


n
X
H(X1 , ..., Xn ) = H(Xi |Xi−1 , . . . , X1 ) = (n−1)h2 (a)+H(X1 ).
i=1

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for entropy: example

▶ Consider U1 , ..., Un i.i.d Bernoulli(a) and

Xi = f (Xi−1 , Ui ).

with f (x, 0) ̸= f (x, 1) for all x.


▶ Given Xi−1 , ..., X1 , Xi ∈ {f (Xi−1 , 0), f (Xi−1 , 1)} w.p. (a, 1 − a).
▶ The conditional entropy reads

H(Xi |Xi−1 , ..., X1 ) = h2 (a)

▶ Applying the chain rule:


n
X
H(X1 , ..., Xn ) = H(Xi |Xi−1 , . . . , X1 ) = (n−1)h2 (a)+H(X1 ).
i=1

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for entropy: example

▶ Consider U1 , ..., Un i.i.d Bernoulli(a) and

Xi = f (Xi−1 , Ui ).

with f (x, 0) ̸= f (x, 1) for all x.


▶ Given Xi−1 , ..., X1 , Xi ∈ {f (Xi−1 , 0), f (Xi−1 , 1)} w.p. (a, 1 − a).
▶ The conditional entropy reads

H(Xi |Xi−1 , ..., X1 ) = h2 (a)

▶ Applying the chain rule:


n
X
H(X1 , ..., Xn ) = H(Xi |Xi−1 , . . . , X1 ) = (n−1)h2 (a)+H(X1 ).
i=1

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for entropy: example

▶ Consider U1 , ..., Un i.i.d Bernoulli(a) and

Xi = f (Xi−1 , Ui ).

with f (x, 0) ̸= f (x, 1) for all x.


▶ Given Xi−1 , ..., X1 , Xi ∈ {f (Xi−1 , 0), f (Xi−1 , 1)} w.p. (a, 1 − a).
▶ The conditional entropy reads

H(Xi |Xi−1 , ..., X1 ) = h2 (a)

▶ Applying the chain rule:


n
X
H(X1 , ..., Xn ) = H(Xi |Xi−1 , . . . , X1 ) = (n−1)h2 (a)+H(X1 ).
i=1

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for mutual information

Property
For any X1 , . . . , Xn we have:
n
X
I(X1 , . . . , Xn ; Y) = I(Xi ; Y|Xi−1 , . . . , X1 )
i=1

Proof Chain rule + definition of mutual information:

I(X1 , . . . , Xn ; Y) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y)


Xn X n
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y)
i=1 i=1
Xn
= I(Xi ; Y|Xi−1 , . . . , X1 ).
i=1

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for mutual information

Property
For any X1 , . . . , Xn we have:
n
X
I(X1 , . . . , Xn ; Y) = I(Xi ; Y|Xi−1 , . . . , X1 )
i=1

Proof Chain rule + definition of mutual information:

I(X1 , . . . , Xn ; Y) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y)


Xn X n
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y)
i=1 i=1
Xn
= I(Xi ; Y|Xi−1 , . . . , X1 ).
i=1

Information Theory, Richard Combes, CentraleSupelec, 2024


Chain rule for relative entropy

Property
Consider X, Y discrete random variables with joint distribution pX,Y
and marginal distributions pX , pY respectively. We have:

D(pX,Y ||qX,Y ) = D(pX ||qX ) + D(pY|X ||qY|X )

Proof Using the Bayes rule:


 
pX,Y (X, Y)
D(pX,Y ||qX,Y ) = E log2
qX,Y (X, Y)
pY|X (X, Y)
   
pX (X, Y)
= E log2 + E log2
qY|X (X, Y) qX (X, Y)
= D(pY|X ||qY|X ) + D(pX ||qX )

proving the result.


Information Theory, Richard Combes, CentraleSupelec, 2024
Chain rule for relative entropy

Property
Consider X, Y discrete random variables with joint distribution pX,Y
and marginal distributions pX , pY respectively. We have:

D(pX,Y ||qX,Y ) = D(pX ||qX ) + D(pY|X ||qY|X )

Proof Using the Bayes rule:


 
pX,Y (X, Y)
D(pX,Y ||qX,Y ) = E log2
qX,Y (X, Y)
pY|X (X, Y)
   
pX (X, Y)
= E log2 + E log2
qY|X (X, Y) qX (X, Y)
= D(pY|X ||qY|X ) + D(pX ||qX )

proving the result.


Information Theory, Richard Combes, CentraleSupelec, 2024
Log-sum inequality
Proposition
For any (ai )i , (bi )i positive
n n Pn
X ai X ai
ai log2 ≥ ( ai ) log2 Pni=1
bi i=1 bi
i=1 i=1

ai
with equality iff bi = c for all i.

Proof Function f (x) = x log2 x is strictly convex as f ′′ (x) = 1


x > 0.
Using Jensen with αi = Pnbi bj :
j=1
n n n n n
X ai X X a 
i
X X ai 
ai log2 = ( bj ) αi f ≥( bj )f αi
bi bi bi
i=1 j=1 i=1 j=1 i=1
n Pn
X ai
=( ai ) log2 Pni=1 .
i=1 i=1 bi

Information Theory, Richard Combes, CentraleSupelec, 2024


Log-sum inequality
Proposition
For any (ai )i , (bi )i positive
n n Pn
X ai X ai
ai log2 ≥ ( ai ) log2 Pni=1
bi i=1 bi
i=1 i=1

ai
with equality iff bi = c for all i.

Proof Function f (x) = x log2 x is strictly convex as f ′′ (x) = 1


x > 0.
Using Jensen with αi = Pnbi bj :
j=1
n n n n n
X ai X X a 
i
X X ai 
ai log2 = ( bj ) αi f ≥( bj )f αi
bi bi bi
i=1 j=1 i=1 j=1 i=1
n Pn
X ai
=( ai ) log2 Pni=1 .
i=1 i=1 bi

Information Theory, Richard Combes, CentraleSupelec, 2024


Markov chains

Definition
X → Y → Z is a Markov chain iff X and Z are independent given Y.
Equivalently we have (X, Y, Z) ∼ pX,Y,Z (x, y, z) with

pX,Y,Z (x, y, z) = pX (x)pY|X (y|x)pZ|Y,X (z|y, x) = pX (x)pY|X (y|x)pZ|Y (z|y).

▶ Interpretation: Z depends on X only through Y


▶ Particular case: if Z = g(Y) then X → Y → Z

Information Theory, Richard Combes, CentraleSupelec, 2024


Markov chains

Definition
X → Y → Z is a Markov chain iff X and Z are independent given Y.
Equivalently we have (X, Y, Z) ∼ pX,Y,Z (x, y, z) with

pX,Y,Z (x, y, z) = pX (x)pY|X (y|x)pZ|Y,X (z|y, x) = pX (x)pY|X (y|x)pZ|Y (z|y).

▶ Interpretation: Z depends on X only through Y


▶ Particular case: if Z = g(Y) then X → Y → Z

Information Theory, Richard Combes, CentraleSupelec, 2024


Markov chains

Definition
X → Y → Z is a Markov chain iff X and Z are independent given Y.
Equivalently we have (X, Y, Z) ∼ pX,Y,Z (x, y, z) with

pX,Y,Z (x, y, z) = pX (x)pY|X (y|x)pZ|Y,X (z|y, x) = pX (x)pY|X (y|x)pZ|Y (z|y).

▶ Interpretation: Z depends on X only through Y


▶ Particular case: if Z = g(Y) then X → Y → Z

Information Theory, Richard Combes, CentraleSupelec, 2024


Data processing inequality

Property
If X → Y → Z then I(X; Y) ≥ I(X; Z).

Proof We have:

I(X; Y, Z) = I(X; Z) + I(X; Y|Z) = I(X; Y) + I(X; Z|Y)


≥0 =0

hence I(X; Y) ≥ I(X; Z).

Information Theory, Richard Combes, CentraleSupelec, 2024


Data processing inequality

Property
If X → Y → Z then I(X; Y) ≥ I(X; Z).

Proof We have:

I(X; Y, Z) = I(X; Z) + I(X; Y|Z) = I(X; Y) + I(X; Z|Y)


≥0 =0

hence I(X; Y) ≥ I(X; Z).

Information Theory, Richard Combes, CentraleSupelec, 2024


Data processing inequality: illustration

1−α b 0
0 b

b ×

1 b b 1
1−α

▶ We wish to communicate over a channel with input/output X, Y


▶ A helper helps by informing the receiver of g(Y)
▶ Since X 7→ Y 7→ g(Y) we have: I(X; g(Y)) ≤ I(X; Y)
Takeaway Even the smartest processing of the output of a channel
never increases its capacity.
Information Theory, Richard Combes, CentraleSupelec, 2024
Data processing inequality: illustration

1−α b 0
0 b

b ×

1 b b 1
1−α

▶ We wish to communicate over a channel with input/output X, Y


▶ A helper helps by informing the receiver of g(Y)
▶ Since X 7→ Y 7→ g(Y) we have: I(X; g(Y)) ≤ I(X; Y)
Takeaway Even the smartest processing of the output of a channel
never increases its capacity.
Information Theory, Richard Combes, CentraleSupelec, 2024
Data processing inequality: illustration

1−α b 0
0 b

b ×

1 b b 1
1−α

▶ We wish to communicate over a channel with input/output X, Y


▶ A helper helps by informing the receiver of g(Y)
▶ Since X 7→ Y 7→ g(Y) we have: I(X; g(Y)) ≤ I(X; Y)
Takeaway Even the smartest processing of the output of a channel
never increases its capacity.
Information Theory, Richard Combes, CentraleSupelec, 2024
Data processing inequality: illustration

1−α b 0
0 b

b ×

1 b b 1
1−α

▶ We wish to communicate over a channel with input/output X, Y


▶ A helper helps by informing the receiver of g(Y)
▶ Since X 7→ Y 7→ g(Y) we have: I(X; g(Y)) ≤ I(X; Y)
Takeaway Even the smartest processing of the output of a channel
never increases its capacity.
Information Theory, Richard Combes, CentraleSupelec, 2024
Data processing inequality: illustration

1−α b 0
0 b

b ×

1 b b 1
1−α

▶ We wish to communicate over a channel with input/output X, Y


▶ A helper helps by informing the receiver of g(Y)
▶ Since X 7→ Y 7→ g(Y) we have: I(X; g(Y)) ≤ I(X; Y)
Takeaway Even the smartest processing of the output of a channel
never increases its capacity.
Information Theory, Richard Combes, CentraleSupelec, 2024
Fano’s inequality

Proposition
If X → Y → X̂ then:

h2 (P(X ̸= X̂)) + P(X ̸= X̂) log2 |X | ≥ H(X|Y)

with h2 (a) = a log 1a + (1 − a) log 1−a


1
the binary entropy.

▶ If H(X|Y) is large, it is impossible to estimate X using Y


▶ Fano’s inequality holds for any estimation procedure
▶ Useful to prove impossibility results

Information Theory, Richard Combes, CentraleSupelec, 2024


Fano’s inequality

Proposition
If X → Y → X̂ then:

h2 (P(X ̸= X̂)) + P(X ̸= X̂) log2 |X | ≥ H(X|Y)

with h2 (a) = a log 1a + (1 − a) log 1−a


1
the binary entropy.

▶ If H(X|Y) is large, it is impossible to estimate X using Y


▶ Fano’s inequality holds for any estimation procedure
▶ Useful to prove impossibility results

Information Theory, Richard Combes, CentraleSupelec, 2024


Fano’s inequality

Proposition
If X → Y → X̂ then:

h2 (P(X ̸= X̂)) + P(X ̸= X̂) log2 |X | ≥ H(X|Y)

with h2 (a) = a log 1a + (1 − a) log 1−a


1
the binary entropy.

▶ If H(X|Y) is large, it is impossible to estimate X using Y


▶ Fano’s inequality holds for any estimation procedure
▶ Useful to prove impossibility results

Information Theory, Richard Combes, CentraleSupelec, 2024


Fano’s inequality

Proposition
If X → Y → X̂ then:

h2 (P(X ̸= X̂)) + P(X ̸= X̂) log2 |X | ≥ H(X|Y)

with h2 (a) = a log 1a + (1 − a) log 1−a


1
the binary entropy.

▶ If H(X|Y) is large, it is impossible to estimate X using Y


▶ Fano’s inequality holds for any estimation procedure
▶ Useful to prove impossibility results

Information Theory, Richard Combes, CentraleSupelec, 2024


Fano’s inequality: Proof

Since X → Y → X̂ is a Markov chain

H(X) − H(X|X̂) = I(X; X̂) ≤ I(X; Y) = H(X) − H(X|Y)

so that
H(X|Y) ≤ H(X|X̂)
Define E = 1{X̂ ̸= X}, using the chain rule in both directions:

H(E|X̂) + H(X|E, X̂) = H(X, E|X̂) = H(X|X̂) + H(E|X, X̂)

Now H(E|X, X̂) = 0 because E is a deterministic function of X, X̂


which proves:

H(X|X̂) = H(E|X̂) + H(X|E, X̂)

Information Theory, Richard Combes, CentraleSupelec, 2024


Fano’s inequality: Proof

Since X → Y → X̂ is a Markov chain

H(X) − H(X|X̂) = I(X; X̂) ≤ I(X; Y) = H(X) − H(X|Y)

so that
H(X|Y) ≤ H(X|X̂)
Define E = 1{X̂ ̸= X}, using the chain rule in both directions:

H(E|X̂) + H(X|E, X̂) = H(X, E|X̂) = H(X|X̂) + H(E|X, X̂)

Now H(E|X, X̂) = 0 because E is a deterministic function of X, X̂


which proves:

H(X|X̂) = H(E|X̂) + H(X|E, X̂)

Information Theory, Richard Combes, CentraleSupelec, 2024


Fano’s inequality: Proof

Since X → Y → X̂ is a Markov chain

H(X) − H(X|X̂) = I(X; X̂) ≤ I(X; Y) = H(X) − H(X|Y)

so that
H(X|Y) ≤ H(X|X̂)
Define E = 1{X̂ ̸= X}, using the chain rule in both directions:

H(E|X̂) + H(X|E, X̂) = H(X, E|X̂) = H(X|X̂) + H(E|X, X̂)

Now H(E|X, X̂) = 0 because E is a deterministic function of X, X̂


which proves:

H(X|X̂) = H(E|X̂) + H(X|E, X̂)

Information Theory, Richard Combes, CentraleSupelec, 2024


Fano’s inequality: Proof

We have

H(X|E, X̂) ≤ P(E = 1) log2 (|X | − 1) + P(E = 0) log2 (1)

because if E = 0 then X = X̂ has 1 possible values and if E = 1


X ̸= X̂ has |X | − 1 possible values.
Finally, since conditioning reduces entropy:

H(E|X̂) ≤ H(E) = h2 (P(E = 1))

which concludes the proof.

Information Theory, Richard Combes, CentraleSupelec, 2024


Fano’s inequality: Proof

We have

H(X|E, X̂) ≤ P(E = 1) log2 (|X | − 1) + P(E = 0) log2 (1)

because if E = 0 then X = X̂ has 1 possible values and if E = 1


X ̸= X̂ has |X | − 1 possible values.
Finally, since conditioning reduces entropy:

H(E|X̂) ≤ H(E) = h2 (P(E = 1))

which concludes the proof.

Information Theory, Richard Combes, CentraleSupelec, 2024


Asymptotic equipartition property
Proposition
Consider (Xi )i=1,...,n i.i.d. with common distribution pX :
n
1X 1
log2 → H(X) in probability.
n pX (Xi ) n→∞
i=1

Consider (Xi , Yi )i=1,...,n i.i.d. with common joint distribution pX,Y :


n
1X 1
log2 → H(X, Y) in probability.
n pX,Y (Xi , Yi ) n→∞
i=1

n
1X 1
log2 → H(X|Y) in probability.
n pX|Y (Xi |Yi ) n→∞
i=1
n
1X pX (Xi )pY (Yi )
log2 → I(X; Y) in probability.
n pX,Y (Xi , Yi ) n→∞
i=1 Information Theory, Richard Combes, CentraleSupelec, 2024
The typical set

Proposition
Consider X1 , . . . , Xn i.i.d. with common distribution pX .
Given ϵ > 0 define the typical set:
n
n 1X 1 o
Anϵ n
= (x1 , ..., xn ) ∈ X : log2 − H(X) ≤ ϵ .
n p(xi )
i=1

Then:
(i) |Anϵ | ≤ 2n(H(X)+ϵ) for all n
(ii) |Anϵ | ≥ (1 − ϵ)2n(H(X)−ϵ) for n large enough
(iii) P((X1 , . . . , Xn ) ∈ Anϵ ) ≥ 1 − ϵ for n large enough

Typical set a high probability set of size ≈ 2nH(X) .

Information Theory, Richard Combes, CentraleSupelec, 2024


The typical set

Proposition
Consider X1 , . . . , Xn i.i.d. with common distribution pX .
Given ϵ > 0 define the typical set:
n
n 1X 1 o
Anϵ n
= (x1 , ..., xn ) ∈ X : log2 − H(X) ≤ ϵ .
n p(xi )
i=1

Then:
(i) |Anϵ | ≤ 2n(H(X)+ϵ) for all n
(ii) |Anϵ | ≥ (1 − ϵ)2n(H(X)−ϵ) for n large enough
(iii) P((X1 , . . . , Xn ) ∈ Anϵ ) ≥ 1 − ϵ for n large enough

Typical set a high probability set of size ≈ 2nH(X) .

Information Theory, Richard Combes, CentraleSupelec, 2024


The typical set: proof

▶ By definition, (x1 , ..., xn ) ∈ Anϵ if and only if

2−n(H(X)+ϵ) ≤ pX (x1 )...pX (xn ) ≤ 2−n(H(X)−ϵ)

▶ Computing the probability of the typical set:


X
P((X1 , . . . , Xn ) ∈ Anϵ ) = pX (x1 )...pX (xn )
(x1 ,...,xn )∈Anϵ

▶ Which we bound as

|Anϵ |2−n(H(X)+ϵ) ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ |Anϵ |2−n(H(X)−ϵ)

Information Theory, Richard Combes, CentraleSupelec, 2024


The typical set: proof

▶ By definition, (x1 , ..., xn ) ∈ Anϵ if and only if

2−n(H(X)+ϵ) ≤ pX (x1 )...pX (xn ) ≤ 2−n(H(X)−ϵ)

▶ Computing the probability of the typical set:


X
P((X1 , . . . , Xn ) ∈ Anϵ ) = pX (x1 )...pX (xn )
(x1 ,...,xn )∈Anϵ

▶ Which we bound as

|Anϵ |2−n(H(X)+ϵ) ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ |Anϵ |2−n(H(X)−ϵ)

Information Theory, Richard Combes, CentraleSupelec, 2024


The typical set: proof

▶ By definition, (x1 , ..., xn ) ∈ Anϵ if and only if

2−n(H(X)+ϵ) ≤ pX (x1 )...pX (xn ) ≤ 2−n(H(X)−ϵ)

▶ Computing the probability of the typical set:


X
P((X1 , . . . , Xn ) ∈ Anϵ ) = pX (x1 )...pX (xn )
(x1 ,...,xn )∈Anϵ

▶ Which we bound as

|Anϵ |2−n(H(X)+ϵ) ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ |Anϵ |2−n(H(X)−ϵ)

Information Theory, Richard Combes, CentraleSupelec, 2024


The typical set: proof

▶ From asymptotic equipartition the typical set is a high


probability set, and for n large enough

1 − ϵ ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 1.

▶ The size of the typical set is bounded as

|Anϵ | ≤ 2n(H(X)−ϵ) P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 2n(H(X)−ϵ)


|Anϵ | ≥ P((X1 , . . . , Xn ) ∈ Anϵ )2n(H(X)+ϵ) ≥ (1 − ϵ)2n(H(X)+ϵ)

▶ This concludes the proof.

Information Theory, Richard Combes, CentraleSupelec, 2024


The typical set: proof

▶ From asymptotic equipartition the typical set is a high


probability set, and for n large enough

1 − ϵ ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 1.

▶ The size of the typical set is bounded as

|Anϵ | ≤ 2n(H(X)−ϵ) P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 2n(H(X)−ϵ)


|Anϵ | ≥ P((X1 , . . . , Xn ) ∈ Anϵ )2n(H(X)+ϵ) ≥ (1 − ϵ)2n(H(X)+ϵ)

▶ This concludes the proof.

Information Theory, Richard Combes, CentraleSupelec, 2024


The typical set: proof

▶ From asymptotic equipartition the typical set is a high


probability set, and for n large enough

1 − ϵ ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 1.

▶ The size of the typical set is bounded as

|Anϵ | ≤ 2n(H(X)−ϵ) P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 2n(H(X)−ϵ)


|Anϵ | ≥ P((X1 , . . . , Xn ) ∈ Anϵ )2n(H(X)+ϵ) ≥ (1 − ϵ)2n(H(X)+ϵ)

▶ This concludes the proof.

Information Theory, Richard Combes, CentraleSupelec, 2024


Illustration of typicality

X n of size 2n log2 |X |

Anǫ of size ≈ 2nH(X)

Compression With high probability, X1 , ..., Xn can be represented by


about nH(X) bits instead of n log2 |X | bits!

Information Theory, Richard Combes, CentraleSupelec, 2024


Illustration of typicality

X n of size 2n log2 |X |

Anǫ of size ≈ 2nH(X)

Compression With high probability, X1 , ..., Xn can be represented by


about nH(X) bits instead of n log2 |X | bits!

Information Theory, Richard Combes, CentraleSupelec, 2024

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy