0% found this document useful (0 votes)
59 views25 pages

Lect2 PDF

Uploaded by

Raghu s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views25 pages

Lect2 PDF

Uploaded by

Raghu s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Kraft’s inequality

An instantaneous code (prefix code, tree code) with the codeword


lengths l1 , . . . , lN exists if and only if
N
X
2−li ≤ 1
i=1

Proof: Suppose that we have a tree code. Let lmax = max{l1 , . . . , lN }.


Expand the tree so that all branches have the depth lmax . A codeword at
depth li has 2lmax −li leaves underneath itself at depth lmax . The sets of
leaves under codewords are disjoint. The total number of leaves under
codewords are less than or equal to 2lmax . Thus we have
N
X
2lmax −li ≤ 2lmax ⇒
i=1

N
X
2−li ≤ 1
i=1
Kraft’s inequality, cont.

Conversely, given codeword lengths l1 , . . . , lN that fulfill Kraft’s


inequality, we can always construct a tree code.

Start with a complete binary tree where all the leaves are at depth lmax .

Assume, without loss of generality, that the codeword lengths are sorted
in increasing order.

Choose a free node at depth l1 for the first codeword and remove all its
descendants. Do the same for l2 and codeword 2, etc. until we have
placed all codewords.
Kraft’s inequality, cont.

Obviously we can place a codeword at depth l1 .

In order for the algorithm to work, there must in every step i be free
leaves at the maximum depth lmax . The number of remaining leaves is
i−1
X i−1
X N
X
2lmax − 2lmax −lj = 2lmax (1 − 2−lj ) > 2lmax (1 − 2−lj ) ≥ 0
j=1 j=1 j=1

where we used the fact that Kraft’s inequality is fulfilled.

This shows that there are free leaves in every step. Thus, we can
construct a tree code code with the given codeword lengths.
Kraft-McMillan’s inequality

Kraft’s inequality can be shown to be fulfilled for all uniquely decodable


codes, not just prefix codes. It is then called Kraft-McMillan’s inequality:
A uniquely decodable code with the codeword lengths l1 , . . . , lN exists if
and only if
X N
2−li ≤ 1
i=1
PN
Consider ( i=1 2−li )n , where n is an arbitrary positive integer.
N
X N
X N
X
−li n
( 2 ) = ··· 2−(li1 + ...+lin )
i=1 i1 =1 in =1
Kraft-McMillan’s inequality, cont.

li1 + . . . + lin is the length of n codewords from the code. The smallest
value this exponent can take is n, which would happen if all code words
had the length 1. The largest value the exponent can take is nl where l is
the maximal codeword length. The summation can then be written as

XN nl
X
( 2−li )n = Ak 2−k
i=1 k=n

where Ak is the number of combinations of n codewords that have the


combined length k. The number of possible binary sequences of length k
is 2k . Since the code is uniquely decodable, we must have

Ak ≤ 2k

in order to be able to decode.


Kraft-McMillan’s inequality, cont.

We have
N
X nl
X
( 2−li )n ≤ 2k 2−k = nl − n + 1
i=1 k=n

which gives us
N
X
2−li ≤ (n(l − 1) + 1)1/n
i=1

This is true for all n, including when we let n tend to infinity, which
finally gives us
XN
2−li ≤ 1
i=1

The converse of the inequality has already been proven, since we know
that we can construct a prefix code with the given codeword lengths if
they fulfill Kraft’s inequality, and all prefix codes are uniquely decodable.
Instantaneous codes

One consequence of Kraft-McMillan’s inequality is that it tells us that


there is nothing to gain by using codes that are uniquely decodable but
not instantaneous.

Given a uniquely decodable but not instantaneous code with codeword


lengths l1 , . . . , lN , Kraft’s inequality gives that we can always construct
an instantaneous code with exactly the same codeword lengths. This new
code will have the same rate as the old code but it will be easier to
decode.
Performance measure

We measure how good a code is with the mean data rate R (more often
just data rate or rate).

average number of bits per codeword


R= [bits/symbol]
average number of symbols per code word
Since we’re doing data compression, we want the rate to be as low as
possible.
If we initially assume that we have a memoryless source Xj and code one
symbol at a time using a tree code, then R is given by
L
X
R = l¯ = pi · li [bits/symbol]
i=1

where L is the alphabet size and pi the probability of symbol i.


l¯ is the mean codeword length [bits/codeword].
Theoretical lower bound

Given that we have a memoryless source Xj and that we code one symbol
at a time with a prefix code. Then the mean codeword length l¯ (which is
equal to the rate) is bounded by
L
X
l¯ ≥ − pi · log2 pi = H(Xj )
i=1

H(Xj ) is called the entropy of the source.


Theoretical lower bound, cont.

Consider the difference between the entropy and the mean codeword
length
L L L
X X X 1
H(Xj ) − l¯ = − pi · log pi − pi · li = pi · (log − li )
pi
i=1 i=1 i=1
L L
X 1 X 2−li
= pi · (log − log 2li ) = pi · log
pi pi
i=1 i=1
L L L
1 X 2−li 1 X −li X
≤ pi · ( − 1) = ( 2 − pi )
ln 2 pi ln 2
i=1 i=1 i=1
1
≤ (1 − 1) = 0
ln 2
where we used the fact that ln x ≤ x − 1 and Kraft’s inequality.
Shannon’s information measure

We want a measure I of information that is connected to the probabilities


of events.
Some desired properties:
I The information I (A) of an event A should only depend of the
probability P(A) of the event.
I The lower the probability of the event, the larger the information
should be.
I If the probability of an event is 1, the information should be 0.
I Information should be a continuous function of the probability.
I If the independent events A and B happen, the information should
be the sum of the informations I (A) + I (B)
This gives that information should be a logarithmic measure.
Information Theory

The information I (A; B) that is given about an event A, when event B


happens is defined as

4 P(A|B)
I (A; B) = logb
P(A)

where we assume that P(A) 6= 0 and P(B) 6= 0. In the future we assume,


unless otherwise specified, that b = 2. The unit of information is then
called bit. (If b = e the unit is called nat.)
I (A; B) is symmetric in A and B:

P(A|B) P(AB)
I (A; B) = log = log =
P(A) P(A)P(B)

P(B|A)
= log = I (B; A)
P(B)
Therefore the information is also called mutual information.
Information Theory, cont.

We further have that

−∞ ≤ I (A; B) ≤ − log P(A)

with “equality” to the left if P(A|B) = 0 and equality to the right if


P(A|B) = 1.
I (A; B) = 0 means that the events A and B are independent.
− log P(A) is the amount of information that needs to be given in order
for us to determine that event A has happened.

P(A|A)
I (A; A) = log = − log P(A)
P(A)

We define the self information of the event A as


4
I (A) = − log P(A)
Information Theory, cont.

If we apply the definitions on the events {X = x} and {Y = y } we get

I (X = x) = − log pX (x)

and
pX |Y (x|y )
I (X = x; Y = y ) = log
pX (x)
These are real functions of the random variables X and the random
variable (X , Y ), so the mean values are well defined
L
4 X
H(X ) = E {I (X = x)} = − pX (xi ) log pX (xi )
i=1

is called the entropy (or uncertainty) of the random variable X .


Information Theory, cont.

4
I (X ; Y ) = E {I (X = x; Y = y )} =
L X
M
X pX |Y (xi |yj )
= pXY (xi , yj ) log
pX (xi )
i=1 j=1

is called the mutual information between the random variables X and Y .


Information Theory, cont.

If (X , Y ) is viewed as one random variable we get


L X
X M
H(X , Y ) = − pXY (xi , yj ) log pXY (xi , yj )
i=1 j=1

This is called the joint entropy of X and Y .


It then follows that the mutual information can be written as
pX |Y pXY
I (X ; Y ) = E {log } = E {log }
pX pX pY
= E {log pXY − log pX − log pY }
= E {log pXY } − E {log pX } − E {log pY }
= H(X ) + H(Y ) − H(X , Y )
Information Theory, cont.

Useful inequality
log r ≤ (r − 1) log e
with equality if and only if r = 1.
This can also be written
ln r ≤ r − 1
If X takes values in {x1 , x2 , . . . , xL } we have that

0 ≤ H(X ) ≤ log L

with equality to the left if and only if there is an i such that pX (xi ) = 1
and with equality to the right if and only if pX (xi ) = 1/L for all
i = 1, . . . , L.
Information Theory, cont.

Proof left inequality:



 = 0, pX (xi ) = 0
−pX (xi ) · log pX (xi ) > 0, 0 < pX (xi ) < 1
= 0, pX (xi ) = 1

Thus we have that H(X ) ≥ 0 with equality if and only if pX (xi ) is either
0 or 1 for each i, but this means that pX (xi ) = 1 for exactly one i.
Information Theory, cont.

Proof right inequality:


L
X
H(X ) − log L = − pX (xi ) log pX (xi ) − log L
i=1
L
X 1
= pX (xi ) log
L · pX (xi )
i=1
L
X 1
≤ pX (xi )( − 1) log e
L · pX (xi )
i=1
L L
X 1 X
= ( − pX (xi )) log e
L
i=1 i=1
= (1 − 1) log e = 0
1
with equality if and only if pX (xi ) = L for all i = 1, . . . , L
Information Theory, cont.
The conditional entropy of X given the event Y = yj is
L
4 X
H(X |Y = yj ) = − pX |Y (xi |yj ) log pX |Y (xi |yj )
i=1

We have that
0 ≤ H(X |Y = yj ) ≤ log L
The conditional entropy of X given Y is defined as
4
H(X |Y ) = E {− log pX |Y } =
L X
X M
= − pXY (xi , yj ) log pX |Y (xi |yj )
i=1 j=1

We have that
0 ≤ H(X |Y ) ≤ log L
Information Theory, cont.

We also have that


L X
X M
H(X |Y ) = − pXY (xi , yj ) log pX |Y (xi |yj )
i=1 j=1

L X
X M
=− pY (yj )pX |Y (xi |yj ) log pX |Y (xi |yj )
i=1 j=1

M
X L
X
=− pY (yj ) pX |Y (xi |yj ) log pX |Y (xi |yj )
j=1 i=1

M
X
= pY (yj )H(X |Y = yj )
j=1
Information Theory, cont.

We have
pX1 X2 ...XN = pX1 · pX2 |X1 · · · pXN |X1 ...XN−1
which leads to the chain rule

H(X1 X2 . . . XN ) =

H(X1 ) + H(X2 |X1 ) + · · · + H(XN |X1 . . . XN−1 )


We also have that

I (X ; Y ) = H(X ) − H(X |Y ) =
= H(Y ) − H(Y |X )
Information Theory, cont.

Other interesting inequalities

H(X |Y ) ≤ H(X )

with equality if and only if X and Y are independent.

I (X ; Y ) ≥ 0

with equality if and only if X and Y are independent.


If f (X ) is a function of X , we have

H(f (X )) ≤ H(X )

H(f (X )|X ) = 0
H(X , f (X )) = H(X )
Entropy for sources

The entropy, or entropy rate of a stationary random source Xn is defined


as
1
lim H(X1 . . . Xn ) =
n→∞ n
1
lim (H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |X1 . . . Xn−1 )) =
n→∞ n
lim H(Xn |X1 . . . Xn−1 )
n→∞

For a memoryless source, the entropy rate is equal to H(Xn ).


Entropy for Markov sources

The entropy rate of a stationary Markov source Xn of order k is given by

H(Xn |Xn−1 . . . Xn−k )

The entropy rate of the state sequence Sn is the same as the entropy rate
of the source

H(Sn |Sn−1 Sn−2 . . .) = H(Sn |Sn−1 ) =


H(Xn . . . Xn−k+1 |Xn−1 . . . Xn−k ) = H(Xn |Xn−1 . . . Xn−k )

and thus the entropy rate can also be calculated by


k
L
X
H(Sn |Sn−1 ) = wj · H(Sn |Sn−1 = sj )
j=1

ie a weighted average of the entropies of the outgoing probabilities of


each state.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy