100% found this document useful (2 votes)
663 views

Chapter One 1-Random Variables: Dr. Mahmood

This document summarizes key concepts in probability and statistics: 1) It defines discrete and continuous random variables and their probability distributions. Discrete variables take countable values while continuous variables can take any value within an interval. 2) It discusses concepts like joint probability, conditional probability, and Bayes' theorem for calculating probabilities of dependent events. 3) It introduces the concept of independent variables and Venn diagrams to depict logical relations between sets.

Uploaded by

tareq omar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
663 views

Chapter One 1-Random Variables: Dr. Mahmood

This document summarizes key concepts in probability and statistics: 1) It defines discrete and continuous random variables and their probability distributions. Discrete variables take countable values while continuous variables can take any value within an interval. 2) It discusses concepts like joint probability, conditional probability, and Bayes' theorem for calculating probabilities of dependent events. 3) It introduces the concept of independent variables and Venn diagrams to depict logical relations between sets.

Uploaded by

tareq omar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Chapter one

1- Random Variables

A random variable, usually written X, is a variable whose possible values are numerical
outcomes of a random phenomenon. There are two types of random variables, discrete
and continuous. All random variables have a cumulative distribution function. It is a
function giving the probability that the random variable X is less than or equal to x, for
every value x.

1-1 Discrete Random Variables

A discrete random variable is one which may take on only a countable number of distinct
values such as 0,1,2,3,4,........ If a random variable can take only a finite number of
distinct values, then it must be discrete. Examples of discrete random variables include
the number of children in a family, the number of defective light bulbs in a box of ten.
The probability distribution of a discrete random variable is a list of probabilities
associated with each of its possible values. It is also sometimes called the probability
function or the probability mass function.

When the sample space Ω has a finite number of equally likely outcomes, so that the
discrete uniform probability law applies. Then, the probability of any event x is given
by:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓 𝑥
𝑃(𝐴) =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓 Ω

This distribution may also be described by the probability histogram. Suppose a random
variable X may take k different values, with the probability that X = 𝑥𝑖 defined to be
P(X = 𝑥𝑖 ) =𝑃𝑖 . The probabilities 𝑃𝑖 must satisfy the following:

1- 0 < 𝑃𝑖 < 1 for each i

Dr. Mahmood 21 September 2017 1


2- 𝑃1 + 𝑃2 + ⋯ + 𝑃𝑘 = 1 or,
𝑘

∑ 𝑃𝑖 = 1
𝑖=1

Example
Suppose a variable X can take the values 1, 2, 3, or 4. The probabilities associated
with each outcome are described by the following table:

Outcome 1 2 3 4
Probability 0.1 0.3 0.4 0.2

Figure 1: probability distribution

The cumulative distribution function for the above probability distribution is


calculated as follows:
The probability that X is less than or equal to 1 is 0.1,
the probability that X is less than or equal to 2 is 0.1+0.3 = 0.4,
the probability that X is less than or equal to 3 is 0.1+0.3+0.4 = 0.8, and
the probability that X is less than or equal to 4 is 0.1+0.3+0.4+0.2 = 1.

Dr. Mahmood 21 September 2017 2


H.W: Having a text of (ABCAABDCAA). Calculate the probability of each letter,
plot the probability distribution and the cumulative distribution.

Figure 2: cumulative distribution

1-2 Continuous Random Variables

A continuous random variable is one which takes an infinite number of possible values.
Continuous random variables are usually measurements. Examples include height,
weight and the amount of sugar in an orange. A continuous random variable is not
defined at specific values. Instead, it is defined over an interval of values, and is
represented by the area under a curve. The curve, which represents a function p(x),
must satisfy the following:

1: The curve has no negative values (p(x) > 0 for all x)


2: The total area under the curve is equal to 1.

A curve meeting these requirements is known as a density curve. If any interval of


numbers of equal width has an equal probability, then the curve describing the
distribution is a rectangle, with constant height across the interval and 0 height
elsewhere, these curves are known as uniform distributions.

Dr. Mahmood 21 September 2017 3


Figure 3: Uniform distribution

Another type of distribution is the normal distribution having a bell-shaped density


curve described by its mean 𝜇 and standard deviation 𝜎. The height of a normal
density curve at a given point x is given by:

1 𝑥−𝜇 2
−0.5( )
ℎ= 𝑒 𝜎
𝜎√2𝜋

Figure 4: The Standard normal curve

2- Joint Probability:

Joint probability is the probability of event Y occurring at the same time event X
occurs. Its notation is 𝑃(𝑋 ∩ 𝑌)𝑜𝑟 𝑃(𝑋, 𝑌), which reads; the joint probability of
X and Y.
𝑃(𝑋, 𝑌) = 𝑃(𝑋) × 𝑃(𝑌)

Dr. Mahmood 21 September 2017 4


If X and Y are discrete random variables, then 𝑓(𝑥, 𝑦) must satisfy:
0 ≤ 𝑓(𝑥, 𝑦) ≤ 1 and,

∑ ∑ 𝑓(𝑥, 𝑦) = 1
𝑥 𝑦

If X and Y are continuous random variables, then 𝑓(𝑥, 𝑦) must satisfy:


𝑓(𝑥, 𝑦) ≥ 0 and,
∞ ∞
∫ ∫ 𝑓(𝑥, 𝑦) = 1
−∞ −∞

Example:
For discrete random variable, if the probability of rolling a four on one die is
𝑃(𝑋) and if the probability of rolling a four on second die is 𝑃(𝑌). Find 𝑃(𝑋, 𝑌).
Solution:
We have 𝑃(𝑋) = 𝑃(𝑌) = 1/6
1 1 1
𝑃(𝑋, 𝑌) = 𝑃(𝑋) × 𝑃(𝑌) = × = = 0.0277 = 2.8%
6 6 36

3- Conditional Probabilities:

It is happened when there are dependent events. We have to use the symbol "|" to
mean "given":

- P(B|A) means "Event B given Event A has occurred".

- P(B|A) is also called the "Conditional Probability" of B given A has occurred .

- And we write it as

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓 𝐴 𝑎𝑛𝑑 𝐵


𝑃(𝐴 | 𝐵) =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓 𝐵

Dr. Mahmood 21 September 2017 5


Or

𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴 | 𝐵) =
𝑃(𝐵)

Where 𝑃(𝐵) > 0

Example: A box contains 5 green pencils and 7 yellow pencils. Two pencils are chosen
at random from the box without replacement. What is the probability they are different
colors?

Solution: Using a tree diagram:

4- Bayes’ Theorem

Bayes’ theorem: an equation that allows us to manipulate conditional probabilities.


For two events, A and B, Bayes’ theorem lets us to go from p(B|A) to p(A|B).

𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴 | 𝐵) = 𝑃(𝐵) ≠ 0
𝑃(𝐵)
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐵 | 𝐴) = 𝑃(𝐴) ≠ 0
𝑃(𝐴)
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴 | 𝐵) × 𝑃(𝐵) = 𝑃(𝐵| 𝐴) × 𝑃(𝐴)

Dr. Mahmood 21 September 2017 6


𝑃(𝐴)
𝑃(𝐴 | 𝐵) = 𝑃(𝐵 | 𝐴) × 𝑃(𝐵) ≠ 0
𝑃(𝐵)
Example:

If 𝑃(𝑋 = 0) = 0.2, 𝑃(𝑋 = 1) = 0.3, 𝑃(𝑋 = 2) = 0.5, 𝑃(𝑌 = 0) =


0.4 𝑎𝑛𝑑 𝑃(𝑌 = 1) = 0.6. Determine 𝑃(𝑋 = 0 |𝑌 = 0), 𝑃(𝑋 = 1 |𝑌 = 0)

5- Independence of Two Variables:

The concept of independent random variables is very similar to independent events.


If two events A and B are independent, we have P(A,B)=P(A)P(B)=P(A∩B). For
example, let’s say you wanted to know the average weight of a bag of sugar so you
randomly sample 50 bags from various grocery stores. You wouldn’t expect the
weight of one bag to affect another, so the variables are independent.

6- Venn's Diagram:

A Venn diagram is a diagram that shows all possible logical relations between a
finite collections of different sets. These diagrams depict elements as points in the
plane, and sets as regions inside closed curves. A Venn diagram consists of multiple
overlapping closed curves, usually circles, each representing a set. The points inside
a curve labelled S represent elements of the set S, while points outside the boundary
represent elements not in the set S. Fig. 5 shows the set 𝐴 = {1, 2, 3}, 𝐵 =
{4, 5} 𝑎𝑛𝑑 𝑈 = {1, 2, 3, 4, 5, 6}.

U
6
4 2 3
5 B 1 A

Figure 5: An example of Venn's Diagrams

Dr. Mahmood 21 September 2017 7


Example:

From the adjoining Venn diagram of Fig. 6, find the following sets:

A, B, C, X, A', B', C-A, B-C, A ∪ B, A ∩ B

(𝐵 ∪ 𝐶)′….

Solution:

𝐴 = {1,3, 4, 5}, 𝐵 = {2, 4, 5, 6}, 𝐶 = {1, 5, 6, 7, 10}

𝑋 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

𝐴′ = {2, 6, 7, 8, 9, 10}, Figure 6:Venn's Diagram

𝐵′ = {1, 3, 7, 8, 9, 10},

(𝐶 − 𝐴) = {3, 4, 6, 7, 10},

(𝐵 − 𝐶) = {1, 2, 4, 7, 10},

(𝐴 ∪ 𝐵) = {1, 2, 3, 4, 5, 6},

(𝐴 ∩ 𝐵) = {4, 5},

(𝐵 ∪ 𝐶)′ = {3, 8, 9}……..

7- Model of information transmission system

Transmitting a message from a transmitter to a receiver can be sketched as in Fig. 7:

The components of information system as described by Shannon are:

1. An information source is a device which randomly delivers symbols from an


alphabet. As an example, a PC (Personal Computer) connected to internet is an
information source which produces binary digits from the binary alphabet {0, 1}.
2. A source encoder allows one to represent the data source more compactly by
eliminating redundancy: it aims to reduce the data rate.

Dr. Mahmood 21 September 2017 8


3. A channel encoder adds redundancy to protect the transmitted signal against
transmission errors.

Figure 7: Shannon paradigm


4. A channel is a system which links a transmitter to a receiver. It includes signaling
equipment and pair of copper wires or coaxial cable or optical fiber, among other
possibilities.
5. The rest of blocks is the receiver end, each block has inverse processing to the
corresponding transmitted end.

8- Self- information:

In information theory, self-information is a measure of the information


content associated with the outcome of a random variable. It is expressed in a unit of
information, for example bits, nats, or hartleys, depending on the base of the logarithm
used in its calculation.

Dr. Mahmood 21 September 2017 9


A bit is the basic unit of information in computing and digital communications. A bit
can have only one of two values, and may therefore be physically implemented with a
two-state device. These values are most commonly represented as 0 and 1.

A nat is the natural unit of information, sometimes also nit or nepit, is a unit
of information or entropy, based on natural logarithms and powers of e, rather than the
powers of 2 and base 2 logarithms which define the bit. This unit is also known by its
unit symbol, the nat.

The hartley (symbol Hart) is a unit of information defined by International


Standard IEC 80000-13 of the International Electrotechnical Commission. One hartley
is the information content of an event if the probability of that event occurring is 1/10. It
is therefore equal to the information contained in one decimal digit (or dit).

1 Hart ≈ 3.322 Sh ≈ 2.303 nat.


The amount of self-information contained in a probabilistic event depends only on
the probability of that event: the smaller its probability, the larger the self-information
associated with receiving the information that the event indeed occurred as shown in
Fig.8.
i- Information is zero if 𝑃(𝑥𝑖 ) = 1 (certain event)
ii- Information increase as 𝑃(𝑥𝑖 ) decrease to zero
iii- Information is a +ve quantity
The log function satisfies all previous three points hence:
1
𝐼(𝑥𝑖 ) = log 𝑎 = −log 𝑎 𝑃(𝑥𝑖 )
𝑃(𝑥𝑖 )
Where 𝐼(𝑥𝑖 ) is self information of (𝑥𝑖 ) and if:
i- If “a” =2 , then 𝐼(𝑥𝑖 ) has the unit of bits
ii- If “a”= e = 2.71828, then 𝐼(𝑥𝑖 ) has the unit of nats
iii- If “a”= 10, then 𝐼(𝑥𝑖 ) has the unit of hartly
𝑙𝑛𝑥
Recall that log 𝑎 𝑥 =
𝑙𝑛𝑎

Dr. Mahmood 21 September 2017 10


Example 1:

A fair die is thrown, find the amount of information gained if you are told that 4 will
appear.

Solution:

1
𝑃(1) = 𝑃(2) = ⋯ … … . = 𝑃(6) =
6
1
1 ln( )
𝐼(4) = −log 2 ( ) = 6 = 2.5849 𝑏𝑖𝑡𝑠
6 𝑙𝑛2

Example 2:

A biased coin has P(Head)=0.3. Find the amount of information gained if you are told
that a tail will appear.

Solution:

𝑃(𝑡𝑎𝑖𝑙) = 1 − 𝑃(𝐻𝑒𝑎𝑑) = 1 − 0.3 = 0.7

𝑙𝑛0.7
𝐼(𝑡𝑎𝑖𝑙) = −log 2 (0.7) = − = 0.5145 𝑏𝑖𝑡𝑠
𝑙𝑛2

HW
A communication system source emits the following information with their
corresponding probabilities as follows: A=1/2, B=1/4, C=1/8. Calculate the information
conveyed by each source outputs.

Dr. Mahmood 21 September 2017 11


10

6
I(xi) bits

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P(xi)

Figure 8: Relation between probability and self-information

9- Average information (entropy):

In information theory, entropy is the average amount of information contained in


each message received. Here, message stands for an event, sample or character
drawn from a distribution or data stream. Entropy thus characterizes our uncertainty
about our source of information.

9-1 Source Entropy:

If the source produces not equiprobable messages then 𝐼(𝑥𝑖 ), 𝑖 = 1, 2, … … . . , 𝑛 are


different. Then the statistical average of 𝐼(𝑥𝑖 ) over i will give the average amount of
uncertainty associated with source X. This average is called source entropy and
denoted by 𝐻(𝑋), given by:
𝑛

𝐻(𝑋) = ∑ 𝑃(𝑥𝑖 ) 𝐼(𝑥𝑖 )


𝑖=1
𝑛

∴ 𝐻(𝑋) = − ∑ 𝑃(𝑥𝑖 ) log 𝑎 𝑃(𝑥𝑖 )


𝑖=1
Example:
Find the entropy of the source producing the following messages:

Dr. Mahmood 21 September 2017 12


𝑃𝑥1 = 0.25, 𝑃𝑥2 = 0.1, 𝑃𝑥3 = 0.15, 𝑎𝑛𝑑 𝑃𝑥4 = 0.5

Solution:
𝑛

𝐻(𝑋) = − ∑ 𝑃(𝑥𝑖 ) log 𝑎 𝑃(𝑥𝑖 )


𝑖=1
[0.25𝑙𝑛0.25 + 0.1𝑙𝑛0.1 + 0.15𝑙𝑛0.15 + 0.5𝑙𝑛0.5]
=−
𝑙𝑛2
𝑏𝑖𝑡𝑠
𝐻(𝑋) = 1.7427
𝑠𝑦𝑚𝑏𝑜𝑙
9-2 Binary Source entropy:

In information theory, the binary entropy function, denoted or H(X) or Hb(X), is


defined as the entropy of a Bernoulli process with probability p of one of two values.
Mathematically, the Bernoulli trial is modelled as a random variable X that can take on
only two values: 0 and 1:

𝑃(0 𝑇 ) + 𝑃(1 𝑇 ) = 1 → 𝑃(1 𝑇 ) = 1 − 𝑃(0 𝑇 )


We have:

𝐻(𝑋) = − ∑ 𝑃(𝑥𝑖 ) log 𝑎 𝑃(𝑥𝑖 )


𝑖=1
2

𝐻𝑏 (𝑋) = − ∑ 𝑃(𝑥𝑖 ) log 𝑎 𝑃(𝑥𝑖 )


𝑖=1

Then:
𝐻𝑏 (𝑋) = −[𝑃(0 𝑇 ) log 2 𝑃(0 𝑇 ) + (1 − 𝑃(0 𝑇 )) log 2 (1 − 𝑃(0 𝑇 ))] 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
If 𝑃(0 𝑇 ) = 0.2, 𝑡ℎ𝑒𝑛 𝑃(1 𝑇 ) = 1 − 0.2 = 0.8, 𝑎𝑛𝑑 𝑝𝑢𝑡 𝑖𝑛 𝑎𝑏𝑜𝑣𝑒 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛,
𝐻𝑏 (𝑋) = −[0.2 log 2 (0.2) + 0.8 log 2 (0.8)] = 0.7

9-3 Maximum Source Entropy:

For binary source, if 𝑃(0 𝑇 ) = 𝑃(1 𝑇 ) = 0.5, then the entropy is:

1
𝐻𝑏 (𝑋) = −[0.5 log 2 (0.5) + 0.5 log 2 (0.5)] = − log 2 ( ) = log 2 (2) = 1 𝑏𝑖𝑡
2

Dr. Mahmood 21 September 2017 13


Note that 𝐻𝑏 (𝑋)is maximum equal to 1(bit) if: 𝑃(0 𝑇 ) = 𝑃(1 𝑇 ) = 0.5, the entropy of
binary source or any source having only two value is distributed as shown in Fig.9

Figure 10: Entropy of binary source distribution

For any non-binary source, if all messages are equiprobable, then 𝑃(𝑥𝑖 ) = 1/𝑛 so that:
1 1 1
𝐻(𝑋) = 𝐻(𝑋)𝑚𝑎𝑥 = −[ log 𝑎 ( )] × 𝑛 = −log 𝑎 ( ) = log 𝑎 𝑛 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙, which
𝑛 𝑛 𝑛

is the maximum value of source entropy. Also, 𝐻(𝑋) = 0 if one of the message has the
probability of a certain event or p(x) =1.

9-4 Source Entropy Rate:


It is the average rate of amount of information produced per second.
𝑏𝑖𝑡𝑠
𝑅(𝑋) = 𝐻(𝑋) × 𝑟𝑎𝑡𝑒 𝑜𝑓 𝑝𝑟𝑜𝑑𝑢𝑐𝑖𝑛𝑔 𝑡ℎ𝑒 𝑠𝑦𝑚𝑏𝑜𝑙𝑠 = = 𝑏𝑝𝑠
𝑠𝑒𝑐
The unit of H(X) is bits/symbol and the rate of producing the symbols is symbol/sec, so
that the unit of R(X) is bits/sec.
Sometimes
𝐻(𝑋)
𝑅(𝑋) = ,
𝜏̅

Dr. Mahmood 21 September 2017 14


Where
𝑛

𝜏̅ = ∑ 𝜏𝑖 𝑃(𝑥𝑖 )
𝑖=1
𝜏̅ is the average time duration of symbols, 𝜏𝑖 is the time duration of the symbol 𝑥𝑖 .

Example 1:
A source produces dots ‘.’ And dashes ‘-‘ with P(dot)=0.65. If the time duration of dot
is 200ms and that for a dash is 800ms. Find the average source entropy rate.
Solution:
𝑃(𝑑𝑎𝑠ℎ) = 1 − 𝑃(𝑑𝑜𝑡) = 1 − 0.65 = 0.35
𝐻(𝑋) = −[0.65log 2 (0.65) + 0.35log 2 (0.35)] = 0.934 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
𝜏̅ = 0.2 × 0.65 + 0.8 × 0.35 = 0.41 𝑠𝑒𝑐
𝐻(𝑋) 0.34
𝑅(𝑋) = = = 2.278 𝑏𝑝𝑠
𝜏̅ 0.41
Example 2:
A discrete source emits one of five symbols once every millisecond. The symbol
probabilities are 1/2, 1/4, 1/8, 1/16 and 1/16 respectively. Calculate the information rate.

Solution:
5
1
H = ∑ Pi log 2
pi
i=1
1 1 1 1 1
H = log 2 2 + log 2 4 + log 2 8 + log 2 16 + log 2 16
2 4 8 16 16

𝐻=0.5+0.5+0.375+0.25+0.25 = 1.875 bit/symbol


H 1.875
R= = = 1.875 kbps
τ 10−3

Dr. Mahmood 21 September 2017 15


HW:

A source produces dots and dashes; the probability of the dot is twice the probability
of the dash. The duration of the dot is 10msec and the duration of the dash is set to
three times the duration of the dot. Calculate the source entropy rate.

10- Mutual information for noisy channel:


Consider the set of symbols 𝑥1 , 𝑥2 , … . , 𝑥𝑛 , the
transmitter 𝑇𝑥 my produce. The receiver 𝑅𝑥 may receive
𝑦1 , 𝑦2 … … … . 𝑦𝑚 . Theoretically, if the noise and
jamming is neglected, then the set X=set Y. However and
due to noise and jamming, there will be a conditional
probability 𝑃(𝑦𝑗 ∣ 𝑥𝑖 ):
1- 𝑃(𝑥𝑖 ) to be what is so called the a priori probability
of the symbol 𝑥𝑖 , which is the prob of selecting 𝑥𝑖 for transmission.
2- 𝑃(𝑦𝑗 ∣ 𝑥𝑖 ) to be what is called the aposteriori probability of the symbol 𝑥𝑖 after
the reception of 𝑦𝑗 .
The amount of information that 𝑦𝑗 provides about 𝑥𝑖 is called the mutual
information between 𝑥𝑖 and 𝑦𝑖 . This is given by:
𝑎𝑝𝑜𝑠𝑡𝑒𝑟𝑜𝑟𝑖 𝑝𝑟𝑜𝑏 𝑃( 𝑦𝑗 ∣∣ 𝑥𝑖 )
𝐼(𝑥𝑖 , 𝑦𝑗 ) = log 2 ( ) = log 2 ( )
𝑎𝑝𝑟𝑖𝑜𝑟𝑖 𝑝𝑟𝑜𝑏 𝑃(𝑥𝑖 )

Properties of 𝑰(𝒙𝒊 , 𝒚𝒋 ):

1- It is symmetric, 𝐼(𝑥𝑖 , 𝑦𝑗 ) = 𝐼(𝑦𝑗 , 𝑥𝑖 ).


2- 𝐼(𝑥𝑖 , 𝑦𝑗 ) > 0 if aposteriori probability > a priori probability, 𝑦𝑗 provides +ve
information about 𝑥𝑖 .

Dr. Mahmood 21 September 2017 16


3- 𝐼(𝑥𝑖 , 𝑦𝑗 ) = 0 if aposteriori probability = a priori probability, which is the case of
statistical independence when 𝑦𝑗 provides no information about 𝑥𝑖 .
4- 𝐼(𝑥𝑖 , 𝑦𝑗 ) < 0 if aposteriori probability < a priori probability, 𝑦𝑗 provides -ve
information about 𝑥𝑖 , or 𝑦𝑗 adds ambiguity.
𝑃( 𝑥𝑖∣∣𝑦𝑗 )
Also 𝐼(𝑥𝑖 , 𝑦𝑗 ) = log 2 ( )
𝑃(𝑦𝑗 )

Example:
Show that I(X, Y) is zero for extremely noisy channel.
Solution:
For extremely noisy channel, then 𝑦𝑗 gives no information about 𝑥𝑖 the receiver can’t
decide anything about 𝑥𝑖 as if we transmit a deterministic signal 𝑥𝑖 but the receiver
receives noise like signal 𝑦𝑗 that is completely has no correlation with 𝑥𝑖 . Then 𝑥𝑖 and
𝑦𝑗 are statistically independent so that 𝑃( 𝑥𝑖 ∣∣ 𝑦𝑗 ) = 𝑃(𝑥𝑖 )𝑎𝑛𝑑 𝑃( 𝑦𝑗 ∣∣ 𝑥𝑖 ) =
𝑃(𝑥𝑖 ) 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖 𝑎𝑛𝑑 𝑗, 𝑡ℎ𝑒𝑛:
𝐼(𝑥𝑖 , 𝑦𝑗 ) = log 2 1 = 0 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖 & 𝑗, 𝑡ℎ𝑒𝑛 𝐼(𝑋, 𝑌) = 0

10.1 Joint entropy:

In information theory, joint entropy is a measure of the uncertainty associated with a


set of variables.
𝑚 𝑛

𝐻(𝑋, 𝑌) = 𝐻(𝑋𝑌) = − ∑ ∑ 𝑃(𝑥𝑖 , 𝑦𝑗 ) log 2 𝑃(𝑥𝑖 , 𝑦𝑗 ) 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙


𝑗=1 𝑖=1

10.2 Conditional entropy:

In information theory, the conditional entropy quantifies the amount of information


needed to describe the outcome of a random variable Y given that the value of another
random variable X is known.

Dr. Mahmood 21 September 2017 17


𝑚 𝑛

𝐻(𝑌 ∣ 𝑋) = − ∑ ∑ 𝑃(𝑥𝑖 , 𝑦𝑗 ) log 2 𝑃(𝑦𝑗 ∣ 𝑥𝑖 ) 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙


𝑗=1 𝑖=1

10.3 Marginal Entropies:

Marginal entropies is a term usually used to denote both source entropy H(X) defined
as before and the receiver entropy H(Y) given by:

𝑚
𝑏𝑖𝑡𝑠
𝐻(𝑌) = − ∑ 𝑃(𝑦𝑗 ) log 2 𝑃(𝑦𝑗 )
𝑠𝑦𝑚𝑏𝑜𝑙
𝑗=1

10.4 Transinformation (average mutual information):


It is the statistical average of all pair 𝐼(𝑥𝑖 , 𝑦𝑗 ) , 𝑖 = 1, 2, … . . , 𝑛, 𝑗 = 1, 2, … . . , 𝑚.
This is denoted by 𝐼(𝑋, 𝑌) and is given by:
𝑛 𝑚

𝐼(𝑋, 𝑌) = ∑ ∑ 𝐼(𝑥𝑖 , 𝑦𝑗 )𝑃(𝑥𝑖 , 𝑦𝑗 )


𝑖=1 𝑗=1
𝑛 𝑚
𝑃( 𝑦𝑗 ∣∣ 𝑥𝑖 ) 𝑏𝑖𝑡𝑠
𝐼(𝑋, 𝑌) = ∑ ∑ 𝑃(𝑥𝑖 , 𝑦𝑗 ) log 2 ( )
𝑃(𝑦𝑗 ) 𝑠𝑦𝑚𝑏𝑜𝑙
𝑖=1 𝑗=1
or
𝑛 𝑚
𝑃( 𝑥𝑖 ∣∣ 𝑦𝑗 )
𝐼(𝑋, 𝑌) = ∑ ∑ 𝑃(𝑥𝑖 , 𝑦𝑗 ) log 2 ( ) 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
𝑃(𝑥𝑖 )
𝑖=1 𝑗=1
10.5 Relationship between joint, conditional and transinformation:

𝐻( 𝑌 ∣ 𝑋 ) = 𝐻(𝑋, 𝑌) − 𝐻(𝑋)
𝐻( 𝑋 ∣ 𝑌 ) = 𝐻(𝑋, 𝑌) − 𝐻(𝑌)
Where, the 𝐻( 𝑋 ∣ 𝑌 ) is the losses entropy.
Also we have:
𝐼(𝑋, 𝑌) = 𝐻(𝑋) − 𝐻(𝑋 ∣ 𝑌)
𝐼(𝑋, 𝑌) = 𝐻(𝑌) − 𝐻(𝑌 ∣ 𝑋)

Dr. Mahmood 21 September 2017 18


Example:
The joint probability of a system is given by:
𝑥1 0.5 0.25
𝑃(𝑋, 𝑌) = 𝑥2 [ 0 0.125 ]
𝑥3 0.0625 0.0625
Find:
1- Marginal entropies.
2- Joint entropy
3- Conditional entropies.
4- The transinformation.
𝑥 𝑥2 𝑥3 𝑦1 𝑦2
1- 𝑃(𝑋) = [ 1 ] 𝑃(𝑌) = [ ]
0.75 0.125 0.125 0.5625 0.4375
𝐻(𝑋) = −[0.75 ln(0.75) + 2 × 0.125 ln(0.125)]/𝑙𝑛2
= 1.06127 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
𝐻(𝑌) = −[0.5625 ln(0.5625) + 0.4375 ln(0.4375)]/𝑙𝑛2
= 0.9887 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
2-
𝑚 𝑛

𝐻(𝑋, 𝑌) = − ∑ ∑ 𝑃(𝑥𝑖 , 𝑦𝑗 ) log 2 𝑃(𝑥𝑖 , 𝑦𝑗 )


𝑗=1 𝑖=1

𝐻(𝑋, 𝑌)
[0.5ln(0.5) + 0.25 ln(0.25) + 0.125 ln(0.125) + 2 × 0.0625 ln(0.0625)]
=−
𝑙𝑛2
= 1.875 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
𝑏𝑖𝑡𝑠
3- 𝐻( 𝑌 ∣ 𝑋 ) = 𝐻(𝑋, 𝑌) − 𝐻(𝑋) = 1.875 − 1.06127 = 0.813
𝑠𝑦𝑚𝑏𝑜𝑙

𝐻( 𝑋 ∣ 𝑌 ) = 𝐻(𝑋, 𝑌) − 𝐻(𝑌) = 1.875 − 0.9887 = 0.886 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙


4- 𝐼(𝑋, 𝑌) = 𝐻(𝑋) − 𝐻( 𝑋 ∣ 𝑌 ) = 1.06127 − 0.8863 = 0.17497 𝑏𝑖𝑡𝑠/
𝑠𝑦𝑚𝑏𝑜𝑙.

Dr. Mahmood 21 September 2017 19


Chapter Two

1- Channel:
In telecommunications and computer networking, a communication channel
or channel, refers either to a physical transmission medium such as a wire, or to
a logical connection over a multiplexed medium such as a radio channel. A channel is
used to convey an information signal, for example a digital bit stream, from one or
several senders (or transmitters) to one or several receivers. A channel has a certain
capacity for transmitting information, often measured by its bandwidth in Hz or its data
rate in bits per second.
2- Symmetric channel:

The symmetric channel have the following condition:

a- Equal number of symbol in X&Y, i.e. P(Y∣X) is a square matrix.


b- Any row in P(Y∣X) matrix comes from some permutation of other rows.

For example the following conditional probability of various channel types as shown:

0.9 0.1
a- 𝑃( 𝑌 ∣ 𝑋 ) = [ ] is a BSC, because it is square matrix and 1st row is the
0.1 0.9
permutation of 2nd row.
0.9 0.05 0.05
b- 𝑃( 𝑌 ∣ 𝑋 ) = [0.05 0.9 0.05] is TSC, because it is square matrix and each
0.05 0.05 0.9
row is a permutation of others.
0.8 0.1 0.1
c- 𝑃( 𝑌 ∣ 𝑋 ) = [ ] is a non-symmetric since since it is not square
0.1 0.8 0.1
although each row is permutation of others.
0.8 0.1 0.1
d- 𝑃( 𝑌 ∣ 𝑋 ) = [0.1 0.7 0.2] is a non-symmetric although it is square since 2nd
0.1 0.1 0.8
row is not permutation of other rows.

DR. MAHMOOD 2017-11-14 20


2.1- Binary symmetric channel (BSC)
It is a common communications channel model used in coding theory and information
theory. In this model, a transmitter wishes to send a bit (a zero or a one), and the receiver
receives a bit. It is assumed that the bit is usually transmitted correctly, but that it will
be "flipped" with a small probability (the "crossover probability").
1-P
0 0
P

1 1
1-P

A binary symmetric channel with crossover probability p denoted by BSCp, is a


channel with binary input and binary output and probability of error p; that is, if X is the
transmitted random variable and Y the received variable, then the channel is
characterized by the conditional probabilities:

Pr( 𝑌 = 0 ∣ 𝑋 = 0 ) = 1 − 𝑃

Pr( 𝑌 = 0 ∣ 𝑋 = 1 ) = 𝑃

Pr( 𝑌 = 1 ∣ 𝑋 = 0 ) = 𝑃

Pr( 𝑌 = 1 ∣ 𝑋 = 1 ) = 1 − 𝑃

2.2- Ternary symmetric channel (TSC):

The transitional probability of TSC is:

𝑦1 𝑦2 𝑦3
𝑥1
1 − 2𝑃𝑒 𝑃𝑒 𝑃𝑒
𝑃( 𝑌 ∣ 𝑋 ) = 𝑥2 [ 𝑃 1 − 2𝑃𝑒 𝑃𝑒 ]
𝑥3 𝑒
𝑃𝑒 𝑃𝑒 1 − 2𝑃𝑒

The TSC is symmetric but not very practical since practically 𝑥1 and 𝑥3 are not affected
so much as 𝑥2 . In fact the interference between 𝑥1 and 𝑥3 is much less than the
interference between 𝑥1 and 𝑥2 or 𝑥2 and 𝑥3 .

DR. MAHMOOD 2017-11-14 21


1-2Pe
X1 Y1
Pe

X2 1-2Pe Y2
Pe
X3 Y3
1-2Pe

Hence the more practice but nonsymmetric channel has the trans. prob.
𝑦 𝑦2 𝑦3
𝑥1 1
1 − 𝑃𝑒 𝑃𝑒 0
𝑃( 𝑌 ∣ 𝑋 ) = 𝑥2 [𝑃 1 − 2𝑃𝑒 𝑃𝑒 ]
𝑥3 𝑒
0 𝑃𝑒 1 − 𝑃𝑒

Where 𝑥1 interfere with 𝑥2 exactly the same as interference between 𝑥2 and 𝑥3 , but 𝑥1
and 𝑥3 are not interfere.

1-Pe
X1 Y1
Pe

X2 1-2Pe Y2
Pe

X3 Y3
1-Pe

3- Discrete Memoryless Channel:

The Discrete Memoryless Channel (DMC) has an input X and an output Y. At any given
time (t), the channel output Y= y only depends on the input X = x at that time (t) and it
does not depend on the past history of the input. DMC is represented by the conditional
probability of the output Y = y given the input X = x, or P(YX).

X Channel Y
P(YX)

DR. MAHMOOD 2017-11-14 22


4- Binary Erasure Channel (BEC):

The Binary Erasure Channel (BEC) model are widely used to represent channels or links
that “losses” data. Prime examples of such channels are Internet links and routes. A
BEC channel has a binary input X and a ternary output Y.
1-Pe
X1 Y1

Pe
Erasure

X2 Y2
1-Pe

Note that for the BEC, the probability of “bit error” is zero. In other words, the
following conditional probabilities hold for any BEC model:

Pr( 𝑌 = "𝑒𝑟𝑎𝑠𝑢𝑟𝑒" ∣ 𝑋 = 0 ) = 𝑃

Pr( 𝑌 = "𝑒𝑟𝑎𝑠𝑢𝑟𝑒" ∣ 𝑋 = 1 ) = 𝑃

Pr( 𝑌 = 0 ∣ 𝑋 = 0 ) = 1 − 𝑃

Pr( 𝑌 = 1 ∣ 𝑋 = 1 ) = 1 − 𝑃

Pr( 𝑌 = 0 ∣ 𝑋 = 1 ) = 0

Pr( 𝑌 = 1 ∣ 𝑋 = 0 ) = 0

5- Special Channels:

a- Lossless channel: It has only one nonzero element in each column of the
transitional matrix P(Y∣X).
𝑦1 𝑦2 𝑦3 𝑦4 𝑦5
𝑥1
3/4 1/4 0 0 0
𝑃( 𝑌 ∣ 𝑋 ) = 𝑥2 [ ]
𝑥3 0 0 1/3 2/3 0
0 0 0 0 1

DR. MAHMOOD 2017-11-14 23


This channel has H(X∣Y)=0 and I(X, Y)=H(X) with zero losses entropy.
b- Deterministic channel: It has only one nonzero element in each row, the
transitional matrix P(Y∣X), as an example:
𝑦1 𝑦2 𝑦3
𝑥1 1 0 0
1 0 0
𝑃( 𝑌 ∣ 𝑋 ) = 𝑥2
𝑥3 0 0 1
0 1 0
[ 0 1 0 ]
This channel has H(Y∣X)=0 and I(Y, X)=H(Y) with zero noisy entropy.
c- Ideal channel: It has only one nonzero element in each row and column, the
transitional matrix P(Y∣X), i.e. it is an identity matrix, as an example:
𝑥1 𝑦1 𝑦2 𝑦3
1 0 0
𝑃( 𝑌 ∣ 𝑋 ) = 𝑥2 [ ]
𝑥3 0 1 0
0 0 1
This channel has H(Y∣X)= H(X∣Y)=0 and I(Y, X)=H(Y)=H(X).
d- Noisy channel: No relation between input and output:
H( X ∣ Y ) = 𝐻(𝑌), H( Y ∣ X ) = 𝐻(𝑋)
𝐼(𝑋, 𝑌) = 0, 𝐶 = 0
𝐻(𝑋, 𝑌) = 𝐻(𝑋) + 𝐻(𝑌)

6- Shannon’s theorem:

a- A given communication system has a maximum rate of information C known as


the channel capacity.
b- If the information rate R is less than C, then one can approach arbitrarily small
error probabilities by using intelligent coding techniques.
c- To get lower error probabilities, the encoder has to work on longer blocks of
signal data. This entails longer delays and higher computational requirements.

DR. MAHMOOD 2017-11-14 24


Thus, if R ≤ C then transmission may be accomplished without error in the presence
of noise. The negation of this theorem is also true: if R > C, then errors cannot be
avoided regardless of the coding technique used.

Consider a bandlimited Gaussian channel operating in the presence of additive


Gaussian noise:

The Shannon-Hartley theorem states that the channel capacity is given by:

𝑆
𝐶 = 𝐵𝑙𝑜𝑔2 (1 + )
𝑁

Where C is the capacity in bits per second, B is the bandwidth of the channel in Hertz,
and S/N is the signal-to-noise ratio.

7- Channel Capacity (Discrete channel)

This is defined as the maximum of I(X,Y):

𝐶 = 𝑐ℎ𝑎𝑛𝑛𝑒𝑙 𝑐𝑎𝑝𝑎𝑐𝑖𝑡𝑦 = max[𝐼(𝑋, 𝑌)] 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙

Physically it is the maximum amount of information each symbol can carry to the
receiver. Sometimes this capacity is also expressed in bits/sec if related to the rate of
producing symbols r:

𝑅(𝑋, 𝑌) = 𝑟 × 𝐼(𝑋, 𝑌) 𝑏𝑖𝑡𝑠/ sec 𝑜𝑟 𝑅(𝑋, 𝑌) = 𝐼(𝑋, 𝑌)/ 𝜏̅

a- Channel capacity of Symmetric channels:


The channel capacity is defined as max[𝐼(𝑋, 𝑌)]:
𝐼(𝑋, 𝑌) = 𝐻(𝑌) − 𝐻( 𝑌 ∣ 𝑋 )

DR. MAHMOOD 2017-11-14 25


𝑚 𝑛

𝐼(𝑋, 𝑌) = 𝐻(𝑌) + ∑ ∑ 𝑃(𝑥𝑖 , 𝑦𝑗 ) log 2 𝑃(𝑦𝑗 ∣ 𝑥𝑖 )


𝑗=1 𝑖=1

But we have
𝑃(𝑥𝑖 , 𝑦𝑗 ) = 𝑃(𝑥𝑖 )𝑃(𝑦𝑗 ∣ 𝑥𝑖 ) 𝑝𝑢𝑡 𝑖𝑛 𝑎𝑏𝑜𝑣𝑒 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑦𝑖𝑒𝑙𝑑𝑒𝑠:

𝑚 𝑛

𝐼(𝑋, 𝑌) = 𝐻(𝑌) + ∑ ∑ 𝑃(𝑥𝑖 )𝑃(𝑦𝑗 ∣ 𝑥𝑖 ) log 2 𝑃(𝑦𝑗 ∣ 𝑥𝑖 )


𝑗=1 𝑖=1

If the channel is symmetric the quantity:


𝑚

∑ 𝑃(𝑦𝑗 ∣ 𝑥𝑖 )log 2 𝑃(𝑦𝑗 ∣ 𝑥𝑖 ) = 𝐾


𝑗=1

Where K is constant and independent of the row number i so that the equation
becomes:
𝑛

𝐼(𝑋, 𝑌) = 𝐻(𝑌) + 𝐾 ∑ 𝑃(𝑥𝑖 )


𝑖=1

Hence 𝐼(𝑋, 𝑌) = 𝐻(𝑌) + 𝐾 for symmetric channels


Max of 𝐼(𝑋, 𝑌) = max[𝐻(𝑌) + 𝐾] = max[𝐻(𝑌)] + 𝐾
When Y has equiprobable symbols then max[𝐻(𝑌)] = 𝑙𝑜𝑔2 𝑚
Then
𝐼(𝑋, 𝑌) = 𝑙𝑜𝑔2 𝑚 + 𝐾
Or
𝐶 = 𝑙𝑜𝑔2 𝑚 + 𝐾
Example 9:
For the BSC shown:
0.7
X1 Y1

X2 0.7 Y2

DR. MAHMOOD 2017-11-14 26


Find the channel capacity and efficiency if 𝐼(𝑥1 ) = 2𝑏𝑖𝑡𝑠
Solution:
0.7 0.3
𝑃( 𝑌 ∣ 𝑋 ) = [ ]
0.3 0.7
Since the channel is symmetric then
𝐶 = 𝑙𝑜𝑔2 𝑚 + 𝐾 and 𝑛 = 𝑚
𝑤ℎ𝑒𝑟𝑒 𝑛 𝑎𝑛𝑑 𝑚 𝑎𝑟𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑟𝑜𝑤 𝑎𝑛𝑑 𝑐𝑜𝑙𝑢𝑚𝑛 𝑟𝑒𝑝𝑒𝑠𝑡𝑖𝑣𝑒𝑙𝑦
𝐾 = 0.7𝑙𝑜𝑔2 0.7 + 0.3𝑙𝑜𝑔2 0.3 = −0.88129
𝐶 = 1 − 0.88129 = 0.1187 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
𝐼(𝑋,𝑌)
The channel efficiency 𝜂 =
𝐶

𝐼(𝑥1 ) = −𝑙𝑜𝑔2 𝑃(𝑥1 ) = 2

𝑃(𝑥1 ) = 2−2 = 0.25 𝑡ℎ𝑒𝑛 𝑃(𝑋) = [0.25 0.75]𝑇

And we have 𝑃(𝑥𝑖 , 𝑦𝑗 ) = 𝑃(𝑥𝑖 )𝑃(𝑦𝑗 ∣ 𝑥𝑖 ) so that


0.7 × 0.25 0.3 × 0.25 0.175 0.075
𝑃(𝑋, 𝑌) = [ ]=[ ]
0.3 × 0.75 0.7 × 0.75 0.225 0.525
𝑃(𝑌) = [0.4 0.6] → 𝐻(𝑌) = 0.97095 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
𝐼(𝑋, 𝑌) = 𝐻(𝑌) + 𝐾 = 0.97095 − 0.88129 = 0.0896 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
0.0896
Then 𝜂 = = 75.6%
0.1187

To find the channel redundancy:


𝑅 = 1 − 𝜂 = 1 − 0.756 = 0.244 𝑜𝑟 24.4%
1- Channel capacity of nonsymmetric channels:
We can find the channel capacity of nonsymmetric channel by the following
steps:
a- Find I(X, Y) as a function of input prob:
𝐼(𝑋, 𝑌) = 𝑓(𝑃(𝑥1 ), 𝑃(𝑥2 ) … … … , 𝑃(𝑥𝑛 ))
And use the constraint to reduce the number of variable by 1.
b- Partial differentiate I(X, Y) with respect to (n-1) input prob., then equate
these partial derivatives to zero.

DR. MAHMOOD 2017-11-14 27


c- Solve the (n-1) equations simultaneously then find
𝑃(𝑥1 ), 𝑃(𝑥2 ) … . . . . , 𝑃(𝑥𝑛 ) that gives maximum I(X, Y).
d- Put resulted values of input prob. in the function given in step 1 to find 𝐶 =
max[𝐼(𝑋, 𝑌)].

Example 10:

Find the channel capacity for the channel having the following transition:

0.7 0.3
𝑃( 𝑌 ∣ 𝑋 ) = [ ]
0.1 0.9

Assume that 𝑃(𝑥1 ) = 𝑃 ≅ 0.862

Solution: First not that the channel is not symmetric since the 1st row is not
permutation of 2nd row.

a- Let 𝑃(𝑥1 ) = 0.862 , 𝑡ℎ𝑒𝑛 𝑃(𝑥2 ) = 1 − 𝑃 = 0.138.

𝑃(𝑋, 𝑌) = 𝑃(𝑋) × 𝑃( 𝑌 ∣ 𝑋 )

0.7 × 0.862 0.3 × 0.862


∴ 𝑃(𝑋, 𝑌) = [ ]
0.1(1 − 0.862) 0.9(1 − 0.862)

0.6034 0.2586
=[ ]
0.0138 0.1242

From above results 𝑃(𝑌) = [0.6172 0.3828]


𝑚

𝐻(𝑌) = − ∑ 𝑃(𝑦𝑗 ) log 2 𝑃(𝑦𝑗 )


𝑗=1

∴ 𝐻(𝑌) = 0.960 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙

We have
𝑚 𝑛

𝐻(𝑌 ∣ 𝑋) = − ∑ ∑ 𝑃(𝑥𝑖 , 𝑦𝑗 ) log 2 𝑃(𝑦𝑗 ∣ 𝑥𝑖 )


𝑗=1 𝑖=1

DR. MAHMOOD 2017-11-14 28


𝐻(𝑌 ∣ 𝑋) = −[0.6034 𝑙𝑛0.7 + 0.2586𝑙𝑛0.3 + 0.0138𝑙𝑛0.1 + 0.1242𝑙𝑛0.9]/𝑙𝑛

H( Y ∣ X ) = 0.8244 bits/symbol

𝐶 = max[𝐼(𝑋, 𝑌)] = 𝐻(𝑌) − H( Y ∣ X )

= 0.96021 − 0.8244 = 0.1358 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙.

Review questions:
A binary source sending 𝑥1 with a probability of 0.4 and 𝑥2 with 0.6 probability
through a channel with a probabilities of errors of 0.1 for 𝑥1 and 0.2 for 𝑥2 .Determine:
1- Source entropy.
2- Marginal entropy.
3- Joint entropy.
4- Conditional entropy 𝐻(𝑌𝑋).
5- Losses entropy 𝐻(𝑋𝑌).
6- Transinformation.
Solution:
1- The channel diagram:
0.9
0.4 𝑥1 𝑦1
0.1

0.2
0.6 𝑥2 𝑦2
0.8
0.9 0.1
Or 𝑃(𝑌X) = [ ]
0.2 0.8
𝑛

𝐻(𝑋) = − ∑ 𝑝(𝑥𝑖 ) 𝑙𝑜𝑔2 𝑝(𝑥𝑖 )


𝑖=1

[0.4 ln(0.4) + 0.6 ln(0.6)] 𝑏𝑖𝑡𝑠


𝐻(𝑋) = − = 0.971
𝑙𝑛2 𝑠𝑦𝑚𝑏𝑜𝑙
2- 𝑃(𝑋, 𝑌) = 𝑃(𝑌X) × 𝑃(𝑋)
0.9 × 0.4 0.1 × 0.4 0.36 0.04
∴ 𝑃(𝑋, 𝑌) = [ ]=[ ]
0.2 × 0.6 0.8 × 0.6 0.12 0.48
∴ 𝑃(𝑌) = [0.48 0.52]

DR. MAHMOOD 2017-11-14 29


𝑚

𝐻(𝑌) = − ∑ 𝑝(𝑦𝑗 ) 𝑙𝑜𝑔2 𝑝(𝑦𝑗 )


𝑗=1

[0.48 ln(0.48) + 0.52 ln(0.52)]


𝐻(𝑌) = − = 0.999 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
ln(2)

3- 𝐻(𝑋, 𝑌)
𝑚 𝑛

𝐻(𝑋, 𝑌) = − ∑ ∑ 𝑃(𝑥𝑖 , 𝑦𝑗 ) log 2 𝑃(𝑥𝑖 , 𝑦𝑗 )


𝑗=1 𝑖=1
[0.36 ln(0.36) + 0.04 ln(0.04) + 0.12 ln(0.12) + 0.48 ln(0.48)]
𝐻(𝑋, 𝑌) = −
ln(2)
= 1.592 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
4- 𝐻(𝑌X)
𝑚 𝑛

𝐻(𝑌X) == − ∑ ∑ 𝑃(𝑥𝑖 , 𝑦𝑗 ) log 2 𝑃(𝑦𝑗 𝑥𝑖 )


𝑗=1 𝑖=1
[0.36 ln(0.9) + 0.12 ln(0.2) + 0.04 ln(0.1) + 0.48 ln(0.8)]
𝐻(𝑌X) = −
ln(2)
𝑏𝑖𝑡𝑠
= 0.621
𝑠𝑦𝑚𝑏𝑜𝑙
𝑏𝑖𝑡𝑠
Or 𝐻(𝑌X) = 𝐻(𝑋, 𝑌) − 𝐻(𝑋) = 1.592 − 0.971 = 0.621
𝑠𝑦𝑚𝑏𝑜𝑙
5- 𝐻(𝑋Y) = 𝐻(𝑋, 𝑌) − 𝐻(𝑌) = 1.592 − 0.999 = 0.593 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙
6- 𝐼(𝑋, 𝑌) = 𝐻(𝑋) − 𝐻(𝑋Y) = 0.971 − 0.593 = 0.378 bits/symbol

2- Cascading of Channels
If two channels are cascaded, then the overall transition matrix is the product of the two
transition matrices.
p( z / x)  p( y / x). p( z / y)
(n  k ) ( n  m) ( m  k )
matrix matrix matrix
1 1 1
. . .

. Channel 1 . Channel 1 .
n m k

DR. MAHMOOD 2017-11-14 30


For the series information channel, the overall channel capacity is not exceed any of
each channel individually.
𝐼(𝑋, 𝑍) ≤ 𝐼(𝑋, 𝑌) & 𝐼(𝑋, 𝑍) ≤ 𝐼(𝑌, 𝑍)

Example:
Find the transition matrix p( z / x) for the cascaded channel shown.
0.8
0.7
0.2 0.3
0.3 1

0.7
1

0.7 0.3
 0. 8 0 . 2 0  ,
p( y / x)    p( z / y )   1 0
 0. 3 0 0 . 7   
 1 0 

0.7 0.3
0.8 0.2 0   0.76 0.24
p ( z / x)    1 0 
0.3 0 0.7    0.91 0.09
 1 0 

DR. MAHMOOD 2017-11-14 31


Chapter Three
Source Coding

1- Sampling theorem:

Sampling of the signals is the fundamental operation in digital communication. A


continuous time signal is first converted to discrete time signal by sampling process.
Also it should be possible to recover or reconstruct the signal completely from its
samples.

The sampling theorem state that:

i- A band limited signal of finite energy, which has no frequency components higher
than W Hz, is completely described by specifying the values of the signal at instant
of time separated by 1/2W second and
ii- A band limited signal of finite energy, which has no frequency components higher
than W Hz, may be completely recovered from the knowledge of its samples taken
at the rate of 2W samples per second.

When the sampling rate is chosen 𝑓𝑠 = 2𝑓𝑚 each spectral replicate is separated from
each of its neighbors by a frequency band exactly equal to 𝑓𝑠 hertz, and the analog
waveform ca theoretically be completely recovered from the samples, by the use of
filtering. It should be clear that if 𝑓𝑠 > 2𝑓𝑚 , the replications will be move farther apart
in frequency making it easier to perform the filtering operation.

When the sampling rate is reduced, such that 𝑓𝑠 < 2𝑓𝑚 , the replications will overlap, as
shown in figure below, and some information will be lost. This phenomenon is called
aliasing.

DR. MAHMOOD 2017-12-08 30


Sampled spectrum 𝑓𝑠 > 2𝑓𝑚

Sampled spectrum 𝑓𝑠 < 2𝑓𝑚

A bandlimited signal having no spectral components above 𝑓𝑚 hertz can be


1
determined uniquely by values sampled at uniform intervals of 𝑇𝑠 ≤ 𝑠𝑒𝑐.
2𝑓𝑚
1
The sampling rate is 𝑓𝑠 =
𝑇𝑠

So that 𝑓𝑠 ≥ 2𝑓𝑚 . The sampling rate 𝑓𝑠 = 2𝑓𝑚 is called Nyquist rate.

Example: Find the Nyquist rate and Nyquist interval for the following signals.

sin⁡(500𝜋𝑡)
i- 𝑚(𝑡) =
𝜋𝑡
1
ii- 𝑚(𝑡) = cos(4000𝜋𝑡) cos(1000𝜋𝑡)
2𝜋

Solution:

i- 𝑤𝑡 = 500𝜋𝑡⁡⁡⁡⁡⁡⁡⁡⁡ ∴ 2𝜋𝑓 = 500𝜋⁡⁡⁡⁡⁡⁡ → 𝑓 = 250𝐻𝑧


1 1
Nyquist interval = = = 2⁡𝑚𝑠𝑒𝑐.
2𝑓𝑚𝑎𝑥 2×250

DR. MAHMOOD 2017-12-08 31


Nyquist rate =2𝑓𝑚𝑎𝑥 = 2 × 250 = 500𝐻𝑧

1 1
ii- 𝑚(𝑡) = [ {cos(4000𝜋𝑡 − 1000𝜋𝑡) + cos(4000𝜋𝑡 + 1000𝜋𝑡)}]
2𝜋 2

1
= {cos(3000𝜋𝑡) + cos(5000𝜋𝑡)}
4𝜋
Then the highest frequency is 2500Hz
1 1
Nyquist interval = = = 0.2⁡𝑚𝑠𝑒𝑐.
2𝑓𝑚𝑎𝑥 2×2500

Nyquist rate =2𝑓𝑚𝑎𝑥 = 2 × 2500 = 5000𝐻𝑧

H. W:

Find the Nyquist interval and Nyquist rate for the following:
1
i- cos(400𝜋𝑡) . cos(200𝜋𝑡)
2𝜋
1
ii- 𝑠𝑖𝑛𝜋𝑡
𝜋

Example:

A waveform [20+20sin(500t+30o] is to be sampled periodically and reproduced


from these sample values. Find maximum allowable time interval between
sample values, how many sample values are needed to be stored in order to
reproduce 1 sec of this waveform?.
Solution:
𝑥(𝑡) = 20 + 20 sin(500𝑡 + 300 )
𝑤 = 500 → 2𝜋𝑓 = 500 → 𝑓 = 79.58⁡𝐻𝑧
Minimum sampling rate will be twice of the signal frequency:
𝑓𝑠(min) = 2 × 79.58 = 159.15⁡𝐻𝑧
1 1
𝑇𝑠(𝑚𝑎𝑥) = = = 6.283⁡𝑚𝑠𝑒𝑐.
𝑓𝑠(min) 159.15

DR. MAHMOOD 2017-12-08 32


1
Number of sample in 1𝑠𝑒𝑐 = = 159.16 ≈ 160⁡𝑠𝑎𝑚𝑝𝑙𝑒
6.283𝑚𝑠𝑒𝑐

2- Source coding:
An important problem in communications is the efficient representation of data
generated by a discrete source. The process by which this representation is
accomplished is called source encoding. An efficient source encoder must satisfies two
functional requirements:

i- The code words produced by the encoder are in binary form.


ii- The source code is uniquely decodable, so that the original source sequence can
be reconstructed perfectly from the encoded binary sequence.

The entropy for a source with statistically independent symbols:


𝑚

𝐻(𝑌) = − ∑ 𝑃(𝑦𝑗 ) log 2 𝑃(𝑦𝑗 )


𝑗=1

We have:
max[𝐻(𝑌)] = 𝑙𝑜𝑔2 𝑚
A code efficiency can therefore be defined as:
𝐻(𝑌)
𝜂= × 100
max[𝐻(𝑌)]
The overall code length, L, can be defined as the average code word length:
𝑚

𝐿 = ∑ 𝑃(𝑥𝑗 )𝑙𝑗 ⁡⁡⁡⁡⁡⁡⁡𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙


𝑗=1

DR. MAHMOOD 2017-12-08 33


The code efficiency can be found by:
𝐻(𝑌)
𝜂= × 100
L
Not that max[𝐻(𝑌)] ⁡⁡⁡⁡⁡𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙 = 𝐿⁡⁡⁡𝑏𝑖𝑡𝑠/𝑐𝑜𝑑𝑒𝑤𝑜𝑟𝑑

i- Fixed- Length Code Words:


If the alphabet X consists of the 7 symbols {a, b, c, d, e, f, g}, then the following
fixed-length code of block length L = 3 could be used.
C(a) = 000
C(b) = 001
C(c) = 010
C(d) = 011
C(e) = 100
C(f) = 101
C(g) = 110.
The encoded output contains L bits per source symbol. For the above example
the source sequence bad... would be encoded into 001000011... . Note that the
output bits are simply run together (or, more technically, concatenated). This
method is nonprobabilistic; it takes no account of whether some symbols occur
more frequently than others, and it works robustly regardless of the symbol
frequencies.
This is used when the source produces almost equiprobable messages
p( x1 )  p( x2 )  p( x3 )  ...  p( xn ) , then l1  l2  l3  ...  ln  LC and for binary coding

then:
1- LC  log2 n bit/message if n  2r ( n  2,4,8,16,.... and r is an
integer) which gives   100 %

2- LC  Int[log2 n]  1 bits/message if n  2r which gives less efficiency

Example
For ten equiprobable messages coded in a fixed length code then

DR. MAHMOOD 2017-12-08 34


1
p( xi )  and LC  Int[log2 10]  1  4 bits
10
H (X ) log 2 10
and    100%   100%  83.048%
LC 4
Example: For eight equiprobable messages coded in a fixed length code then
1 3
p( xi )  and LC  log2 8  3 bits and    100%  100%
8 3
Example: Find the efficiency of a fixed length code used to encode messages obtained
from throwing a fair die (a) once, (b) twice, (c) 3 times.
Solution
a- For a fair die, the messages obtained from it are equiprobable with a probability
1
of p( xi )  with n  6 .
6
LC  Int[log2 6]  1  3 bits/message
H (X ) log 2 6
  100%   100%  86.165%
LC 3
b- For two throws then the possible messages are n  6  6  36 messages with
equal probabilities

LC  Int[log2 36]  1  6 bits/message  6 bits/2-symbols

while H ( X )  log 2 6 bits/symbol   2  H ( X )  100%  86.165%


LC

c- For three throws then the possible messages are n  6  6  6  216 with equal
probabilities

LC  Int[log2 216]  1  8 bits/message  8 bits/3-symbols

while H ( X )  log 2 6 bits/symbol 3 H ( X )


  100%  96.936%
LC

DR. MAHMOOD 2017-12-08 35


ii- Variable-Length Code Words
When the source symbols are not equally probable, a more efficient encoding method
is to use variable-length code words. For example, a variable-length code for the
alphabet X = {a, b, c} and its lengths might be given by

C(a)= 0 l(a)=1
C(b)= 10 l(b)=2
C(c)= 11 l(c)=2
The major property that is usually required from any variable-length code is that of
unique decodability. For example, the above code C for the alphabet X = {a, b, c} is
soon shown to be uniquely decodable. However such code is not uniquely decodable,
even though the codewords are all different. If the source decoder observes 01, it
cannot determine whether the source emitted (a b) or (c).

Prefix-free codes: A prefix code is a type of code system (typically a variable-


length code) distinguished by its possession of the "prefix property", which requires
that there is no code word in the system that is a prefix (initial segment) of any other
code word in the system. For example:

{𝑎 = 0, 𝑏 = 110, 𝑐 = 10, 𝑑 = 111}⁡𝑖𝑠⁡𝑎⁡𝑝𝑟𝑒𝑓𝑖𝑥⁡𝑐𝑜𝑑𝑒.

When message probabilities are not equal, then we use variable length codes. The
following properties need to be considered when attempting to use variable length
codes:

1) Unique decoding:
Example
Consider a 4 alphabet symbols with symbols represented by binary digits as
follows:
A0
B  01
C  11
DR. MAHMOOD 2017-12-08 36
D  00
If we receive the code word 0011 it is not known whether the transmission was DC
or AAC . This example is not, therefore, uniquely decodable.
2) Instantaneous decoding:
Example
Consider a 4 alphabet symbols with symbols represented by binary digits as
follows:
A0
B  10
C  110
D  111
This code can be instantaneously decoded since no complete codeword is a prefix of a
larger codeword. This is in contrast to the previous example where A is a prefix of both
B and D . This example is also a ‘comma code’ as the symbol zero indicates the end
of a codeword except for the all ones word whose length is known.

Example
Consider a 4 alphabet symbols with symbols represented by binary digits as follows:
A0
B  01
C  011
D  111
The code is identical to the previous example but the bits are time reversed. It is still
uniquely decodable but no longer instantaneous, since early codewords are now prefixes
of later ones.

Shannon Code
For messages x1 , x2 , x3 ,… xn with probabilities p( x1 ) , p ( x2 ) , p( x3 ) ,… p( xn ) then:

DR. MAHMOOD 2017-12-08 37


r
1) li   log2 p( xi ) if p( xi )   1  1 1 1
{ , , ,...}
2 2 4 8
r
2) li  Int[ log2 p( xi )]  1 if p( x )   1 
i
2
i 1
Also define Fi   p( xk ) 1  i  0
k 1

then the codeword of xi is the binary equivalent of Fi consisting of li bits.

Ci  Fi 2i
l

where Ci is the binary equivalent of Fi up to li bits. In encoding, messages must be


arranged in a decreasing order of probabilities.

Example
Develop the Shannon code for the following set of messages,
p( x)  [0.3 0.2 0.15 0.12 0.1 0.08 0.05]

then find:
(a) Code efficiency,
(b) p(0) at the encoder output.
Solution
xi p( xi ) li Fi Ci 0i
x1 0.3 2 0 00 2

x2 0.2 3 0.3 010 2

x3 0.15 3 0.5 100 2

x4 0.12 4 0.65 1010 2

x5 0.10 4 0.77 1100 2

x6 0.08 4 0.87 1101 1

x7 0.05 5 0.95 11110 1

DR. MAHMOOD 2017-12-08 38


To find To find To find

0 0 1
0 1 0

To find To find

1 1
0 1
1 0
0 0

(a) To find the code efficiency, we have


7
LC   li p( xi )  3.1 bits/message.
i 1

7
H ( X )   p( x i ) log 2 p( x i )  2.6029 bits/message.
i 1

H (X )
  100%  83.965%
LC

(b) p(0) at the encoder output is


7

 0i p( xi ) 0.6  0.4  0.3  0.24  0.2  0.08  0.05


p (0)  i 1

LC 3 .1

p(0)  0.603
Example
Repeat the previous example using ternary coding.
Solution
r
1) li   log3 p( xi ) if p( x )   1  1 1 1
i
{ , , ,...}
 3 3 9 27

DR. MAHMOOD 2017-12-08 39


2) li  Int[ log3 p( xi )]  1 Ci  Fi 3i
l
if 1
r
and
p( xi )   
 3

xi p( xi ) li Fi Ci 0i
x1 0.3 2 0 00 2

x2 0.2 2 0.3 02 1

x3 0.15 2 0.5 11 0

x4 0.12 2 0.65 12 0

x5 0.10 3 0.77 202 1

x6 0.08 3 0.87 212 0

x7 0.05 3 0.95 221 0

To find To find To find

0 0 1
1

To find To find

1 2
2 0

(a) To find the code efficiency, we have


7
LC   li p( xi )  2.23 ternary unit/message.
i 1

7
H ( X )   p( x i ) log 3 p( x i )  1.642 ternary unit/message.
i 1

H (X )
  100%  73.632%
LC

(b) p(0) at the encoder output is

DR. MAHMOOD 2017-12-08 40


7

 0i p( xi ) 0 . 6  0 .2  0 .1
p ( 0)  i 1

LC 2.23

p(0)  0.404
Shannon- Fano Code:

In Shannon–Fano coding, the symbols are arranged in order from most probable to
least probable, and then divided into two sets whose total probabilities are as close
as possible to being equal. All symbols then have the first digits of their codes
assigned; symbols in the first set receive "0" and symbols in the second set receive
"1". As long as any sets with more than one member remain, the same process is
repeated on those sets, to determine successive digits of their codes.

Example:

The five symbols which have the following frequency and probabilities, design
suitable Shannon-Fano binary code. Calculate average code length, source entropy
and efficiency.

Symbol count Probabilities Binary Length


codes
A 15 0.385 00 2
B 7 0.1795 01 2
C 6 0.154 10 2
D 6 0.154 110 3
E 5 0.128 111 3

The average code word length:


𝑚

𝐿 = ∑ 𝑃(𝑥𝑗 )𝑙𝑗
𝑗=1

DR. MAHMOOD 2017-12-08 41


𝐿 = 2 × 0.385 + 2 × 0.1793 + 2 × 0.154 + 3 × 0.154 + 3 × 0.128
= 2.28⁡𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙

The source entropy is:


𝑚

𝐻(𝑌) = − ∑ 𝑃(𝑦𝑗 ) log 2 𝑃(𝑦𝑗 )


𝑗=1

𝐻(𝑌) = −[0.385𝑙𝑛0.385 + 0.1793𝑙𝑛0.1793 + 2 × 0.154𝑙𝑛0.154


+ 0.128𝑙0.128]/𝑙𝑛2

𝐻(𝑌) = 2.18567⁡𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙

The code efficiency:

𝐻(𝑌) 2.18567
𝜂= × 100 = × 100 = 95.86%
L 2.28

Example
Develop the Shannon - Fano code for the following set of messages,
p( x)  [0.35 0.2 0.15 0.12 0.1 0.08] then find the code efficiency.
Solution
xi p( xi ) Code li
x1 0.35 0 0 2

x2 0.2 0 1 2

x3 0.15 1 0 0 3

x4 0.12 1 0 1 3

x5 0.10 1 1 0 3

x6 0.08 1 1 1 3

DR. MAHMOOD 2017-12-08 42


6
LC   li p( xi )  2.45 bits/symbol
i 1

6
H ( X )   p( xi ) log 2 p( xi ) 2.396
bits/symbol
i 1

H (X )
  100%  97.796%
LC

Example
Repeat the previous example using with r  3
Solution

xi p( xi ) Code li

x1 0.35 0 1

x2 0.2 1 0 2

x3 0.15 1 1 2

x4 0.12 2 0 2

x5 0.10 2 1 2

x6 0.08 2 2 2

6
LC   li p( xi )  1.65 ternary unit/symbol
i 1

6
H ( X )   p( xi ) log 3 p( xi ) 1.512 ternary unit/symbol
i 1

H (X )
  100%  91.636%
LC

DR. MAHMOOD 2017-12-08 43


Huffman Code

The Huffman coding algorithm comprises two steps, reduction and splitting. These
steps can be summarized as follows:

1) Reduction
a) List the symbols in descending order of probability.

b) Reduce the r least probable symbols to one symbol with a probability


equal to their combined probability.

c) Reorder in descending order of probability at each stage.

d) Repeat the reduction step until only two symbols remain.

2) Splitting

a) Assign 0,1,...r to the r final symbols and work backwards.

b) Expand or lengthen the code to cope with each successive split.

Example: Design Huffman codes for 𝐴 = {𝑎1 , 𝑎2 , … … . 𝑎5 },⁡having the probabilities


{0.2, 0.4, 0.2, 0.1, 0.1}.

DR. MAHMOOD 2017-12-08 44


The average code word length:

𝐿 = 0.4 × 1 + 0.2 × 2 + 0.2 × 3 + 0.1 × 4 + 0.1 × 4 = 2.2⁡𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙

The source entropy:

𝐻(𝑌) = −[0.4𝑙𝑛0.4 + 2 × 0.2𝑙𝑛0.2 + 2 × 0.1𝑙𝑛0.1]/𝑙𝑛2 = 2.12193 bits/symbol

The code efficiency:

2.12193
𝜂= × 100 = 96.45%
2.2

It can be design Huffman codes with minimum variance:

The average code word length is still 2.2 bits/symbol. But variances are different!

Example

Develop the Huffman code for the following set of symbols

Symbol A B C D E F G H

Probability 0.1 0.18 0.4 0.05 0.06 0.1 0.07 0.04

DR. MAHMOOD 2017-12-08 45


Solution
0
C 0.40 0.40 0.40 0.40 0.40 0.40 0.60 1.0

0
B 0.18 0.18 0.18 0.19 0.23 0.37 0.40
1

0
A 0.10 0.10 0.13 0.18 0.19 0.23
1

0
F 0.10 0.10 0.10 0.13 0.18
1

0
G 0.07 0.09 0.10 0.10
1
0
E 0.06 0.07 0.09 1

0
D 0.05 0.06 1

H 0.04 1

So we obtain the following codes


Symbol A B C D E F G H
Probability 0.1 0.18 0.4 0.05 0.06 0.1 0.07 0.04
Codeword 011 001 1 00010 0101 0000 0100 00011

li 3 3 1 5 4 4 4 5
8
H ( X )   p( xi ) log 2 p( xi )  2.552 bits/symbol
i 1

8
LC   li p( xi )  2.61 bits/symbol
i 1

DR. MAHMOOD 2017-12-08 46


H (X )
  100%  97.778%
LC

Data Compression:
In computer science and information theory, data compression, source coding, or bit-
rate reduction involves encoding information using fewer bits than the original
representation. Compression can be either lossy or lossless.

Lossless data compression algorithms usually exploit statistical redundancy to


represent data more concisely without losing information, so that the process is
reversible. Lossless compression is possible because most real-world data has statistical
redundancy. For example, an image may have areas of color that do not change over
several pixels.

Lossy data compression is the converse of lossless data compression. In these


schemes, some loss of information is acceptable. Dropping nonessential detail from the
data source can save storage space. There is a corresponding trade-off between
preserving information and reducing size.

Run-Length Encoding (RLE):


Run-Length Encoding is a very simple lossless data compression technique that
replaces runs of two or more of the same character with a number which represents the
length of the run, followed by the original character; single characters are coded as
runs of 1. RLE is useful for highly-redundant data, indexed images with many pixels
of the same color in a row.
Example:
Input: AAABBCCCCDEEEEEEAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAA
Output: 3A2B4C1D6E38A

The input message to RLE encoder is a variable while the output code word is fixed,
unlike Huffman code where the input is fixed while the output is varied.

DR. MAHMOOD 2017-12-08 47


Example : Consider these repeated pixels values in an image … 0 0 0 0 0 0 0 0 0 0 0 0
5 5 5 5 0 0 0 0 0 0 0 0 We could represent them more efficiently as (12, 0)(4, 5)(8, 0)
24 bytes reduced to 6 which gives a compression ratio of 24/6 = 4:1.
Example :Original Sequence (1 Row): 111122233333311112222 can be encoded as:
(4,1),(3,2),(6,3),(4,1),(4,2). 21 bytes reduced to 10 gives a compression ratio of 21/10 =
21:10.
Example : Original Sequence (1 Row): – HHHHHHHUFFFFFFFFFFFFFF can be
encoded as: (7,H),(1,U),(14,F) . 22 bytes reduced to 6 gives a compression ratio of 22/6
= 11:3 .
Savings Ratio : the savings ratio is related to the compression ratio and is a measure of
the amount of redundancy between two representations (compressed and
uncompressed). Let:
N1 = the total number of bytes required to store an uncompressed (raw) source image.
N2 = the total number of bytes required to store the compressed data.
The compression ratio Cr is then defined as:
𝑁1
𝐶𝑟 =
𝑁2
 Larger compression ratios indicate more effective compression
 Smaller compression ratios indicate less effective compression
 Compression ratios less than one indicate that the uncompressed representation
has high degree of irregularity.
The saving ratio Sr is then defined as :
(𝑁1 − 𝑁2 )
𝑆𝑟 =
𝑁1
 Higher saving ratio indicate more effective compression while negative ratios are
possible and indicate that the compressed image has larger memory size than the
original.

Example: a 5 Megabyte image is compressed into a 1 Megabyte image, the savings


ratio is defined as (5-1)/5 or 4/5 or 80%.
This ratio indicates that 80% of the uncompressed data has been eliminated in the
compressed encoding.

DR. MAHMOOD 2017-12-08 48


Chapter Four
Channel coding

1- Error detection and correction codes:

The idea of error detection and/or correction is to add extra bits to the digital
message that enable the receiver to detect or correct errors with limited
capabilities. These extra bits are called parity bits. If we have k bits, r parity bits
are added, then the transmitted digits are:

𝑛 = 𝑟+𝑘

Here n called code word denoted as (n, k). The efficiency or code rate is equal to
𝑘/𝑛.

Two basic approaches to error correction are available, which are:

a- Automatic-repeat-request (ARQ): Discard those frames in which errors are


detected.
- For frames in which no error was detected, the receiver returns a positive
acknowledgment to the sender.
- For the frame in which errors have been detected, the receiver returns
negative acknowledgement to the sender.
b- Forward error correction (FEC):

Ideally, FEC codes can be used to generate encoding symbols that are transmitted
in packets in such a way that each received packet is fully useful to a receiver to
reassemble the object regardless of previous packet reception patterns. The most
applications of FEC are:
Compact Disc (CD) applications, digital audio and video, Global System Mobile (GSM) and
Mobile communications.

DR. MAHMOOD 2018-10-02 51


2- Basic definitions:
- Systematic and nonsystematic codes: If (a’s) information bits are unchanged
in their values and positions at the transmitted code word, then this code is
said to be systematic (also called block code) where:
Input data 𝐷 = [𝑎1 , 𝑎2 , 𝑎3 , … … … … 𝑎𝑘 ], The output systematic code word (n,
k) is:
𝐶 = [𝑎1 , 𝑎2 , 𝑎3 , … … … … 𝑎𝑘 , 𝑐1, 𝑐2, 𝑐3, … … … … 𝑐𝑟 ]
However if the data bits are spread or changed at the output code word then,
the code is said to be nonsystematic. The output of nonsystematic code word of
(n, k):
𝐶 = [𝑐2 , 𝑎1 , 𝑐1, 𝑎3 , 𝑎2 , 𝑐3, … … … … … ]
- Hamming Distance (HD): It is important parameter to measure the ability of
error detection. It the number of bits that differ between any two codewords
𝐶𝑖 and 𝐶𝑗 denoted by 𝑑𝑖𝑗 . For a binary (n, k) code with 2𝑘 possible codewords,
then minimum HD (dmin) is min⁡(𝑑𝑖𝑗 ), where:
𝑛 ≥ 𝑑𝑖𝑗 ≥ 0
For any code word, the possible error detection is:

2𝑡 = 𝑑𝑚𝑖𝑛 − 1

For example, if 𝑑𝑚𝑖𝑛 = 4, then it is possible to detect 3 errors or less. The


possible error correcting is:

𝑑𝑚𝑖𝑛 − 1
𝑡=
2

So that for 𝑑𝑚𝑖𝑛 = 4, it is possible to correct only one bit.

DR. MAHMOOD 2018-10-02 52


Example (2): Find the minimum HD between the following codwords. Also
determine the possible error detection and the number of error correction bits.

𝐶1 = [100110011], 𝐶2 = [111101100]⁡𝑎𝑛𝑑⁡𝐶3 = [101100101]

Solution: Here 𝑑12 = 6, 𝑑13 = 4⁡𝑎𝑛𝑑⁡𝑑23 = 3, hence 𝑑𝑚𝑖𝑛 = 3.

The possible error detection 2𝑡 = 𝑑𝑚𝑖𝑛 − 1 = 2.

𝑑𝑚𝑖𝑛 −1
The possible error correction ⁡⁡⁡𝑡 = = 1.
2

- Hamming Weight: It is the number of 1’s in the non-zero codeword 𝐶𝑖 ,


denoted by 𝑤𝑖 . For example the codewords of 𝐶1 = [1011100], 𝐶2 =
[1011001], 𝑤1 = 4, 𝑎𝑛𝑑⁡𝑤2 = 3 respectively. If we have two valid
codewords- all ones and all zeros, in this case 𝐻𝐷 = 𝑤𝑖

3- Parity check codes (Error detection):

It is a linear block codes (systematic codes). In this code, an extra bit is added
for each k information and hence the code rate (efficiency) is 𝑘 ⁄(𝑘 + 1). At the
receiver if the number of 1’s is odd then the error is detected. The minimum
Hamming distance for this category is dmin =2, which means that the simple
parity code is a single-bit error-detecting code; it cannot correct any error. There
are two categories in this type: even parity (ensures that a code word has an even
number of 1's) and odd parity (ensures that a code word has an odd number of
1's) in the code word.

DR. MAHMOOD 2018-10-02 53


Example: an even parity-check code of (5, 4) which mean that, k =4 and n =5.

Data word Code word Data word Code word


0010 00101 0110 01100
1010 10100 1000 10001

The above table can be repeated with odd parity-check code of (5, 4) as follow:
Data word Code word Data word Code word
0010 00100 0110 01101
1010 10101 1000 10000

Note:
Error detection was used in early ARQ (Automatic Repeat on Request) systems.
If the receiver detects an error, it asks the transmitter (through another backward
channel) to retransmit.

The sender is calculate the parity bit to be added to the data word to form a code
word. At the receiver, a syndrome is calculated. The syndrome is passed to the
decision logic analyzer. If the syndrome is 0, there is no error in the received
codeword; the data portion of the received codeword is accepted as the data
word; if the syndrome is 1, the data portion of the received codeword is
discarded. The data word is not created as shown in figure below.

DR. MAHMOOD 2018-10-02 54


4- Repetition codes:

The repetition code is one of the most basic error-correcting codes. The idea of
the repetition code is to just repeat the message several times. The encoder is a
simple device that repeats, r times.

For example, if we have a (3, 1) repetition code, then encoding the signal
m=101001 yields a code c=111000111000000111.

Suppose we received a (3, 1) repetition code and we are decoding the signal
c=110001111. The decoded message is m=101. For (r, 1) repetition code an error
correcting capacity of 𝑟/2 (i.e. it will correct up to 𝑟/2 errors in any code word).
In other word the 𝑑𝑚𝑖𝑛 = 𝑟, or increasing the correction capability depending on
r value. Although this code is very simple, it also inefficient and wasteful because
using only (2, 1) repetition code, that would mean we have to double the size of
the bandwidth which means doubling the cost.

DR. MAHMOOD 2018-10-02 55


5- Linear Block Codes:

Linear block codes extend of parity check code by using a larger number of parity
bits to either detect more than one error or correct for one or more errors. A block
codes of an (n, k) binary block code can be selected a 2k codewords from 2n
possibilities to form the code, such that each k bit information block is uniquely
mapped to one of these 2k codewords. In linear codes the sum of any two
codewords is a codeword. The code is said to be linear if, and only if the sum of
𝑉𝑖 (+)𝑉𝑗 is also a code vector, where 𝑉𝑖 ⁡&⁡𝑉𝑗 are codeword vectors and (+)
represents modulo-2 addition.

5-1 Hamming Codes

Hamming codes are a family of linear error-correcting codes that generalize the
Hamming (n,k) -code, and were invented by Richard Hamming in 1950.
Hamming codes can detect up to two-bit errors or correct one-bit errors without
detection of uncorrected errors. Hamming codes are perfect codes, that is, they
achieve the highest possible rate for codes with their block length and minimum
distance of three.

In the codeword, there are k data bits and 𝑟 = 𝑛 − 𝑘 redundant (check) bits,
giving a total of n codeword bits. 𝑛 = 𝑘 + 𝑟

Hamming Code Algorithm:

General algorithm for hamming code is as follows:

1. r parity bits are added to an k - bit data word, forming a code word of n
bits .

DR. MAHMOOD 2018-10-02 56


2. The bit positions are numbered in sequence from 1 to n.
3. Those positions are numbered with powers of two, reserved for the parity
bits and the remaining bits are the data bits.
4. Parity bits are calculated by XOR operation of some combination of data
bits. Combination of data bits are shown below following the rule.
5. It Characterized by (𝑛, 𝑘 ) = (2𝑚 − 1, 2𝑚 − 1 − 𝑚), 𝑤ℎ𝑒𝑟𝑒 𝑚 =
2, 3, 4 … ….

5-2 Exampl: Hamming(7,4)

This table describes which parity bits cover which transmitted bits in the
encoded word. For example, p2 provides an even parity for bits 2, 3, 6, and 7.
It also details which transmitted by which parity bit by reading the column. For
example, d1 is covered by p1 and p2 but not p3. This table will have a striking
resemblance to the parity-check matrix (H).

Or it can be calculate the parity bits from the following equations:


p1 = d1 ⁡d2 ⁡d4
p2 = d1 ⁡d3⁡d4
p3 = d2 ⁡d3 ⁡d4

The parity bits generating circuit is as following:

DR. MAHMOOD 2018-10-02 57


At the receiver, the first step in error correction, is to calculate the syndrome bits
which indicate there is an error or no. Also, the value of syndrome determine the
position detecting using syndrome bits = CBA. The equations for generating
syndrome that will be used in the detecting the position of the error are given by:

A = p1 ⁡d1⁡d2 ⁡d4
B = p2 ⁡d1⁡d3 ⁡d4
C = p3 ⁡d2 ⁡d3⁡d4

Example:

Suppose we want to transmit the data 1011 over noisy communication channel.
Determine the Hamming code word.

Solution:

The first step is to calculate the parity bit value as follow and put it in the
corresponding position as follow:

𝑝1 = d1 ⁡d2 ⁡d4 = 1⁡0⁡1 = 0


p2 = d1⁡d3 ⁡d4 = 1⁡⁡1⁡1 = 1
p3 = d2 ⁡d3 ⁡d4 = 0⁡1⁡1 = 0

DR. MAHMOOD 2018-10-02 58


Bit position 1 2 3 4 5 6 7
Bit name 𝑝1 𝑝2 𝑑1 𝑝3 𝑑2 𝑑3 𝑑4
Received value 0 1 1 0 0 1 1

The the codeword is c=0110011

Suppose the following noise is added to the code word, then the received code
becomes as:

The noise: 𝑛 = 0000100

The received code word: 𝑐𝑟 = 00001000110011 = 0110111

Now calculate the syndrome:

A = p1⁡d1⁡d2 ⁡d4 = 0⁡1⁡1⁡1 = 1⁡


B = p2 ⁡d1⁡d3 ⁡d4 = 1⁡1⁡1⁡1 = 0
C = p3 ⁡d2⁡d3 ⁡d4 = 0⁡1⁡1⁡1 = 1
So that CBA = 101 which indicate that an error in the fifth bit.

Hamming matrices:

Hamming codes can be computed in linear algebra terms through matrices


because Hamming codes are linear codes. For the purposes of Hamming codes,
two Hamming matrices can be defined: the code generator matrix G and the
parity-check matrix H

DR. MAHMOOD 2018-10-02 59


Example:
Suppose we want to transmit the data 1011 over noisy communication channel

This mean that 0110011 would be transmitted instead of transmitting 1011.If no


error occurs during transmission, then the received codeword r is identical to the
transmitted codeword x: 𝑟 = 𝑥. The receiver multiplies H and r to obtainthe
syndrome vector z ,which indicates whether an error has occurred, and if so, for
which codeword bit.

suppose we have introduced a bit error on bit 5

DR. MAHMOOD 2018-10-02 60

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy