Information, Entropy, and The Motivation For Source Codes: Hapter
Information, Entropy, and The Motivation For Source Codes: Hapter
Information, Entropy, and The Motivation For Source Codes: Hapter
C HAPTER 2
Information, Entropy, and the
Motivation for Source Codes
The theory of information developed by Claude Shannon (SM EE 40, PhD Math 40) in
the late 1940s is one of the most impactful ideas of the past century, and has changed the
theory and practice of many fields of technology. The development of communication
systems and networks has benefited greatly from Shannons work. In this chapter, we
will first develop the intution behind information and formally define it as a mathematical
quantity and connect it to another property of data sources, entropy.
We will then show how these notions guide us to efficiently compress and decompress a
data source before communicating (or storing) it without distorting the quality of informa-
tion being received. A key underlying idea here is coding, or more precisely, source coding,
which takes each message (or symbol) being produced by any source of data and asso-
ciate each message (symbol) with a codeword, while achieving several desirable properties.
This mapping between input messages (symbols) and codewords is called a code. Our fo-
cus will be on lossless compression (source coding) techniques, where the recipient of any
uncorrupted message can recover the original message exactly (we deal with corrupted
bits later in later chapters).
1 if by land.
2 if by sea.
7
8 CHAPTER 2. INFORMATION, ENTROPY, AND THE MOTIVATION FOR SOURCE CODES
(Had the sender been from Course VI, it wouldve almost certainly been 0 if by land
and 1 if by sea!)
Lets say we have no prior knowledge of how the British might come, so each of these
choices (messages) is equally probable. In this case, the amount of information conveyed
by the sender specifying the choice is 1 bit. Intuitively, that bit, which can take on one
of two values, can be used to encode the particular choice. If we have to communicate a
sequence of such independent events, say 1000 such events, we can encode the outcome
using 1000 bits of information, each of which specifies the outcome of an associated event.
On the other hand, suppose we somehow knew that the British were far more likely
to come by land than by sea (say, because there is a severe storm forecast). Then, if the
message in fact says that the British are coming by sea, much more information is being
conveyed than if the message said that that they were coming by land. To take another ex-
ample, far more information is conveyed by my telling you that the temperature in Boston
on a January day is 75 F, than if I told you that the temperature is 32 F!
The conclusion you should draw from these examples is that any quantification of in-
formation about an event should depend on the probability of the event. The greater the
probability of an event, the smaller the information associated with knowing that the event
has occurred.
This definition satisfies the basic requirement that it is a decreasing function of p. But
so do an infinite number of functions, so what is the intuition behind using the logarithm
to define information? And what is the base of the logarithm?
The second question is easy to address: you can use any base, because loga (1/ p) =
logb (1/ p)/ logb a, for any two bases a and b. Following Shannons convention, we will use
base 2,1 in which case the unit of information is called a bit.2
The answer to the first question, why the logarithmic function, is that the resulting defi-
nition has several elegant resulting properties, and it is the simplest function that provides
these properties. One of these properties is additivity. If you have two independent events
(i.e., events that have nothing to do with each other), then the probability that they both
occur is equal to the product of the probabilities with which they each occur. What we
would like is for the corresponding information to add up. For instance, the event that it
rained in Seattle yesterday and the event that the number of students enrolled in 6.02 ex-
ceeds 150 are independent, and if I am told something about both events, the amount of
information I now have should be the sum of the information in being told individually of
the occurrence of the two events.
The logarithmic definition provides us with the desired additivity because, given two
1
And we wont mention the base; if you see a log in this chapter, it will be to base 2 unless we mention
otherwise.
2
If we were to use base 10, the unit would be Hartleys, and if we were to use the natural log, base e, it would
be nats, but no one uses those units in practice.
SECTION 2.1. INFORMATION AND ENTROPY 9
1 1
IA + IB = log(1/ p A ) + log(1/ p B ) = log = log .
pA pB P(A and B)
2.1.2 Examples
Suppose that were faced with N equally probable choices. What is the information re-
ceived when I tell you which of the N choices occurred?
Because the probability of each choice is 1/ N, the information is log(1/(1/ N)) = log N
bits.
Now suppose there are initially N equally probable choices, and I tell you something
that narrows the possibilities down to one of M equally probable choices. How much
information have I given you about the choice?
We can answer this question by observing that you now know that the probability of
the choice narrowing done from N equi-probable possibilities to M equi-probable ones is
M/ N. Hence, the information you have received is log(1/(M/ N)) = log(N / M) bits. (Note
that when M = 1, we get the expected answer of log N bits.)
We can therefore write a convenient rule:
log2 (10/1) = 3.322 bits. Note that this information is same as the sum of the previ-
ous three examples: information is cumulative if the joint probability of the events
revealed to us factors intothe product of the individual probabilities.
In this example, we can calculate the probability that they all occur together, and
compare that answer with the product of the probabilities of each of them occurring
individually. Let event A be the digit is even, event B be the digit is 5, and event
C be the digit is a multiple of 3. Then, P(A and B and C) = 1/10 because there is
only one digit, 6, that satisfies all three conditions. P(A) P(B) P(C) = 1/2 1/2
4/10 = 1/10 as well. The reason information adds up is that log(1/ P(AandBandC) =
log 1/ P(A) + log 1/ P(B) + log(1/ P(C).
Note that pairwise indepedence between events is actually not necessary for informa-
tion from three (or more) events to add up. In this example, P(AandB) = P(A)
P(B| A) = 1/2 2/5 = 1/5, while P(A) P(B) = 1/2 1/2 = 1/4.
2.1.3 Entropy
Now that we know how to measure the information contained in a given event, we can
quantify the expected information in a set of possible outcomes. Specifically, if an event i
occurs with probability pi , 1 i N out of a set of N events, then the average or expected
information is given by
N
H(p1 , p2 , . . . p N ) = pi log(1/ pi ). (2.2)
i=1
H is also called the entropy (or Shannon entropy) of the probability distribution. Like
information, it is also measured in bits. It is simply the sum of several terms, each of which
is the information of a given event weighted by the probability of that event occurring. It
is often useful to think of the entropy as the average or expected uncertainty associated with
this set of events.
In the important special case of two mutually exclusive events (i.e., exactly one of the two
SECTION 2.2. SOURCE CODES 11
events can occur), occuring with probabilities p and 1 p, respectively, the entropy
We will be lazy and refer to this special case, H(p, 1 p) as simply H(p).
This entropy as a function of p is plotted in Figure 2-1. It is symmetric about p = 1/2,
with its maximum value of 1 bit occuring when p = 1/2. Note that H(0) = H(1) = 0;
although log(1/ p) as p 0, lim p0 p log(1/ p) 0.
It is easy to verify that the expression for H from Equation (2.2) is always non-negative.
Moreover, H(p1 , p2 , . . . p N ) log N always.
easily manipulated independently. For example, to find the 42nd character in the file, one
just looks at the 42nd byte and interprets those 8 bits as an ASCII character. A text file
containing 1000 characters takes 8000 bits to store. If the text file were HTML to be sent
over the network in response to an HTTP request, it would be natural to send the 1000
bytes (8000 bits) exactly as they appear in the file.
But lets think about how we might compress the file and send fewer than 8000 bits. If
the file contained English text, wed expect that the letter e would occur more frequently
than, say, the letter x. This observation suggests that if we encoded e for transmission
using fewer than 8 bitsand, as a trade-off, had to encode less common characters, like x,
using more than 8 bitswed expect the encoded message to be shorter on average than the
original method. So, for example, we might choose the bit sequence 00 to represent e and
the code 100111100 to represent x.
This intuition is consistent with the definition of the amount of information: commonly
occurring symbols have a higher pi and thus convey less information, so we need fewer
bits to encode such symbols. Similarly, infrequently occurring symbols like x have a lower
pi and thus convey more information, so well use more bits when encoding such sym-
bols. This intuition helps meet our goal of matching the size of the transmitted data to the
information content of the message.
The mapping of information we wish to transmit or store into bit sequences is referred
to as a code. Two examples of codes (fixed-length and variable-length) are shown in Fig-
ure 2-2, mapping different grades to bit sequences in one-to-one fashion. The fixed-length
code is straightforward, but the variable-length code is not arbitrary, but has been carefully
designed, as we will soon learn. Each bit sequence in the code is called a codeword.
When the mapping is performed at the source of the data, generally for the purpose
of compressing the data (ideally, to match the expected number of bits to the underlying
entropy), the resulting mapping is called a source code. Source codes are distinct from
channel codes we will study in later chapters. Source codes remove redundancy and com-
press the data, while channel codes add redundancy in a controlled way to improve the error
resilience of the data in the face of bit errors and erasures caused by imperfect communi-
cation channels. This chapter and the next are about source codes.
We can generalize this insight about encoding common symbols (such as the letter e)
more succinctly than uncommon symbols into a strategy for variable-length codes:
Send commonly occurring symbols using shorter codewords (fewer bits) and
infrequently occurring symbols using longer codewords (more bits).
Wed expect that, on average, encoding the message with a variable-length code would
take fewer bits than the original fixed-length encoding. Of course, if the message were all
xs the variable-length encoding would be longer, but our encoding scheme is designed to
optimize the expected case, not the worst case.
Heres a simple example: suppose we had to design a system to send messages con-
taining 1000 6.02 grades of A, B, C and D (MIT students rarely, if ever, get an F in 6.02
).
Examining past messages, we find that each of the four grades occurs with the probabilities
shown in Figure 2-2.
With four possible choices for each grade, if we use the fixed-length encoding, we need
2 bits to encode a grade, for a total transmission length of 2000 bits when sending 1000
grades.
SECTION 2.3. HOW MUCH COMPRESSION IS POSSIBLE? 13
Figure 2-2: Possible grades shown with probabilities, fixed- and variable-length encodings
With a fixed-length code, the size of the transmission doesnt depend on the actual
messagesending 1000 grades always takes exactly 2000 bits.
Decoding a message sent with the fixed-length code is straightforward: take each pair
of received bits and look them up in the table above to determine the corresponding grade.
Note that its possible to determine, say, the 42nd grade without decoding any other of the
gradesjust look at the 42nd pair of bits.
Using the variable-length code, the number of bits needed for transmitting 1000 grades
depends on the grades.
If the grades were all B, the transmission would take only 1000 bits; if they were all Cs and
Ds, the transmission would take 3000 bits. But we can use the grade probabilities given
in Figure 2-2 to compute the expected length of a transmission as
1 1 1 1 2
1000[( )(2) + ( )(1) + ( )(3) + ( )(3)] = 1000[1 ] = 1666.7 bits
3 2 12 12 3
So, on average, using the variable-length code would shorten the transmission of 1000
grades by 333 bits, a savings of about 17%. Note that to determine, say, the 42nd grade, we
would need to decode the first 41 grades to determine where in the encoded message the
42nd grade appears.
Using variable-length codes looks like a good approach if we want to send fewer bits
on average, but preserve all the information in the original message. On the downside,
we give up the ability to access an arbitrary message symbol without first decoding the
message up to that point.
One obvious question to ask about a particular variable-length code: is it the best en-
coding possible? Might there be a different variable-length code that could do a better job,
i.e., produce even shorter messages on the average? How short can the messages be on the
average? We turn to this question next.
Specifically, the entropy, defined by Equation (2.2), tells us the expected amount of in-
formation in a message, when the message is drawn from a set of possible messages, each
occurring with some probability. The entropy is a lower bound on the amount of informa-
tion that must be sent, on average, when transmitting data about a particular choice.
What happens if we violate this lower bound, i.e., we send fewer bits on the average
than called for by Equation (2.2)? In this case the receiver will not have sufficient informa-
tion and there will be some remaining ambiguityexactly what ambiguity depends on the
encoding, but to construct a code of fewer than the required number of bits, some of the
choices must have been mapped into the same encoding. Thus, when the recipient receives
one of the overloaded encodings, it will not have enough information to unambiguously
determine which of the choices actually occurred.
Equation (2.2) answers our question about how much compression is possible by giving
us a lower bound on the number of bits that must be sent to resolve all ambiguities at the
recipient. Reprising the example from Figure 2-2, we can update the figure using Equation
(2.1).
Figure 2-3: Possible grades shown with probabilities and information content.
Using equation (2.2) we can compute the information content when learning of a particular
grade:
N
1 1 1 1 1
pi log2 ( pi ) = ( 3 )(1.58) + ( 2 )(1) + ( 12 )(3.58) + ( 12 )(3.58) = 1.626 bits
i=1
So encoding a sequence of 1000 grades requires transmitting 1626 bits on the average. The
variable-length code given in Figure 2-2 encodes 1000 grades using 1667 bits on the aver-
age, and so doesnt achieve the maximum possible compression. It turns out the example
code does as well as possible when encoding one grade at a time. To get closer to the lower
bound, we would need to encode sequences of gradesmore on this idea below.
Finding a good codeone where the length of the encoded message matches the
information content (i.e., the entropy)is challenging and one often has to think outside
the box. For example, consider transmitting the results of 1000 flips of an unfair coin
where probability of heads is given by p H . The information content in an unfair coin flip
can be computed using equation (2.3):
For p H = 0.999, this entropy evaluates to .0114. Can you think of a way to encode 1000
unfair coin flips using, on average, just 11.4 bits? The recipient of the encoded message
must be able to tell for each of the 1000 flips which were heads and which were tails. Hint:
SECTION 2.4. WHY COMPRESSION? 15
with a budget of just 11 bits, one obviously cant encode each flip separately!
In fact, some effective codes leverage the context in which the encoded message is be-
ing sent. For example, if the recipient is expecting to receive a Shakespeare sonnet, then
its possible to encode the message using just 8 bits if one knows that there are only 154
Shakespeare sonnets. That is, if the sender and receiver both know the sonnets, and the
sender just wishes to tell the receiver which sonnet to read or listen to, he can do that using
a very small number of bits, just log 154 bits if all the sonnets are equi-probable!
Shorter messages take less time to transmit and so the complete message arrives
more quickly at the recipient. This is good for both the sender and recipient since
it frees up their network capacity for other purposes and reduces their network
charges. For high-volume senders of data (such as Google, say), the impact of send-
ing half as many bytes is economically significant.
Using network resources sparingly is good for all the users who must share the
internal resources (packet queues and links) of the network. Fewer resources per
message means more messages can be accommodated within the networks resource
constraints.
Over error-prone links with non-negligible bit error rates, compressing messages be-
fore they are channel-coded using error-correcting codes can help improve through-
put because all the redundancy in the message can be designed in to improve error
resilience, after removing any other redundancies in the original message. It is better
to design in redundancy with the explicit goal of correcting bit errors, rather than
rely on whatever sub-optimal redundancies happen to exist in the original message.
Exercises
1. Several people at a party are trying to guess a 3-bit binary number. Alice is told that
the number is odd; Bob is told that it is not a multiple of 3 (i.e., not 0, 3, or 6); Charlie
16 CHAPTER 2. INFORMATION, ENTROPY, AND THE MOTIVATION FOR SOURCE CODES
is told that the number contains exactly two 1s; and Deb is given all three of these
clues. How much information (in bits) did each player get about the number?
2. After careful data collection, Alyssa P. Hacker observes that the probability of
HIGH or LOW traffic on Storrow Drive is given by the following table:
HIGH traffic level LOW traffic level
If the Red Sox are playing P(HIGH traffic) = 0.999 P(LOW traffic) = 0.001
If the Red Sox are not playing P(HIGH traffic) = 0.25 P(LOW traffic) = 0.75
(a) If it is known that the Red Sox are playing, then how much information in bits
is conveyed by the statement that the traffic level is LOW. Give your answer as
a mathematical expression.
(b) Suppose it is known that the Red Sox are not playing. What is the entropy
of the corresponding probability distribution of traffic? Give your answer as a
mathematical expression.
3. X is an unknown 4-bit binary number picked uniformly at random from the set of all
possible 4-bit numbers. You are given another 4-bit binary number, Y, and told that
the Hamming distance between X (the unknown number) and Y (the number you
know) is two. How many bits of information about X have you been given?
4. In Blackjack the dealer starts by dealing 2 cards each to himself and his opponent:
one face down, one face up. After you look at your face-down card, you know a total
of three cards. Assuming this was the first hand played from a new deck, how many
bits of information do you have about the dealers face down card after having seen
three cards?
5. The following table shows the undergraduate and MEng enrollments for the School
of Engineering.
(a) When you learn a randomly chosen engineering students department you get
some number of bits of information. For which student department do you get
the least amount of information?
(b) After studying Huffman codes in the next chapter, design a Huffman code to
encode the departments of randomly chosen groups of students. Show your
Huffman tree and give the code for each course.
SECTION 2.4. WHY COMPRESSION? 17
(c) If your code is used to send messages containing only the encodings of the de-
partments for each student in groups of 100 randomly chosen students, what is
the average length of such messages?
6. Youre playing an online card game that uses a deck of 100 cards containing 3 Aces,
7 Kings, 25 Queens, 31 Jacks and 34 Tens. In each round of the game the cards are
shuffled, you make a bet about what type of card will be drawn, then a single card
is drawn and the winners are paid off. The drawn card is reinserted into the deck
before the next round begins.
(a) How much information do you receive when told that a Queen has been drawn
during the current round?
(b) Give a numeric expression for the information content received when learning
about the outcome of a round.
(c) After you learn about Huffman codes in the next chapter, construct a variable-
length Huffman encoding that minimizes the length of messages that report the
outcome of a sequence of rounds. The outcome of a single round is encoded as
A (ace), K (king), Q (queen), J (jack) or X (ten). Specify your encoding for each
of A, K, Q, J and X.
(d) Again, after studying Huffman codes, use your code from part (c) to calculate
the expected length of a message reporting the outcome of 1000 rounds (i.e., a
message that contains 1000 symbols)?
(e) The Nevada Gaming Commission regularly receives messages in which the out-
come for each round is encoded using the symbols A, K , Q, J , and X. They dis-
cover that a large number of messages describing the outcome of 1000 rounds
(i.e., messages with 1000 symbols) can be compressed by the LZW algorithm
into files each containing 43 bytes in total. They decide to issue an indictment
for running a crooked game. Why did the Commission issue the indictment?
7. Consider messages made up entirely of vowels (A, E, I , O, U). Heres a table of prob-
abilities for each of the vowels:
l pl log2 (1/ pl ) pl log2 (1/ pl )
A 0.22 2.18 0.48
E 0.34 1.55 0.53
I 0.17 2.57 0.43
O 0.19 2.40 0.46
U 0.08 3.64 0.29
Totals 1.00 12.34 2.19
(a) Give an expression for the number of bits of information you receive when
learning that a particular vowel is either I or U.
(b) After studying Huffman codes in the next chapter, use Huffmans algorithm
to construct a variable-length code assuming that each vowel is encoded indi-
vidually. Draw a diagram of the Huffman tree and give the encoding for each
of the vowels.
18 CHAPTER 2. INFORMATION, ENTROPY, AND THE MOTIVATION FOR SOURCE CODES
(c) Using your code from part (B) above, give an expression for the expected length
in bits of an encoded message transmitting 100 vowels.
(d) Ben Bitdiddle spends all night working on a more complicated encoding algo-
rithm and sends you email claiming that using his code the expected length in
bits of an encoded message transmitting 100 vowels is 197 bits. Would you pay
good money for his implementation?