Univ of California Lecture Notes-ProbabiliyTheory
Univ of California Lecture Notes-ProbabiliyTheory
Univ of California Lecture Notes-ProbabiliyTheory
Jean Walrand
Department of Electrical Engineering and Computer Sciences
University of California
Berkeley, CA 94720
Table of Contents 3
Abstract 9
Introduction 1
1 Modelling Uncertainty 3
1.1 Models and Physical Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Concepts and Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Function of Hidden Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 A Look Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Probability Space 13
2.1 Choosing At Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Countable Additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Choosing uniformly in {1, 2, . . . , N } . . . . . . . . . . . . . . . . . . 17
2.5.2 Choosing uniformly in [0, 1] . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.3 Choosing uniformly in [0, 1]2 . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.1 Stars and Bars Method . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3
4 CONTENTS
3.4.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.4 General Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Random Variable 37
4.1 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Examples of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Generating Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Function of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Moments of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.8 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.10 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Random Variables 67
5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Joint Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Conditional Expectation 85
6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 MMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Two Pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 Properties of Conditional Expectation . . . . . . . . . . . . . . . . . . . . . 90
6.5 Gambling System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.7 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9 Estimation 143
9.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.2 Linear Least Squares Estimator: LLSE . . . . . . . . . . . . . . . . . . . . . 143
9.3 Recursive LLSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.4 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.5.1 LSSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.6 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
16 Applications 255
16.1 Optical Communication Link . . . . . . . . . . . . . . . . . . . . . . . . . . 255
16.2 Digital Wireless Communication Link . . . . . . . . . . . . . . . . . . . . . 258
16.3 M/M/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
16.4 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
16.5 A Simple Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
16.6 Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
B Functions 275
Bibliography 293
Abstract
These notes are derived from lectures and office-hour conversations in a junior/senior-level
course on probability and random processes in the Department of Electrical Engineering
and Computer Sciences at the University of California, Berkeley.
The notes do not replace a textbook. Rather, they provide a guide through the material.
The style is casual, with no attempt at mathematical rigor. The goal to to help the student
figure out the meaning of various concepts and to illustrate them with examples.
When choosing a textbook for this course, we always face a dilemma. On the one hand,
there are many excellent books on probability theory and random processes. However, we
find that these texts are too demanding for the level of the course. On the other hand,
books written for the engineering students tend to be fuzzy in their attempt to avoid subtle
mathematical concepts. As a result, we always end up having to complement the textbook
we select. If we select a math book, we need to help the student understand the meaning of
the results and to provide many illustrations. If we select a book for engineers, we need to
provide a more complete conceptual picture. These notes grew out of these efforts at filling
the gaps.
You will notice that we are not trying to be comprehensive. All the details are available
in textbooks. There is no need to repeat the obvious.
The author wants to thank the many inquisitive students he has had in that class and
the very good teaching assistants, in particular Teresa Tung, Mubaraq Misra, and Eric Chi,
who helped him over the years; they contributed many of the problems.
Happy reading and keep testing hypotheses!
9
Introduction
Engineering systems are designed to operate well in the face of uncertainty of characteristics
Understanding how to model uncertainty and how to analyze its effects is – or should be
we design. Communication systems are designed to compensate for noise. Internet routers
are built to absorb traffic fluctuations. Building must resist the unpredictable vibrations
circuit manufacturing steps are subject to unpredictable variations. Searching for genes is
What should you understand about probability? It is a complex subject that has been
constructed over decades by pure and applied mathematicians. Thousands of books explore
various aspects of the theory. How much do you really need to know and where do you
start?
The first key concept is how to model uncertainty (see Chapter 2 - 3). What do we mean
by a “random experiment?” Once you understand that concept, the notion of a random
variable should become transparent (see Chapters 4 - 5). You may be surprised to learn that
a random variable does not vary! Terms may be confusing. Once you appreciate the notion
of randomness, you should get some understanding for the idea of expectation (Section 4.5)
and how observations modify it (Chapter 6). A special class of random variables (Gaussian)
1
2
are particularly useful in many applications (Chapter 7). After you master these key notions,
you are ready to look at detection (Chapter 8) and estimation problems (Chapter 9). These
are representative examples of how one can process observation to reduce uncertainty. That
is, how one learns. Many systems are subject to the cumulative effect of many sources of
randomness. We study such effects in Chapter 11 after having provided some background
in Chapter 10. The final set of important notions concern random processes: uncertain
evolution over time. We look at particularly useful models of such processes in Chapters
The concepts are difficult, but the math is not (Appendix ?? reviews what you should
know). The trick is to know what we are trying to compute. Look at examples and invent
new ones to reinforce your understanding of ideas. Don’t get discouraged if some ideas seem
obscure at first, but do not let the obscurity persist! This stuff is not that hard, it is only
Modelling Uncertainty
stress the importance of concepts that justify the structure of the theory. We comment on
the notion of a hidden variable. We conclude the chapter with a very brief historical look
and the models of Probability Theory. That difference is similar to that between laws of
theoretical physics and the real world: even though mathematicians view the theory as
standing on its own, when engineers use it, they see it as a model of the physical world.
Consider flipping a fair coin repeatedly. Designate by 0 and 1 the two possible outcomes
of a coin flip (say 0 for head and 1 for tail). This experiment takes place in the physical
world. The outcomes are uncertain. In this chapter, we try to appreciate the probability
3
4 CHAPTER 1. MODELLING UNCERTAINTY
In our many years of teaching probability models, we have always found that what is
most subtle is the interpretation of the models, not the calculations. In particular, this
introductory course uses mostly elementary algebra and some simple calculus. However,
understanding the meaning of the models, what one is trying to calculate, requires becoming
needed; but to develop some intuition about the theory, to be able to anticipate theorems
and results, to relate these developments to the physical reality, it is important to have some
interpretation of the definitions and of the basic axioms of the theory. We will attempt to
One idea is that the uncertainty in the world is fully contained in the selection of some
hidden variable. (This model does not apply to quantum mechanics, which we do not
consider here.) If this variable were known, then nothing would be uncertain anymore.
Think of this variable as being picked by nature at the big bang. Many choices were
possible, but one particular choice was made and everything derives from it. [In most cases,
it is easier to think of nature’s choice only as it affects a specific experiment, but we worry
about this type of detail later.] In other words, everything that is uncertain is a function of
that hidden variable. By function, we mean that if we know the hidden variable, then we
Let us denote the hidden variable by ω. Take one uncertain thing, such as the outcome
of the fifth coin flip. This outcome is a function of ω. If we designate the outcome of
1.4. A LOOK BACK 5
the fifth coin flip by X, then we conclude that X is a function of ω. We can denote that
function by X(ω). Another uncertain thing could be the outcome of the twelfth coin flip.
We can denote it by Y (ω). The key point here is that X and Y are functions of the same
Summing up, everything that is random is some function X of some hidden variable ω.
This is a model. To make this model more precise, we need to explain how ω is selected
and what these functions X(ω) are like. These ideas will keep us busy for a while!
The theory was developed by a number of inquiring minds. We briefly review some of their
contributions. (We condense this historical account from the very nice book by S. M. Stigler
[9]. For ease of exposition, we simplify the examples and the notation.)
To start our exploration of “uncertainty” We propose to review very briefly the various
Say that an amplifier has some gain A that we would like to measure. We observe the
6 CHAPTER 1. MODELLING UNCERTAINTY
input X and the output Y and we know that Y = AX. If we could measure X and Y
precisely, then we could determine A by a simple division. However, assume that we cannot
measure these quantities precisely. Instead we make two sets of measurements: (X, Y ) and
that (X, Y ) = (2, 5) and (X 0 , Y 0 ) = (4, 7). No value of A works exactly for both sets of
measurements. The problem is that we did not measure the input and the output accurately
One approach is to average the measurements, say by taking the arithmetic means:
((X + X 0 )/2, (Y + Y 0 )/2) = (3, 6) and to find the gain A so that 6 = A × 3, so that A = 2.
A second approach is to solve for A for each pair of measurements: For (X, Y ), we find
A = 2.5 and for (X 0 , Y 0 ), we find A = 1.75. We can average these values and decide that A
the errors between Y and AX and between Y 0 and AX 0 . That is, we look for A that
(5 − 2A)2 + (7 − 4A)2 = 74 − 76A + 20A2 . Setting the derivative with respect to A equal to
0, we find −76 + 40A = 0, or A = 1.9. This is the solution proposed by Legendre in 1805.
The method of least squares is one that produces the “best” prediction of the output
based on the input, under rather general conditions. However, to understand this notion,
If an urn contains 5 red balls and 7 blue balls, then the odds of picking “at random” a
red ball from the urn are 5 out of 12. One can view the likelihood of a complex event as
being the ratio of the number of favorable cases divided by the total number of “equally
likely” cases. This is a somewhat circular definition, but not completely: from symmetry
considerations, one may postulate the existence equally likely events. However, in most
situations, one cannot determine – let alone count – the equally likely cases nor the favorable
cases. (Consider for instance the odds of having a sunny Memorial Day in Berkeley.)
Jacob Bernoulli (one of twelve Bernoullis who contributed to Mathematics, Physics, and
Probability) showed the following result. If we pick a ball from an urn with r red balls and
b blue balls a large number N of times (always replacing the ball before the next attempt),
then the fraction of times that we pick a red ball approaches r/(r + b). More precisely, he
showed that the probability that this fraction differs from r/(r + b) by more than any given
² > 0 goes to 0 as N increases. We will learn this result as the weak law of large numbers.
Bernoulli. When N is large and ² small, he derived the normal approximation to the
8 CHAPTER 1. MODELLING UNCERTAINTY
probability discussed earlier. This is the first mention of this distribution and an example
Looking again at Bernoulli’s and de Moivre’s problem, we see that they assumed p =
r/(r + b) known and worried about the probability that the fraction of N balls selected from
the urn differs from p by more than a fixed ² > 0. Bernoulli showed that this probability
goes to zero (he also got some conservative estimates of N needed for that probability to
Simpson (a heavy drinker) worried about the “reverse” question. Assume we do not
know p and that we observe the fraction q of a large number N of balls being red. We
believe that p should be close to q, but how close can we be confident that it is? Simpson
proposed a naı̈ve answer by making arbitrary assumptions on the likelihood of the values
of p.
Bayes understood Simpson’s error. To appreciate Bayes’ argument, assume that q = 0.6
and that we have made 100 experiments. What are the odds that p ∈ [0.55, 0.65]? If you are
told that p = 0.5, then these odds are 0. However, if you are told that the urn was chosen
such that p = 0.5 or p = 1, with equal probabilities, then the odds that p ∈ [0.55, 0.65] are
now close to 1.
Bayes understood how to include systematically the information about the prior distri-
bution in the calculation of the posterior distribution. He discovered what we know today
tions of the central limit theorem and various approximation results for integrals (based on
Gauss developed the systematic theory of least squares estimation when the errors are
Gaussian. We explain in the notes the remarkable fact that the best estimate is linear in
the observations.
1.4. A LOOK BACK 11
Markov Chains
A sequence of coin flips produces results that are independent. Many physical systems
exhibit a more complex behavior that requires a new class of models. Markov introduced
a class of such models that enable to capture dependencies over time. His models, called
Kolmogorov was one of the most prolific mathematicians of the 20th century. He made
and functional analysis, the theory of probability and mathematical statistics, the analysis
lished some essential properties such as the extension theorem and many other fundamental
results.
12 CHAPTER 1. MODELLING UNCERTAINTY
1.5 References
There are many good books on probability theory and random processes. For the level of
this course, we recommend Ross [7], Hoel et al. [4], Pitman [5], and Bremaud [2]. The
books by Feller [3] are always inspiring. For a deeper look at probability theory, Breiman
[1] are a good start. For cute problems, we recommend Sevastyanov et al. [8].
Chapter 2
Probability Space
In this chapter we describe the probability model of “choosing an object at random.” Ex-
amples will help us come up with a good definition. We explain that the key idea is to
outcomes. These sets are events. The description of the events and of their probability
First consider picking a card out of a 52-card deck. We could say that the odds of picking
any particular card are the same as that of picking any other card, assuming that the deck
has been well shuffled. We then decide to assign a “probability” of 1/52 to each card. That
probability represents the odds that a given card is picked. One interpretation is that if we
repeat the experiment “choosing a card from the deck” a large number N of times (replacing
the card previously picked every time and re-shuffling the deck before the next selection),
then a given card, say the ace of diamonds, is selected approximated N/52 times. Note that
this is only an interpretation. There is nothing that tells us that this is indeed the case;
moreover, if it is the case, then there is certainly nothing yet in our theory that allows us to
expect that result. Indeed, so far, we have simply assigned the number 1/52 to each card
13
14 CHAPTER 2. PROBABILITY SPACE
in the deck. Our interpretation comes from what we expect from the physical experiment.
deeper properties of the sequences of successive cards picked from a deck. We will come back
to these deeper properties when we study independence. You may object that the definition
of probability involves implicitly that of “equally likely events.” That is correct as far as
the interpretation goes. The mathematical definition does not require such a notion.
hitting a specific point on the board, measured with pinpoint accuracy, is essentially zero.
Accordingly, in contrast with the previous example, we cannot assign numbers to individual
outcomes of the experiment. The way to proceed is to assign numbers to sets of possible
outcomes. Thus, one can look at a subset of the dartboard and assign some probability
that represents the odds that the dart will land in that set. It is not simple to assign the
numbers to all the sets in a way that these numbers really correspond to the odds of a given
dart player. Even if we forget about trying to model an actual player, it is not that simple
to assign numbers to all the subsets of the dartboard. At the very least, to be meaningful,
the numbers assigned to the different subsets must obey some basic consistency rules. For
instance, if A and B are two subsets of the dartboard such that A ⊂ B, then the number
P (B) assigned to B must be at least as large as the number P (A) assigned to A. Also, if A
and B are disjoint, then P (A ∪ B) = P (A) + P (B). Finally, P (Ω) = 1, if Ω designates the
set of all possible outcomes (the dartboard, possibly extended to cover all bases). This is the
basic story: probability is defined on sets of possible outcomes and it is additive. [However,
it turns out that one more property is required: countable additivity (see below).]
Note that we can lump our two examples into one. Indeed, the first case can be viewed
as a particular case of the second where we would define P (A) = |A|/52, where A is any
subset of the deck of cards and |A| is the number of cards in the deck. This definition is
certainly additive and it assigns the probability 1/52 to any one card.
2.2. EVENTS 15
Some care is required when defining what we mean by a random choice. See Bertrand’s
2.2 Events
The sets of outcomes to which one assigns a probability are called events. It is not necessary
(and often not possible, as we may explain later) for every set of outcomes to be an event.
For instance, assume that we are only interested in whether the card that we pick is
black or red. In that case, it suffices to define P (A) = 0.5 = P (Ac ) where A is the set of all
the black cards and Ac is the complement of that set, i.e., the set of all the red cards. Of
course, we know that P (Ω) = 1 where Ω is the set of all the cards and P (∅) = 0, where ∅
events also. Indeed, if we want to define the probability that the outcome is in A and the
probability that it is in B, it is reasonable to ask that we can also define the probability that
extension, set operations that are performed on a finite collection of events should always
produce an event. For instance, if A, B, C, D are events, then [(A \ B) ∩ C] ∪ D should also
be an event. We say that the set of events is closed under finite set operations. [We explain
below that we need to extend this property to countable operations.] With these properties,
it makes sense to write for disjoint events A and B that P (A ∪ B) = P (A) + P (B). Indeed,
If you want to see why, generally for uncountable sample spaces, all sets of outcomes
16 CHAPTER 2. PROBABILITY SPACE
This topic is the first serious hurdle that you face when studying probability theory. If
you understand this section, you increase considerably your appreciation of the theory.
We want to be able to say that if the events An for n = 1, 2, . . ., are such that An ⊂ An+1
for all n and if A := ∪n An , then P (An ) ↑ P (A) as n → ∞. Why is this useful? This
property, called σ-additivity is the key to being able to approximate events. The property
specifies that the probability is continuous: if we approximate the events, then we also
This strategy of “filling the gaps” by taking limits is central in mathematics. You
remember that real numbers are defined as limits of rational numbers. Similarly, integrals
are defined as limits of sums. The key idea is that different approximations should give the
same result. For this to work, we need the continuity property above.
event whenever the events An for n = 1, 2, . . ., are such that An ⊂ An+1 . More generally,
For instance, if we define P ([0, x]) = x for x ∈ [0, 1], then we can define P ([0, a)) = a
because if ² is small enough, then An := [0, a − ²/n] is such that An ⊂ An+1 and [0, a) :=
You may wish to review the meaning of countability (see Appendix ??).
2.4. PROBABILITY SPACE 17
Putting together the observations of the sections above, we have defined a probability space
as follows.
• P is a countably additive function from F into [0, 1] such that P (Ω) = 1, called a
probability measure.
Examples will clarify this definition. The main point is that one defines the probability
of sets of outcomes (the events). The probability should be countably additive (to be
continuous). Accordingly (to be able to write down this property), and also quite intuitively,
2.5 Examples
Throughout the course, we will make use of simple examples of probability space. We review
We say that we pick a value ω uniformly in {1, 2, . . . , N } when the N values are equally
likely to be selected. In this case, the sample space Ω is Ω = {1, 2, . . . , N }. For any subset
A ⊂ Ω, one defines P (A) = |A|/N where |A| is the number of elements in A. For instance,
Here, Ω = [0, 1] and one has, for example, P ([0, 0.3]) = 0.3 and P ([0.2, 0.7]) = 0.5. That
is, P (A) is the “length” of the set A. Thus, if ω is picked uniformly in [0, 1], then one can
It turns out that one cannot define the length of every subset of [0, 1], as we explain
in Appendix C. The collection of sets whose length is defined is the smallest σ-field that
contains the intervals. This collection is called the Borel σ-field of [0, 1]. More generally, the
smallest σ-field of < that contains the intervals is the Borel σ-field of <, usually designated
by B.
Here, Ω = [0, 1]2 and one has, for example, P ([0.1, 0.4] × [0.2, 0.8]) = 0.3 × 0.6 = 0.18. That
is, P (A) is the “area” of the set A. Thus, P ([0.1, 0.4] × [0.2, 0.8]) = 0.18. Similarly, in that
case, if
then
1 π
P (B) = and P (C) = .
2 4
As in one dimension, one cannot define the area of every subset of [0, 1]2 . The proper
σ-field is the smallest that contains the rectangles. It is called the Borel σ-field of [0, 1]2 .
More generally, the smallest σ-field of <2 that contains the rectangles is the Borel σ-field
2.6 Summary
σ-field of Ω, i.e., a collection of subsets of Ω that is closed under countable set operations,
2.7. SOLVED PROBLEMS 19
The idea is to specify the likelihood of various outcomes (elements of Ω). If one can
specify the probability of individual outcomes (e.g., when Ω is countable), then one can
choose F = 2Ω , so that all sets of outcomes are events. However, this is generally not
possible as the example of the uniform distribution on [0, 1] shows. (See Appendix C.)
In many problems, we use a method for counting the number of ordered groupings of
identical objects. This method is called the stars and bars method. Suppose we are given
identical objects we call stars. Any ordered grouping of these stars can be obtained by
separating them by bars. For example, || ∗ ∗ ∗ |∗ separates four stars into four groups of sizes
0, 0, 3, and 1.
form M groups. The number of orderings is the number of ways of placing the N identical
¡ ¢
stars and M − 1 identical bars into N + M − 1 spaces, N +M
M
−1
.
Creating compound objects of stars and bars is useful when there are bounds on the
Example 2.7.1. Describe the probability space {Ω, F, P } that corresponds to the random
experiment “picking five cards without replacement from a perfectly shuffled 52-card deck.”
1. One can choose Ω to be all the permutations of A := {1, 2, . . . , 52}. The interpretation
of ω ∈ Ω is then the shuffled deck. Each permutation is equally likely, so that pω = 1/(52!)
for ω ∈ Ω. When we pick the five cards, these cards are (ω1 , ω2 , . . . , ω5 ), the top 5 cards of
the deck.
20 CHAPTER 2. PROBABILITY SPACE
2. One can also choose Ω to be all the subsets of A with five elements. In this case, each
¡ ¢
subset is equally likely and, since there are N := 525 such subsets, one defines pω = 1/N
for ω ∈ Ω.
{1, 2, . . . , 5}}. In this case, the outcome specifies the order in which we pick the cards.
Since there are M := 52!/(47!) such ordered lists of five cards without replacement, we
As this example shows, there are multiple ways of describing a random experiment.
What matters is that Ω is large enough to specify completely the outcome of the experiment.
Example 2.7.2. Pick three balls without replacement from an urn with fifteen balls that
are identical except that ten are red and five are blue. Specify the probability space.
One possibility is to specify the color of the three balls in the order they are picked.
Then
10 9 8 5 4 3
Ω = {R, B}3 , F = 2Ω , P ({RRR}) = , . . . , P ({BBB}) = .
15 14 13 15 14 13
Example 2.7.3. You flip a fair coin until you get three consecutive ‘heads’. Specify the
probability space.
One possible choice is Ω = {H, T }∗ , the set of finite sequences of H and T . That is,
{H, T }∗ = ∪∞ n
n=1 {H, T } .
This is another example of a probability space that is bigger than necessary, but easier
Example 2.7.4. Let Ω = {0, 1, 2, . . .}. Let F be the collection of subsets of Ω that are
No, F is not closed under countable set operations. For instance, {2n} ∈ F for each
A := ∪∞
n=0 {2n}
Example 2.7.5. In a class with 24 students, what is the probability that no two students
N N −1 N −2 N −n+1
α := × × × ··· × .
N N N N
(In this derivation we defined a = (N − n + 1)/N .) With n = 24 and N = 365 we find that
α ≈ 0.48.
Example 2.7.6. Let A, B, C be three events. Assume that P (A) = 0.6, P (B) = 0.6, P (C) =
P (A ∩ B ∩ C).
so that
P (A ∩ B ∩ C) = 0.2.
Example 2.7.7. Let Ω = {1, 2, 3, 4} and let F = 2Ω be the collection of all the subsets of
such that
(ii). The σ-field generated by A is F. (This means that F is the smallest σ-field of Ω
Note that P1 and P2 are not the same, thus satisfying (iii).
1 1 1
P1 ({1, 2}) = P1 ({1}) + P1 ({2}) = 8 + 8 = 4
1 2 1
P2 ({1, 2}) = P2 ({1}) + P2 ({2}) = 12 + 12 = 4
To check (ii), we only need to check that ∀k ∈ Ω, {k} can be formed by set operations
on sets in A ∪ φ∪ Ω. Then any other set in F can be formed by set operations on {k}.
Example 2.7.8. Choose a number randomly between 1 and 999999 inclusive, all choices
being equally likely. What is the probability that the digits sum up to 23? For example, the
number 7646 is between 1 and 999999 and its digits sum up to 23 (7+6+4+6=23).
Numbers between 1 and 999999 inclusive have 6 digits for which each digit has a value in
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. We are interested in finding the numbers x1 +x2 +x3 +x4 +x5 +x6 =
First consider all nonnegative xi where each digit can range from 0 to 23, the number
¡ ¢
of ways to distribute 23 amongst the xi ’s is 28
5 .
But we need to restrict the digits xi < 10. So we need to subtract the number of ways
arrange 23 amongst xi when some xk ≥ 10 is the same as the number of ways to arrange
P ¡ ¢
yi so that 6i=1 yi = 23 − 10 is 18
5 . There are 6 possible ways for some xk ≥ 10 so there
¡ ¢
are a total of 6 18
5 ways for some digit to be greater than or equal to 10, as we can see by
However, the above counts events multiple times. For instance, x1 = x2 = 10 is counted
both when x1 ≥ 10 and when x2 ≥ 10. We need to account for these events that are counted
multiple times. We can consider when two digits are greater than or equal to 10: xj ≥ 10
number of ways to distribute 23 amongst xi when there are 2 greater than or equal to 10 is
P
equivalent to the number of ways to distribute yi when 6i=1 yi = 23 − 10 − 10 = 3. There
¡¢ ¡¢
are 85 ways to distribute these yi and there are 62 ways to choose the possible two digits
We are interested in when the sum of xi ’s is equal to 23. So we can have at most 2 xi ’s
sum up to 23. The probability that a number randomly chosen has digits that sum up to
(28
5)
−6(18 + 6 8
5 ) (2)(5)
23 is 999999 .
P
Example 2.7.9. Let A1 , A2 , . . . , An , n ≥ 2 be events. Prove that P (∪ni=1 Ai ) = i P (Ai ) −
P P n+1 P (A ∩ A ∩ . . . ∩ A ).
i<j P (Ai ∩ Aj ) + i<j<k P (Ai ∩ Aj ∩ Ak ) − · · · + (−1) 1 2 n
First consider the base case when n = 2. P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) − P (A1 ∩ A2 ).
Assume the result holds true for n, prove the result for n + 1.
P (∪n+1 n n
i=1 Ai ) = P (∪i=1 Ai ) + P (An+1 ) − P ((∪i=1 Ai ) ∩ An+1 )
of those events occur is zero. This result is known as the Borel-Cantelli Lemma.
To prove this result we must write the event “infinitely many of the events An occur”
2.7. SOLVED PROBLEMS 25
A = ∩∞ ∞
m=1 ∪n=m An .
To see this, note that ω is in infinitely many An if and only if for all m ≥ 1 there is some
n ≥ m such that ω ∈ An .
The theme of this chapter is how to use observations. The key idea is that observations
modify our belief about the likelihood of events. The mathematical notion of conditional
Assume that we know that the outcome is in B ⊂ Ω. Given that information, what is the
probability that the outcome is in A ⊂ Ω? This probability is written P [A|B] and is read
“the conditional probability of A given B,” or “the probability of A given B”, for short.
For instance, one picks a card at random from a 52-card deck. One knows that the card
is black. What is the probability that it is the ace of clubs? The sensible answer is that
if one only knows that the card is black, then that card is equally likely to be any one of
the 26 black cards. Therefore, the probability that it is the ace of clubs is 1/26. Similarly,
given that the card is black, the probability that it is an ace is 2/26, because there are 2
We can formulate that calculation as follows. Let A be the set of aces (4 cards) and B
the set of black cards (26 cards). Then, P [A|B] = P (A ∩ B)/P (B) = (2/52)(26/52) = 2/26.
27
28 CHAPTER 3. CONDITIONAL PROBABILITY AND INDEPENDENCE
Also, given that the outcome is in B, the probability of all the outcomes in B should be
by P (B). This division does not modify the relative likelihood of the various outcomes in
B.
P (A ∩ B)
P [A|B] = .
P (B)
This definition of conditional probability makes sense if P (B) > 0. If P (B) = 0, we define
P [A|B] = 0. This definition is somewhat arbitrary but it makes the formulas valid in all
cases.
Note that
P (A ∩ B) = P [A | B]P (B).
3.2 Remark
Define P 0 (A) = P [A|B] for any event A. Then P 0 (·) is a new probability measure. In
particular, the usual formulas apply. For instance, P 0 [A ∩ C] = P 0 [A|C]P 0 (C), i.e.,
which you can verify by using the definition of P [·|B]. After a while, you should be able to
Let B1 and B2 be disjoint events whose union is Ω. Let also A be another event. We can
write
3.4. INDEPENDENCE 29
Hence,
P [A|B1 ]P (B1 )
P [B1 |A] = .
P [A|B1 ]P (B1 ) + P [A|B2 ]P (B2 )
This formula extends to a finite number of events Bn that partition Ω. The result is
know as Bayes’ rule. Think of the Bn as possible “causes” of some effect A. You know the
prior probabilities P (Bn ) of the causes and also the probability that each cause provokes
the effect A. The formula tells you how to calculate the probability that a given cause
provoked the observed effect. Applications abound, as we will see in detection theory. For
instance, you alarm can sound either if there is a burglar or also if there is no burglar (false
alarm). Given that the alarm sounds, what is the probability that it is a false alarm?
3.4 Independence
It may happen that knowing that an event occurs does not change the probability of another
event. In that case, we say that the events are independent. Let us look at an example
first.
3.4.1 Example 1
We roll two dice and we designate the pair of results by ω = (ω1 , ω2 ). Then Ω has 36
probability 1/36. Let A = {ω ∈ Ω|ω1 ∈ {1, 3, 4}} and B = {ω ∈ Ω|ω2 ∈ {3, 5}}. Assume
Using the conditional probability formula, we find P [A|B] = P (A∩B)/P (B) = (6/36)/(12/36) =
1/2. Note also that P(A) = 18/36 = 1/2. Thus, in this example, P [A|B] = P (A).
The interpretation is that if we know the outcome of the second roll, we dont know
3.4.2 Example 2
We pick two points independently and uniformly in [0, 1]. In this case, the outcome ω =
(ω1 , ω2 ) of the experiment (the pair of points chosen) belongs to the set Ω = [0, 1]2 . That
point ω is picked uniformly in [0, 1]2 . Let A = [0.2, 0.5] × [0, 1] and B = [0, 1] × [0.2, 0.8].
The interpretation of A is that the first point is picked in [0.2, 0.5]; that of B is that the
second point is picked in [0.2, 0.8]. Note that P (A) = 0.3 and P (B) = 0.6. Moreover, since
A ∩ B = [0.2, 0.5] × [0.2, 0.8], one finds that P (A ∩ B) = 0.3 × 0.6 = P (A)P (B). Thus, A
3.4.3 Definition
Motivated by the discussion above, we say that two events A and B are independent if
P (A ∩ B) = P (A)P (B).
Do not confuse “independent” and “disjoint.” If two events A and B are disjoint, then
they are independent only if at least one of them has probability 0. Indeed, if they are
P (B) = 0. Intuitively, if A and B are disjoint, then knowing that A occurs implies that B
does not, which is some new information about B unless B is impossible in the first place.
Generally, we say that a collection of events {Ai , i ∈ I} are mutually independent if for any
Subtility
The definition seems innocuous, but one has to be a bit careful. For instance, look at the
The sample space Ω = {1, 2, 3, 4} has four points that have a probability 1/4 each.
The events A, B, C are defined as A = {1, 2}, B = {1, 3}, C = {2, 3}. We can verify that
are independent and so are B and C. However, the events {A, B, C} are not mutually
The point of the example is the following. Knowing that A has occurred tells us some-
thing about outcome ω of the random experiment. This knowledge, by itself, is not sufficient
32 CHAPTER 3. CONDITIONAL PROBABILITY AND INDEPENDENCE
to affect our estimate of the probability that C has occurred. The same is true is we know
that B has occurred. However, if we know that both A and B have occurred, then we
know that C cannot have occurred. Thus, it is not correct to think that “A does not tell
us anything about C, B does not tell us anything about C, therefore A and B do not tell
us anything about C.” I encourage you to think about this example carefully.
3.5 Summary
That identity is false. Here is one counterexample. Let Ω = {1, 2, 3, 4} and pω = 1/4
Example 3.6.2. There are two coins. The first coin is fair. The second coin is such that
P (H) = 0.6 = 1−P (T ). You are given one of the two coins, with equal probabilities between
the two coins. You flip the coin four times and three of the four outcomes are H. What is
Let A designate the event “your coin is fair.” Let also B designate the event “three of
P (A ∩ B) P [B|A]P (A)
P [A|B] = =
P (A) P [B|A]P (A) + P [B|Ac ]P (Ac )
¡4¢ 4
3 (1/2) 2−4
= ¡4¢ ¡4¢ = −4 .
4 3
3 (1/2) + 3 (0.6) (0.4)
2 + (0.6)3 0.4
Example 3.6.3. Choose two numbers uniformly but without replacement in {0, 1, . . . , 10}.
What is the probability that the sum is less than or equal to 10 given that the smallest is
Draw a picture of
|A ∩ B| |A|
= .
|B| |B|
34 CHAPTER 3. CONDITIONAL PROBABILITY AND INDEPENDENCE
Your picture shows that |A| = 10+9+8+· · ·+1 = 55 and that |B| = 10×5+4×5 = 70.
Example 3.6.4. You flip a fair coin repeatedly. What is the probability that you have to
There must be exactly one head among the first nine flips and the last flip must be
1 1 9
9( )9 × ( ) = 10 .
2 2 2
Example 3.6.5. a. Let A and B be independent events. Show that AC and B are inde-
pendent.
b. Let A and B be two events. If the occurrence of event B makes A more likely, then
does the occurrence of the event A make B more likely? Justify your answer.
a. We have
b. The occurrence of event B makes A more likely can be interpreted as P [A|B] > P (A).
Now,
P (A ∩ B) P [B|A]P (A)
P [A|B]) = = > P (A).
P (B) P (B)
P [B|A]
Hence P (B) > 1 and P [B|A] > P (B). Thus the occurrence of event A makes B more
likely.
3.6. SOLVED PROBLEMS 35
P (A) = 0. Hence P (A ∩ B) = 0.
Example 3.6.6. A man has 5 coins in his pocket. Two are double-headed, one is double-
tailed, and two are normal. The coins cannot be distinguished unless one looks at them.
a. The man shuts his eyes, chooses a coin at random, and tosses it. What is the
b. He opens his eyes and sees that the upper face of the coin is a head. What is the
c. He shuts his eyes again, picks up the same coin, and tosses it again. What is the
d. He opens his eyes and sees that the upper face is a head. What is the probability that
Let D denote the event that he picks a double-headed coin, N denote the event that he
picks a normal coin, and Z be the event that he picks the double-tailed coin. Let HLi (and
HUi ) denote the event that the lower face (and the upper face) of the coin on the ith toss
is a head.
a. One has
b. We find
2
P (HL1 ∩ HU1 ) 5 2
P [HL1 |HU1 ] = = 3 = .
P (HU1 ) 5
3
36 CHAPTER 3. CONDITIONAL PROBABILITY AND INDEPENDENCE
c. We write
P (HL2 ∩ HU1 )
P [HL2 |HU1 ] =
P (HU1 )
P [HL2 ∩ HU1 |D]P (D) + P [HL2 ∩ HU1 |N ]P (N ) + P [HL2 ∩ HU1 |Z]P (Z)
=
P (HU1 )
2 1 2 1
(1)( 5 ) + ( 4 )( 5 ) + (0)( 5 ) 5
= 3 = .
5
6
d. Similarly,
Random Variable
In this chapter we define a random variable and we illustrate the definition with examples.
We then define the expectation and moments of a random variable. We conclude the chapter
A random variable takes real values. The definition is “a random variable is a measurable
(−∞, +∞). If the outcome of the random experiment is ω, then the value of the random
Physical examples: noise voltage at a given time and place, temperature at a given time
and place, height of the next person to enter the room, and so on. The color of a randomly
picked apple is not a random variable since its value is not a real number.
4.1 Measurability
For instance, let Ω = [0, 1] and A = [0, 0.5]. Assume that the events are [0, 1], [0, 0.5], (0.5, 1],
and ∅. For instance, assume that we have defined P ([0, 0.5]) = 0.73 and that this is all we
know. Consider the function X(ω) that takes the value 0 when ω is in [0, 0.3] and the
37
38 CHAPTER 4. RANDOM VARIABLE
value 1 when ω is in (0.3, 1]. This function is not a random variable. We cannot determine
not defined. This is what we mean by measurability. Thus, measurability is not a subtle
notion. It is a first order idea: What are the functions whose statistics are defined by the
operations.)
X −1 (B) ∈ F, ∀B ∈ B
where B is the Borel σ-field of <, i.e., the smallest σ-field that contains the intervals.
4.2 Distribution
then called the Probability Mass Function (pmf) of the random variable X.
function (cdf) of X - completely characterizes the “statistics” of X. For short, we also call
A function F (·) : < → < is the cdf of some random variable if and only if
limx→−∞ FX (x) = 0;
limx→∞ FX (x) = 1;
then P (An ) ↓ P (A). For instance, since (−∞, x) ↓ ∅ as x ↓ −∞, it follows that FX (x) ↓ 0
as x ↓ −∞. The fact that a function with these properties is a cdf can be seen from the
for all real numbers a < b. In this expression, fX (·) is a nonnegative function called the
probability density function (pdf) of X. The pdf fX (·) is the derivative of the cdf FX (·).
Obviously, a discrete random variable is not continuous. Also, a random variable may
where g(·) is the derivative of FX (·) where it is differentiable and δ(x−xn ) is a Dirac impulse
whenever g(·) is a function that is continuous at x0 . With this formal definition, you can
We say that the random variable X has a Bernoulli distribution with parameter p ∈ [0, 1],
P (X = 1) = p and P (X = 0) = 1 − p. (4.3.1)
The random variable X has a binomial distribution with parameters n ∈ {1, 2, . . .} and
The random variable X has a geometric distribution with parameter p ∈ (0, 1], and we
write X =D G(p), if
A random variable X with a geometric distribution has the remarkable property of being
P [X ≥ n + m | X ≥ n] = P (X ≥ m), ∀ m, n ≥ 0.
P (X ≥ n + m) (1 − p)n+m
P [X ≥ n + m | X ≥ n] = = = (1 − p)m = P (X ≥ m). (4.3.4)
P (X ≥ n) (1 − p)n
4.4. GENERATING RANDOM VARIABLES 41
The interpretation is that if X is the lifetime of a light bulb (in years, say), then the residual
lifetime X − n of that light bulb, if it is still alive after n years, has the same distribution
as that of a new light bulb. Thus, if light bulbs had a geometrically distributed lifetime,
The random variable X has a Poisson distribution with parameter λ > 0, and we write
X =D P (λ), if
λn −λ
P (X = n) = e , for n ≥ 0. (4.3.5)
n!
The random variable X is uniformly distributed in the interval [a, b] where a < b, and
we write X =D U [a, b] if
1
b−a ,
if x ∈ [a, b]
fX (x) = (4.3.6)
0, otherwise.
The random variable X is exponentially distributed with rate λ > 0, and we write
X =D Exd(λ), if
λe−λx , if x > 0
fX (x) = (4.3.7)
0, otherwise.
Methods to generate a random variable X with a given distribution from uniform random
variables are useful in simulations and provide a good insight into the meaning of the cdf
The first method is to generate a random variable Z uniform in [0, 1] (using the random
number generator of your computer) and to define X(Z) = min{a|F (a) ≥ Z}. Then, for
any real number b, X ≤ b if Z ≤ F (b), which occurs with probability F (b), as desired.
The second method uses the pdf. Assume that X is continuous with pdf f (x) and that
P (a < X < b) = 1 and f (x) ≤ c. Pick a point (X, Y ) uniformly in [a, b]×[0, c] (by generating
two uniform random variables). If the point falls under the curve f (·), i.e., if Y ≤ f (X),
then keep the value X; otherwise, repeat. Then, P (a < X < a + ²) = P [A|B] where
A = {(x, y)|a < x < a + ², y < f (x)} = [a, a + ²] × [0, f (a)]. Then, P [A|B] = P (A)/P (B)
with P (A) = f (a)²/[(ba)c] and P (B) = 1/[(ba)c]. Hence P [A|B] = f (a)², as desired. (The
factor 1/[(ba)c] is to normalize our uniform distribution on [a, b] × [0, c] and the term 1 in
the numerator of P (B) comes from the fact that the p.d.f. f (·) integrates to 1.
4.5 Expectation
Imagine that you play a game of chance a large number K of times. Each time you play,
is correct, you expect to win xn approximately Kpn times out of K. Accordingly, your total
P
earnings should be approximately equal to n Kxn pn . Thus, your earnings should average
P
n xn pn per instance of the game. That value, the average earnings per experiment, is the
interpretation of the expected value of the random variable X that represents your earnings.
There are some potential problems. The sums could yield ∞ − ∞. In that case, we say
Let X be a random variable and h : < → < be a function. Since X is some function from
Ω to <, so is h(X). Is h(X) a random variable? Well, there is that measurability question.
We must check that h(X)−1 ((−∞, a]) ∈ F for some a ∈ <. That property holds if h(·) is
h−1 (B) ∈ B, ∀B ∈ B.
All the functions from < to < we will encounter are Borel-measurable.
Using Definition 4.1.1 we see that if h is Borel-measurable, then, for all B ∈ B, one
has A = h−1 (B) ∈ B, so that (h(X))−1 (B) = X −1 (A) ∈ F, which proves that h(X) is a
random variable.
In some cases, if X has a pdf, Y = h(X) may also have one. For instance, if X has pdf
P (y < Y < y + dy) = P ((y − b)/a < X < (y − b)/a + dy/a) = fX ((y − b)/a)dy/a,
so that the pdf of Y , say fY (y), is fY (y) = fX ((y − b)/a)/a. We highlight that useful result
below:
1 y−b
If Y = aX + b, then fY (y) = fX ( ). (4.6.1)
|a| a
By adapting the above argument, you can check that if h(·) is differentiable and one-to-
1
fY (y) = fX (x)|h(x)=y . (4.6.2)
|h0 (x)|
44 CHAPTER 4. RANDOM VARIABLE
1 1 √
fY (y) = 1{x ∈ [0, 1]}|(3+x)2 =y = √ 1{y ∈ [ 3, 2]}.
2(3 + x) 2 y
that
X
E(h(X)) = h(xn )pn where pn = P (X = xn ).
n
If you think about it, it looks a bit like magic. However, it is not hard to understand what
is going on. This expression is useful because you do not have to calculate the probability
Again, you do not have to calculate the pdf of Y . For instance, with X =D U [0, 1] and
√
If we do the change of variables y = (3 + x)2 , so that dy = 2(3 + x)dx = 2 ydx, then we
The variance of X is
The variance measures the “spread” of the distribution around the mean. A random
variable with a zero variance is constant. The larger the variance, the more “uncertain”
the random variable is, in the mean square sense. Note that I say “uncertain” and not
“variable” since you know by now that a random variable does not vary.
4.8 Inequalities
Inequalities are often useful to estimate some expected values. Here are a few particularly
useful ones.
46 CHAPTER 4. RANDOM VARIABLE
Exponential bound:
1 + x ≤ exp{x}.
Chebychev:
4.9 Summary
A random variable is a function X : Ω → < such that X −1 ((−∞, x]) ∈ F for all x ∈ <. We
We also defined the pmf and pdf that summarize the distribution of the random variable.
We introduced the moments and the variance and we stated a few useful inequalities.
You should become familiar with the distributions we introduced. We put a table that
a. Find c;
b. Find P ( 12 < X ≤ 34 );
R∞
a. Need to have −∞ fX (x)dx = 1. Now,
Z ∞ Z 1 Z 1
x2 x3 1 1 1 c
fX (x)dx = cx(1 − x)dx = c (x − x2 )dx = c( − )]0 = c( − ) = .
−∞ 0 0 2 3 2 3 6
Thus c = 6.
b. We find
Z 3 Z 3
1 3 4 4 x2 x3 34
P( ≤ X ≤ ) = fX (x)dx = 6x(1 − x)dx = 6( − )] 1
2 4 1 1 2 3 2
2 2
9 9 1 1 11
= 6( − − + )=
32 64 8 24 32
c. The cdf is FX (x) = P (X ≤ x). For x < 0, FX (x) = 0. For 0 ≤ x < 1, FX (x) =
Rx
0 6y(1 − y)dy = 3x2 − 2x3 . For x ≥ 1, FX (x) = 1. Hence the cdf is
0, if x < 0;
FX (x) = 3x2 − 2x3 , if 0 ≤ x < 1;
1, if x ≥ 1.
d. We find
Z ∞ Z 1
3 3
E(X) = xfX (x)dx = x6x(1 − x)dx = [2x3 ]10 − [ x4 ]10 = 2 − = 0.5,
−∞ 0 2 2
48 CHAPTER 4. RANDOM VARIABLE
Hence,
3 1 1
var(X) = E(X 2 ) − (E(X))2 = − ( )2 = .
10 2 20
Let Ω = {0, 1, 2}, F = {∅, {0}, {1, 2}, Ω}, P ({0}) = 1/2 = P ({1, 2}), and X(ω) = ω for
The meaning of all this is that the probability space is not rich enough to specify
P (X ≤ 1).
Example 4.10.3. Define the random variable X as follows. You throw a dart uniformly
in a circle with radius 5. The random variable X is equal to 2 minus the distance between
the dart and the center of the circle if this distance is less than or equal to one. Otherwise,
X is equal to 0.
(−∞, +∞).
b. Give the mathematical expression for the probability density function f (x) of X for
Let Y be the distance between the dart and the center of the circle.
Also, X = 0 if Y > 1, which occurs with probability (25−1)/25 = 24/25. These observations
24 2x − 4
f (x) = δ(x) + 1{1 < x < 2}.
25 25
Example 4.10.4. Express the cdf of the following random variables in terms of FX (·).
a. X + := max{0, X};
b. −X.
c. X − := max{0, −X};
d. |X|.
a.
0, if x < 0;
P (X + ≤ x) =
P (X ≤ x) = FX (x), if x ≥ 0
above. Hence,
0, if x < 0;
P (X − ≤ x) =
1 − F (−x−), if x ≥ 0.
Therefore,
0, if x < 0;
P (|X| ≤ x) =
F (x) − FX (−x−), if x ≥ 0.
Example 4.10.5. A dart is flung at a circular dartboard of radius 3. Suppose that the
probability that the dart lands in some region A of the dartboard is proportional to the area
for i = 1, 2, 3.
b. X = 3 − Z where Z is as in part a.
c. Assume now that the player has some probability 0.3 of missing the target altogether.
If he does not miss, he hits an area A with a probability proportional to |A|. The score X
Hence,
0, if x < 1;
5/9, if 1 ≤ x < 2;
FX (x) =
8/9, if 2 ≤ x < 3;
1, if x ≥ 3.
a.ii. Accordingly,
5 1 1
E(X) = 1 × + 2 × + 3 × = 2.
9 3 9
c.i. Let Y be the score given that the player does not miss the target. Then Y has
the pdf that we derived in part b. The score X of the player who misses the target with
probability 0.3 is equal to 0 with probability 0.3 and to Y with probability 0.7. Hence,
That is,
0, if x < 0;
FX (x) = 1 − 0.7(3 − x)2 /9, if 0 ≤ x < 3;
1, if x ≥ 3.
c.ii. From the definition of X in terms of Y we see that
Example 4.10.6. Suppose you put m balls randomly in n boxes. Each box can hold an
arbitrarily large number of balls. What is the expected number of empty boxes?
Designate by p the probability that the first box is empty. Let Xk be equal to 1 when
box k is empty and to zero otherwise, for k = 1, . . . , n. The number of empty boxes is X =
Now,
n−1 m
p=( ) .
n
Indeed, p is the probability of the intersection of the independent events Ak for k = 1, . . . , m
Example 4.10.7. A cereal company is running a promotion for which it is giving a toy in
every box of cereal. There are n different toys and each box is equally likely to contain any
one of the n toys. What is the expected number of boxes of cereal you have to purchase to
Assume that you have just collected the first m toys, for some m = 0, . . . n−1. Designate
by Xm the random number of boxes you have to purchase until you collect another different
designates the additional number of boxes that you purchase until you get another different
n−m m m m
E(Xm ) = ×1+ × E(1 + Ym ) = 1 + E(Ym ) = 1 + E(Xm ).
n n n n
Solving, we find E(Xm ) = n/(n − m). Finally, the expected number of boxes we have to
purchase is
n−1
X n−1
X n 1 1 1
E(X0 + X1 + · · · + Xn−1 ) = E(Xm ) = = n(1 + + + · · · + ).
n−m 2 3 n
m=0 m=0
Example 4.10.8. You pick a point P with a uniform distribution in [0, 1]2 . Let Θ denote
the angle made between the x-axis and the line segment that joins (0, 0) to the point P .
Since P is chosen uniformly on the square, the probability we are within some region of
1 2 π 2 π
= [ln( ) + − (ln( ) − )]
2 2 4 2 4
π
=
4
e. Find fX (x);
g. Calculate E(X).
a. Figure 4.1(a) shows the cdf FX (x). To show that FX (x) is indeed a cdf we must
verify the properties (4.2.1). The figure shows that FX (·) satisfies these properties.
54 CHAPTER 4. RANDOM VARIABLE
1.0
FX(x)
0.9
0.7
0.5
0.3
0.1
-1 1 2 3 4
x
(a)
fX(x) 0.4
0.3
0.2
0.1
-1 1 2 3 4
x
(b)
g. Accordingly to (4.5.1),
Z 3
E[X] = 0 × 0.3 + 2 × 0.4 + 3 × 0.1 + x × 0.2dx
2
· ¸3
x2
= 0.8 + 0.3 + 0.2 × = 1.1 + 0.2 × 2.5 = 1.6
2 2
Example 4.10.10. Let X and Y be independent random variables with common distribution
a. Show that V = max{X, Y } has distribution function FV (v) = F (v)2 and density
Let X and Y be independent random variables each having the uniform distribution on
[0, 1].
c. Find E(U ).
d. Find cov(U, V ).
a. We find
FV (v) = P (V ≤ v) = P (max{X, Y } ≤ v)
d
Differentiate the cdf to get the pdf by using the Chain Rule fV (v) = dv FV (v) =
2F (v)f (v).
d
Differentiate the cdf to get the pdf by using the Chain Rule fU (u) = du FU (u) = 2f (u)(1 −
F (u)).
cov[U, V ] = E[(U − E[U ])(V − E[V ])] = E[U V ] − E[U ]E[V ] = E[XY ] − E[U ]E[V ]
Z 1Z 1 Z 1 Z 1
= xyfX,Y (x, y)dxdy − E[U ]E[V ] = xdx ydy − E[U ]E[V ]
0 0 0 0
1 1 2 1
= ×1− × = .
2 3 3 36
fU (u) = 2f (u)(1 − F (u)) = 2e−2u . Thus U is an exponential random variable with mean 12 .
Z ∞ Z ∞
2 2
E[V ] = v fV (v)dv = 2v 2 (e−v − e−2v )dv
0 0
Z ∞ Z ∞
2 −v
= 2v e dv − 2v 2 e−2v dv
0
Z ∞0 Z ∞
2 −v ∞
= (−2v e ]0 + 4ve−v dv) − (−v 2 e−2v + 2ve−2v dv)
0 0
Z ∞ Z ∞
−v ∞ −v −2v ∞
= (0 − 4ve ]0 + 4e dv) − (0 − ve ]0 + e−2v dv)
0 0
1 −2v ∞ 1 7
= (0 − 4e−v ]∞
0 ) − (0 − e ]0 ) = 4 − = .
2 2 2
7
V ar[V ] = E[V 2 ] − E[V ]2 = 2 − ( 32 )2 = 45 .
Example 4.10.11. Choose X in [0, 1] as follows. With probability 0.2, X = 0.3; with
probability 0.3, X = 0.7; otherwise, X is uniformly distributed in [0.2, 0.5] ∪ [0.6, 0.8]. (a).
Plot the c.d.f. of X; (b) Find E(X); (c) Find var(X); (d) Calculate P [X ≤ 0.3 | X ≤ 0.7].
a. Figure 4.2 shows the p.d.f. and the c.d.f. of X. Note that the value of the density
Consequently,
FX(x)
1
0.1
0 x
fX(x)
1
0.1 x
0
0 0.3 0.7 1
P (X ≤ 0.3) 0.3 1
P [X ≤ 0.3 | X ≤ 0.7] = = = .
P (X ≤ 0.7) 0.9 3
Example 4.10.12. Let X be uniformly distributed in [0, 10]. Find the cdf of the following
random variables.
b. Z := 2 + X 2 ;
c. V := |X − 4|;
d. W := sin(2πX).
a. FY (y) = P (Y ≤ y) = P (max{2, min{4, X}} ≤ y). Note that Y ∈ [2, 4], so that
Let y ∈ [2, 4). We see that Y ≤ y if and only if X ≤ y, which occurs with probability
y/10.
4.10. SOLVED PROBLEMS 59
1.0
FY(y)
0.9
0.7
0.5
0.4
0.3
0.2
0.1
y
-1 1 2 3 4
0.6
fY (y)
0.5
0.4
0.3
0.2
0.1
y
-1 1 2 3 4
Hence,
0, if y < 2
FY (y) = y/10, if 2 ≤ y < 4
1, if y ≥ 4.
Accordingly,
√
b. FZ (z) = P (Z ≤ z) = P (2 + X 2 ≤ z) = P (X ≤ z − 2). Consequently,
0, if z < 2
√
FZ (z) = z − 2, if 2 ≤ z < 102
1, if z ≥ 102.
Also,
0, if z ≤ 2
fZ (z) = √1 , if 2 < z < 102
2 z−2
0, if z ≥ 102
c. FV (v) = P (V ≤ v) = P (| X − 4 |≤ v) = P (−v + 4 ≤ X ≤ v + 4). Hence,
0, if v < 0
0.2v, if 0 ≤ v < 4
FV (v) =
0.1v + 0.4, if 4 ≤ v < 6
1, if v ≥ 6.
Also,
0, if v ≤ 0
0.2, if 0 < v ≤ 4
fV (v) =
0.1, if 4 < v ≤ 6
0, if v > 6
FW (−1−) = 0 and FW (1) = 1. The interesting case is w ∈ (−1, 1). A picture shows that
0, if w < −1
FW (w) = 0.5 + π1 sin−1 (w), if − 1 ≤ w < 1
1, if w ≥ 1.
Example 4.10.13. Assume that a dart flung at a target hits a point ω uniformly distributed
in [0, 1]2 . The random variables X(ω), Y (ω), Z(ω) are defined as follows. X(ω) is the
maximum distance between ω and any side of the square. Y (ω) is the minimum distance
between ω and any side of the square. Z(ω) is the distance between ω and a fixed vertex of
the square.
1 1 1
C2
1-x x x2
C1
A B
x 1-x
0 0 0
0 x 1-x 1 0 1-x x 1 0 x1 1
A = {ω | X ≥ x};
B = {ω | Y ≤ x};
C1 = {ω | Z ≤ x1 } when x1 ≤ 1;
C2 = {ω | Z ≤ x2 } when x2 > 1.
Note the difference in labels on the axes for the events A and B. For C1 and C2 , the
Accordingly,
0, if x < 0
fX (x) = 4(1 − 2x), if 0 ≤ x ≤ 0.5
0, if x ≥ 0.5.
b. Similarly,
62 CHAPTER 4. RANDOM VARIABLE
0, if x < 0.5
FY (x) = (2x − 1)2 , if 0.5 ≤ x ≤ 1
1, if x ≥ 1.5.
Accordingly,
0, if x < 0
fY (x) = 4(2x − 1), if 0.5 ≤ x ≤ 1
0, if x > 1.
c. The area of C1 is πx21 /4. That of C2 consists of a rectangle [0, v] × [0, 1] plus the
p
integral over uin[v, 1] of x22 − u2 . One finds
1
2 πz 0≤z<1
1
√
fZ (z) = 2 πz − 2zcos−1 ( z1 ) 1 ≤ z < 2
√
0 z≥ 2
Example 4.10.14. A circle of unit radius is thrown on an infinite sheet of graph paper
that is grid-ruled with a square grid with squares of unit side. Assume that the center of the
circle is uniformly distributed in the square in which it falls. Find the expected number of
There is a very difficult way to solve the problem and a very easy way. The difficult
way is as follows. Let X be the number of vertex points that fall in the circle. We find
P (X = k) for k = 1, 2, . . . and we compute the expectation. This is very hard because the
sets of possible locations of the center of the circle for these various events are complicated
intersections of circles.
The easy way is as follows. We consider the four vertices of the square in which the
center of the circle lies. For each of these vertices, there is some probability p that it is in
(2,1)
(0,0) (0,1)
The key observation here is that the average value of a sum of random variables is the sum
of their average values, even when these random variables are not independent.
It remains to calculate p. To do that, note that the set of possible locations of the center
of the circle in a given square such that one vertex is in the circle is a quarter-circle with
Example 4.10.15. Ten numbers are selected from {1, 2, 3, . . . , 30} uniformly and without
replacement. Find the expected value of the sum of the selected numbers.
Let X1 , . . . , X10 be the ten numbers you pick in {1, 2, . . . , 30} uniformly and without
replacement. Then E(X1 + · · · + X10 ) = E(X1 ) + · · · + E(X10 ). Consider any Xk for some
trick is to avoid looking at the joint distribution of the Xi , as in the previous example.
Find the value of a that minimizes the average value of the square distance between the point
Let the random variable Z be the squared distance between (X, 0) and (a, 1). That is,
Z = (X − a)2 + (0 − 1)2 .
d 2
(a − 2aE[X] − E[X 2 ] + a).
da
We find that the value of a for which this expression is equal to zero is a = E(X).
d2
The value of da2
(a2 − 2aE[X] − E[X 2 ] + a) for a = E(X) is equal to 2. Since this is
The idea is that X is a lifetime of an item whose residual lifetime X gets longer as it
gets older. An example would be an item whose lifetime is either Exd(1) or Exd(2), each
with probability 0.5, say. As the item gets older, it becomes more likely that its lifetime
is Exd(1) (i.e., with mean 1) instead of Exd(2) (with mean 1/2). Let’s do the math to
Hence,
Z ∞
P (X > a) = fX (x)dx = 0.5e−a + 0.5e−2a , a ≥ 0,
a
so that
0.5e−(a+b) + 0.5e−2(a+b)
P [X > a + b | X > a] = .
0.5e−a + 0.5e−2a
4.10. SOLVED PROBLEMS 65
Here, we can choose a pdf that decays faster than exponentially. Say that the lifetime
has a density
fX (x) = A exp{−x2 }
R∞
where A is such that 0 fX (x)dx = 1. The property we are trying to verify is equivalent
to
Z ∞ Z ∞ Z ∞
fX (x)dx < fX (x)dx fX (x)dx,
a+b a b
or
Z ∞ Z ∞ Z ∞ Z ∞
fX (x)dx fX (x)dx < fX (x)dx fX (x)dx.
0 a+b a b
That is,
where
Z Z
2 +y 2 )
φ(D) = e−(x dxdy
D
for a set D ⊂ <2 and A = [0, b]×[a+b, ∞), B = [b, ∞)×[a+b, ∞), and C = [b, ∞)×[a, a+b].
To show φ(A) < φ(C), we note that each point (x, a + b + y) in A corresponds to a point
(b + y, a + x) in C and
2 +(a+b+y)2 ) 2 +(a+x)2 )
e−(x < e−((b+y) ,
by convexity of g(z) = z 2 .
66 CHAPTER 4. RANDOM VARIABLE
Example 4.10.19. Suppose that the number of telephone calls made in a day is a Poisson
a. What is the probability that more than 1142 calls are made in a day?
Let N denote the number of telephone calls made in a day. N is Poisson with mean 1000
e−1000 1000n
so the pmf is// P (N = n) = n! and V ar[N ] = 1000.
P∞ 1000n
a. P (N > 1142) = e−1000 n=1143 n! .
E[N ] 1000
b. P (N > 1142) = P (N ≥ 1143) ≤ 1143 = 1143 .
V ar[N ] 1000
c. P (N > 1142) = P (N ≥ 1143) ≤ P (|N − E[N ]| ≥ 143) ≤ 1432
= 20449 .
Chapter 5
Random Variables
A collection of random variables is a collection of functions of the outcome of the same ran-
dom experiment. We explain how one characterizes the statistics of these random variables.
We have looked at one random variable. The idea somehow was that we made one
numerical observation of one random experiment. Here we extend the idea to multiple
numerical observations about the same random experiment. Since there is one random
are all functions of the same ω. That is, one models these observations as X(ω), Y (ω), and
Z(ω).
As you may expect, these values are related in some way. Thus, observing X(ω) provides
some information about Y (ω). In fact, one of the interesting questions is how one can use
the information that some observations contain about some other random variables that we
5.1 Examples
We pick a ball randomly from a bag and we note its weight X and its diameter Y .
67
68 CHAPTER 5. RANDOM VARIABLES
We track the evolution over time of the value of Cisco shares and we want to forecast
future values.
A transmitter sends some signal and the receiver observes the signal it receives and tries
The joint distribution of {X(ω), Y (ω)} is specified by the joint cumulative distribution
for a nonnegative function fX,Y (x, y) that is called the joint pdf (jpdf) of the random
variables.
This joint distribution contains more information than the two individual distributions.
For instance, let {X(ω), Y (ω)} be the coordinates of a point chosen uniformly in [0, 1]2 .
Define also Z(ω) = X(ω). Observe that the individual distributions of the each of the
random variables in the pairs {X(ω), Y (ω)} and {X(ω), Z(ω)} are the same. The tight
The random variables are positively (resp. negatively, un-) correlated if cov(X, Y ) > 0
(resp. < 0, = 0). The covariance is a measure of dependence. The idea is that if E(XY ) is
larger than E(X)E(Y ), then X and Y tend to be large or small together more than if they
were independent. In our example above, E(XZ) = E(X 2 ) = 1/3 > E(X)E(Z) = 1/4.
Figures 5.1 illustrates the meaning of correlation. Each of the figures shows the possible
values of a pair (X, Y ) of random variables; all the values are equally likely. In the left-most
figure, X and Y tend to be large or small together. These random variables are positively
correlated. Indeed, the product XY is larger on average than it would be if a larger value of
X did not imply a larger than average value of Y . The other two figures can be understood
similarly.
If h : <2 → < is nice (Borel-measurable - once again, all the functions from <2 to <
that we encounter have that property), then h(X, Y ) is a random variable. One can show,
It is sometimes convenient to use vector notation. To do that, one defines the expected
value of a random vector to be the vector of expected values. Similarly, the expected value
of a matrix is the matrix of expected values. Let X be a column vector whose n elements
X1 , . . . , Xn are random variables. That is, X = (X1 , . . . , Xn )T where (·)T indicates the
W whose entry (i, j) is the random variable Wi,j for i = 1, . . . , m and j = 1, . . . , n, we define
W ) to be the matrix whose entry (i, j) is E(Wi,j ). Recall that if A and B are matrices of
E(W
X , Y ) := E((X
ΣX,Y := cov(X X − E(X
X ))(Y Y ))T ) = E(XY
Y − E(Y XY T ) − E(X Y T)
X )E(Y
and
X − E(X
ΣX := E((X X ))(X X ))T ) = E(X
X − E(X X X T ) − E(X X T ).
X )E(X
AX
cov(AX
AX, BY ) = A cov(X BT .
X , Y )B (5.2.1)
Similarly,
X T Y ) = E(tr(X
E(X X Y T )) = trE(X
X Y T ). (5.2.2)
5.3 Independence
for all subsets A and B of the real line (... Borel sets, to be precise).
if the probability that any finite subcollection of them belongs to any given subsets is the
We have seen examples before: flipping coins, tossing dice, picking (X, Y ) uniformly in
Theorem 5.3.1. a. The random variables X, Y are independent if and only if the joint
cdf FX,Y (x, y) is equal to FX (x)FY (y), for all x, y. A collection of random variables are
mutually independent if the jcdf of any finite subcollection is the product of the cdf.
b. If the random variables X, Y have a joint pdf fX,Y (x, y), they are independent if and
only if fX,Y (x, y) = fX (x)fY (y), for all x, y. A collection of random variables with a jpdf
are mutually independent if the jpdf of any finite subcollection is the product of the pdf.
d. If X and Y are independent, then E(XY ) = E(X)E(Y ). [The converse is not true!]
f. The variance of the sum of pairwise independent random variables is the sum of their
variances.
The expression to the right of the identity is called the convolution of fX and fY . Hence,
the pdf of the sum of two independent random variables is the convolution of their pdf.
Proof:
72 CHAPTER 5. RANDOM VARIABLES
We provide sketches of the proof of these important results. The derivation should help
Conversely, assume that the identity above holds. It is easy to see that
P (X ∈ (a, b] and Y ∈ (c, d]) = FX,Y (b, d) − FX,Y (a, d) − FX,Y (b, c) + FX,Y (a, c).
Using FX,Y (x, y) = FX (x)FY (y) in this expression, we find after some simple algebra that
Since the probability is countably additive, the expression above implies that
P (X ∈ A and Y ∈ B) = P (X ∈ A)P (Y ∈ B)
for a collection of sets A and B that is closed under countable operations and that contains
the intervals. Consequently, the identity above holds for all A, B ∈ B where B is the Borel
The same argument proves the corresponding result for mutual independence of random
variables.
so that
c. Assume X and Y are independent. Note that g(X) ∈ A if and only if X ∈ g −1 (A),
which shows that g(X) and h(Y ) are independent. The derivation of the mutual indepen-
d. Assume that X and Y are independent and that they are continuous. Then
The same derivation holds in the discrete case. The hybrid case is similar.
Note that the converse is not true. For instance, assume that (X, Y ) is equally likely to
take the four values {(−1, 0), (0, −1), (1, 0), (0, 1)}. We find that E(XY ) = 0 = E(X)E(Y ).
0) = 1/2.
74 CHAPTER 5. RANDOM VARIABLES
e. We can prove this result by induction by noticing that if {Xn , n ≥ 1} are mutually
In this calculation, we used the fact that E(Xi Xj ) = E(Xi )E(Xj ) for i 6= j because the
g. Note that
Z ∞
P (X + Y ≤ x) = P (X ≤ x − u and Y ∈ (u, u + du))
Z−∞
∞
= P (X ≤ x − u)fY (u)du.
−∞
5.4 Summary
We explained that multiple random variables are defined on the same probability space.
X )). In particular,
We discussed the joint distribution. We showed how to calculate E(h(X
we defined the variance, covariance, k-th moment. The vector notation has few secrets for
you.
5.5. SOLVED PROBLEMS 75
You also know the definition (and meaning) of independence and mutual independence
and you know that the mean value of a product of independent random variables is the
product of their mean values. You can also prove that functions of independent random
variables are independent. We also showed that the variance of the sum of pairwise inde-
This follows from the definitions by computing the pmf of the sum.
Example 5.5.2. Let X1 and X2 be independent and such that Xi is Exd(λi ) for i = 1, 2.
Calculate
We find
where
and
Hence,
λ1
P [X1 ≤ X2 |X1 ∧ X2 = x] = .
λ1 + λ2
Example 5.5.3. Let {Xn , n ≥ 1} be i.i.d. U [0, 1]. Calculate var(X1 + 2X2 + X32 ).
76 CHAPTER 5. RANDOM VARIABLES
Consequently,
1 4 5 1 1
var(X1 + 2X2 + X32 ) = + + E(X38 ) − (E(X34 ))2 = + − ( )2 ≈ 0.49.
12 12 12 9 5
Example 5.5.4. Let X, Y be i.i.d. U [0, 1]. Compute and plot the pdf of X + Y .
We use (5.3.2):
Z ∞
fX+Y (x) = fX (u)fY (x − u)du.
−∞
For a give value of x, the convolution is the integral of the product of fY (u) and fX (x − u).
The latter function is obtained by flipping fX (u) around the vertical axis and dragging it
Example 5.5.5. Let X = (X1 , X2 )T be a vector of two i.i.d. U [0, 1] random variables. Let
that j.p.d.f. when it exists? If it does not exist, how do you characterize the distribution of
Y.
Assume that the two rows of A are proportional to each other. Then so are Y1 and Y2 .
fX + Y (x)
fY (u) fX + Y (x)
fX(x - u) .
1 1
u
x
0 1 x 0 1 2
it cannot have a density, for the integral in the plane of any function that is nonzero only
on a line is equal to zero, which violates the requirement that the density must integrate
by writing that Y2 = αY1 and Y1 is a linear combination of i.i.d. U [0, 1] random variables.
1
fY (yy ) = A−1y ),
fX (A
|A|
g(X1 , . . . , Xm ) and h(Xm+1 , . . . , Xn ) are independent random variables for any functions
and similarly,
Hence,
which proves the independence. (The next-to-last line follows from the mutual independence
of the Xi .)
Example 5.5.7. Let X, Y be two points picked independently and uniformly on the circum-
By symmetry we can assume that the point X has coordinates (1, 0). The point Y then
has coordinates (cos(θ), sin(θ)) where θ is uniformly distributed in [0, 2π]. Consequently,
g(θ).
We now use the basic results on the density of a function of a random variable. To
Accordingly,
g(θ) ∈ (z, z + δ)
if and only if
δ
θ ∈ ((θn , θn + )
g0(θn )
for some θn such that g(θn ) = z. It follows that, if Z = g(θ), then
X 1
fZ (z) = fθ (θn ).
n
|g0(θn )|
5.5. SOLVED PROBLEMS 79
In this expression, the sum is over all the θn such that g(θn ) = z.
Coming back to our example, g(θ) = z if 2(1 − cos(θ)) = z. In that case, |g0(θ)| =
p
2| sin(θ)| = 2 1 − (1 − z2 )2 . Note that there are two values of θ such that g(θ) = z whenever
1 1 1
fZ (z) = 2 × p z 2
× = q , for z ∈ (0, 4).
2 1 − (1 − 2 ) 2π 2π z − z2
4
Example 5.5.8. The two random vectors X and Y are selected independently and uniformly
by symmetry.
Now,
Also,
Z 1
1 x3 1
E(X12 ) = x2 dx = [ ]1−1 = .
−1 2 6 3
4
X − Y ||2 ) = 4E(X12 ) = .
E(||X
3
Example 5.5.9. Let {Xn , n ≥ 1} be i.i.d. B(p). Assume that g, h : <n → < have the
X ) and h(X
The intuition is that g(X X ) are large together and small together.
80 CHAPTER 5. RANDOM VARIABLES
g̃(x) = g(x) − g(0) and h̃(x) = h(x) − h(0), we see that it is equivalent to show that
cov(g̃(X1 ), h̃(X1 )) ≥ 0. In other words, we can assume without loss of generality that
which is seen to be satisfied since g(1) and h(1) are nonnegative and p ≤ 1.
Assume that the result is true for n. Let X = (X1 , . . . , Xn ) and V = Xn+1 . We must
show that
X , V )h(X
E(g(X X , V )) ≥ E(g(X
X , V )E(h(X
X , V )).
X , i)h(X
E(g(X X , i)) ≥ E(g(X
X , i)E(h(X
X , i)), for i = 0, 1.
X , 0)) = 0.
E(g(X
and
X , V )) ≤ E(h(X
E(h(X X , 1)),
so that
X , V ))E(h(X
E(g(X X , V )) = pE(g(X
X , 1))E(h(X
X , V )) ≤ pE(g(X
X , 1))E(h(X
X , 1))
X , 1)h(X
≤ pE(g(X X , 1)) ≤ pE(g(X
X , 1)h(X
X , 1)) + (1 − p)E(g(X
X , 0)h(X
X , 0))
X , V )h(X
= E(g(X X , V )),
Example 5.5.10. Let X be uniformly distributed in [0, 2π] and Y = sin(X). Calculate the
p.d.f. fY of Y .
X 1
fY (y) = fX (xn )
|g 0 (x n )|
For each y ∈ (−1, 1), there are two values of xn in [0, 2π] such that g(xn ) = sin(xn ) = y.
q p
|g 0 (xn )| = | cos(xn )| = 1 − sin2 (xn ) = 1 − y2,
and
1
fX (xn ) = .
2π
Hence,
1 1 1
fY (y) = 2 p = p .
1 − y 2π
2 π 1 − y2
Example 5.5.11. Let {X, Y } be independent random variables with X exponentially dis-
tributed with mean 1 and Y uniformly distributed in [0, 1]. Calculate E(max{X, Y }).
P (Z ≤ z) = P (X ≤ z, Y ≤ z) = P (X ≤ z)P (Y ≤ z)
z(1 − e−z ), for z ∈ [0, 1]
=
1 − e−z , for z ≥ 1.
Hence,
1 − e−z + ze−z , for z ∈ [0, 1]
fZ (z) =
e−z , for z ≥ 1.
82 CHAPTER 5. RANDOM VARIABLES
Accordingly,
Z ∞ Z 1 Z ∞
−z −z
E(Z) = zfZ (z)dz = z(1 − e + ze )dz + ze−z dz
0 0 1
Z 1 Z 1 Z 1
ze−z dz = − zde−z = −[ze−z ]10 + e−z dz
0 0 0
−1
= −e − [e−z ]10 = 1 − 2e −1
.
Z 1 Z 1 Z 1
z 2 e−z dz = − z 2 de−z = −[z 2 e−z ]10 + 2ze−z dz
0 0 0
−1 −1 −1
= −e + 2(1 − 2e ) = 2 − 5e .
Z ∞ Z 1
−z
ze dz = 1 − ze−z dz = 2e−1 .
1 0
1
E(Z) = − (1 − 2e−1 ) + (2 − 5e−1 ) + 2e−1 = 3 − 5e−1 ≈ 1.16.
2
Example 5.5.12. Let {Xn , n ≥ 1} be i.i.d. with E(Xn ) = µ and var(Xn ) = σ 2 . Use
X1 + · · · + Xn
α := P (| − µ| ≥ ²).
n
1 X1 + · · · + Xn 1 nvar(X1 ) σ2
α≤ 2
var( )= 2 2
= 2.
² n ² n n²
This calculation shows that the sample mean gets closer and closer to the mean: the
Example 5.5.13. Let X =D P (λ). You pick X white balls. You color the balls indepen-
dently, each red with probability p and blue with probability 1 − p. Let Y be the number
of red balls and Z the number of blue balls. Show that Y and Z are independent and that
We find
µ ¶
m+n m
P (Y = m, Z = n) = P (X = m + n) p (1 − p)n
m
µ ¶
λm+n m+n m λm+n (m + n)! m
= p (1 − p)n = × p (1 − p)n
(m + n)! m (m + n)! m!n!
(λp)m −λp (λ(1 − p))n −λ(1−p)
= [ e ]×[ e ],
m! n!
‘
Chapter 6
Conditional Expectation
Conditional expectation tells us how to use the observation of a random variable Y (ω)
to estimate another random variable X(ω). This conditional expectation is the best guess
about X(ω) given Y (ω) if we want to minimize the mean squared error. Of course, the value
6.1 Examples
6.1.1 Example 1
Assume that the pair of random variables (X, Y ) is discrete and takes values in {x1 , . . . , xm }×
X X
P (Y = yj ) = P (X = xi , Y = yj ) = p(i, j).
i i
define
X
E[X|Y = yj ] = xi P [X = xi |Y = yj ].
i
85
86 CHAPTER 6. CONDITIONAL EXPECTATION
X
E[X|Y ] = E[X|Y = yj ]1{Y = yj }.
j
For instance, your guess about the temperature in San Francisco certainly depends on
the temperature you observe in Berkeley. Since the latter is random, so is your guess about
the former.
Although this definition is sensible, it is not obvious in what sense this is the best guess
6.1.2 Example 2
Consider the case where (X, Y ) have a joint density f (x, y) and marginal densities fX (x)
and fY (y). One can then define the conditional density of X given that Y = y as follows.
We see that
As δ goes down to zero, we see that fX|Y [x|y] is the conditional density of X given
Z ∞
E[X|Y = y] = xfX|Y [x|y]dx. (6.1.1)
−∞
6.1.3 Example 3
The ideas of Examples 1 and 2 extend to hybrid cases. For instance, consider the situation
The figure shows the joint distribution of (X, Y ). With probability 0.4, (X, Y ) =
(0.75, 0.25). Otherwise (with probability 0.6), the pair (X, Y ) is picked uniformly in the
6.2. MMSE 87
square [0, 1]2 . You see that E[X|Y = y] = 0.5 if y 6= 0.25. Also, if Y = 0.25, then X = 0.75,
Thus E[X|Y ] = g(Y ) where g(0.25) = 0.75 and g(y) = 0.5 for y 6= 0.25.
In this case, E[X|Y ] is a random variable such that E[X|Y ] = 0.5 w.p. 0.6 and E[X|Y ] =
(Note that the expected value of E[X|Y ] is 0.5 × 0.6 + 0.75 × 0.4 = 0.6 and you can
observe that E(X) = 0.5 × 0.6 + 0.75 × 0.4 = 0.6. That is, E(E[X|Y ]) = E(X) and we will
6.2 MMSE
The examples that we have explored led us to define E[X|Y ] as the expected value of X
when it has its conditional distribution given the value of Y . In this section, we explain
that E[X|Y ] can be defined as the function g(Y ) of the observed value Y that minimizes
E((X − g(Y )2 ). That is, E[X|Y ] is the best guess about X that is based on Y , where best
= E((X − E[X|Y ])2 ) + E((E[X|Y ] − g(Y ))2 ) + 2E((X − E[X|Y ])(E[X|Y ] − g(Y )))
= E((X − E[X|Y ])2 ) + E((E[X|Y ] − g(Y ))2 ) + 2E((X − E[X|Y ])h(Y )) (6.2.1)
The second step is to show that the last term in (6.2.1) is equal to zero. To show that,
we calculate
Z Z Z
E(h(Y )E[X|Y ]) = h(y)E[X|Y = y]fY (y)dy = h(y){ xfX|Y [x|y]dx} fY (y)dy
Z Z
= xh(y)fX,Y (x, y)dxdy = E(h(Y )X). (6.2.2)
The next-to-last identity uses the fact that fX|Y [x|y]fY (y) = fX,Y (x, y), by definition of
The final step is to observe that (6.2.1) with the last term equal to zero implies that
E((X −g(Y ))2 ) = E((X −E[X|Y ])2 )+E((E[X|Y ]−g(Y ))2 ) ≥ E((X −E[X|Y ])2 ). (6.2.3)
This is the story when joint densities exist. The derivation can be adapted to the case when
The left-hand part of Figure 6.2 shows that E[X|Y ] is the average value of X on sets that
correspond to a constant value of Y . The figure also highlights the fact that E[X|Y ] is a
random variable.
6.3. TWO PICTURES 89
X
X
E[X|Y]
E[X|Y]
g(Y)
0 1
The right-hand part of Figure 6.2 depicts random variables as points in some vector
space. The figure shows that E[X|Y ] is the function of Y that is closest to X. The
metric in the space is d(V, W ) = (E(V − W )2 )1/2 . That figure illustrates the relations
(6.2.3). These relations are a statement of Pythagora’s theorem: the square of the length
of the hypothenuse d2 (X, g(Y )) is the sum of the squares of the sides of the right triangle
d2 (X, E[X|Y ]) + d2 (E[X|Y ], g(Y )). This figure shows that E[X|Y ] is the projection of X
onto the hyperplane {k(Y ) | k(·) is a function }. The figure also shows that for E[X|Y ] to
be that projection, the vector X − E[X|Y ] must be orthogonal to every function k(Y ), and
To give you a concrete feel for this vector space, imagine that Ω = {ω1 , . . . , ωN } and that
pk is the probability that ω is equal to ωk . In that case, the random variable X corresponds
to the vector (X(ω1 )(p1 )1/2 , . . . , X(ωN )(pN )1/2 ) in <N . For a general Ω, the random variable
X is a function of ω and it belongs to a function space. This space is a vector space since
linear combinations of functions are also functions. If we restrict our attention to random
variables X with E(X 2 ) < ∞, that space, with the metric that one defines, turns out to
be closed under limits of convergent sequences. (Such a space is called a Hilbert space.)
This property is very useful because it implies that as one chooses functions gn (Y ) whose
90 CHAPTER 6. CONDITIONAL EXPECTATION
g(Y ). This argument implies the existence of conditional expectation. The uniqueness is
intuitively clear: if two random variables g(Y ) and g 0 (Y ) achieve the minimum distance to
Thus, we can define E[X|Y ] as the function g(Y ) that minimizes E((X − g(Y ))2 ). This
definition does not assume the existence of a conditional density fX|Y [·|·], nor of a joint
pmf.
One often calculates the conditional expectation by using the properties of that operator.
We derive these properties in this section. We highlight the key observation we made in the
We proved in (6.2.2) that E[X|Y ] satisfies that property. To show that a function that
satisfies (6.4.1) is the conditional expectation, one observes that if both g(Y ) and g 0 (Y )
satisfy that condition, they must be equal. To see that, note that
Using (6.4.1), we see that the last term is equal to zero. Hence,
But we assume that E((X −g(Y ))2 ) = E((X −g 0 (Y ))2 ). Consequently, E((g(Y )−g 0 (Y ))2 ) =
a. Linearity:
b. Known Factor:
c. Averaging:
e. Smoothing:
Proof:
The derivation of these identities is a simple exercise, but going through it should help
a. Linearity is fairly clear from our original definition (6.1.1), when the conditional
density exists. In the general case, we can use the lemma as follows. We do this derivation
To show that the function g(Y ) := a1 E[X1 | Y ] + a2 E[X2 | Y ] is in fact E[X|Y ] with
E((ai E[Xi | Y ])h(Y )) = E(E[Xi | Y ](ai h(Y ))) = E(Xi (ai h(Y ))) = E(ai Xi h(Y )). (6.4.8)
92 CHAPTER 6. CONDITIONAL EXPECTATION
Indeed, the third identity follows from the property (6.4.1) applied to E[Xi | Y ] with h(Y )
b. To show that g(Y ) := k(Y )E[X|Y ] is equal to E[Xk(Y )|Y ] we prove that it satisfies
This identity follows from the property (6.4.1) of E[X|Y ] where one replaces h(Y ) by
k(Y )h(Y ).
d. We use the lemma. Let g(Y ) = E[X|Y ]. We show that for any h(Y ) one has
E(g(Y )h(Y )) = E(E[X|Y, Z]h(Y )). Now, E(E[X|Y, Z]h(Y )) = E(Xh(Y )) by (6.4.1) and
E(g(Y )h(Y )) = E(E[X|Y ]h(Y )) = E(Xh(Y )), also by (6.4.1). This completes the proof.
i.i.d. random variables with P (Xn = −1) = P (Xn = 1) = 0.5. The random variable Xn
represents your gain at the n-th game of roulette, playing black or red and assuming that
there is no house advantage (no 0 nor double-zero). Say that you have played n times and
you gamble on the next game. You earn Yn Xn+1 on that next game. After a number of
Z = Y0 X1 + Y1 X2 + · · · + Yn Xn+1 .
(Here, Y0 is some arbitrary initial bet.) Assume that the random variables Yn are
bounded (which is not unreasonable since there may be a table limit), then you find that
6.6 Summary
The setup is that (X, Y ) are random variables on some common probability space, i.e., with
The minimum mean squares estimator of X given Y is defined as the function g(Y )
that minimizes E((X − g(Y ))2 ). We know that the answer is g(Y ) = E[X | Y ]. How do we
calculate it?
94 CHAPTER 6. CONDITIONAL EXPECTATION
Direct Calculation
The direct calculation uses (6.1.1) or the discrete version. We look at hybrid cases in the
examples.
Symmetry
m
E[X1 + · · · + Xm | X1 + · · · + Xn ] = × (X1 + · · · + Xn ) for 1 ≤ m ≤ n.
n
Note also that
E[Xi | X1 + · · · + Xn ] = Y, for i = 1, . . . , n
where Y is some random variable. Second, by (6.4.2), if we add up these identities for
i = 1, . . . , n, we find
nY = E[X1 + · · · + Xn | X1 + · · · + Xn ] = X1 + · · · + Xn .
Hence,
Y = (X1 + · · · + Xn )/n,
so that
Using these identities we can now derive the two properties stated above.
Properties
Often one can use the properties of conditional expectation states in Theorem 6.4.2 to
calculate E[X | Y ].
6.7. SOLVED PROBLEMS 95
Example 6.7.1. Let (X, Y ) be a point picked uniformly in the quarter circle {(x, y) | x ≥
p
Given Y = y, X is uniformly distributed in [0, 1 − y 2 ]. Hence
1p
E[X | Y ] = 1 − Y 2.
2
b. Find E[T ].
c. Find V ar[T ].
Example 6.7.3. The random variables Xi are i.i.d. and such that E[Xi ] = µ and var(Xi ) =
values. Let S = X1 + X2 + . . . + XN .
a. Find E(S).
b. Find var(S).
Then,
var(S) = E(S 2 ) − (E(S))2 = E(N )σ 2 + E(N 2 )µ2 − µ2 (E(N ))2 = E(N )σ 2 + var(N )µ2 .
Example 6.7.4. Let X, Y be independent and uniform in [0, 1]. Calculate E[X 2 | X + Y ].
Z 1
1 1 x3 1 1 − (z − 1)3
E[X 2 | X + Y = z] = x2 dx = [ ]z−1 = .
z−1 2−z 2−z 3 3(2 − z)
Similarly, if z < 1, then
Z z
1 1 x3 z2
E[X 2 | X + Y = z] = x2 dx = [ ]z0 = .
0 z z 3 3
Example 6.7.5. Let (X, Y ) be the coordinates of a point chosen uniformly in [0, 1]2 . Cal-
culate E[X | XY ].
This is an example where we use the straightforward approach, based on the definition.
The problem is interesting because is illustrates that approach in a tractable but nontrivial
example. Let Z = XY .
Z 1
E[X | Z = z] = xf[X|Z] [x | z]dx.
0
6.7. SOLVED PROBLEMS 97
Now,
fX,Z (x, z)
f[X|Z] [x | z] = .
fZ (z)
Also,
Hence,
1
x, if x ∈ [0, 1] and z ∈ [0, x]
fX,Z (x, z) =
0, otherwise.
Consequently,
Z 1 Z 1
1
fZ (z) = fX,Z (x, z)dx = dx = −ln(z), 0 ≤ z ≤ 1.
0 z x
Finally,
1
f[X|Z] [x | z] = − , for x ∈ [0, 1] and z ∈ [0, x],
xln(z)
and
Z 1
1 z−1
E[X | Z = z] = x(− )dx = ,
z xln(z) ln(z)
so that
XY − 1
E[X | XY ] = .
ln(XY )
Examples of values:
Example 6.7.6. Let X, Y be independent and exponentially distributed with mean 1. Find
E[cos(X + Y ) | X].
98 CHAPTER 6. CONDITIONAL EXPECTATION
We have
Z ∞ Z ∞
−y
E[cos(X + Y ) | X = x] = cos(x + y)e dy = Re{ ei(x+y)−y dy}
0 0
eix cos(x) − sin(x)
= Re{ }= .
1−i 2
E[X1 | Y ].
Intuition suggests, and it is not too hard to justify, that if Y = y, then X1 = y with prob-
ability 1/n, and with probability (n − 1)/n the random variable X1 is uniformly distributed
Example 6.7.8. Let X, Y, Z be independent and uniform in [0, 1]. Calculate E[(X + 2Y +
Z)2 | X].
Example 6.7.9. Let X, Y, Z be three random variables defined on the same probability
and
Hence,
E((X −X1 )2 ) = E((X −X2 +X2 −X1 )2 ) = E((X −X2 )2 )+E((X2 −X1 )2 ) ≥ E((X −X2 )2 ).
Example 6.7.10. Pick the point (X, Y ) uniformly in the triangle {(x, y) | 0 ≤ x ≤
1 and 0 ≤ y ≤ x}.
a. Calculate E[X | Y ].
1+Y
E[X | Y ] = .
2
X
E[Y | X] = .
2
X2
E[(X − Y )2 | X] = .
3
Example 6.7.11. Assume that the two random variables X and Y are such that E[X |
We show that E((X − Y )2 ) = 0. This will prove that X − Y = 0 with probability one.
Note that
Now,
Similarly, one finds that E(XY ) = E(Y 2 ). Putting together the pieces, we get E((X −
Y )2 ) = 0.
Example 6.7.12. Let X, Y be independent random variables uniformly distributed in [0, 1].
Drawing a unit square, we see that given {X < Y }, the pair (X, Y ) is uniformly dis-
tributed in the triangle left of the diagonal from the upper left corner to the bottom right
corner of that square. Accordingly, the p.d.f. f (x) of X is given by f (x) = 2(1 − x). Hence,
Z 1
1
E[X|X < Y ] = x × 2(1 − x)dx = .
0 3
Chapter 7
Gaussian random variables show up frequently. (This is because of the central limit theorem
that we discuss later in the class.) Here are a few essential properties that we explain in
the chapter.
• If random variables are jointly Gaussian, then the conditional expectation is linear.
7.1 Gaussian
Definition 7.1.1. We say that X is a standard Gaussian (or standard Normal) random
1
fX (x) = √ exp{−x2 /2}, for x ∈ <.
2π
101
102 CHAPTER 7. GAUSSIAN RANDOM VARIABLES
To see that fX (·) is a proper density we should verify that it integrates to one. We do
Z Z 2π 2π
1
= exp{−(x2 + y 2 )/2}dxdy.
2π
and y = r sin(θ). Then, dxdy = rdrdθ and x2 + y 2 = r2 . We then rewrite the integral above
as follows:
Z ∞ Z 2π Z ∞
1 −r2 /2 2 /2
A2 = re drdθ = re−r dr
0 0 2π 0
Z ∞
2 /2 2 /2
=− de−r = [e−r ]∞
0 = 1,
0
so that
Z
0 d 1
φ (u) := φ(u) = ixeiux √ exp{−x2 /2}dx
du 2πZ
Z
1 iux −x2 /2 1 2
= −i √ e de = i √ e−x /2 deiux
Z 2π 2π
1
= −u eiux √ exp{−x2 /2}dx = −uφ(u).
2π
7.1. GAUSSIAN 103
claimed.
2 /2
We have shown that if X =D N (0, 1), then E(eiuX ) = e−u . To see the converse, one
observes that
Z ∞
iuX
E(e )= eiux fX (x)dx,
−∞
so that E(eiuX ) is the Fourier transform of fX (·). It can be shown that the Fourier transform
specifies fX (·) uniquely. That is, if two random variables X and Y are such that E(eiuX ) =
2 /2
E(eiuY ), then fX (x) = fY (x) for x ∈ <. Accordingly, if E(eiuX ) = e−u , it must be that
X =D N (0, 1).
Since
1 1 1
E(exp{iuX}) = E(1 + iuX − u2 X 2 − iu3 X 3 + · · · + (iuX)n + · · · )
2 3! n!
2 1 2 1 4 1 6 1
= exp{−u /2} = 1 − u + u − u + · · · + (−u2 /2)m + · · · ,
2 8 48 m!
we see that E(X m ) = 0 for m odd and
1 1
(iu)2m E(X 2m ) = (−u2 /2)m ,
(2m)! m!
(2m)!
so that E(X 2m ) = 2m m! .
The cdf of X does not admit a closed form expression. Its values have been tabulated
and many software packages provides that function. Table 7.1 shows sample values of the
7.1.2 N (µ, σ 2 )
exp{iuµ − u2 σ 2 /2}.
1 (x − µ)2
√ exp{− }, for x ∈ <.
2πσ 2 2σ 2
So far we have considered a single Gaussian random variable. In this section we discuss
collections of random variables that have a Gaussian joint distribution. Such random vari-
7.2.1 N (00, I )
Random variables X are said to be jointly Gaussian if u T X is Gaussian for any vector u .
7.2. JOINTLY GAUSSIAN 105
and var(Y ) = u T Σu
u (see (7.5.7)), we see that
Now,
Z ∞ Z ∞
uT X Tx
E(e iu
)= ··· eiuu x)dx1 . . . dxn
fX (x (7.2.2)
−∞ −∞
out that this Fourier transform completely determines the joint density.
Using these preliminary calculations we can derive the following very useful result.
Jointly Gaussian random variables are independent if and only if they are uncorrelated.
Proof:
We know from Theorem 5.3.1 that the random variables X = (X1 , . . . , Xn )T are inde-
pendent if and only if their joint density is the product of their individual densities. Now,
That is, random variables are mutually independent if and only if their joint characteris-
TX
tic function E(eiuu ) is the product of their individual characteristic functions E(eium Xm ).
To complete the proof, we use the specific form (7.2.1) of the joint characteristic function of
jointly Gaussian random variables and we note that it factorizes if and only if Σ is diagonal,
¤
106 CHAPTER 7. GAUSSIAN RANDOM VARIABLES
µ, AA T ). Assume that A is
Note also that if X = N (00, I ), then Y = µ + AX = N (µ
nonsingular. Then
1
fY (yy ) = A−1y ),
fX (A
|A|
in view of the change of variables. Hence,
1 1
fY (yy ) = n/2
exp{− (yy − µ )T Σ −1 (yy − µ )
Σ|)
(2π|Σ 2
where Σ = AA T . We used the fact that the determinant of the product of square matrices
Σ| = |A
is the product of their determinants, so that |Σ AT | = |A
A| × |A A|2 and |A Σ|1/2 .
A| = |Σ
We explain how to calculate the conditional expectation of jointly Gaussian random vari-
ables. The result is remarkably simple: the conditional expectation is linear in the obser-
vations!
Theorem 7.3.1. Let (X, Y ) be two jointly Gaussian random variables. Then
a. One has
cov(X, Y )
E[X | Y ] = E(X) + (Y − E(Y )). (7.3.1)
var(Y )
X , Y ) be jointly Gaussian.
Let (X
b. If ΣY is invertible, then
X |Y
E[X X ) + ΣX ,YY ΣY−1 (Y
Y ] = E(X Y − E(Y
Y )). (7.3.2)
X |Y
E[X X ) + ΣX ,YY ΣY† (Y
Y ] = E(X Y − E(Y
Y )) (7.3.3)
Proof:
Then we can look for a vector a and a matrix B of compatible dimensions so that
X |Y
E[X Y ] = E[X
X − a − BY + a + BY |Y
Y]
X − a − BY |Y
= E[X Y ] + E[a
a + BY |Y
Y ], by (6.4.2)
X − a − BY ) + a + BY , by (6.4.5)
= E(X
= a + BY .
X − a − BY ] = 0, or
To find the desired a and B , we solve E[X
X ) − B E(Y
a = E(X Y)
Y T ] = 0, or
X − a − BY )Y
and E[(X
ΣX ,YY = B ΣY . (7.3.4)
B = ΣX ,YY ΣY−1 ,
so that
X |Y
E[X X ) + ΣX ,YY ΣY−1 (Y
Y ] = E(X Y − E(Y
Y )).
If ΣY ,YY is not invertible is not, we choose a pseudo-inverse ΣY† that solves (7.3.4), i.e., is
such that
We then find
X |Y
E[X X ) + ΣX ,YY ΣY† (Y
Y ] = E(X Y − E(Y
Y )).
7.4 Summary
If X, Y are jointly Gaussian, then E[X | Y ] = E(X) + cov(X, Y )var(Y )−1 (Y − E(Y )).
Example 7.5.1. The noise voltage X in an electric circuit can be modelled as a Gaussian
a. What is the probability that it exceeds 10−4 ? What is the probability that it exceeds
2 × 10−4 ? What is the probability that its value is between −2 × 10−4 and 10−4 ?
b. Given that the noise value is positive, what is the probability that it exceeds 10−4 ?
Let Z = 104 X, then Z =D N (0, 1) and we can reformulate the questions in terms of Z.
a. Using (7.1) we find P (Z > 1) = 0.159 and P (Z > 2) = 0.023. Indeed, P (Z > d) =
P (−2 < Z < 1) = P (Z < 1)−P (Z ≤ −2) = 1−P (Z > 1)−P (Z > 2) = 1−0.159−0.023 = 0.818.
b. We have
P (Z > 1)
P [Z > 1 | Z > 0] = = 2P (Z > 1) = 0.318.
P (Z > 0)
7.5. SOLVED PROBLEMS 109
Hence,
r
−4 2
E(|X|) = 10 .
π
random variables. A low-pass filter takes the sequence U and produces the output sequence
a. Find the joint pdf of Xn and Xn−1 and find the joint pdf of Xn and Xn+m for m > 1.
b. Find the joint pdf of Yn and Yn−1 and find the joint pdf of Yn and Yn+m for m > 1.
We start with some preliminary observations. First, since the Ui are independent, they
are jointly Gaussian. Second, Xn and Yn are linear combinations of the Ui and thus are
also jointly Gaussian. Third, the jpdf of jointly gaussian random variables Z is
1 1
fZ (zz ) = p exp[− (zz − m )C −1 (zz − m )]
(2π)n det(C) 2
matrix
Z − m )(Z
E[(Z Z − m )T ]. Finally, we need some basic facts
from algebra. If C =
a b d −b
, then det(C) = ad − bc and C −1 = 1 . We are now ready to
det(C)
c d −c a
answer the questions.
U.
a. Express in the form X = AU
Un−1
1 1
Xn 0
= 2 2 Un
1 1
Xn−1 2 2 0
Un+1
110 CHAPTER 7. GAUSSIAN RANDOM VARIABLES
X ] = AE[U
Then E[X U ] = 0.
1 0 0 0 1
1 1 2 1 1
0
U U T ]AT =
X X T ] = AE[U
C = E[X 2 2 0 1 0 1 1 = 2 4
1 1 2 2 1 1
2 2 0 1 4 2
0 0 1 2 0
1 1 3
Then det(C) = 4 − 16 = 16 and
16 12 − 14
C −1 =
3 − 14 1
2
2
fXn Xn−1 (xn , xn−1 ) = √
π 3
exp[− 43 (x2n − xn xn−1 + x2n−1 )]
For m > 1,
Un
Xn 1 1
0 0 Un+1
= 2 2
Xn+m 0 0 1 1 Un+m
2 2
Un+m+1
X ] = AE[U
Then E[X U ] = 0.
1
1 0 0 0 0
2
1 1
0 0 0 1 0 0 1
0 1
0
U U T ]AT =
X X T ] = AE[U
C = E[X 2 2
2 2
=
0 0 1 1 0 0 1 0 0 1
0 1
2 2 2 2
1
0 0 0 1 0 2
1
Then det(C) = 4 and
2 0
C −1 =
0 2
1
fXn Xn+m (xn , xn+m ) = π exp[− 14 (x2n + x2n+m )]
b.
Un−2
Yn 0 − 12 1
= 2 Un−1
Yn−1 − 12 1
2 0
Un
Y ] = AE[U
Then E[Y U ] = 0.
1 0 0 0 − 21
0 − 12 1 1
− 14
U U T ]AT =
Y Y T ] = AE[U
C = E[Y 2 0 1 0 −1 1 = 2
2 2
− 12 1
2 0 1
− 14 1
2
0 0 1 2 0
7.5. SOLVED PROBLEMS 111
1 1 3
Then det(C) = 4 − 16 = 16 and
1 1
16 2 4
C −1 =
3 1 1
4 2
2
fYn Yn−1 (yn , yn−1 ) = √
π 3
exp[− 43 (yn2 + yn yn−1 + yn−1
2 )]
For m > 1,
Un−1
Yn − 21 1
0 0 Un
= 2
Yn+m 0 0 − 21 1 Un+m−1
2
Un+m
Y ] = AE[U
Then E[Y U ] = 0.
1 0 0 0 − 12 0
− 12 1
0 0 0 1 0 0 1 0 1
0
U U T ]AT =
Y Y T ] = AE[U
C = E[Y 2
2
2
=
0 0 − 12 1 0 0 1 0 0 − 12 0 1
2 2
1
0 0 0 1 0 2
1
Then det(C) = and
4
2 0
C −1 =
0 2
1
fYn Yn+m (yn , yn+m ) = π exp[− 14 (yn2 + yn+m
2 )]
1 1 3
Then det(C) = 4 − 14 = 16 and
1
16 − 14
C −1 = 2
3 − 14 1
2
2
fXn Yn (xn , yn ) = √
π 3
exp[− 43 (x2n − xn yn + yn2 )]
1
Then det(C) = 4 and
2 0
C −1 =
0 2
1
fXn Yn+1 (xn , yn+1 ) = π exp[− 14 (x2n + yn+1
2 )]
We have
3X + Z
E[X + 2Y |3X + Z, 4Y + 2V ] = a Σ−1
4Y + 2V
where
and
var(3X + Z) E((3X + Z)(4Y + 2V )) 10 0
Σ= = .
E((3X + Z)(4Y + 2V )) var(4Y + 2V ) 0 20
Hence,
10−1 0 3X + Z
E[X+2Y |3X+Z, 4Y +2V ] = [3, 8] = 3 (3X+Z)+ 4 (4Y +2V ).
0 20−1 4Y + 2V 10 10
Example 7.5.4. Assume that {X, Yn , n ≥ 1} are mutually independent random variables
σ 2
Thus we know that X − X̂n = N (0, n+σ 2 ). Accordingly,
σ2 0.1
P (|X − X̂n | > 0.1) = P (|N (0, 2
)| > 0.1) = P (|N (0, 1)| > )
n+σ αn
q
σ2
where αn = n+σ 2
. For this probability to be at most 5% we need
r
0.1 σ2 0.1
= 2, i.e., αn = 2
= ,
αn n+σ 2
so that
n = 19σ 2 .
The result is intuitively pleasing: If the observations are more noisy (σ 2 large), we need
Example 7.5.5. Assume that X, Y are i.i.d. N (0, 1). Calculate E[(X + Y )4 | X − Y ].
Note that X + Y and X − Y are independent because they are jointly Gaussian and
uncorrelated. Hence,
X 2 + Y 2 =D Exd(1/2). That is, the sum of the squares of two i.i.d. zero-mean Gaussian
Z ∞ Z ∞
iuW 2 +y 2 ) 1 −(x2 +y2 )/2
E(e ) = eiu(x e dxdy
−∞ −∞ 2π
Z 2π Z ∞
2 1 −r2 /2
= eiur e rdrdθ
2π
Z0 ∞ 0
2 2 /2
= eiur e−r rdr
0
Z ∞
1 2 2 1
= d[eiur −r /2 ] = .
0 2iu − 1 1 − 2iu
7.5. SOLVED PROBLEMS 115
Example 7.5.7. Let {Xn , n ≥ 0} be Gaussian N (0, 1) random variables. Assume that
Yn+1 = aYn + Xn for n ≥ 0 where Y0 is a Gaussian random variable with mean zero and
a. We see that
αn+1 = a2 αn + 1 and α0 = σ 2 .
1 − a2n
var(Yn ) = αn = a2n σ 2 + , for n ≥ 0.
1 − a2
1
var(Yn ) → γ 2 := as n → ∞.
1 − a2
a.Calculate
E[X1 + X2 + X3 | X1 + X2 , X2 + X3 , X3 + X4 ].
b. Calculate
E[X1 + X2 + X3 | X1 + X2 + X3 + X4 + X5 ].
where the coefficients a, b, c must be such that the estimation error is orthogonal to the
= E((X1 + X2 + X3 ) − Y )(X3 + X4 )) = 0.
2 − a − (a + b) = 2 − (a + b) − (b + c) = 1 − (b + c) − c = 0,
Yk = E[Xk | X1 + X2 + X3 + X4 + X5 ].
Y1 +Y2 +Y3 +Y4 +Y5 = E[X1 +X2 +X3 +X4 +X5 | X1 +X2 +X3 +X4 +X5 ] = X1 +X2 +X3 +X4 +X5 .
3
E[X1 + X2 + X3 | X1 + X2 + X3 + X4 + X5 ] = Y1 + Y2 + Y3 = (X1 + X2 + X3 + X4 + X5 ).
5
Example 7.5.9. Let the Xn ’s be as in Example 7.5.7. Find the jpdf of (X1 + 2X2 +
These random variables are jointly Gaussian, zero mean, and with covariance matrix Σ
given by
14 11 11
Σ=
11 14 11 .
11 11 14
Indeed, Σ is the matrix of covariances. For instance, its entry (2, 3) is given by
1 1
x) =
fX (x exp{− x T Σ−1x }.
(2π)3/2 |Σ|1/2 2
Y ] where
E[X1 + 3X2 |Y
X1
1 2 3
Y =
X2
3 2 1
X3
By now, this should be familiar. The solution is Y := a(X1 + 2X2 + 3X3 ) + b(3X1 +
and
Example 7.5.11. Find the jpdf of (2X1 + X2 , X1 + 3X2 ) where X1 and X2 are independent
These random variables are jointly Gaussian, zero-mean, with covariance Σ given by
5 5
Σ= .
5 10
Hence,
1 1
x) =
fX (x 1/2
exp{− x T Σ−1x }
2π|Σ| 2
1 1 T −1
= exp{− x Σ x }
10π 2
where
1 10 −5
Σ−1 = .
25 −5 5
Example 7.5.12. The random variable X is N (µ, 1). Find an approximate value of µ so
that
Example 7.5.13. Let X be a N (0, 1) random variable. Calculate the mean and the variance
2 /2
E(eiuX ) = e−u and eiθ = cos(θ) + i sin(θ).
Therefore,
2 /2
E(cos(uX) + i sin(uX)) = e−u ,
7.5. SOLVED PROBLEMS 119
so that
2 /2
E(cos(uX)) = e−u and E(sin(uX)) = 0.
1 1 1
E(cos2 (X)) = E( (1 + cos(2X))) = + E(cos(2X)).
2 2 2
2 /2
E(cos(2X)) = e−2 = e−2 ,
1 1 −2 1 1
var(cos(X)) = E(cos2 (X)) − (E(cos(uX)))2 = + e − (e−1/2 )2 = + e−2 − e−1 .
2 2 2 2
Similarly, we find
1 1 −2
E(sin2 (X)) = E(1 − cos2 (X)) = − e = var(sin(X)).
2 2
a. Calculate
E[3X + 5Y | 2X − Y, X + Z].
E[3X + 5Y | V ] = a Σ−1
V V
where
V T ) = [1, 3]
a = E((3X + 5Y )V
and
5 2
ΣV = .
2 2
Hence,
−1
5 2 1 2 −2
E[3X + 5Y | V ] = [1, 3] V = [1, 3] V
2 2 6 −2 5
1 2 13
V = − (2X − Y ) + (X + Z).
= [−4, 13]V
6 3 6
b. Now,
1
E[3X + 5Y | V ] = E(3X + 5Y ) + a Σ−1 V − E(V
V (V V − [1, 2]T )
V )) = 8 + [−4, 13](V
6
26 2 13
= − (2X − Y ) + (X + Z).
6 3 6
Example 7.5.16. Let (X, Y ) be jointly Gaussian. Show that X − E[X | Y ] is Gaussian
We know that
cov(X, Y )
E[X | Y ] = E(X) + (Y − E(Y )).
var(Y )
Consequently,
cov(X, Y )
X − E[X | Y ] = X − E(X) − (Y − E(Y ))
var(Y )
and is certainly Gaussian. This difference is zero-mean. Its variance is
The detection problem is roughly as follows. We want to guess which of finitely many
possible causes produced an observed effect. For instance, you have a fever (observed effect);
do you think you have the flu or a cold or the malaria? As another example, you observe
some strange shape on an X-ray; is it a cancer or some infection of the tissues? A receiver
gets a particular waveform; did the transmitter send the bit 0 or the bit 1? (Hypothesis
testing is similar.) As you can see, these problems are prevalent in applications.
There are two basic formulations: either we know the prior probabilities of the possible
causes (Bayesian) or we do not (non-Bayesian). When we do not, we can look for the
8.1 Bayesian
Assume that X takes values in a finite set {0, . . . , M }. We know the conditional density or
We want to choose Z in {0, . . . , N } on the basis of Y to minimize E(c(X, Z)) where c(·, ·)
Since E(c(X, Z)) = E(E[c(X, Z)|Y ]), we should choose Z = g(Y ) where
121
122 CHAPTER 8. DETECTION AND HYPOTHESIS TESTING
the maximum a posteriori estimate of X given {Y = y}, the most likely value of X given
{Y = y}.
P (X = x)fY |X [y|x]
P [X = x|Y = y] = ,
fY (y)
so that
The common criticism of this formulation is that in many cases the prior distribution
of X is not known at all. For instance, consider designing a burglar alarm for your house.
What prior probability should you use? You suspect a garbage in, garbage out effect here
Instead of choosing the value of X that is most likely given the observation, one can
choose the value of X that makes the observation most likely. That is, one can choose
arg maxx P [Y = y|X = x]. This estimator is called the maximum likelihood estimator of X
Identity (8.1.2) shows that M LE[X|Y = y] = M AP [X|Y = y] when the prior distri-
bution of X is uniform, i.e., when P (X = x) has the same value for all x. Note also that
deeper property of the MLE is that under weak assumptions it tends to be a good estimator
(asymptotically efficient).
8.3. HYPOTHESIS TESTING PROBLEM 123
Consider the problem of designing a fire alarm system. You want to make the alarm as
sensitive as possible as long as it does not generate too many false alarms. We formulate
We consider the case of a simple hypothesis. We define the problem and state the solution
in the form of a theorem. We then examine some examples. We conclude the section by
There are two possible hypotheses H0: X = 0 or H1: X = 1. Should one reject H0 on
One is given the distribution of the observation Y given X. The problem is to choose
to a bound on the probability of false alarm: P [Z = 1|X = 0] ≤ β, for a given β ∈ (0, 1).
the solution of that problem by Z = HT [X|Y ], which means that Z is the solution of the
Discrete Case
Given X, Y has a known p.m.f. P [Y = y|X]. Let L(y) = P [Y = y|X = 1]/P [Y = y|X = 0]
It is not too difficult to show that there is a choice of λ and γ for which [Z = 1|X =
Continuous Case
Given X, Y has a known p.d.f. fY |X [y|x]. Let L(y) = fY |X [y|1]/fY |X [y|0] (the likelihood
ratio).
The solution is
1, if L(y) > λ
Z=
0, if L(y) ≤ λ.
8.3.2 Examples
Example 1
If X = k, Y is exponentially distributed with mean µ(k), for k = 0, 1 where 0 < µ(0) <
µ(1). Here, fY |X [y|x] = κ(x) exp{−κ(x)y} where κ(x) = µ(x)−1 for x = 0, 1. Thus Z =
1{Y > y0 } where y0 is such that P [Z = 1|X = 0] = β4. That is, exp{−κ(0)y0 } = β, or
Example 2
If X = k, Y is Gaussian with mean µ(k) and variance 1, for k = 0, 1 where 0 < µ(0) <
√
µ(1). Here, fY |X [y|x] = K exp{−(x − µ(k))2 /2} where K = 1/ 2π. Accordingly, L(y) =
B exp{x(µ(1) − µ(0))} where B = exp{(µ(0)2 − µ(1)2 )/2}. Thus Z = 1{Y > y0 } where y0
is such that P [Z = 1|X = 0] = β. That is, P (N (µ(0), 1) > y0 ) = β, and one finds the value
Example 3
You flip a coin 100 times and count the number Y of heads. You must decide whether the
coin is fair or biased, say with P (H) = 0.6. The goal is to minimize the probability of
Solution: You can verify that L(y) in increasing in y. Thus, the best decision is Z = 1
Figure 8.1 illustrates P [Z = 1|f air]. For β = 0.001, one finds (using a calculator)
One finds also that P [Y ≥ 58|f air] = 0.066 and P [Y ≥ 59|f air] = 0.043; accordingly, if
β = 0.05 one should decide Z = 1 w.p. 1 if Y >= 59 and Z = 1 w.p. 0.3 if Y = 58. Indeed,
in that case, P [Z = 1|f air] = P [Y ≥ 59|f air] + 0.3P [Y = 58|f air] = 0.043 + 0.3(0.066 −
0.043) = 0.05.
Before we give a formal proof, we discuss an analogy that might help you understand the
structure of the result. Imagine that you have a finite budget with which to buy food items
from a given set. Your objective is to maximize the total number of calories of the items
you buy. Intuitively, the best strategy is to rank the items in decreasing order of calories
per dollar and to buy the items in that order until you run out of money. When you do
that, it might be that you still have some money left after purchasing item n − 1 but not
quite enough to buy item n. In that case, if you could, you would buy a fraction of item n.
8.3. HYPOTHESIS TESTING PROBLEM 127
If you cannot, and if we care only about the expected amount of money you spend, then you
could buy the next item with some probability between 0 and 1 chosen so that you spend all
you money, on average. Now imagine that the items are values of the observation Y when
you decide to sound the alarm. Each item y has a cost P [Y = y|X = 0] in terms of false
of correct detection (the caloric content of the item in our previous example). According
to our intuition, we rank the items y in decreasing order of the reward/cost ratio which is
precisely the likelihood ratio. Consequently, you sound the alarm when the likelihood ratio
exceeds to value λ and you may have to randomize at some item to spend you total budget,
on average.
Let Z be as specified by the theorem and let V be another random variable based on Y ,
possibly with randomization, and such that P [V = 1|X = 0] ≤ β. We want to show that
For the next step, we need the fact that if W is a function of Y , then E[W L|X = 0] =
E[W |Z = 1]. We show this fact in the continuous case. The other cases are similar. We
find
Z Z
fY |X [y|1]
E[W L|X = 0] = W (y)L(y)fY |X [y|0]dy = W (y) f [y|0]dy
fY |X [y|0] Y |X
Z
= W (y)fY |X [y|1]dy = E[W |X = 1],
as we wanted to show.
so that
P [Z = 1 | X = 1] − P [V = 1 | X = 1] ≥ λ(P [Z = 1 | X = 0] − P [V = 1 | X = 0]) ≥ 0
128 CHAPTER 8. DETECTION AND HYPOTHESIS TESTING
P [Z = 1 | X = 1] ≥ P [V = 1 | X = 1],
So far we have learned how to decide between two hypotheses that specify the distribution
of the observations. In this section we consider composite hypotheses. Each of the two
alternatives corresponds to a set of possible distributions and we want to decide which set
8.4.1 Example 1
Consider once again Examples 1 and 2 in Section 8.3.2. Note that the optimal decision Z
does not depend on the value of µ(1). Consequently, the optimal decision would be the
H0: µ = µ(0)
The hypothesis H1 is called a composite hypothesis because it does not specify a unique
8.4.2 Example 2
Once again, consider Examples 1 and 2 in Section 8.3.2 but with the hypotheses
H0: µ ≤ µ(0)
Both hypotheses H0 and H1 are composite. We claim that the optimal decision Z is
still the same as in the original simple hypotheses case. To see that, observe that P [Z =
1|µ] = P [Y > y0 |µ] ≤ P [Y < y0 |µ(0)] = β, so that our decision meets the condition that
8.4.3 Example 3
Both examples 8.4.1 and 8.4.2 consider one-sided tests where the values of the parameter µ
under H1 are all larger than those permitted under H0. What about a two-sided test with
H0: µ = µ(0)
H1: µ 6= µ(0).
H0: µ ∈ A
H1: µ ∈ B
In general, optimal tests for such situations do not exist and one resorts to approxima-
tions. We saw earlier that the optimal decisions for a simple hypothesis test is based on
the value of the likelihood ratio L(y), which is the ratio of the densities of Y under the
two hypotheses X = 1 and X = 0, respectively. One might then try to extend this test by
replacing L(y) by the ratio of the two densities under H1 and H0, respectively. How do we
define the density under the hypothesis H1 “µ ∈ B”? One idea is to calculate the MLE of
µ given Y and H1, and similarly for H0. This approach works well under some situations.
However, the details would carry us a bit to far. Interested students will find expositions
of these methods in any good statistics book. Look for the keywords “likelihood ratio test,
goodness-of-fit test”.
130 CHAPTER 8. DETECTION AND HYPOTHESIS TESTING
8.5 Summary
The detection story is that X, Y are random variables, X ∈ {0, 1} (we could consider more
values), and we want to guess the value of X based on Y . We call this guess X̂. There are
a few possible formulations. In all cases, we assume that f[Y |X] [y | x] is known. It tells us
how Y is related to X.
8.5.1 MAP
8.5.2 MLE
The MLE is the MAP when the values of X are equally likely prior to any observation.
1, if L(Y ) > λ
X̂ = HT [X | Y ] := 0, if L(Y ) < λ
1 with probability γ, if L(Y ) = λ.
Here,
f[Y |X] [y | 1]
L(y) =
f[Y |X] [y | 0]
8.6. SOLVED PROBLEMS 131
is the likelihood ratio and we have to choose λ ≥ 0 and γ ∈ [0, 1] so that P [X̂ = 1 | X =
0] = β.
If L(Y ) is a continuous random variable, then we don’t have to bother with the case L(Y ) =
1, if L(Y ) > λ
X̂ =
0, if L(Y ) < λ
In some cases, L(Y ) is a continuous random variable that is strictly increasing in Y . Then
the decision is
X̂ = 1{Y ≥ y0 }
P [Y ≥ y0 | X = 0] = α.
Example 8.6.1. Given X, Y = N (0.1 + X, 0.1 + X), for X ∈ {0, 1}. (Model inspired from
a. We find
1 (y − 1.1)2
f[Y |X] [y | 1] = √ exp{− }
2.2π 2.2
132 CHAPTER 8. DETECTION AND HYPOTHESIS TESTING
and
1 (y − 0.1)2
f[Y |X] [y | 0] = √ exp{− }.
0.2π 0.2
Hence, X̂ = 1 if f[Y |X] [y | 1]π(1) ≥ f[Y |X] [y | 0]π(0). After some algebra, one finds that
X̂ = 1{y ∈
/ (−0.53, 0.53)}.
Intuitively, if X = 0, then Y =D N (0.1, 0.1) and is likely to be close to zero. On the other
hand, if X = 1, then Y =D N (1.1, 1.1) and is more likely to take large positive value or
negative values.
= 0.4P (|N (0.1, 0.1)| > 0.53) + 0.6P (|N (1.1, 1.1)| < 0.53)
= 0.8P (N (0.1, 0.1) < −0.53) + 1.2P (N (1.1, 1.1) < −0.53)
= 0.8P (N (0, 0.1) < −0.63) + 1.2P (N (0, 1.1) < −1.63)
√ √
= 0.8P (N (0, 1) < −0.63/ 0.1) + 1.2P (N (0, 1) < −1.63/ 1.1)
We find, for i = 0, 1,
so that X̂ = 1 if
i.e., if
1 π(1)λ1
y≤ ln{ } =: y0 .
λ1 − λ0 π(0)λ0
Consequently,
HT [X | Y ] with β = 10−2 .
We have
1 y2 1 (y − 1)2
L(y) = [ √ exp{− }]/[ √ exp{− }].
2π2 4 2π 2
We see that L(y) is a strictly increasing function of y 2 − 4y (not of y!). Thus, L(y) ≥ λ is
X̂ = 1{y 2 − 4y ≥ τ }
P [Y 2 − 4Y ≥ τ |X = 0] = β.
How do we find the value of τ for β = 10−2 ? A brute force approach consists in calculating
We have
λ1 e−λ1 y
L(y) = .
λ0 e−λ0 y
X̂ = 1{Y > y0 }
This decision rule does not depend on the value of λ1 < 1. Accordingly, X̂ solves the
Here,
1{0 ≤ y ≤ 2}
L(y) = .
1{−1 ≤ y ≤ 1}
Thus, L(y) = 0 for y ∈ [−1, 0); L(y) = 1 for y ∈ [0, 1]; and L(y) = ∞ for y ∈ (1, 2]. The
decision is then
1, if Y > 1
X̂ = HT [X | Y ] := 0, if Y < 0
1 with probability γ, if Y ∈ [0, 1].
8.6. SOLVED PROBLEMS 135
We choose γ so that
β = 0.2 = P [X̂ = 1 | X = 0]
1
= γP [Y ∈ [0, 1] | X = 0] + P [Y > 1 | X = 0] = γ ,
2
Example 8.6.6. Pick the point (X, Y ) uniformly in the triangle {(x, y) | 0 ≤ x ≤ 1 and 0 ≤
y ≤ x}.
a. Find the function g : [0, 1] → {0, 0.5, 1} that minimizes E((X − g(Y ))2 ).
b. Find the function g : [0, 1] → < that minimizes E(h(X − g(Y ))) where h(·) is a
function whose primitive integral H(·) is anti-symmetric strictly convex over [0, ∞). For
For each y, we should choose the value g(y) := v ∈ {0, 0.5, 1} that minimizes E[(X − v)2 |
We expect that the minimizing value g(y) of v is nondecreasing in y. That is, we expect
that
0, if y ∈ [0, a)
g(y) = 0.5, if y ∈ [a, b)
1, if y ∈ [b, 1].
136 CHAPTER 8. DETECTION AND HYPOTHESIS TESTING
The “critical” values a < b are such that the choices are indifferent. That is,
Substituting the expression for the conditional expectation, these equations become
b. As in previous part,
so that, for each given y, we should choose g(y) to be the value v that minimizes E[h(X −v) |
Y = y]. Now,
Z 1
1 1
E[h(X − v) | Y = y] = h(x − v)dx = [H(1 − v) − H(y − v)].
1−y y 1−y
Now, we claim that the minimizing value of v is v ∗ = (1 + y)/2. To see that, note that for
(1 − v) + (v − y)
H(1−v)−H(y −v) = H(1−v)+H(v −y) > 2H( ) = H(1−v ∗ )+H(v ∗ −y),
2
0) = p ∈ [0, 1].
a. By definition,
Consequently,
M LE[X | Y ] = Y.
b. We know that
Therefore,
M AP [X | Y = 0]
= arg maxx {g(x = 0) = P (X = 0)P (0, 0) = 0.7(1 − p), g(x = 1) = P (X = 1)P (1, 0) = 0.4p}
and
M AP [X | Y = 1]
= arg maxx {h(x = 0) = P (X = 0)P (0, 1) = 0.3(1 − p), h(x = 1) = P (X = 1)P (1, 1) = 0.6p}.
Consequently,
0, if y = 0 and p < 7/11;
1, if y = 0 and p ≥ 7/11;
M AP [X|Y = y] =
0, if y = 1 and p < 1/3;
1, if y = 1 and p ≥ 1/3.
c. We know that
1, if L(y) = P (1, y)/P (0, y) > λ;
X̂ = 1 w.p. γ, if L(y) = P (1, y)/P (0, y) = λ;
0, if L(y) = P (1, y)/P (0, y) < λ.
138 CHAPTER 8. DETECTION AND HYPOTHESIS TESTING
We must find λ and γ so that P [X̂ = 1 | X = 0] = β. Note that L(1) = 2 and L(0) = 4/7.
Accordingly,
0, if λ > 2;
1 w.p. P (0, 1)γ = 0.3γ if λ = 2;
P [X̂ = 1 | X = 0] =
1 w.p. P (0, 1) = 0.3 if λ ∈ (4/7, 2);
1 w.p. P (0, 0)γ + P (0, 1) = 0.7γ + 0.3 if λ = 4/7; 1, if λ < 4/7.
To see this, observe that if λ > 2, then L(0) < L(1) < λ, so that X̂ = 0. Also, if λ = 2, then
see that P [X̂ = 1 | X = 0] = P (0, 1)γ. The other cases are similar.
It follows that
λ = 2, γ = β/0.3, if β ≤ 0.3;
Example 8.6.8. Given X, the random variables {Yn , n ≥ 1} are exponentially distributed
fY n |X=2 (y n |X = 2) 1 n 1 Pni=1 yi
L(y n ) = = ( ) e2 .
fY n |X=1 (y n |X = 1) 2
to determine the suitable value of ρ. After Chapter 11 we will be able to use a Gaussian
0) = P (X = +1) = 1/3 and Y is N (0, σ 2 ). Find the function g : < → {−1, 0, +1} that
Let Z = X + Y . Since the prior distribution of X is uniform, we know that the solution
Now,
1 (z − x)2
fZ|X [z|x] = √ exp{− }.
2πσ 2 2σ 2
Hence,
Consequently,
−1, if z ≤ −0.5;
X̂ = 0, if − 0.5 < z < 0.5;
1, if z ≥ 0.5.
140 CHAPTER 8. DETECTION AND HYPOTHESIS TESTING
Example 8.6.10. Assume that X is uniformly distributed in the set {1, 2, 3, 4}. When
1 1
f [yy |X = i] := fY |X [yy |i] = exp[− ||yy − vi ||2 ].
2π 2
Y ) = M LE[X|Y
Also, because the prior of X is uniform, we now that g(Y Y ]. That is,
That is, the MAP of X given Y corresponds to the vector vi that is closest to the received
vector Y . This is fairly intuitive, given the shape of the Gaussian density.
where the last inequality follows from the result in Example 7.5.6.
Hence,
P [X̂ = i | X = i] ≥ 1 − exp{−0.5d2i }.
Consequently,
4
X
P [X̂ = X] = P [X̂ = i | X = i]P (X = i)
i=1
4
1X
≥ (1 − exp{−0.5d2i }).
4
i=1
8.6. SOLVED PROBLEMS 141
p p
For instance, assume that the vectors v i are the four corners of the square [− ρ/2, ρ/2]2 .
p
In that case, di = ρ/2 for all i and we find that
P [X̂ = X] ≥ 1 − exp{−0.25ρ}.
Note also that ||vv i ||2 = ρ, so that ρ is the power that the transmitter sends. As ρ increases,
so does our lower bound on the probability of decoding the signal correctly.
This type of simple bound is commonly used in the evaluation of communication systems.
Example 8.6.11. A machine produces steel balls for ball bearings. When the machine
operates properly, the radii of the balls are i.i.d. and N (100, 4). When the machine is
a. You measure n balls produced by the machine and you must raise an alarm if you
believe that the machine is defective. However, you want to limit the probability of false
b. Compute the probability of missed detection that you obtain in part (a). This prob-
ability depends on the number n of balls, so you cannot get an explicit answer. Select the
a. This is a hypothesis test. Let X = 0 when the machine operates properly and X = 1
1, if Pn Y < λ
k=1 k
X̂ =
0, if Pn Yk > λ
k=1
142 CHAPTER 8. DETECTION AND HYPOTHESIS TESTING
Xn
1% = P [ Yk < λ|X = 0] = P (N (n × 100, n × 4) < λ)
k=1
λ − 100n
= P (N (0, 4n) < λ − 100n) = P (N (0, 1) < √ ).
2 n
λ − 100n √
√ = −2.3, i.e., λ = 100n − 4.6 n.
2 n
b. We have
Xn
P [X̂ = 0|X = 1] = P [ Yk > λ|X = 1] = P (N (n × 98, n × 4) > λ)
k=1
√
λ − 98n 2n − 4.6 n
= P (N (0, 1) > √ ) = P (N (0, 1) > √ )
2 n 2 n
√
= P (N (0, 1) > n − 2.3).
√
n − 2.3 = 3.1, i.e., n = (2.3 + 3.1)2 ≈ 29.16.
Estimation
The estimation problem is similar to the detection problem except that the unobserved
random variable X does not take values in a finite set. That is, one observes Y ∈ < and
one must compute an estimate of X ∈ < based on Y that is close to X in some sense.
Once again, one has Bayesian and non-Bayesian formulations. The non-Bayesian case
9.1 Properties
E(X), i.e., if its mean is the same as that of X. Recall that E[X | Y ] is unbiased.
If we make more and more observations, we look at the estimator X̂n of X given
(Y1 , . . . , Yn ) : X̂n = gn (Y1 , . . . , Yn ). We say that X̂n is asymptotically unbiased if lim E(X̂n ) =
E(X).
In this section we study a class of estimators that are linear in the observations. They have
143
144 CHAPTER 9. ESTIMATION
Let (X, Y ) be a pair of random variables on some probability space. The linear least
X − a − BY ||2 ).
minimizes E(||X
One has
cov(X, Y )
L[X | Y ] = E(X) + (Y − E(Y )). (9.2.1)
var(Y )
Also, if ΣY is invertible,
Proof:
We provide the proof in the scalar case. The vector case is very similar and we leave the
details to the reader. The key observation is that Z = LLSE[X|Y ] if and only if Z = a+bY
To see why that is the case, assume that Z has those two properties and let V = c + dY
But
so that Z = LLSE[X | Y ].
by c = a and d = b. Setting the derivative of φ(c, d) with respect to c to zero and similarly
∂φ(c, d)
0= |c=a,d=b = 2E(X − a − bY ) = 2E(X − Z)
∂c
and
cov(X, Y )
Z = E(X) + (Y − E(Y )).
var(Y )
Note that these conditions say that the estimation error X − Z should be orthogonal
There are many cases where one keeps on making observations. How do we update the
X̂ = LLSE[X|Y ] = aY ? The answer lies in the observations captured in Figure 9.1 (we
assume all the random variables are zero-mean to simplify the notation).
These ideas lead to the Kalman filter which is a recursive estimator linear in the obser-
vations.
Assume that the joint pdf of X depends on some parameter Θ. To estimate Θ given X
X ). In such a
case, there is no useful information in X about Θ that is not already in T (X
following form:
x|θ] = h(x
fX |Θ [x x)g(T (x
x); θ).
For instance, if given {Θ = θ} the random variables X1 , . . . , Xn are i.i.d. N (θ, σ 2 ), then
one can show that X1 + . . . + Xn is a sufficient statistic for Θ. Knowing a sufficient statistic
X ], M AP [Θ|X
M LE[Θ|X X ], HT [Θ|X
X ], and E[Θ|X
X ] are all functions of T (X
X ). The correspond-
X ].
ing result does not hold for LLSE[Θ|X
9.5 Summary
9.5.1 LSSE
We discussed the linear least squares estimator of X given Y . The formulas are given in
The formulas are the same as those for the conditional expectation when the random
variables are jointly Gaussian. In the non-Gaussian case, the conditional expectation is not
linear and the LLSE is not as close to X as the conditional expectation. That is, in general,
and the inequality is strict unless the conditional expectation happens to be linear, as in
in the observations X that is relevant for estimating some parameter Θ. The necessary and
Example 9.6.1. Let X, Y be a pair of random variables. Find the value of a that minimizes
the variance of X − aY .
Note that
We also know that the values of a and b that minimize E((X − aY − b)2 ) are such that
Example 9.6.2. The random variable X is uniformly distributed in {1, 2, . . . , 100}. You
are presented a bag with X blue balls and 100 − X red balls. You pick 10 balls from the
bag and get b blue balls and r red balls. Explain how to calculate the maximum a posteriori
estimate of X.
Your intuition probably suggests that if we get b blue balls out of 10, there should be
about 10b blue balls out of 100. Thus, we expect the answer to be x = 10b. We verify that
intuition.
Designate by A the event “we got b blue balls and r = 10 − b red balls.” Since the prior
p.m.f. of X is uniform,
Now,
¡x¢¡100−x¢
b
P [A | X = x] = ¡100r¢ .
10
9.6. SOLVED PROBLEMS 149
Hence,
µ ¶µ ¶
x 100 − x x!r!(100 − x − r)!
M AP [X | A] = argmaxx = argmaxx .
b r b!(x − b)!(100 − x)!
x!(90 + b − x)!
α(x) := .
(x − b)!(100 − x)!
Note that
α(x) x(90 + b − x)
= .
α(x − 1) (x − b)(100 − x)
Hence,
i.e.,
Thus, as x increases from b to 100 − r = 90 + b, we see that α(x) increases as long as x ≤ 10b
M AP [X | A] = 10b.
Example 9.6.3. Let X, Y, Z be i.i.d. and with a B(n, p) distribution (that is, Binomial
a. What is var(X)?
c. Calculate L[XY | X + Y ].
a. Since X is the sum of n i.i.d. B(p) random variables, we find var(X) = np(1 − p).
150 CHAPTER 9. ESTIMATION
where
5np(1 − p) np(1 − p) 5 1 1 10 −1
Σ= = np(1−p) , so that Σ−1 = .
np(1 − p) 10np(1 − p) 1 10 np(1 − p) −1 5
c. Similarly,
cov(XY, X + Y )
L[XY | X + Y ] = (np)2 + (X + Y − 2np).
var(X + Y )
Now,
and
Hence,
Finally
2(np)2 (1 − p)
L[XY | X + Y ] = (np)2 + (X + Y − 2np) = np(X + Y ) − (np)2 .
2np(1 − p)
x | θ]) = h(x
fX |Θ ([x x)g(T (x
x); θ). (9.6.1)
9.6. SOLVED PROBLEMS 151
X ) does
a. Show that the identity (9.6.1) holds if and only if the density of X given T (X
not depend on θ.
b. Show that if given θ the random variables {Xn , n ≥ 1} are i.i.d. N (θ, σ 2 ), then
c. Show that if given θ the random variables {Xn , n ≥ 1} are i.i.d. Poisson with mean
X ≈ x , T ≈ t|θ]
P [X h(x x)g(t; θ)dx
x h(x)dx x
X ≈ x | T ≈ t, θ] ≈
P [X ≈R 0 0
≈R 0 0
.
P [T ≈ t|θ] x0 )=t} h(x )g(t; θ)dx
x0 |T (x
{x {x0 |T (x0 )=t} h(x )dx
X ≈ x | T = t, θ] ≈ h(x
P [X xdx
x).
Then,
X ≈ x | θ] = P [X
P [X X ≈ x | T = t, θ]P [T ≈ t | θ] ≈ h(x
x)dx
xP [T ≈ t | θ] = h(x
x)g(T (x
x); θ)dx
x.
X n
1
x | θ] =
fX |θ [x 2 n/2
exp{− (xi − θ)2 /2σ 2 }
(2πσ ) i=1
Xn Xn
1 2
=[ exp{− xi }] × [exp{(−2 xi + nθ)2 /2σ 2 }] = h(x
x)g(T (x
x); θ)
(2πσ 2 )n/2 i=1 i=1
Pn
x) =
where T (x i=1 xi ,
X n
1
x) =
h(x exp{− x2i }, and
(2πσ 2 )n/2 i=1
x) + nθ)2 /2σ 2 }.
x); θ) = exp{(−2T (x
g(T (x
152 CHAPTER 9. ESTIMATION
P
θ xi θ i xi
x | θ] =
fX |θ [x Πni=1 [ exp{−θ}] = x)g(T (x
exp{−nθ} = h(x x); θ)
xi ! Πi xi !
Pn
x) =
where T (x i=1 xi ,
(d) Assume that, given θ, X1 and X2 are i.i.d. N (θ, θ). Then, with X = (X1 , X2 ) and
x) = x1 + x2 ,
T (x
x)g(T (x
This function cannot be factorized in the form h(x x); θ).
Example 9.6.5. Assume that {Xn , n ≥ 1} are independent and uniformly distributed in
where
5
a = E[V1 ] =
2
and,
−1 −1
31 8 1
b V ar(V2 ) cov(V2 , V3 ) cov(V1 , V2 ) 0.0533
= = 180 3 6 =
8 5 1
c cov(V2 , V3 ) V ar(V3 ) cov(V1 , V3 ) 3 12 6 0.0591
5
Hence, L[V1 |V2 , V3 ] = 2 + 0.0533(V2 − 56 ) + 0.0591(V3 − 32 ).
9.6. SOLVED PROBLEMS 153
Example 9.6.6. Let the point (X, Y ) be picked randomly in the quarter circle {(x, y) ∈
<2+ | x2 + y 2 ≤ 1}.
a. Find L[X | Y ].
b. Find L[X 2 | Y ].
Z √1−y2 p
4 4 1 − y2
fY (y) = dx =
0 π π
Z 1
p Z π/2
2 24 1 − y2 4
E(Y ) = y dx = cos2 (θ)sin2 (θ)dθ
0 π 0 π
Z π/2 Z π/2
1 1
= sin2 (2θ)dθ = (1 − cos(4θ))dθ
0 π 0 2π
1 sin(4θ) π/2 1
= [ (x − )]0 = .
2π 4 4
Z 1
p Z 0
1 − y2 4y −4 2
E(Y ) = dy = z dz
0 π 1 π
−4z 3 0 4
= [ ] = .
3π 1 3π
1 16 4
Hence var(Y ) = 4 − 9π 2
= 0.0699. Similarly, E[X] = 3π .
Z √
1Z Z 11−x2
4 4
cov(XY ) = xy dydx = x(1 − x2 )dx
0 0 π 0 2π
4 x2 x4 1 1
= [ ( − )]0 = = 0.1592.
2π 2 4 2π
154 CHAPTER 9. ESTIMATION
Z √
1Z Z 1
1−x2
2 4 2 4 2
cov[X Y ] = x y dydx = x (1 − x2 )dx
0 0 π 0 2π
4 x3 x5 1 4
= [ ( − )]0 = = 0.0849.
2π 3 5 15π
cov(XY ) 0.1592
L[X|Y ] = (Y −E[Y ])+E[X] = (Y −0.4244)+0.4244 = 2.2775(Y −0.4244)+0.4244.
var(Y ) 0.0699
(b) Similarly,
cov(X 2 Y ) 0.0849
L[X 2 |Y ] = (Y −E[Y ])+E[X 2 ] = (Y −0.4244)+0.25 = 1.2146(Y −0.4244)+0.25.
var(Y ) 0.0699
Example 9.6.7. Let X, Y be independent random variables uniformly distributed in [0, 1].
Calculate L[Y 2 | 2X + Y ].
One has
Example 9.6.8. Let {Xn , n ≥ 1} be independent N (0, 1) random variables. Define Yn+1 =
0}. Calculate
E[Yn+m |Y0 , Y1 , . . . , Yn ]
for m, n ≥ 0.
Hint: First argue that observing {Y0 , Y1 , . . . , Yn } is the same as observing {Y0 , X1 , . . . , Xn }.
Second, get an expression for Yn+m in terms of Y0 , X1 , . . . , Xn+m . Finally, use the inde-
One has
...
Hence,
E[Yn+m | Y0 , Y1 , . . . , Yn ] = am Yn .
Example 9.6.9. Given {Θ = θ}, the random variables {Xn , n ≥ 1} are i.i.d. U [0, θ].
1
fX|Θ [x | θ]fΘ (θ) = 1{xk ≤ θ, k = 1, . . . , n}λe−λθ .
θn
Hence,
θ̂n = max{X1 , . . . , Xn }.
Consequently, by symmetry,
1
E[Θ − θ̂n |θ] = θ.
n+1
Finally,
1
E(|Θ − θ̂n |) = E(E[Θ − θ̂n |θ])) = .
λ(n + 1)
A few words about the symmetry argument. Consider a circle with a circumference
By symmetry, the average distance between two points is 1/(n + 1). Pick any one point and
open the circle at that point, calling one end 0 and the other end 1. The other n points are
distributed independently and uniformly on [0, 1]. So, the average distance between 1 and
Example 9.6.10. Let X, Y be independent random variables uniformly distributed in [0, 1].
a. Once again, we draw a unit square. Let R2 = X 2 + Y 2 . Given {R = r}, the pair
(X, Y ) is uniformly distributed on the intersection of the circumference of the circle with
If r < 1, then this intersection is the quarter of the circumference and we must calculate
Z π/2
2 2 π/2 2r
E[X|R = r] = E(r cos(θ)) = r cos(x) dx = [r sin(x) ]0 = .
0 π π π
If r > 1, then θ is uniformly distributed in [θ1 , θ2 ] where cos(θ1 ) = 1/r and sin(θ2 ) = 1/r.
Hence,
Z θ2
1
E[X|R = r] = E(r cos(θ)) = r cos(x) dx
θ1 θ2 − θ1
1 sin(θ2 ) − sin(θ1 )
= [r sin(x) ]θ2 =r
θ2 − θ1 θ1 θ2 − θ1
√
1 − r2 − 1
= .
sin−1 (1/r) − cos−1 (1/r)
b. Let V = X 2 + Y 2 . Then,
cov(X, V )
L[X|V ] = E(X) + (V − E(V )).
var(V)
9.6. SOLVED PROBLEMS 157
In addition,
8
var(V) = E(V 2 ) − (E(V ))2 = E(X 4 + 2X 2 Y 2 + Y 4 ) − (2/3)2 =
45
and
2
E(V ) = .
3
Consequently,
1 1/12 2 3 15
L[X|V ] = + (V − ) = + (X 2 + Y 2 ).
2 8/45 3 16 32
Yi = si X + Wi , i = 1, . . . , n
where W1 , . . . , Wn are independent N (0, 1) and X takes the values +1 and −1 with equal
known deterministic signal. Determine the MAP rule for deciding on X based on Y =
(Y1 , . . . , Yn ).
fY |X [yy | + 1]
L(yy ) = .
fY |X [yy | − 1]
Now,
1 1
fY |X [yy |X = x] = ΠN
i=1 [ √ exp{− (yi − si x)2 }].
2π 2
158 CHAPTER 9. ESTIMATION
Accordingly,
P
exp{− 12 ni=1 (yi − si )2 }
L(yy ) = P
exp{− 12 ni=1 (yi + si )2 }
n n
1X 2 1X
= exp{− (yi − si ) + (yi + si )2 }}
2 2
i=1 i=1
n
X
= exp{2 si yi } = exp{2yy ¦ ss}
i=1
where
n
X
y ¦ s := si yi .
i=1
Consequently,
+1, if y ¦ s > 0;
X̂ =
−1, if y ¦ s ≤ 0.
Y = gX + W
where X and W are independent zero-mean Gaussian random variables with respective
2 and σ 2 . Find L[X | Y ] and the resulting mean square error.
variances σX W
Yi = gi X + Wi , i = 1, 2
L[X | Y1 , Y2 ].
a. We know that
cov(X, Y ) cov(X, Y )
X̂ := L[X | Y ] = E(X) + (Y − E(Y )) = Y.
var(Y ) var(Y )
and
Hence,
2
gσX
L[X | Y ] = 2 + σ 2 Y.
g 2 σX W
cov(X, Y )
E((X̂ − X)2 ) = E(( Y − X)2 )
var(Y )
cov2 (X, Y ) cov(X, Y )
= var(Y ) − 2 cov(X, Y ) + var(X)
var2 (Y ) var(Y )
cov2 (X, Y )
= var(X) − .
var(Y )
Hence,
g 2 σX
4 2 σ2
σX
E((X̂ − X)2 ) = σX
2
− 2 + σ2 = W
2 + σ2 .
g 2 σX W g 2 σX W
where
Y T ) = E(X(Y1 , Y2 )) = σX
ΣXYY = E(XY 2
(g1 , g2 )
and
Y12 Y1 Y2 g12 σX
2 + σ2
W
2
g1 g2 σX
YY T) = E
ΣY = E(Y = .
Y2 Y1 Y22 2
g1 g2 σX g22 σX
2 + σ2
W
Hence,
−1
g12 σX
2+ σW 2 2
g1 g2 σX
2
X̂ = σX (g1 , g2 ) Y
2
g1 g2 σX g22 σX
2 + σ2
W
1 g22 σX
2 + σ2
W
2
−g1 g2 σX
2
= σX (g1 , g2 ) Y
g12 σX
2 σ2 + g2σ2 σ2 + σ4
W 2 X W W
2
−g1 g2 σX g12 σX
2 + σ2
W
σX2
= Y
2 + g 2 σ 2 + σ 2 (g1 , g2 )Y .
g12 σX 2 X W
160 CHAPTER 9. ESTIMATION
Thus,
σX2
X̂ = 2 + g 2 σ 2 + σ 2 (g1 Y1 + g2 Y2 ).
g12 σX 2 X W
Example 9.6.13. Suppose we observe Yi = 3Xi +Wi where W1 , W2 are independent N (0, 1)
1 1
X = x 1 := (1, 0)) = , P (X
P (X X = x 2 := (0, 1)) = ,
2 6
1 1
X = x 3 := (−1, 0)) = , and P (X
P (X X = x 4 := (0, −1)) = .
12 4
We know that
X | Y = y ] = arg maxP (X
M AP [X X = x )fY |X y | x ].
X [y
Now,
1 1
fY |X y | x] =
X [y exp{− [(y1 − 3x1 )2 + (y2 − 3x2 )2 ]}.
2π 2
Hence,
M AP [X x||2 − 2 ln(P (X
X | Y = y ] = arg min{||yy − 3x X = x )} =: arg min c(yy , x ).
Note that
y ¦ z := y1 z1 + y2 z2
and
3 i 2 1
X = x i ), for i = 1, 2, 3, 4.
x || − ln(P (X
αi = ||x
2 3
9.6. SOLVED PROBLEMS 161
y2
^ c1 < c 2
X = (0, 1) c1 < c 3
c4 < c 3
^ ^
X = (1, 0)
X^ = (-1, 0)
c2 < c 4
y1
= 0.1
c2 < c 3
^ = (0, -1)
X c1 < c 4
Thus, for every i, j ∈ {1, 2, 3, 4} with i 6= j, there is a line that separates the points y where
c(yy , x i ) < c(yy , x j ) from those where c(yy , x i ) > c(yy , x j ). These lines are the following:
The figure allows us to identify the regions of <2 that correspond to each of the values
162 CHAPTER 9. ESTIMATION
Random behaviors often become more tractable as some parameter of the system, such as
the size or speed, increases. This increased tractability comes from statistical regularity.
When many sources of uncertainty combine their effects, their individual fluctuations may
compensate one another and the combined result may become more predictable. For in-
stance, if you flip a single coin, the outcome is very unpredictable. However, if you flip a
large number of them, the proportion of heads is less variable. To make precise sense of
these limiting behaviors one needs to define the limit of random variables.
and X are functions. Thus, it takes some care to define the convergence of functions. For
the same reason, the meaning of “Xn is close to X” requires some careful definition.
We explain that there are a number of different notions of convergence of random vari-
ables, thus a number of ways of defining that Xn approaches X. The differences between
these notions may seem subtle at first, however they are significant and correspond to very
with convergence in distribution, then explain how to use transform methods to prove that
type of convergence. We then discuss almost sure convergence and convergence in proba-
bility and in L2 . We conclude with a discussion of the relations between these difference
163
164 CHAPTER 10. LIMITS OF RANDOM VARIABLES
Looking ahead, the strong law of large numbers is an almost sure convergence result;
the weak law is a convergence in probability result; the central limit theorem is about
Intuitively we can say that the random variables X and Y are similar if their cdf are about
the same, i.e., if P (X ≤ x) ≈ P (Y ≤ x) for all x ∈ <. For example, we could say that X is
almost a standard Gaussian random variable if this approximation holds when Y = N (0, 1).
It is in this sense that one can show that many random variables that occur in physical
The random variables Xn are said to converge in distribution to the random variable X,
and we write Xn →D X, if
lim FXn (x) = FX (x), for all x ∈ < such that FX (x) = FX (x−). (10.1.1)
n→∞
The restriction FX (x) = FX (x−) requires some elaboration. Consider the random
be able to say that Xn approaches X in distribution. You see that FXn (x) → FX (x) for all
x 6= 1. However, FXn (1) = 0 for all n and FX (1) = 1. The restriction in (10.1.1) takes care
of such discontinuity.
With this definition, you can check that if Xn and X are discrete with P (Xn = xm,n ) =
and limn→∞ xm,n = xm for n ≥ 1. In other words, the definition is conform to our intuition:
10.2 Transforms
here. (See the Central Limit Theorem in Section 11.3 for another example.) For n ≥ 1 and
p > 0, let X(n, p) be binomial with the parameters (n, p). That is,
µ ¶
n m
P (X(n, p) = m) = p (1 − p)n−m , m = 0, 1, . . . , n.
m
with mean λ. We do this by showing that E(z X(n,p) ) → E(z X ) for all complex numbers
z. These expected values are the z-transforms of the probability mass functions of the
random variables. One then invokes a theorem that says that if the z-transforms converge,
then so do the probability mass functions. In these notes, we do the calculation and we
accept the theorem. Note that X(n, p) is the sum of n i.i.d. random variables that are 1
with probability p and zero otherwise. If we designate one such generic random variable by
V (p), we have
E(z X(n,p) ) = (E(z V (p) ))n = ((1 − p) + pz)n → (1 + λ(z − 1)/n)n → exp{λ(z − 1)},
by (??). Also,
X (λ)n
E(z X ) = zn exp{−λ} = exp{λ(z − 1)}.
n
n!
166 CHAPTER 10. LIMITS OF RANDOM VARIABLES
A strong form of convergence is when the real numbers Xn (ω) approach the real number
X(ω) for all ω. In that case we say that the random variables Xn converge to the random
The expression “for almost all ω” means for all ω except possibly for a set of ω with
probability zero.
Anticipating future results, look at the fraction Xn of heads as you flip a fair coin n times.
You expect Xn to approach 1/2 for every possible realization of this random experiment.
However, the outcome where you always gets heads is such that the fraction of heads does
not go to 1/2. In fact, there are many conceivable sequences of coin flips where Xn does not
approach 1/2. However, all these sequences together have probability zero. This example
shows that it would be silly to insist that Xn (ω) → X(ω) for all ω. This condition would
Almost sure convergence is a very strong result. It states that the sequence of real
numbers Xn (ω) approaches the real number X(ω) for every possible outcome of the random
experiment. By possible, we mean here “outside of a set of probability zero.” Thus, as you
perform the random experiment, you find that the numbers Xn (ω) approach the number
X(ω).
The following observation is very useful in applications. Assume that Xn →a.s. X and
that g : < → < is a continuous function. Note that Xn (ω) → X(ω) implies by continuity
10.3. ALMOST SURE CONVERGENCE 167
that g(Xn (ω)) → g(X(ω)). Accordingly, we find that the random variables g(Xn ) converge
10.3.1 Example
A common technique to prove almost sure convergence is to use the Borel-Cantelli Lemma
To prove that fact, fix ² > 0. Note that, by Chebyshev’s inequality (4.8.1),
var(Xn ) 1
P (|Xn | ≥ ²) ≤ 2
≤ 2 2 , for n ≥ 1.
² n ²
Consequently,
∞
X ∞
X 1
P (|Xn | ≥ ²) ≤ < ∞.
n ²2
2
n=1 n=1
This almost shows that Xn →a.s. 0. Now, to be technically precise, we should show
such that |Xn (ω)| ≤ ² whenever n ≥ 0. What is missing in our derivation is that the set A
may depend on ². To fix this problem, let An be the set A that corresponds to ² = 1/n and
P
choose A = ∪n An . Then P (A) ≤ n P (An ) = 0 and that set has the required property.
168 CHAPTER 10. LIMITS OF RANDOM VARIABLES
It may be that Xn and X are more and more likely to be close as n increases. In that case,
we say that the random variables Xn converge in probability to the random variable X.
if
and almost sure convergence. Let us try to clarify the difference. If Xn →P X, there is no
Here is an example that illustrates the point. Let {Ω, F, P } be the interval [0, 1) with
the uniform distribution. The random variables {X1 , X2 , . . .} are equal to zero everywhere
except that they are equal to one on the intervals [0, 1), [0, 1/2), [1/2, 1), [0, 1/4), [1/4, 2/4),
[2/4, 3/4), [3/4, 1), [0, 1/8), [1/8, 2/8), [2/8, 3/8), [3/8, 4/8), [4/8, 5/8), . . ., respectively.
Thus, X1 = 1, X2 (ω) = 1{ω ∈ [0, 1/2)}, X9 (ω) = 1{ω ∈ [1/8, 2/8)}, and so on. From this
definition we see that P (Xn 6= 0) → 0, so that Xn →P 0. Moreover, for every ω < 1 the
sequence {Xn (ω), n ≥ 1} contains infinitely many ones. Accordingly, Xn does not converge
Thus it is possible for the probability of An := {|Xn − X| > ²} to go to zero but for
those sets to keep on “scanning” Ω and, consequently, for |Xn (ω) − X(ω)| to be larger than
Imagine that you simulate the sequence {Xn , n ≥ 1} and X. If the sequence that you
observe from your simulation run is such that Xn (ω) 9 X(ω), then you can conclude that
10.5. CONVERGENCE IN L2 169
10.5 Convergence in L2
Another way to say that X and Y are close to each other is when E(|X − Y |2 ) is small.
definition.
This definition has an intuitive meaning: the error becomes small in the mean squares
sense. If you recall our discussion of estimators and the interpretation of MMSE and LLSE
in a space with the metric based on the mean squared error, then convergence in L2 is
precisely convergence in that metric. Not surprisingly, this notion of convergence is well
10.6 Relationships
All these convergence notions make sense. How do they relate to one another? Figure 10.1
provided a summary.
The discussion below of these implications should help you clarify your understanding
We explained above that convergence in probability does not imply almost sure conver-
gence. We now prove that the opposite implication is true. That is, we prove the following
theorem.
170 CHAPTER 10. LIMITS OF RANDOM VARIABLES
Theorem 10.6.1. Let Xn , n ≥ 1 and X be random variable defined on the same probability
Proof:
Assume that Xn →a.s. X. We show that for all ² > 0, P (An ) → 0 where An := {ω |
To do this, we define Bn = ∪∞
m=n Am . Note that
Bn ↓ B := ∩∞ ∞ ∞
n=1 Bn = ∩n=1 ∪m=n An = {ω | ω ∈ An for infinitely many values of n}.
Recapitulating, the idea of the proof is that if Xn (ω) → X(ω), then this ω is not in An
for all n large enough. Therefore, the only ω’s that are in B must be those where Xn does
not converge to X, and these ω’s have probability zero. The consideration of Bn instead of
An is needed because the sets An may not decrease even though they are contained in the
Theorem 10.6.2. Let Xn , n ≥ 1 and X be random variable defined on the same probability
Proof:
E(|Xn − X|2 )
P (|Xn − X| > ²) ≤ .
²2
This inequality shows that if Xn →L2 X, then P |Xn − X| > ²) → 0 for any given ² > 0,
so that Xn →P X. ¤
It is useful to have a simple example in mind that shows that convergence in probability
does not imply convergence in L2 . Such an example is as follows. Let {Ω, F, P } be [0, 1] with
the uniform probability. Define Xn (ω) = n × 1{ω ≤ 1/n}. You can see that Xn →a.s. 0.
This example also shows that Xn →a.s. X does not necessarily imply that E(Xn ) →
We explain in the next section that some simple sufficient condition guarantee the equality
We turn our attention to the last implication in Figure 10.1. We show the following
result.
Proof:
x + ²). If ² is small enough, both side of this inequality are close to P (X ≤ x), so that we
Fix α > 0. First, find ² > 0 so that P (X ∈ [x − ², x + ²]) ≤ α/2. This is possible because
possible because Xn →P X.
Observe that
Also,
n ≥ n0 ,
so that
P (Xn ≤ x) ≤ FX (x) + α.
so that
These two inequalities imply that |FX (x) − P (Xn ≤ x)| ≤ α for n ≥ n0 . Hence Xn →D X.
Assume Xn →as X. We gave an example that show that it is generally not the case that
E(Xn ) → E(X). (See 10.6.1).) However, two simple sets of sufficient conditions are known.
10.7. CONVERGENCE OF EXPECTATION 173
b. Assume Xn →as X and |Xn | ≤ Y for all n with E(Y ) < ∞. Then E(Xn ) → E(X).
We refer the reader to probability textbooks for a proof of this result that we use in
examples below.
174 CHAPTER 10. LIMITS OF RANDOM VARIABLES
Chapter 11
We started the course by saying that, in the long term, about half of the flips of a fair
coin yield tail. This is our intuitive understanding of probability. The law of large number
explains that our model of uncertain events conforms to that property. The central limit
theorem tells us how fast this convergence happens. We discuss these results in this chapter.
Let {Xn , n ≥ 1} be i.i.d. random variables with mean µ and finite variance σ 2 . Let also
Yn = (X1 + · · · + Xn )/n be the sample mean of the first n random variables Then
Yn →P µ as n → ∞.
Proof:
X1 + · · · + Xn 1 X1 + · · · + Xn
P (| − µ| ≥ ²) ≤ 2 E(( − µ)2 ).
n ² n
175
176 CHAPTER 11. LAW OF LARGE NUMBERS & CENTRAL LIMIT THEOREM
Now, E( X1 +···+X
n
n
− µ) = 0, so that
X1 + · · · + Xn X1 + · · · + Xn
E(( − µ)2 ) = var( ).
n n
We know that the variance of a sum of independent random variables is the sum of their
X1 + · · · + Xn 1 1 σ2
var( ) = 2 var(X1 + · · · + Xn ) = 2 × nσ 2 = .
n n n n
X1 + · · · + Xn σ2
P (| − µ| ≥ ²) ≤ → 0 as n → ∞.
n n
Hence,
X1 + · · · + Xn
→P 0 as n → ∞.
n
Theorem 11.2.1. Strong Law of Large Numbers Let {Xn , n ≥ 1} be i.i.d. random variables
with mean µ and finite variance σ 2 . Let also Yn = (X1 + · · · + Xn )/n be the sample mean
Yn →a.s. µ as n → ∞.
The proof of this result is a bit too technical for this course.
11.3. CENTRAL LIMIT THEOREM 177
The next result estimates the speed of convergence of the sample mean to the expected
value.
Theorem 11.3.1. CLT Let {Xn , n ≥ 1} be i.i.d. random variables with mean µ and finite
variance σ 2 . Then
√
Zn := (Yn µ) n →D N (0, σ 2 ) as n → ∞.
√
This result says (roughly) that the error Yn µ is of order σ/ n when n is large. Thus,
if one makes four times more observations, the error on the mean estimate is reduced by a
factor of 2.
Equivalently,
√
n
(Yn µ) →D N (0, 1) as n → ∞.
σ
Proof:
2 /2
Here is a rough sketch of the proof. We show that E(eiuZn ) → e−u and we then
invoke a theorem that says that if the Fourier transform of fZn converge to that of fV , then
Zn →D V .
We find that
1
E(eiuZn ) = E(exp{iu √ ((X1 − µ) + · · · + (Xn − µ))})
σ n
1
= [E(exp{iu √ (X1 − µ)})]n
σ n
1 1 1
≈ [1 + iu √ E(X1 − µ) + (iu √ )2 E((X1 − µ)2 )]n
σ n 2 σ n
2
u n 2
≈ [1 − ] ≈ e−u /2 ,
2n
by (??).
The formal proof must justify the approximations. In particular, one must show that one
can ignore all the terms of order higher than two in the Taylor expansion of the exponential.
¤
178 CHAPTER 11. LAW OF LARGE NUMBERS & CENTRAL LIMIT THEOREM
In the CLT, one must know the variance of the random variables to estimate the convergence
rate of the sample mean to the mean. In practice it is quite common that one does not
Theorem 11.4.1. Approximate CLT Let {Xn , n ≥ 1} be i.i.d. random variables with mean
µ and such that E(Xnα ) < ∞ for some α > 2. Let Yn = (X1 + · · · + Xn )/n, as before. Then
√
n
(Yn µ) →D N (0, 1) as n → ∞
σn
where
(X1 − Yn )2 + · · · + (Xn − Yn )2
σn2 = .
n
That result, which we do not prove here, says that one can replace the variance of the
Using the approximate CLT, we can construct a confidence interval about our sample mean
estimate. Indeed, we can say that when n gets large, (see Table 7.1)
√
n
P (|(Yn − µ) | > 2) ≈ 5%,
σn
so that
σn σn
P (µ ∈ [Yn − 2 √ , Yn + 2 √ ]) ≈ 95%.
n n
We say that
σn σn
[Yn − 2 √ , Yn + 2 √ ]
n n
Similarly,
σn σn
[Yn − 2.6 √ , Yn + 2.6 √ ]
n n
1.6σn 1.6σn
[µn − √ , µn + √ ] is the 90% confidence interval for µ.
n n
11.6 Summary
The Strong Law of Large Numbers (SLLN) and the Central Limit Theorem (CLT) are very
The SLLN says that the sample mean Yn of n i.i.d. random variables converges to their
√
expected value. The CLT shows that the error multiplied by n is approximately Gaussian.
σn σn
[Yn − α √ , Yn − α √ ]
n n
where σn is the sample estimate of the standard deviation. The probability that the mean
is in that interval is approximately P (|N (0, 1)| ≤ α). Using Table 7.1 one can select the
Example 11.7.1. Let {Xn , n ≥ 1} be independent and uniformly distributed in [0, 1].
sin(X1 )+···+sin(Xn )
a. Calculate limn→∞ n .
Remember that the Strong Law of Large Numbers says that that if Xn , n ≥ 1 are i.i.d
c. Note that
sin(X1 ) + · · · + sin(Xn )
| | ≤ 1.
n
Consequently, the result of part (a) and Theorem 10.7.1 imply that
sin(X1 ) + · · · + sin(Xn )
lim E( ) = 0.4597.
n→∞ n
X1 + · · · + Xn
lim E(sin( )) = 0.5.
n→∞ n
Example 11.7.2. Let {Xn , n ≥ 1} be i.i.d. N (0, 1). What can you say about
as n → ∞?
Example 11.7.3. Let {Xn , n ≥ 1} be i.i.d. with mean µ and finite variance σ 2 . Let
X1 + · · · + Xn (X1 − µn )2 + · · · + (Xn − µn )2
µn := and σn2 = . (11.7.1)
n n
Show that
σn2 →a.s. σ 2 as n → ∞.
By the SLLN, µn →a.s. µ. You can then show that σn2 − s2n →a.s. 0 where
Example 11.7.4. We want to poll a population to estimate the fraction p of people who
will vote for Bush in the next presidential election. We want to to find the smallest number
of people we need to poll to estimate p with a margin of error of plus or minus 3% with 95%
confidence.
For simplicity it is assumed that the decisions Xn of the different voters are i.i.d. B(p).
2σn 2σn
[µn − √ , µn + √ ] is the 95% confidence interval for µ.
n n
We also know that the variance of the random variables is bounded by 1/4. Indeed,
1 1
[µn − √ , µn + √ ] contains the 95% confidence interval for µ.
n n
182 CHAPTER 11. LAW OF LARGE NUMBERS & CENTRAL LIMIT THEOREM
1
√ ≤ 3%,
n
i.e.,
1 2
n≥( ) = 1112.
0.03
and p1 = 0.55. Let α = 0.5% and X̂ = HT [X | Y ]. Find the value of n needed so that
P [X̂ = 0 | X = 1] = α. Use the CLT to estimate the probabilities. (Note that X̂ is such
√ Y1 + · · · + Yn
Zn := n[ − 0.5] ≈ N (0, 1/4).
n
Then,
√ y0
P (Y1 + · · · + Yn > y0 ) = P (Zn > n( − 0.5))
n
1 √ y0 √ y0
≈ P (N (0, ) > n( − 0.5))) = P (N (0, 1) > 2 n( − 0.5)).
4 n n
We need
√ y0
2 n( − 0.5)) ≥ 2.6
n
for this probability to be 0.5%. Thus,
√
y0 = 0.5n + 1.3 n.
Then
By CLT,
√ W1 + · · · + Wn
Un := n[ − 0.55] ≈ N (0, 1/4).
n
(We used the fact that 0.55(1 − 0.55) ≈ 1/4.)
Thus,
√ y0
P (W1 + · · · + Wn < y0 ) = P (Un < n[ − 0.55])
n
1 √ y0 √ y0
≈ P (N (0, ) < n[ − 0.55]) = P (N (0, 1) < 2 n[ − 0.55])
4 n √ n
√ 0.5n + 1.3 n
= P (N (0, 1) < 2 n[ − 0.55]).
n
Example 11.7.6. Let {Xn , n ≥ 1} be i.i.d. random variables that are uniformly distributed
√ X1 + · · · + Xn 1
lim P ( n| − | > ²)
n→∞ n 2
√ X1 + · · · + Xn 1
n| − | →D N (0, σ 2 )
n 2
where
1 1 1
σ 2 = var(X1 ) = E(X12 ) − (E(X1 ))2 = − = .
3 4 12
184 CHAPTER 11. LAW OF LARGE NUMBERS & CENTRAL LIMIT THEOREM
Hence,
√ X1 + · · · + Xn 1 1
P ( n| − | > ²) ≈ P (|N (0, )| > ²).
n 2 12
Now,
1 1
P (|N (0, )| > ²) = P (| √ N (0, 1)| > ²)
12 12
√ √ √
= P (|N (0, 1)| > ² 12) = 2P (N (0, 1) > ² 12) = 2Q(² 12).
Consequently,
√ X1 + · · · + Xn 1 √
lim P ( n| − | > ²) = 2Q(² 12).
n→∞ n 2
Example 11.7.7. Given X, the random variables {Yn , n ≥ 1} are exponentially distributed
We choose ρ so that
Xn
P[ Yi > ρ | X = 1] = β.
i=1
√
Let us write ρ = n + α n and find α so that
Xn
√
P[ Yi > n + α n | X = 1] = β.
i=1
Equivalently,
Xn
√
P [( Yi − n)/ n > α | X = 1] = β.
i=1
11.7. SOLVED PROBLEMS 185
with mean p ∈ (0, 1). We construct an estimate p̂n of p from {X1 , . . . , Xn }. We know that
|p̂n − p|
P( ≤ 5%) ≥ 95%.
p
For Bernoulli Random Variables we have E[Xn ] = p and var[Xn ] = p(1 − p). Let
p̂n = (X1 + · · · + Xn )/n. We know that p̂n →as p. Moreover, from the CLT,
√ p̂n − p
np →D N (0, 1).
p(1 − p)
Now,
√
|p̂n − p| √ p̂n − p np
≤ 0.05 ⇔ | n p | ≤ 0.05 √ .
p p(1 − p) 1−p
Hence, for n large,
√
|p̂n − p| 0.05 × np
P( ≤ 0.05) ≈ P (|N (0, 1)| ≤ √ ).
p 1−p
Using (7.1) we find P (|N (0, 1)| ≤ 2) ≥ 0.95. Hence, P ( |p̂np−p| ≤ 0.05) if
√
0.05 × np
√ ≥ 2,
1−p
i.e.,
1−p
n ≥ 1600 =: n0 .
p
Since we know that p ∈ [0.4, 0.6], the above condition implies 1067 ≤ n0 ≤ 2400. Hence
Example 11.7.9. Given X, the random variables {Yn , n ≥ 1} are independent and Bernoulli(X).
the estimator that minimizes P [X̂ = 0.5|X = 0.6] subject to P [X̂ = 0.6|X = 0.5] ≤ 5%.
Using the CLT, estimate the smallest value of n so that P [X̂ = 0.5|X = 0.6] ≤ 5%.
H0 : X = 0.5
H1 : X = 0.6.
Pn
where k = i=1 yi is the number of heads in n Bernoulli trials. Note that L(yy ) is an
Xn
0.05 = P [X̂ = 0.6|X = 0.5] = P ( Yi ≥ k0 ).
i=1
Using (7.1) we find P (N (0, 1) ≥ 1.7) ≈ 0.05. Hence, we need to find k0 such that
r
n 1
( k0 − 0.5) ≈ 1.7. (11.7.2)
0.25 n
Hence, for n large, using (7.1) we find P [X̂ = 0.5|X = 0.6] ≤ 5% is equivalent to
r
n 1
( k0 − 0.6) ≈ −1.7. (11.7.3)
0.24 n
11.7. SOLVED PROBLEMS 187
Ignoring the difference between 0.24 and 0.25, we find that the two equations (11.7.2)-
(11.7.3) imply
1 1
k0 − 0.5 ≈ −( k0 − 0.6),
n n
1
which implies n k0 ≈ 0.55. Substituting this estimate in (11.7.2), we find
r
n
(0.55 − 0.5) ≈ 1.7,
0.25
Example 11.7.10. The number of days that a certain type of component functions before
Once the component fails it is immediately replaced by another one of the same type. If
P
we designate by Xi the lifetime of the ith component to be put to use, then Sn = ni=1 Xi
n
r = lim .
n→∞ Sn
b. Use the CLT to determine how many components one would need to have to be
n Sn Sn −1 2
r = lim = lim [ ]−1 = [ lim ] = [ ]−1 = 1.5 failures per day.
n→∞ Sn n→∞ n n→∞ n 3
(Note: In the third identity we can interchange the function g(x) = x−1 and the limit
b. Since there are about 1.5 failures per day we will need a few more than 365×1, 5 = 548
components. We use the CLT to estimate how many more we need. You can verify that
var(X) = 1/18. Let Z =D N (0, 1). We want to find k so that P (Sk > 365) ≥ 0.95. Now,
365 − k(2/3)
p ≤ −1.7,
k/18
or equivalently,
2 p 2 1
k − 365 ≥ 1.7 k/18, i.e., (k − 365)2 ≥ (1.7)2 k ,
3 3 18
or
4 2 4 1 4 4 1
k − × 365k + (365)2 ≥ (1.7)2 k , i.e., k 2 − [ × 365 + (1.7)2 ]k + (365)2 ≥ 0.
9 3 18 9 3 18
That is,
This implies
p
486.6 + (486.6)2 − 4 × 0.444 × 133, 225
k≥ ≈ 563.
2 × 0.444
Chapter 12
interested in the evolution in time of random variables. For instance, one watches on an
oscilloscope the noise across two terminals. One may observe packets that arrive at an
Internet router, or cosmic rays hitting a detector. If ω designates the outcome of the
random experiment (as usual) and t the time, then one is interested in a collection of
random variables X = {X(t, ω), t ∈ T } where T designates the set of times. Typically,
[0, ∞) or T = (−∞, ∞). When T is countable, one says that X is a discrete-time random
random process.
Similarly, a random process is characterized by the joint cdf of any finite collection of
the random variables. These joint cdf are called the finite dimensional distributions of
the random process. For instance, to specify a random process {Xt , t ≥ 0} one must
specify the joint cdf of {Xt1 , Xt2 , . . . , Xtn } for any value of n ≥ 1 and any 0 < t1 < · · · <
189
190 CHAPTER 12. RANDOM PROCESSES BERNOULLI - POISSON
variables that have well-defined joint cdf. In some cases however, one specifies the finite
That is, the finite dimensional distribution specifies the joint cdf of a set of random variables
must have marginal distributions for subsets of them that agree with the finite dimensional
distribution of that subset. For instance, the marginal distribution of Xt1 obtained from
the joint distribution of (Xt1 , Xt2 ) should be the same as the distribution specified for Xt1 .
Remarkably, these obviously necessary conditions are also sufficient! This result is known
In this section, we look at two simple random processes: the Bernoulli process and the
Poisson process. We consider two other important examples in the next two chapters.
The discrete-time random process is called the Bernoulli process. This process models
flipping a coin.
Assume we have watched the first n coin flips. How long do we wait for the next 1? That
is, let
We want to calculate
P [τ > m|X1 , X2 , . . . , Xn ].
12.1. BERNOULLI PROCESS 191
Assume that we have been flipping the coin forever. How has it been since the last 1? Let
0, . . . , Xn−m−1 = 0, Xn−m = 1 and the probability of that event is p(1 − p)m . Thus,
There are two ways of looking at the time between two successive 1s.
The first way is to choose some time n and to look at the time since the last 1 and the
time until the next 1. We know that these times have mean (1/p) − 1 and 1/p, respectively.
In particular, the average time between these two 1s around some time n is (2/p) − 1. Note
The second way is to start at some time n, wait until the next 1 and then count the
time until the next 1. But in this way, we find that the time between two consecutive 1s is
The two different answers that we just described are called the Saint Petersburg paradox.
The way to explain it is to notice that when we pick some time n and we look at the previous
and next 1s, we are more likely to have picked an n that falls in a large gap between two
1s than an n that falls in a small gap. Accordingly, we expect the average duration of the
192 CHAPTER 12. RANDOM PROCESSES BERNOULLI - POISSON
interval in which we have picked n to be larger than the typical interval between consecutive
mean 1/p, then if we know that τ is larger than n, then τ −n is still geometrically distributed
Intuitively, this result is immediate if we think of τ as the time until the next 1 for a
Bernoulli process. Knowing that we have already flipped the coin n times and got 0s does
not change how many more times we still have to flip it to get the next 1. (Remember that
with probability p and loses 1 otherwise at each step of a game; the steps are all independent.
Two typical evolutions of Yn are shown in Figure 12.1 when Y0 = 0. The top graph
corresponds to p = 0.54, the bottom one to p = 0.46. The sequence Yn is called a random
Figure 12.1: Random Walk with p = 0.54 (top) and p = 0.46 (bottom).
Assume the gambler plays the above game and starts with an initial fortune Y0 = A > 0.
What is the probability that her fortune will reach B > A before it reaches 0 and she is
a similar way. We want to compute α(A) = P [TB < T0 |Y0 = A]. A moment (or two) of
Note that these equations are derived by conditioning of the first step of the process.
The boundary conditions of these ordinary difference equations are α(0) = 0 and α(B) =
194 CHAPTER 12. RANDOM PROCESSES BERNOULLI - POISSON
Figure 12.2: Reflected Random Walk with p = 0.43 (top) and p = 0.3 (bottom).
For instance, with p = 0.48, A = 100, and B = 1000, one finds α(A) = 2 × 10−35 .
Assume now that you have a rich uncle who gives you $1.00 every time you go bankrupt,
so that you can keep on playing the game forever. To define the game precisely, say that
when you hit 0, you can still play the game. If you lose again, you are back at 0, otherwise
to have 1, and so on. The resulting process is called a reflected random walk. Figure 12.2
shows a typical evolution of your fortune. The top graph corresponds to p = 0.43 and the
other one to p = 0.3. Not surprisingly, with p = 0.3 you are always poor.
One interesting question is how much money you have, on average. For instance, looking
at the lower graph, we can see that a good fraction of the time your fortune is 0, some other
fraction of the time it is 1, and a very small fraction of the time it is larger than 2. How
van we calculate these fractions? One way to answer this question is as follows. Assume
12.1. BERNOULLI PROCESS 195
that π(k) = P (Yn = k) for k ≥ 1 and all n ≥ 0. That is, we assume that the distribution
of Yn does not depend on n. Such a distribution π is said to be invariant. Then, for k > 0,
one has
so that
so that
= pπ(0) + (1 − p)π(1).
The above identities are called the balance equations. You can verify that the solution
of these equations is
p
π(k) = Aρk , k ≥ 0 where ρ = .
1−p
P
Since k π(k) = 1, we see that a solution is possible if ρ < 1, i.e.. if p < 0.5. The
p
π(k) = (1 − ρ)ρk , k ≥ 0 where ρ = when p < 0.5.
1−p
When p ≥ 0.5, there is not solution where P (Yn = k) does not depend on n. To
understand what happens in that case, lets look at the case p = 0.6. The evolution of the
The graph shows that, as time evolves, you get richer and richer. The case p = 0.5 is
In this case, the fortune fluctuates and makes very long excursions. One can show that
the fortune comes back to 0 infinitely often but that it takes a very long time to come back.
So long in fact that the fraction of time that the fortune is zero is negligible. In fact, the
fraction of the time that the fortune is any given value is zero!
How do we show all this? Let p = 0.5. We know that P [TB < T0 |Y0 = A] = A/B. Thus,
P [T0 < TB |Y0 = A] = (B − A)/B. As B increases to infinity, we see that TB also increases
to infinity, to that P [T0 < TB |Y0 = A] increases to P [T0 < ∞|Y0 = A] (by continuity of
to infinity). This shows that the fortune comes back to 0 infinitely often. We can calculate
how longs it takes to come back to 0. To do this, define β(A) = E[min{T0 , TB }|Y0 = A].
Arguing as we have for the gamblers ruin problem, you can justify that
Also, β(0) = 0 = β(B). You can verify that the solution is β(A) = A(B − A). Now,
min{T0 , TB } increases to T0 . It follows from Theorem 10.7.1 that E[T0 |Y0 = A] = ∞ for all
A > 0.
Going back to the case p > 0.5, we see that the fortune appears to be increasing at an
approximately constant rate. In fact, using the SLLN we can see that
We can use that observation to say that Yn grows at rate 2p − 1. We can make this more
precise by defining the scaled process Zk (t) := Y[kt] /k where [kt] is the smallest integer
larger than or equal to kt. We can then say that Zk (t) → (2p − 1)t as k → ∞. The
convergence is almost sure for any given t. However, we would like to say that this is true
for the trajectories of the process. To see what we mean by this convergence, lets look as a
These three scaled versions show that the fluctuations gets smaller as we scale and that
Of course, if we look very closely at the last graph above, we can still see some fluctuations.
One way to see these well is to blow up the y-axis. We show a portion of this graph in
Figure 12.6.
12.1. BERNOULLI PROCESS 199
The fluctuations are still there, obviously. One way to analyze these fluctuations is to
use the central limit theorem. Using the CLT, we find that
where σ 2 = p(1 − p). This shows that, properly scaled, the increments of the fortune look
Gaussian. The case p = 0.5 is particularly interesting, because then we do not have to
Yn+m − Yn
√ ≈ N (0, 1/4).
m
√
We can then scale the process differently and look at Zk (t) = Y[kt] / k. We find that as
k becomes very large, the increments of Zk (t) become independent and Gaussian. In fact,
W (t) with independent increments such that W (t + u) − W (t) = N (0, u). Such a process
number of jumps in [0, t]. The jump times are {Tn , n ≥ 1} where {T1 , T2 − T1 , T3 − T2 , T4 −
T3 , . . .} are i.i.d. and exponentially distributed with mean 1/λ where λ > 0. Thus, the
times between jumps are exponentially distributed with mean 1/λ and are all independent.
Recall that the exponential distribution is memoryless. This implies that, for any t > 0,
given {X(s), 0 ≤ s ≤ t}, the process {X(s) − X(t), s ≥ t} is again Poisson with rate λ.
The number of jumps in [0, t], X(t), is a Poisson random variable with mean λt. That is,
Indeed, the jump times {T1 , T2 , . . .} are separated by i.i.d. Exp(λ) random variables.
= (λ²)n exp{−λt}.
Hence, the probability that there are n jumps in [0, t] is the integral of the above density
on the set S = {t1 , . . . , tn |0 < t1 < . . . < tn < t}, i.e., |S|λn exp{−λt} where |S| designates
the volume of the set S. This set is the fraction 1/n! of the cube [0, t]n . Hence |S| = tn /n!
and
12.2. POISSON PROCESS 201
We know, from the strong law of large numbers, that X(nt)/n → λt almost surely as
The claim is that if p → 0 and k → ∞ in a way that kp → λ, then the process {Wk (t) =
Ykt , t ≥ 0} approaches a Poisson process with rate λ. This property follows from the
12.2.5 Sampling
Imagine customers that arrive as a Poisson process with rate λ. With probability p, a
customer is male, independently of the other customers and of the arrival times. The claim
is that the processes {X(t), t ≥ 0} and {Y (t), t ≥ 0} that count the arrivals of male and
female customers, respectively, are independent Poisson processes with rates λp and λ(1−p).
Using Example 5.5.13, we know that X(t) and Y (t) are independent random variables with
means λpt and λ(1 − p)t, respectively. To conclude the proof, one needs to show that the
increments of X(t) and Y (t) are independent. We leave you the details.
202 CHAPTER 12. RANDOM PROCESSES BERNOULLI - POISSON
The Saint Petersburg paradox holds for Poisson processes. In fact, the Poisson process seen
from an arrival time is the Poisson process plus one point at that time.
12.2.7 Stationarity
Let X = {X(t), t ∈ <} be a random process. We say that it is stationary if {X(t+u), t ∈ <}
has the same distribution for all u ∈ <. In other words, the statistics do not change over
time. We cannot use this process to measure time. This would be the case of the weather
if there were no long-term trend such as global warming or even seasonal effects. As an
exercise, you can show that our reflected fortune process with p < 0.5 and started with
Let X = {X(t), t ∈ <} be a random process. We say that it is time-reversible if {X(t), t ∈ <}
and {X(u−t), t ∈ <} have the same distribution for all u ∈ <. In other words, the statistics
Also, you should show that our reflected fortune process with p < 0.5 and started with
12.2.9 Ergodicity
Roughly, a stochastic process is ergodic if statistics that do not depend on the initial phase
of the process are constant. That is, such statistics do not depend on the realization of the
process. For instance, if you simulate an ergodic process, you need only one simulation run;
Let {X(t), t ∈ <} be a random process. We compute some statistic Z(ω, u) = φ(X(ω, t+
u), t ∈ <). That is, we perform calculations on the process starting at time u. We are
interested in calculations such that Z(ω, u) = Z(ω, 0) for all u. (We give examples shortly;
don’t worry.) Let us call such calculations “invariant” random variables of the process X.
The process X is ergodic if all its invariant random variables are constant.
RT
As an example, let Z(ω, u) = limT →∞ (1/T ) 0 X(u + t)dt. This random variable is
invariant. If the process is ergodic, then Z is the same for all ω: it is constant.
You should show that our reflected fortune process with p ≤ 0.5 is ergodic. The trick
is that this process must eventually go back to 0. One can then couple two versions of
the process that start off independently and merge the first time they meet. Since they
remained glued forever, the long-term statistics are the same. (See Example 14.8.14 for an
12.2.10 Markov
A random process {X(t), t ∈ <} is Markov if, given X(t), the past {X(s), s < t} and the
future {X(s), s > t} are independent. Markov chains are examples of Markov process.
A process with independent increments is Markov. For instance, the Poisson process
For a simple example of a process that is not Markov, let Yn be the reflected fortune
It is not Markov because if you see that Wn−1 = 0 and Wn = 1, then you know that
P [W2 = 0|W1 = 1, W0 = 0] = p.
However,
Note that this example shows that a function of a Markov process may not be a Markov
process.
c. Calculate L[N2 | N1 , N3 ].
λn −λ
a. We know that N1 =d P (λ), so that P (N1 = n) = n! e .
b. Same.
L[U + V | U, U + V + W ]
where the random variables U, V, W are i.i.d. P (λ). We could use the straightforward
approach, with the general formula. However, a symmetry argument turns out to be easier.
X := L[U + V | U, U + V + W ] = L[U + W | U, U + V + W ].
Consequently,
2X = L[(U + V ) + (U + W ) | U, U + V + W ] = U + (U + V + W ).
Hence,
1 1
X = L[N2 | N1 , N3 ] = (U + (U + V + W )) = (N1 + N3 ).
2 2
d. We are given the three i.i.d. P (λ) random variables U, V, W defined earlier. Then
λu+v+w −3λ
P [U = u, V = v, W = w | λ] = e .
u!v!w!
12.2. POISSON PROCESS 205
λn e−3λ ,
n
λ= .
3
Hence,
N3
M LE[λ | N1 , N2 , N3 ] = .
3
Example 12.2.2. Let N = {Nt , t ≥ 0} be a counting process. That is, Nt+s − Ns ∈
{0, 1, 2, . . .} for all s, t ≥ 0. Assume that the process has independent and stationary in-
crements. That is, assume that Nt+s − Ns is independent of {Nu , u ≤ s} for all s, t ≥ 0
and has a p.m.f. that does not depend on s. Assume further that P (Nt > 1 ) = o(t). Show
that N is a Poisson process. (Hint: Show that the times between jumps are exponentially
Let
a(s) = P (Nt+s = Nt ).
Then
a0 (s + u) = a0 (s)a(u).
At s = 0, this gives
a0 (u) = a0 (0)a(u).
The solution is a(u) = a(0) exp{a0 (0)u}, which shows that the distribution of the first jump
that the times between successive jumps are i.i.d. and exponentially distributed.
206 CHAPTER 12. RANDOM PROCESSES BERNOULLI - POISSON
Example 12.2.3. Construct a counting process N such that for all t > 0 the random
variable Nt is Poisson with mean λt but the process is not a Poisson process.
Assume that At is a Poisson process with rate λ. Let F (t; x) = P (Nt ≤ x) for x, t ≥ 0.
Let also G(t; u) be the inverse function of F (t; .). That is, G(t; u) = min{x | F (t; x) ≥ u}
for u ∈ [0, 1]. Then we know that if we pick ω uniformly distributed in [0, 1], the random
variable G(t; ω) has p.d.f. F (t; x), i.e., is Poisson with mean λt. Consider the process
N = {Nt (ω) := G(t; ω), t ≥ 0}. It is such that Nt is Poisson with mean λt for all t.
However, if you know the first jump time of Nt , then you know ω and all the other jump
Example 12.2.4. Let N be a Poisson process with rate λ and define Xt = X0 (−1)Nt for
c. Assume that p = 0.5, so that, by symmetry, P (Xt = 1) = 0.5 for all t ≥ 0. Calculate
E(Xt+s Xs ) for s, t ≥ 0.
a. Yes, the process has independent increments. This can be seen by considering the
= P (N (t, s + t] is even)
Similarly,
= P (N (t, s + t] is odd)
12.2. POISSON PROCESS 207
Similarly, we can show independent increments for the case when X(t) = −1
∞
X (λt)2i e−λt
P (N (0, t] is even) =
2i!
i=0
X∞ ∞
e−λt (λt)i X (−λt)i
= ( + )
2 i! i!
i=0 i=0
e−λt + eλt
= e−λt ( )
2
1 + e−2λt
=
2
1−e−2λt
Similarly we get P (N (0, t] is odd) = 2
c. Assume that p = 0.5, so that, by symmetry, P (Xt = 1) = 0.5 for all t ≥ 0. Calculate
E(Xt+s Xs ) for s, t ≥ 0.
Example 12.2.5.
208 CHAPTER 12. RANDOM PROCESSES BERNOULLI - POISSON
Problem 5. Let N be a Poisson process with rate λ. At each jump time Tn of N , a random
number Xn of customers arrive at a cashier waiting line. The random variables {Xn , n ≥ 1}
are i.i.d. with mean µ and variance σ 2 . Let At be the number of customers who arrived by
time t, for t ≥ 0.
At − µλt
√
t
as t → ∞?
P
a. We have, E(At |Nt = n) = E( ni=1 Xi ) = nµ and
∞
X
E(At ) = E(At |Nt = n)P (Nt = n)
i=1
∞
X
= µ nP (Nt = n)
i=1
= µλt
Xn
E(A2t |Nt = n) = E( Xi )2
i=1
n
X n
X
= µ Xi2 +2 Xi Xj
i=1 i=1,0<j<i
∞
X
E(A2t ) = E(A2t |Nt = n)P (Nt = n)
i=1
X∞
= (n(µ2 + σ 2 ) + n(n − 1)µ2 )P (Nt = n)
i=1
∞
X ∞
X
= (µ2 + σ 2 ) n P (Nt = n) + µ2 n(n − 1) P (Nt = n)
i=1 i=1
= (µ2 + σ 2 )λt + µ2 (λt + (λt)2 − λt)
Hence,
= (µ2 + σ 2 )λt
Hence,
At
limt→∞ =µλ
t
c. Define the RV Y1 as the number of customers added in the first 1 sec. Similarly,
define RV Yi as the number of customers added in the ith second. Then (Yn , n ≥ 1) are iid
Let tc = dte. Define ∆c = tc − t and let Zc denote the number of customers arriving in
time interval (t, tc ]. Then E(Zc ) = µλ∆c and var(Zc ) = (µ2 + σ 2 )λ∆c . Hence,
210 CHAPTER 12. RANDOM PROCESSES BERNOULLI - POISSON
Ptc
At − µλt i=1 Yi
− µλtc − Zc + µλ∆
limt→∞ √ = limtc →∞ √
t tc − ∆c
Ptc
Yi − µλtc Zc − µλ∆
≥ limtc →∞ i=1 √ − √
tc tc
The last term in the expression above goes to zero as t → ∞. And using CLT we can
Ptc
Yi −µλtc d
show that limtc →∞ i=1√
tc
→ N (0, λ(µ2 + σ 2 ))
Hence, limt→∞ At √
−µλt
t
→ N (0, λ(µ2 + σ 2 )).
Chapter 13
Filtering Noise
When your FM radio is tuned between stations it produces a noise that covers a wide range
of frequencies. By changing the settings of the bass or treble control, you can make that
noise sound harsher or softer. The change in the noise is called filtering. More generally,
In this chapter, we define the power that a random process has at different frequencies
and we describe how one can design filters that modify that power distribution across
frequencies. We also explain how one can use a filter to compute the LLSE of the current
We discuss the results in discrete time. That is, we consider sequences of values instead
of functions of a continuous time. Digital filters operate on discrete sets of values. For
instance, a piece of music is sampled before being processes, so that continuous time is
made discrete. Moreover, for simplicity, we consider that time takes only a finite number of
values. Physically, we can take a set of time that is large enough to represent an arbitrarily
211
212 CHAPTER 13. FILTERING NOISE
The terminology is that a system transforms its input into an output. The input and output
are sequences of values describing, for instance, a voltage as a function of time. Accordingly,
Such sequences represent signals that, for instance, encode voice or audio. For mathe-
matical convenience, we extend these sequences periodically over all the integers in Z :=
so that x(n) = x(n − N ) for all n ∈ Z. We will consider systems whose input are the
extension of the original finite sequence. For instance, imagine that we want to study how a
particular song is processed. We consider that the song is played over and over and we look
at the result of the processing. Intuitively, this artifice does not change the nature of the
result, except possibly at the very beginning or end of the song. However, mathematically,
13.1.1 Definition
A system is linear if the output that corresponds to a sum of inputs is the sum of their
respective outputs. A system is time-invariant if delaying the input only delays the output
N
X −1
y(n) = h(m)x(n − m), n ∈ Z (13.1.1)
n=0
Thus, the output is the convolution of the (extended) input with the impulse response.
Roughly, the impulse response is the output that corresponds to the input {1, 0, . . . , 0}.
P
We say roughly, because the extension of this input is x(n) = ∞
k=−∞ 1{n = kN }, so that
n ∈ {0, 1, . . . , N − 1}. In that case, the impulse response is the output that corresponds to
the impulse input {1, 0, . . . , 0}. If h(·) is nonzero over a long period of time that exceeds
the period N of the periodic repetition of the input, then the outputs of these different
repetitions superpose each other and the result is complicated. In applications, one chooses
N large enough to exceed the duration of the impulse response. Assuming that h(n) = 0
for n < 0 means that the system does not produce an output before the input is applied.
Such a system is said to be causal. Obviously, all physical systems are causal.
Consider a system that delays the input by k time units. Such a system has impulse
response h(n) = 1{n − k}. Similarly, a system which averages k successive input values
and produces y(n) = (x(n) + x(n − 1) + · · · + x(n − k + 1))/k for some k < N has impulse
response
1
h(n) = 1{0 ≤ n ≤ k − 1}, n = 0, . . . , N − 1.
k
We assume m < N − 1. This system is LTI because the identities are linear and their
coefficients do not depend on n. To find the impulse response, assume that x(n) = 1{n = 0}
and designate by h(n) the corresponding output y(n) when y(n) = 0 for n < 0. For instance,
214 CHAPTER 13. FILTERING NOISE
h(0) = b0 ;
and so on.
We want to analyze the effect of such LTI systems on random processes. Before doing
this, we introduce some tools to study the systems in the frequency domain.
LTI systems are easier to describe in the “frequency domain” than in the “time domain.”
To understand these ideas, we need to review (or introduce) the notion of frequencies.
then so is the output. Using complex-valued signals is a mathematical artefact that simpli-
fies the analysis. In a physical system, the quantities are real. For instance, h(n) is real. If
This gives
N
X −1 N
X −1
h(m) Im{x(n − m)} = h(m) sin(2πu(n − m)/N )
n=0 n=0
= Im{ei2πnu/N |H(u)|e iθ(u)
} = |H(u)| sin(2πnu/N + θ(u)).
13.1. LINEAR TIME-INVARIANT SYSTEMS 215
This expression shows that if the input is the sine wave sin(2πun/T ), then so is the output,
except that its amplitude is multiplied by some gain |H(u)| and that it is delayed. Note
that the analysis is easier with complex functions. One can then take the imaginary part
with β := e−i2π/N .
It turns out that one can recover x(·) from X(·) by computing the Inverse Discrete
Fourier Transform:
N −1
1 X
x(n) = X(u)β −nu , n = 0, 1, . . . , N − 1. (13.1.4)
N
u=0
Indeed,
N −1 N −1 N −1 N −1 N −1
1 X −nu 1 X X mu −nu
X 1 X u(m−n)
X(u)β = [ x(m)β ]β = x(m)[ β ].
N N N
u=0 u=0 m=0 m=0 u=0
But,
N −1
1 X u(m−n)
β = 1, for m = n
N
u=0
and, for m 6= n,
N −1
1 X u(m−n) 1 β N (m−n) − 1
β = = 0,
N N β m−n − 1
u=0
because βN = e−2iπ = 1. Hence,
N −1 N −1
1 X X
X(u)β −nu = x(m)1{m = n} = x(n),
N
u=0 m=0
as claimed.
The following result shows that LTI systems are easy to analyze in the frequency domain.
216 CHAPTER 13. FILTERING NOISE
Theorem 13.1.1. Let x and y be the input and output of an LTI system, as in (13.1.1).
Then
The theorem shows that the convolution (13.1.1) in the time domain becomes a multi-
Proof:
We find
N
X −1 N
X −1 N
X −1
nu
Y (u) = y(n)β = h(m)x(n − m)β nu
n=0 n=0 m=0
N
X NX
−1 −1 N
X −1 N
X −1
= h(m)x(n − m)β (n−m)u β mu = h(m)β mu [ x(n − m)β (n−m)u ]
n=0 m=0 m=0 n=0
N
X −1
= h(m)β mu X(u) = H(u)X(u).
m=0
PN −1
The next-to-last identity follows from the observation that n=0 x(n − m)β (n−m)u does
not depend on m because both x(n) and β nu are periodic with period N . ¤
As an example, consider the LTI system (13.1.2). Assume that the input is x(n) =
ei2πun = β −nu for n ∈ Z. We know that y(n) = H(u)β −nu , n ∈ Z for some H(u) that we
H(u)β −nu +a1 H(u)β −(n−1)u +· · ·+ak H(u)β −(n−k)u = b0 β −nu +b1 β −(n−1)u +· · ·+bm β −(n−m)u .
H(u)[1 + a1 β u + · · · + ak β ku ) = b0 + b1 β u + · · · + β mu .
Consequently,
b0 + b1 β u + · · · + β mu
H(u) = . (13.1.6)
1 + a1 β u + · · · + ak β ku
13.2. WIDE SENSE STATIONARY PROCESSES 217
1
y(n) = (x(n) + x(n − 1) + · · · + x(n − m + 1)).
m
1 1 1 − β mu 1 1 − e−i2πmu/N
H(u) = (1 + β u + · · · + β (m−1)u ) = = . (13.1.7)
m m 1 − βu m 1 − e−i2πu/N
We are interested in analyzing systems when their inputs are random processes. One could
argue that most real systems are better modelled that way than by assuming that the input
is purely deterministic. Moreover, we are concerned by the power of the noise, as we see
The power of a random process has to do with the average value of its squared magnitude.
This should not be surprising. The power of an electrical signal is the product of its voltage
by its current. For a given load, the current is proportional to the voltage, so that that
the power is proportional to the square of the voltage. If the signal fluctuates, the power
is proportional to the long term average value of the square of the voltage. If the law of
large numbers applies (say, if the process is ergodic), then this long term average value is
the expected value. This discussion motivates the definition of power as the expected value
of the square.
We introduce a few key ideas on a simple model. Assume the the input of the system
at time n is the random variable X(n) that the output is Y (n) = X(n) + X(n − 1). The
power of the input is E(X(n)2 ). For this quantity to be well-defined, we assume that
it does not depend on n. The average power of the output is E(Y (n; ω)) = E((X(n) +
X(n − 1))2 ) = E(X(n)2 ) + 2E(X(n)X(n − 1)) + E(X(n − 1)2 ). For this quantity to be
well-defined, we assume that E(X(n)X(n − 1)) does not depend on n. More generally, this
218 CHAPTER 13. FILTERING NOISE
example shows that the definitions become much simpler if we consider processes such that
E(X(n)X(n − m)) does not depend on n for m ≥ 0. These considerations motivate the
E(X(n)) = µ, n ∈ Z (13.2.1)
and
{X(n), n = 0, . . . , N −1} into a sequence defined for n ∈ Z. For the extended sequence to be
wss, the original sequence must satisfy a related condition that E(X(n)X ∗ (n−m)) = RX (m)
for n, m ∈ {0, . . . , N −1} with the convention that n−m is replaced by n−m+N if n−m < 0.
When this condition holds, we say that the sequence {X(0), . . . , X(N − 1)} is wss. Note
also in that case that RX (n) is periodic with period N . Although this condition may seem
One first example is when {X(n), n = 0, . . . , N − 1} are i.i.d. with mean µ and variance
The following result explains how one can generate many examples of wss process.
Theorem 13.2.1. Assume that the input of a LTI system with impulse response {h(n), n =
is wss.
Proof::
13.3. POWER SPECTRUM 219
Second, we calculate
N
X −1 N
X −1
∗
E(Y (n)Y (n − m)) = E( h(k)X(n − k) h(k 0 )X ∗ (n − m − k 0 )
k=0 k0 =0
N
X −1 N
X −1
= h(k)h(k 0 )RX (m + k 0 − k),
k=0 k0 =0
which shows that the result does not depend on n. We designate that result as RY (m).
Note also that RY (m) is periodic in m with period N , since RX has that property.
We take note of the result of the calculation above, for further reference:
N
X −1 N
X −1
RY (m) = h(k)h(k 0 )RX (m + k 0 − k). (13.2.3)
k=0 k0 =0
The result of the calculation in (13.2.3) is cumbersome and hard to interpret. We introduce
the notion of power spectrum which will give us a simpler interpretation of the result.
Definition 13.3.1. Let X be a wss process. We define the power spectrum of the process
with β := e−i2π/N .
This process is a sine wave with frequency u0 and a random phase. (The unit of frequency
is 1/N .) Without the random phase, the process would certainly not be wss since its mean
and
We find that
Indeed, by the identity (13.1.4) for the inverse DFT, we see that
N −1
1 X
RX (m) = SX (u)β −mu , n = 0, 1, . . . , N − 1 (13.3.3)
N
u=0
which agrees with the expression (13.3.2) for SX . This expression (13.3.2) says that the
power spectrum of the process is concentrated on frequency u0 , which is consistent with the
Second, we look at the process with i.i.d. random variables {X(n), n = 0, . . . , N − 1}.
N
X −1 N
X −1
SX (u) = RX (n)β nu = (µ2 + σ 2 )1{n = 0}β nu = µ2 + σ 2 , u = 0, 1, . . . , N − 1.
n=0 n=0
This process has a constant power spectrum. A process with that property is said to be a
white noise. In a sense, it is the opposite of a pure sine wave with a random phase.
13.4. LTI SYSTEMS AND SPECTRUM 221
Theorem 13.4.1. Let X be a wss input of a LTI system with impulse response h(·) and
output Y . Then
Proof:
Now,
N
X −1
0
RX (m + k 0 − k)β (m+k −k)u = SX (u)
m=0
0
because both RX (m + k0 − k) and β (m+k −k)u are periodic in N − 1, so that we can set
as claimed. ¤
The theorem provides a way to interpret the meaning of the power spectrum. Imagine
a wss random process X. We want to understand the meaning of SX (u). Assume that we
√
build an ideal LTI system with transfer function H(u) = N 1{u = u0 }. This filter lets
222 CHAPTER 13. FILTERING NOISE
only the frequency u0 go through. The process X goes through the LTI system. The power
Example 13.5.1. Averaging a process over time should make it smoother and reduce its
rapid changes. Accordingly, we expect that such processing should cut down the high fre-
Figure 13.1 plots the value of |H(u)|2 . The figure shows the “low-pass” filtering effect:
the low frequencies go through but the high frequencies are greatly attenuated. This plot is
the power spectrum of the output when the input is a white noise such as an uncorrelated
sequence. That output is a “colored” noise with most of the power in the low frequencies.
Example 13.5.2. By calculating the differences between successive values of a process one
H(u) = 1 − β u = 1 − ei2πu/N .
13.5. SOLVED PROBLEMS 223
1.2
1
m = 10
0.8
m = 20
0.6 m = 40
0.4
0.2
0
-0.2 0 5 10 15 20 25
0
0 20 40 60 80 100 120 140
Consequently,
Figure 13.2 plots that expression and shows that the filter boosts the high frequencies, as
expected. You will note that this system acts as a high-pass filter for frequencies up to 64
(in this example, N = 128). In practice, one should choose N large enough so that the filter
covers most of the range of frequencies of the input process. Thus, if the power spectrum
of the input process is limited to K, one can choose N = 2K and this system will act as a
high-pass filter.
224 CHAPTER 13. FILTERING NOISE
Chapter 14
A Markov chain models the random motion in time of an object in a countable set. The
key feature of that motion is that the object has no memory of its past and does not carry a
watch. That is, the future motion depends only on the current location. Consequently, the
law of motion is specified by the one-step transition probabilities from any given location.
Markov chains are an important class of models because they are fairly general and good
14.1 Definition
The matrix P = [P (i, j), i, j ∈ S] is the transition probability matrix of the Markov
chain. The matrix P is any nonnegative matrix whose rows sum to 1. Such a matrix is
225
226 CHAPTER 14. MARKOV CHAINS - DISCRETE TIME
P (X0 = i0 , X1 = i1 , . . . , Xn = in )
where P n is the n-th power of the matrix P and π0 is the row vector with entries π0 (j).
A state transition diagram can represent the transition probability matrix. Such a dia-
gram shows the states and the probabilities are represented by numbers on arrows between
states. By convention, no arrow between two states means that the corresponding transition
14.2 Examples
Diagram 14.1 represents a Markov chain with S = {0, 1}. Its transition probability
The following state transition diagram 14.2 corresponds to a Markov chain with S =
{0, 1, 2, . . .}.
This is the state transition diagram of the sequence of fortunes with the rich uncle.
Also, for future reference, we introduce a few other examples. Markov chain 14.3 has
14.2. EXAMPLES 227
two sets of states that do not communicate. (Recall that the absence of an arrow means
Markov chain 14.4 has one state, state 4, that cannot be exited from. Such a state is
Markov chains 14.5 and 14.6 have characteristics that we will discuss later.
Consider the following “non-Markov” chain. The possible values are 0, 1, 2. The initial
value is picked randomly. Also, with probability 1/2, the sequence increases (modulo 3),
and with probability π(0)/2 it is {0, 2, 1, 0, 2, 1, 0, . . .}. Similarly for the other two possible
starting values. This is not Markov since by looking at the previous two values you can
predict exactly the next one, which you cannot do if you only see the current value. Note
that you can “complete the state” by considering the pair of two successive values. This
pair is a Markov chain. More generally, if a sequence has a finite memory of duration k,
(Xn , n ≥ 0) = (1, 2, 0, 1, 2, . . .), and similarly if X0 = 2. Let g(0) = g(1) = 5 and g(2) = 6.
14.3 Classification
The properties of Markov chains are determined largely (completely for finite Markov
chains) by the “topology” of their state transition diagram. We need some terminology.
A Markov chain (or its probability transition matrix) is said to be irreducible if it can
reach every state from every other state (not necessarily in one step). For instance, the
Markov chains in Figures 14.1, 14.2, 14.5, and 14.6 are irreducible but those in Figures 14.3
Here, g.c.d. means the greatest common divisor of the integers in the set. For instance,
g.c.d.{6, 9, 15} = 3 and g.c.d.{12, 15, 25} = 1. For instance, for the Markov chain in Figure
230 CHAPTER 14. MARKOV CHAINS - DISCRETE TIME
If the Markov chain is irreducible, then it can be shown that d(i) has the same value
for all i ∈ S. If this common value d is equal to 1, then the Markov chain is said to be
aperiodic. Otherwise, the Markov chain is said to be periodic with period d. Accordingly,
the Markov chains [1], [2], [6] are aperiodic and Markov chain [5] is periodic with period 3.
Define Ti = min{n ≥ 0|Xn = i}. If the Markov chain is irreducible, then one can show
that P [Ti < ∞|X0 = i] has the same value for all i ∈ S. Moreover, that value is either 1 or
said to be transient.
Moreover, if the irreducible Markov chain is recurrent, then E[Ti |X0 = i] is either finite
for all i ∈ S or infinite for all i ∈ S. If E[Ti |X0 = i] is finite, then the Markov chain is
said to be positive recurrent. Every finite irreducible Markov chain is positive recurrent. If
E[Ti |X0 = i] is infinite, then the Markov chain is null recurrent. Also, one can show that
1
lim (1{X1 = j} + 1{X2 = j} + · · · + 1{XN = j}) = 0
N →∞ n
1
lim (1{X1 = j} + 1{X2 = j} + · · · + 1{XN = j}) =: π(j) > 0
N →∞ n
Finally, if the Markov chain is irreducible, aperiodic, and positive recurrent, then
The Markov chain in Figure 14.2 is transient when p > 0.5, null recurrent when p = 0.5,
If P (Xn = i) = π(i) for i ∈ S (i.e., does not depend on n), the distribution π is said to be
invariant. Since
X
P (Xn+1 = i) = P (Xn = j, Xn+1 = i)
j
X X
= P [Xn+1 = i|Xn = j]P (Xn = j) = P (Xn = j)P (j, i),
j j
if π is invariant, then
X
π(i) = π(j)P (j, i), for i ∈ S.
j
These identities are called the balance equations. Thus, a distribution is invariant if and
An irreducible Markov chain has at most one invariant distribution. It has one if and
only if it is positive recurrent. In that case, the Markov chain is ergodic and asymptotically
The following theorem summarizes the discussion of the previous two sections.
Theorem 14.4.1. Consider an irreducible Markov chain. It is either transient, null recur-
rent, or positive recurrent. Only the last case is possible for a finite-state Markov chain.
If the Markov chain is transient or null recurrent, then it has no invariant distribution;
the fraction of time it is in state j converges to zero for all j, and the probability that it is
If the Markov chain is positive recurrent, it has a unique invariant distribution π. The
fraction of time that the Markov chain is in state j converges to π(j) for all j. If the Markov
chain is aperiodic, then the probability that it is in state j converges to π(j) for all j.
232 CHAPTER 14. MARKOV CHAINS - DISCRETE TIME
We can extend the example of the gambler’s ruin to a general Markov chain. For instance,
given subset and T = min{n ≥ 0|Xn ∈ A} and define β(i) = E[T |X0 = i]. Then one finds
that
X
β(i) = 1 + P (i, j)β(j), for i ∈
/ A.
j
Of course, β(i) = 0 for i ∈ A. In finite cases, these equations suffice to determine β(i).
In infinite cases, one may have to introduce a boundary as we did in the case of the reflected
fortune process. In many cases, no simple solution can be found. These are the first step
Assume that X is a stationary irreducible Markov chain with invariant distribution π and
XN −n , n ≥ 0} look like? It turns out that X 0 is also a stationary Markov chain with the
same invariant distribution (obviously) and with transition probability matrix P 0 given by
π(i)P (i, j)
P 0 (i, j) = , i, j ∈ S.
π(j)
In some cases, P = P 0 . In those cases, X and X 0 have the same finite dimensional
These equations are called the detailed balance equations. You can use these equations
14.7 Summary
The key point of this definition is that, given the present value of Xn , the future {Xm , m ≥
n + 1} and the past {Xm , m ≤ n − 1} are independent. That is, the evolution of X starts
afresh from Xn . In other words, the state Xn contains all the information that is useful for
The First Step Equations are difference equations about some statistics of a Markov
chain {Xn , n ≥ 0} that are derived by considering the different possible values of the first
Finally, a stationary Markov chain reversed in time is again a Markov chain, generally
with a different transition probability matrix, unless the detailed balanced equations hold,
Example 14.8.1. We flip a coin repeatedly until we get three successive heads. What is
We can model the problem as a Discrete Time Markov chain where the states denote
the number of successive heads obtained so far. Figure 14.7 shows the transition diagram
If we are in state 0, we jump back to state 0 if we receive a Tail else we jump to state 1.
The probabilities of each of these actions is 1/2. Similarly if we are in state 2, we jump back
to state 0 if we receive a Tail else we jump state 3, and similarly for the other transitions.
234 CHAPTER 14. MARKOV CHAINS - DISCRETE TIME
2
1/2 1/2
1 1/2
3
1/2
1/2
0
1/2
We are interested in finding the mean number of coin flips until we get 3 successive
heads. This is the mean number of steps taken to reach state 3 from state 0. Let N =
min{n ≥ 0, Xn = 3}, i.e., N is a random variable which specifies the first time we hit state
1 1
β(0) = β(0) + β(1) + 1;
2 2
1 1
β(1) = β(2) + β(0) + 1;
2 2
1 1
β(2) = β(3) + β(0) + 1;
2 2
β(3) = 0.
Solving these three equations we get β(0) = 14, β(1) = 12, β(2) = 8.
Hence the expected numbed of tosses to until we get 3 successive heads is 14.
the smallest σ-field that contains all the events of the form
{ω | ω0 = i0 , . . . ωn = in };
Example 14.8.3. We flip a biased coin forever. Let X1 = 0 and, for n ≥ 2, let Xn = 1
if the outcomes of the n-th and (n − 1)-st coin flips are identical and Xn = 0 otherwise. Is
Algebra shows that the expressions are equal if and only if p = 0.5. Thus, if X is a Markov
chain, p = 0.5. Conversely, if p = 0.5, then we see that the random variables {Xn , n ≥ 2}
up one rung with probability p, otherwise he falls back to the ground. What is the average
Let β(m) be the average time to reach the n-th rung, starting from the m-th one, for
β(n) = 0
The first equation is of the form β(m + 1) = aβ(m) + b with a = 1/p and b = −1/p − (1 −
1 − am
β(m) = am β(0) + b, m = 0, 1, . . . , n.
1−a
1 − pn
β(0) = .
pn − pn+1
For instance, with p = 0.8 and n = 10, one finds β(0) = 41.5.
Example 14.8.5. You toss a fair coin repeatedly with results Y0 , Y1 , Y2 , . . . that are 0 or 1
chain?
No because
1 1
P [Xn+1 = 0 | Xn = 1, Xn−1 = 2] = 6= P [Xn+1 = 0 | Xn = 1] = .
2 4
Example 14.8.6. Consider a small deck of three cards 1, 2, 3. At each step, you take the
middle card and you place it first with probability 1/2 or last with probability 1/2. What is
the average time until the cards are in the reversed order 3, 2, 1?
14.8. SOLVED PROBLEMS 237
The possible states are the six permutations {123, 132, 312, 321, 231, 213}. The state
transition diagram consists of these six states placed around a circle (in the order indicated),
the states 1, 2, . . . , 6 for simplicity, with 1 = 321 and 4 = 123, we write the FSE for the
β(1) = 0.
In these equations, the conventions are that 6 + 1 = 1 and 1 - 1 = 6. Solving the equations
gives β(1) = 0, β(2) = β(6) = 5, β(3) = β(5) = 8, β(4) = 9. Accordingly, the answer to our
problem is that it takes an average of 9 steps to reverse the order of the cards.
Example 14.8.7. For the same Markov chain as in the previous example, what it the
probability F (n) that it takes at most n steps to reverse the order of the cards?
Let F (n; i) be the probability that it takes at most n steps to reach state 1 from state
F (n; 1) = 1, n ≥ 0
Again we adopt the conventions that 6 + 1 = 1 and 1 - 1 = 6. We can solve the equations
numerically and plot the values of F (n) = F (4; n). The graph is shown in Figure 14.8.
Yes, it takes 2, 4, 6, ... steps to go from state i to itself. Thus, the Markov chain is
periodic with period 2. Recall that this implies that the probability of being in state i does
not converge to the invariant distribution (1/6, 1/6, . . . , 1/6). The graph in Figure 14.9
238 CHAPTER 14. MARKOV CHAINS - DISCRETE TIME
1.2
1
F(n)
0.8
0.6
0.4
0.2
0
0 10 20 30
n 40
1.2
1
0.8
Pn(4, 4)
0.6
0.4
0.2
0
0 5 10 15 20 n 25
shows the probability of being in state 4 at time n given that X0 = 4. This is derived by
P
calculating P (4, 4)n . Since P n+1 (4, j) = i P n (4, i)P (i, j) = 0.5P n (4, j − 1) + 0.5P n (4, j +
1), one can compute recursively by iterating a vector with 6 elements instead of a matrix
with 36.
Example 14.8.9. We flip a fair coin repeatedly until we get either the pattern HHH or
Let Xn be the last two outcomes. After two flips, we start with X0 that is equally likely
to be any of the four pairs in {H, T }2 . Look at the transition diagram of Figure 14.10.
The FSE for the average time to hit one of the two states HT H or HHH from the
14.8. SOLVED PROBLEMS 239
HTH HHH
1/2
1/2 1/2 1/2 1/2
TT TH 1/2 HT HH
1/2 1/2
1/2
β(HT ) = 1 + 0.5β(T T )
β(HH) = 1 + 0.5β(HT )
1
(β(T T ), β(HT ), β(HH), β(T H)) = (34, 22, 16, 24)
5
1 34
2 + (β(T T ) + β(T H) + β(HT ) + β(HH)) = = 6.8.
4 5
Example 14.8.10. Give an example of a discrete time irreducible Markov chain with period
3 that is transient.
The figure below shows the state diagram of an irreducible Markov chain with period 3.
Indeed, the Markov chain can go from 0 to 0 in 3, 6, 9, . . . steps. The probability of motion
to the right is much larger than to the left. Accordingly, one can expect that the Markov
1 1
0 1 2 0.9 3 1
4 0.9 1 0.9 1 0.9 1
240 CHAPTER 14. MARKOV CHAINS - DISCRETE TIME
Example 14.8.11. For n = 1, 2, . . ., during year n, barring any catastrophe, your company
makes a profit equal to Xn , where the random variables Xn are i.i.d. and uniformly dis-
tributed in {0, 1, 2}. Unfortunately, during year n, there is also a probability of a catastrophe
that sets back your company’s total profit to 0. Such a catastrophe occurs with probability
5% independently each year. Explain precisely how to calculate the average time until your
company’s total profit reaches 100. (The company does not invest its money; the total profit
is the sum of the profits since the last catastrophe.) Do not perform the calculations but pro-
vide the equations to be solved and explain a complete algorithm that I could use to perform
the calculations.
Let β(i) be the average time until the profits reach 100 starting from i.
Then
β(100) = β(101) = 0
0.95 0.95 0.95
β(i) = 1 + 0.05β(0) + β(i) + β(i + 1) + β(i + 2), for i = 0, 1, . . . , 99.
3 3 3
Fix β(0) and β(1) arbitrarily. use the second equation to find β(2), β(3), . . . , β(100), β(101),
in that order. Use the first two equations to determine the two unknowns β(0), β(1).
Example 14.8.12. Consider the discrete time Markov chain on {0, 1, 2, 3, 4} with transition
P (4, 1) = 0.5. If you picture the states {0, 1, . . . , 4} as the vertices of a pentagon whose
labels increase clockwise, then the Markov chain makes one or two steps clockwise with
d. Write the first step equations to calculate the average time for the Markov chain to
go from state 0 back to state 0. Can you guess the answer from the result of part (b)?
a. The Markov chain is aperiodic. For instance, it can go from state 0 to itself in 4 or
5 steps.
c. A finite irreducible Markov chain has always one and only one invariant distribution.
β(0) = 0,
Example 14.8.13. Let {Xn , n ≥ 0} be i.i.d. Bernoulli with mean p. Define Y0 = X0 . For
The sequence {Yn , n ≥ 0} is not a Markov chain unless p = 1 or p = 0. To see this, note
that
P [Y3 = 1 | Y2 = 1, Y1 = 0] = P [X0 = 0, X1 = 0, X2 = 1 | X0 = 0, X1 = 0, X2 = 1] = 1.
The average time from state 0 to itself can be guessed from the invariant distribution as
follows. Imagine that once the Markov chain reaches state 0 it takes on average β steps for
it to return to 0. Then, the Markov chain spends one step out every 1 + β steps in state 0,
on average. Hence, the probability of being in state 0 should be equal to 1/(1 + β). Thus,
Yn+1 = max{0, min{k + Xn+1 , N }} where the {Xn , n ≥ 1} are i.i.d. with P (Xn = +1) =
p = 1 − P (Xn = −1). The random variable Y0 is independent of {Xn , n ≥ 1}. Assume that
0 < p < 1.
stationary.
c. Write the first step equations for the average time until Yn = 0 given Y0 = k.
p
π(0) = (1 − p)π(0) + (1 − p)π(1) ⇒ π(1) = ρπ(0) where ρ :=
1−p
π(1) = pπ(0) + (1 − p)π(2) = (1 − p)π(1) + (1 − p)π(2) ⇒ π(2) = ρ2 π(0)
π(n) = ρn π(0), n = 0, 1, . . . , N.
ρn (1 − ρ)
π(n) = , n = 0, 1, . . . , N.
1 − ρN +1
14.8. SOLVED PROBLEMS 243
b. Let {Zn , n ≥ 0} be a stationary version of the Markov chain. That is, its initial value
Z0 is selected with the invariant distribution and if Zn = k, then Zn+1 = max{0, min{k +
Xn+1 , N }}. Define the sequence {Yn , n ≥ 0} so that Y0 is arbitrary and if Yn = k, then
Yn+1 = max{0, min{k + Xn+1 , N }}. The random variables Xn are the same for the two
sequences. Note that Zn and Yn will be equal after some finite random time τ . For instance,
we can choose τ to be the first time that N successive Xn ’s are equal to −1. Indeed, at
that time, both Y and Z must be zero and they remain equal thereafter. Now,
Since P (Zn = k) = π(k) for all n, this shows that P (Zn = k) → π(k).
c. Let
Then
β(0) = 0.
244 CHAPTER 14. MARKOV CHAINS - DISCRETE TIME
Chapter 15
We limit our discussion to a simple case: the regular Markov chains. Such a Markov chain
visits states in a countable set. When it reaches a state, it stays there for an exponentially
distributed random time (called the state holding time) with a mean that depends only on
the state. The Markov chain then jumps out of that state to another state with transition
probabilities that depend only on the current state. Given the current state, the state
holding time, the next state, and the evolution of the Markov chain prior to hitting the
current state are independent. The Markov chain is regular if jumps do not accumulate,
i.e., if it makes only finitely many jumps in finite time. We explain this construction and
we give some examples. We then state some results about the stationary distribution.
15.1 Definition
elements, and row sums equal to zero. Such a matrix is called a rate matrix or generator.
245
246 CHAPTER 15. MARKOV CHAINS - CONTINUOUS TIME
The formula says (if you can read between the symbols) that given X(t), the future and
One represents the rate matrix by a state transition diagram that shows the states; an
arrow from i to j 6= i marked with q(i, j) shows the transition rate between these states.
Let Q be a rate matrix. Define the process X = {X(t), t ≥ 0} as follow. Start by choosing
X(0) according to some distribution in S. When X reaches a state i (and when it starts
in that state), it stays there for some exponentially distributed time with mean 1/q(i)
where q(i) = −q(i, i). When it leaves state i, the process jumps to state j 6= i with
probability q(i, j)/q(i). The evolution of X then continues as before. Figure 15.1 illustrates
this construction.
This construction defines a process on the positive real line if the jumps do not accumu-
late, which we assume here. A simple argument to shows that this construction corresponds
15.3. EXAMPLES 247
to the definition. Essentially, the memoryless property of the exponential distribution and
15.3 Examples
The example of Figure 15.2 corresponds to a Markov chain with two states. The example
of Figure 15.3 corresponds to a Poisson process with rate λ. The example of Figure 15.4
The classification results are similar to the discrete time case. An irreducible Markov chain
(can reach every state from every other state) is either null recurrent, positive recurrent, or
transient (defined as in discrete time). The positive recurrent Markov chains have a unique
invariant distribution, the others do not have any. Also, a distribution π is invariant if and
πQ = 0. (15.4.1)
For instance, the Markov chain in Figure 15.4 is transient if ρ := λ/µ > 1; it is null recurrent
π(n) = (1 − ρ)ρn , n = 0, 1, 2, . . . .
15.5 Time-Reversibility
A stationary irreducible Markov chain is time-reversible if and only if π satisfies the detailed
balance equations:
15.6 Summary
The random process X = {Xt , t ≥ 0} taking values in the countable set S is a Markov chain
The definition specifies the Markov property that given Xt the past and the future are
independent. Recall that the Markov chain stays in state i for an exponentially distributed
15.7. SOLVED PROBLEMS 249
time with rate q(i), then jumps to state j with probability q(i, j)/q(i) for j 6= i, and the
A stationary Markov chain is time-reversible if the detailed balance equations (15.5.1) are
satisfied.
Example 15.7.1. Consider n light bulbs that have independent lifetimes exponentially dis-
tributed with mean 1. What is the average time until the last bulb dies?
Let Xt be the number of bulbs still alive at time t ≥ 0. Because of the memoryless
property of the exponential distribution, {Xt , t ≥ 0} is a Markov chain. Also, the rate
The average time in state m is 1/m and the Markov chain goes from state n to n − 1 to
1 1 1 1
+ + · · · + + + 1.
n n−1 3 2
To fix ideas, one finds the average time to be about 3.6 when n = 20.
Example 15.7.2. In the previous example, assume that the janitor replaces a burned out
bulb after an exponentially distributed time with mean 0.1. What is the average time until
The rate matrix now corresponds to the state diagram shown in Figure 15.5. Defining
250 CHAPTER 15. MARKOV CHAINS - CONTINUOUS TIME
n m+1 m
n m + 1 10 m 10 m - 1 0
10 n - 1
β(m) as the average time from state m to state 0, for m ∈ {0, 1, . . . , n}, we can write the
FSE as
1 m 10
β(m) = + β(m − 1) + β(m + 1), for m ∈ {1, 2, . . . , n − 1}
m + 10 m + 10 m + 10
1
β(n) = + β(n − 1)
n
β(0) = 0.
If we knew β(n − 1), we could solve recursively for all values of β(m). We could then check
that β(0) = 0. Choosing n = 20 and adjusting β(19) so that β(0) = 0, we find numerically
processes with rates λ and µ, respectively. Let also X0 be a random variable independent
matrix? Show that it is irreducible (unless λ = µ = 0). For what values of λ and µ is the
= λh + o(h).
15.7. SOLVED PROBLEMS 251
To study the recurrence or transience of the Markov chain, consider the jump chain with
λ
λ+µ j =i+1
Pij = µ
j =i−1
λ+µ
0 otherwise.
λ µ λ µ
We know that this DTMC is transient if λ+µ 6= λ+µ and is null recurrent if λ+µ = λ+µ .
Example 15.7.4. Let Q be a rate matrix on a finite state space S . For each pair of states
(i, j) ∈ S 2 with i 6= j, let N (i, j) = {Nt (i, j), t ≥ 0} be a Poisson process with rate q(i, j).
Assume that these Poisson processes are all mutually independent. Construct the process
If Xt = i, let s be the first jump time after time t of one of the Poisson processes N (i, j)
for j 6= i. If s is a jump time of N (i, j), then let Xu = i for u ∈ [t, s) and let Xs = j.
Continue the construction using the same procedure. Show that X is a Markov chain with
rate matrix Q.
First note the following. If we have two random variables X =D Exd(λ) and Y =D
at time t. S − 1 exponential clocks, each with rate q(i, j), 1 ≤ j ≤ S, j 6= i are running
expiring first is given by PS q(i,k) . The distribution of the time spent in state i is
j=0,j6=i q(i,j)
PS
exponential with rate j=0,j6=i q(i, j).
252 CHAPTER 15. MARKOV CHAINS - CONTINUOUS TIME
Next, we must look at the infinitesimal rate of jumping from state i to state j.
Also,
S
X
limh→∞ P (Xt+h = i|Xt = i, X(u), 0 ≤ u ≤ t) = 1 − q(i, j) + o(h)
j=0,j6=i
Example 15.7.5. Let X be a Markov chain with rate matrix Q on a finite state space S .
Let λ be such that λ ≥ −q(i, i) for all i ∈ S . Define the process Y as follows. Choose
Y0 = X0 . Let {Nt , t ≥ 0} be a Poisson process with rate λ and let T be its first jump time.
If Y0 = i, then let Yu = i for u ∈ [0, T ). Let also YT = j with probability q(i, j)/λ for j 6= i
and YT = i with probability 1 + q(i, i)/λ. Continue the construction in the same way. Show
Consider the infinitesimal rate of jumping from state i to state j. In the original chain
we had:
In the new chain Y , when in state i, we start an exponentially distributed clock of rate
λ instead of rate −q(i, i) as in the original chain. When the clock expires we jump to state
q(i,j) q(i,j)
j with probability λ . (In chain X this probability was −q(i,i) ).
h is the probability that the exponential clock expires and we actually jump.
q(i, j)
limh→∞ P (Yt+h = j|Yt = i, X(u), 0 ≤ u ≤ t) = ∗ (λh + o(h))
λ
= q(i, j)h + o(h)
15.7. SOLVED PROBLEMS 253
In the new chain Y , the probability of staying in the current state is equal to the
probability that exponential clock does not expire in the infinitesimal time interval h or
q(i, i)
limh→∞ P (Yt+h = i|Yt = i, Y (u), 0 ≤ u ≤ t) = 1 − λh + o(h) + (λh + o(h))(1 + )
λ
= 1 + q(i, i)h + o(h)
Hence the new chain Y exhibits the same rate matrix as chain X.
Example 15.7.6. Let X be a Markov chain on {1, 2, 3, 4} with the rate matrix Q given by
−2 1 0 1
0 0 0 0
Q= .
1 1 0 −2
1 0 1 −2
Define β(i) = E[T2 |X0 = i]. The general form of the first step equations for the CTMC
is:
N
X q(i, j)
1
β(i) = + β(j)
−q(i, i) −q(i, i)
j=1
254 CHAPTER 15. MARKOV CHAINS - CONTINUOUS TIME
1 1 1
β(1) = + β(2) + β(4)
2 2 2
β(2) = 0
1 1 1
β(3) = + β(1) + β(2)
2 2 2
1 1 1
β(i4) = + β(1) + β(3).
2 2 2
Solving these equations we get: β(1) = 1.4, β(3) = 1.2, β(4) = 1.8.
b. We can find E[eiuT2 |X0 = 1] by writing the first step equations for the characteristic
function. For instance, note that the time T (1) to hit 2 starting from 1 is equal to an
exponential time with rate 2, say τ , plus, with probability 1/2, the time T (4) to hit 2 from
1 1
v1 (u) := E(eiuT (1) ) = E(eiuτ ) + E(eiu(τ +T (4)) )
2 2
1 2 1 2
= + v4 (u)
2 2 − iu 2 2 − iu
1 1
= + v4 (u).
2 − iu 2 − iu
In this derivation, we used the fact that E(eiuτ ) = 2/(2 − iu) and we defined v4 (u) =
E(eiuT (4) ).
Similarly, we find
1 1
v3 (u) := E(eiuT (3) ) = + v1 (u),
2 − iu 2 − iu
1 1
v4 (u) = v1 (u) + v3 (u).
2 − iu 2 − iu
(2 − iu)3 + (2 − iu)
E[eiuT2 | X0 = 1] = .
(2 − iu)4 − (2 − iu)2 − 1
Chapter 16
Applications
Of course, one week of lectures on applications of this theory is vastly inadequate. Many
networks, information theory and coding, and many others. In this brief chapter we take a
[Photodetector] → [Receiver].
To send a bit “1” we turn the laser on for T seconds; to send a bit “0” we turn it
off for T seconds. We agree that we start by turning the laser on for T seconds before
the first bit and that we send always groups of N bits. [More sophisticated systems exist
that can send a variable number of bits.] When the laser is on, it produces light that
reaches the photodetector with some intensity λ1 . This light is “seen” by the photodetector
that converts it into electricity. When the laser is off, no light reaches the photodetector.
Unfortunately, the electronic circuitry adds some thermal noise. As a result, the receiver
sees light with an intensity λ0 + λ1 when the laser is on and with an intensity λ0 when
255
256 CHAPTER 16. APPLICATIONS
the laser is off. (That is, λ0 is the equivalent intensity of light that corresponds to the
noise current.) By “light with intensity λ” we mean a Poisson process of photons that has
intensity λ. Indeed, a laser does not produce photons are precise times. Rather, it produces
a Poisson stream of photons. The brighter the laser, the larger the intensity of the Poisson
process.
The problem of the receiver is to decide whether it has received a bit 0 or a bit 1 during
a given time interval of T seconds. For simplicity, we assume that the boundary between
Let Y be the number of photons that the receiver sees during the time interval. Let
X = 0 if the bit is 0 and X = 1 if the bit is 1. Then fY |X [y|0] is Poisson with mean λ0 T
and fY |X [y|1] is Poisson with mean (λ0 + λ1 )T . The problem can then be formulated as an
Assume that the bits 0 and 1 are equally likely. To minimize the probability of detection
error we should detect using the MLE (which is the same as the MAP in this case). That
is, we decide 1 if P [Y = y|X = 1] > P [Y = y|X = 0]. To simplify the math, we use a
Poisson random variable with mean µ with a N (µ, µ). With this approximation,
1
fY |X [y|0] = √ exp{−(y − λ0 T )2 /(2λ0 T )}
2πλ0 T
and
1
fY |X [y|1] = p exp{−(y − (λ0 + λ1 )T )2 /(2(λ0 + λ1 )T )}
2π(λ0 + λ1 )T
Figure 16.1 shows these densities when λ0 T = 10 and (λ0 + λ1 )T = 25.
Using the graphs, one sees that the receiver decides that it got a bit 1 whenever Y > 16.3.
(In fact, I used the actual values of the densities to identify this threshold.) You also see
that P [Decide “100 |X = 0] = P (N (10, 10) > 16.3) = 0.023 and P [Decide “000 |X = 1] =
P (N (25, 25) < 16.3) = 0.041. To find the numerical values, you use a calculator or a
16.1. OPTICAL COMMUNICATION LINK 257
table of the c.d.f. of N (0, 1) after you write that P (N (10, 10) > 16.3) = P (N (0, 1) >
One interesting question is to figure out how to design the link so that the probability
of errors are acceptably small. A typical target for an optical link is a probability of error
of the order of 10−12 , which is orders of magnitude smaller than what we have achieved so
far. To reduce the probability of error, one must reduce the amount of “overlap” of the two
densities shown in the figure. If the values of λ0 and λ1 are given (λ0 depends on the noise
and λ1 depends on the received light power from the transmitter laser), one solution is to
increase T , i.e., to transmit the bits more slowly by spending more time for each bit. Note
that the graphs will separate if one multiplies the means and variances by some constant
larger than 1.
258 CHAPTER 16. APPLICATIONS
Consider the following model of an wireless communication link: [Transmitter and antenna]
For simplicity, assume a discrete time model of the system. To transmit bit 0 the
sends a signal b := {bn , n = 1, 2, . . . , N }. The actual values are selected based on the
efficiency of the antenna at transmitting such signals and on some other reasons. In any
case, it seems quite intuitive that the two signals should be quite different if the receiver
must be able to distinguish a 0 from a 1. We try to understand how the receiver makes its
choice. As always, the difficulty is that the receiver gets the transmitted signal corrupted
Yn = an + Zn , n = 1, 2, . . . , N
Yn = bn + Zn , n = 1, 2, . . . , N
when the transmitter sends a bit 1, where {Zn , n = 1, 2, . . . , N } are i.i.d., N (0, σ 2 ).
P P
The MLE decides 0 if ||YY − a ||2 := n (Yn − an )2 < n (Yn − bn )2 =: ||Y Y − b ||2 , and
decides 1 otherwise. That is, the MLE decides 0 if the received signal Y is closer to the
signal a than to the signal b . Nothing counterintuitive; of course the key is to measure
“closeness” correctly. One can then study the probability of making an error and one finds
that it depends on the energy of the signals a and b ; again this is not too surprising. The
problem can be extended to more than 2 signals so that we can send more than one bit in
N steps. The problem of designing the best signals (that minimize the probability of error)
Consider jobs that arrive at a queue according to a Poisson process with rate λ. The jobs are
served by a single server and they require i.i.d. Exd(µ) service times. This queue is called
Because of the memoryless property of the exponential distribution, the number X(t)
of jobs in the queue at time t is a Markov chain with the state transition diagram shown in
λπ(0) = µπ(1)
You can check that the solution is π(n) = ρn (1 − ρ) for n = 0, 1, . . ., with ρ = λ/µ,
Consider a job that arrives at the queue and the queue is in steady-state (i.e., X(·) has
its invariant distribution). What is the probability that it finds n other jobs already in the
queue? Say that the job arrives during (t, t + ²), then we want
because X(t+²)−X(t) and X(t) are independent. (The memoryless property of the Poisson
process.) This result is know as PASTA: Poisson arrivals see time averages.
How long will the job spend in the queue? To do the calculation, first recall that a
random variable W is Exd(α) if and only if E(exp{−uW }) = α/(u+α). Next, observe that
with probability π(n), the job has to wait for n + 1 exponential service times Z1 , . . . , Zn+1
260 CHAPTER 16. APPLICATIONS
X
E(exp{−uV }) = π(n) exp{−u(Z1 + · · · + Zn+1 )}
n
X µ−λ
= ρn (1 − ρ)(λ/(u + λ))n+1 = · · · = .
n
u+µ−λ
This calculation shows that the job spends an exponential time in the queue with mean
1/(µ − λ).
We explain a simplified model of speech recognition and the algorithm that computes the
MAP.
chain:
for x1 , . . . , xn ∈ S. Here, π(·) and P (., .) are supposed to model the language. The listener
This model is called a hidden Markov chain model: The string that the speaker pro-
nounces is a Markov chain that is hidden from the listener who cannot read her mind but
instead only hears the sounds. We want to calculate M AP [X|Y ]. Note that
P (X = x)P [Y = y|X = x]
P [X = x|Y = y] =
P (X = x)
= π(x1 )P (x1 , x2 ) · · · P (xn−1 , xn )Q(x1 , y1 ) · · · Q(xn , yn )/P (X = x).
Consequently,
16.4. SPEECH RECOGNITION 261
where the maximization is over x ∈ Sn . To maximize this product, we minimize the negative
d1 (0, x) = − log(π(x)Q(x, y1 ))
and
Minimizing d1 (0, x1 )+d2 (x1 , x2 )+ · · · +dn (xn−1 , xn ) is equivalent to finding the shortest
A shortest path algorithm is due Bellman-Ford. Let A(x, k) be the length of the
where the minimum is over all x0 in S. This algorithm, applied to calculate the MAP is
Consider the following “matching pennies” game. Alice and Bob both have a penny and
they select which face they show. If they both show the same face, Alice wins $1.00 from
Bob. If they show different faces, Bob wins $1.00 from Alice. Intuitively it is quite clear
that the best way to play the game is for both players to choose randomly and with equal
probabilities which face to show. This strategy constitutes an “equilibrium” in the sense
that no player has an incentive to deviate unilaterally from it. Indeed, if Bob plays randomly
(50/50), then the average reward of Alice is 0 no matter how she plays, so she might as well
play randomly (50/50). It is also not hard to see that this is the only equilibrium. Such an
This example is a particular case of a general type of games that can be described as
follows. If Alice chooses the action a ∈ A and Bob the action b ∈ B, then Alice gets the
payoff A(a, b) and Bob the payoff B(a, b). If Alice and Bob choose a and b randomly and
independently in A and B, then they get the corresponding expected rewards. Nash proved
the remarkable result that if A and B are finite, then there must be at least one random
16.6 Decisions
One is given a perfectly shuffled 52-card deck. The cards will be turned over one at a time.
You must try to guess when an ace is about to be turned over. If you guess correctly, you
win $1.00, otherwise you lose. One strategy might be to wait until a number of cards are
turned over; if you are lucky, the fraction of aces left in the deck will get larger than 4/52
and this will increase your odds of guessing right. Unfortunately, things might go the other
way and you might see a number of aces being turned over quickly, thus reducing your odds
A simple argument shows that it does not really matter how you play. To see this,
designate by V (n, m) your maximum expected reward given that there remain m aces and
a total of n cards in the deck. By maximum, we mean the expected reward you can get by
playing the game in the best possible way. The key to the analysis is to observe that you
have two choices as the game starts with n cards and m aces: either you gamble on the next
card or you don’t. If you do, that next card is an ace with probability m/n and you win $1.00
with that probability. If you don’t, then, after the next card is turned over, with probability
m/n you find yourself with a deck of n − 1 cards with m − 1 aces; with probability 1 − m/n,
you face a deck of n − 1 cards and m aces. Accordingly, if you play the game optimally
after the next card, your expected reward is (m/n)V (n − 1, m − 1) + (1 − m/n)V (n − 1, m).
Hence, the maximum reward V (n, m) should be either m/n (if you gamble on the next
You can verify that the solution of the above equations is V (n, m) = m/n whenever
n > 0. Thus, both alternatives (gamble on the next card or don’t) yield the same expected
reward.
The equations (16.6.1) are called the Dynamic Programming Equations (DPE). They
264 CHAPTER 16. APPLICATIONS
express the maximum expected reward by comparing the consequences of the different
decisions that can be made at a stage of the game. The trick to write down that equation
is to identify correctly the “state” of the game (here, the pair (n, m)). By solving the DPE
and choosing at each stage the decision that corresponds to the maximum, one derives the
optimal strategy. These ideas are due to Richard Bellman. (I borrowed this simple example
Mathematics Review
A.1 Numbers
You are familiar with whole, rational, real, and complex numbers. You know how to perform
operations on complex numbers and how to convert them to and from the r × eiθ notation.
√
Recall that |a + ib| = a2 + b2 .
3+i
= 1 + 2i
1−i
and that
√ iπ/6
2+i= 5e .
Let A be a set of real numbers. An upper bound of A is a finite number b such that b ≥ a
for all a in A. If there is an upper bound of A that is in A, it is called the maximal element
of A and is designated by max{A}. If A has an upper bound, it has a lowest upper bound
265
266 APPENDIX A. MATHEMATICS REVIEW
that is designated by sup{A}. One defines a lower bound, the minimal element min{A},
For instance, let A = (2, 5]. Then 6 is an upper bound of A, 1 is a lower bound,
For any real number x one defines x+ = max{x, 0} and x− = (−x)+ . Note that
|x| = x+ + x− . We also use the notation x ∧ y = min{x, y} and x ∨ y = max{x, y}. For
A.2 Summations
quences {xn , n ≥ 1}. For instance, you remember and you can prove that if a 6= 1, then
N
X 1 − aN +1
an = .
1−a
n=0
By taking the derivative of the above expression with respect to a, you find that, when
|a| < 1,
∞
X 1
nan−1 = .
(1 − a)2
n=0
By taking the derivative one more time, we get
∞
X 2
n(n − 1)an−2 = .
(1 − a)3
n=0
A.3. COMBINATORICS 267
N X
X n N X
X N
xm,n = xm,n .
n=0 m=0 m=0 n=m
A.3 Combinatorics
A.3.1 Permutations
n! = 1 × 2 × 3 × · · · × n.
By convention, 0! = 1.
For instance, there are 120 ways of seating 5 people at a table with 5 chairs.
A.3.2 Combinations
¡N ¢
There are n distinct groups of n objects selected without replacement from a set of N
µ ¶
N N!
= .
n (N − n)!n!
For instance, there are about 2.6 × 106 distinct sets of five cards picked from a 52-card
deck.
N µ ¶
X
N N n N −n
(a + b) = a b .
n
n=0
A.3.3 Variations
You should be able to apply these ideas and their variations. For instance, you can count
A.4 Calculus
Z b
f (x)dx.
a
In particular, you know how to calculate some simple integrals. You recall that, for
n = 0, 1, 2, . . .,
Z 1
xn dx = 1/(n + 1).
0
Also,
Z A
1
dx = ln A.
1 x
You know the integration by parts formula and you can calculate
Z y
xn ex dx.
0
a n
(1 + ) → ea as n → ∞.
n
1 2 1
ex = 1 + x + x + x3 + · · ·
2! 3!
and
A.5 Sets
A set is a well-defined collection of elements. That is, for every element one can determine
whether it is in the set or not. Recall the notation x ∈ A meaning that x is an element of
It is usual to characterize a set by a proposition that its elements satisfy. For instance
A ∩ B = {x | x ∈ A and a ∈ B}.
Similarly, you know how to define A ∪ B, A \ B, and A∆B. You also know the meaning
You are not confused by the notation and you would never write [1, 2] ∈ [0, 3] because
you know that [1, 2] ⊂ [0, 3]. Similarly, you would never write 1 ⊂ [0, 3] but you would write
1 ∈ [0, 3] or {1} ⊂ [0, 3]. Along the same lines, you know that 0 ∈ [0, 3] but you would never
A.6 Countability
A set A is countable if it is finite or if one can enumerate its elements as A = {a1 , a2 , a3 , . . .}.
A subset of a countable set is countable. If the sets An are countable for n ≥ 1, then so is
their union
A = ∪∞
n=1 An := {a | a ∈ An for some n ≥ 1}.
270 APPENDIX A. MATHEMATICS REVIEW
The set [0, 1] is not countable. To see that,k imagine that one can enumerate its elements
and write them as decimal expansions, for instance as {0.23145..., 0.42156.., 0.13798..., . . .}.
This list does not contain the number 0.135... selected in a way that the first digit in the
expansion differs from the first digit of the first element in the list, its second digit differs
from that of the second element, and so on. This diagonal argument shows that there is no
Let p and q be two propositions. We say “if p then q” if the proposition q is true whenever
p is. For instance, if p means “it rains” and q means “the roof gets wet,” then we can
You know that if the statement “if p then q” is true, then so is the statement “if not q
then not p.” However, the statement “if not p then not q” may not be true.
Therefore, if we know that the statement “if p then q” is true, a method for proving
and b are not both multiples of 2. Taking the square, we get 2 = a2 /b2 . This implies that
a2 = 2b2 is even, which implies that a is even and that b is not (since a and b are not both
multiples of 2). But then a = 2c and a2 = 4c2 = 2b2 , which shows that b2 = 2c2 is even, so
A.8. SAMPLE PROBLEMS 271
Assume that for n ≥ 1, p(n) designates a proposition. The induction method for proving
that p(n) is true for all finite n ≥ 1 consists in showing first that p(1) is true and second
that if p(n) is true, then so is p(n + 1). The second step is called the induction step.
identity is certainly true for n = 1. Assume that it is true for some N . Then
a + a2 + · · · + aN + aN +1
= (a − aN +1 )/(1 − a) + aN +1
= (a − aN +2 )/(1 − a),
If p(n) is true for all finite n, this does not imply that p(∞) is true, even if p(∞) is
well-defined. For instance, the set {1, 2, . . . , n} is finite for all finite n, but {1, 2, . . .} is
infinite.
Problem A.8.1. Express (1 + 3i)/(2 + i) in the form a + bi and in the form r × eiθ .
n
X Xn
3
k =( k)2 .
k=1 k=1
272 APPENDIX A. MATHEMATICS REVIEW
Note: We want a proof by induction, not a direct proof. You may use the fact that
n
X n(n + 1)
k= .
2
k=1
Problem A.8.3. Give an example of a function f (x) defined on [0, 1] such that
and the function f (x) does not have a maximum on [0, 1].
R1 x+1
Problem A.8.4. Calculate 0 x+2 dx.
1. 0 ∈ (0, 1)
2. 0 ⊂ (−1, 3)
R∞
Problem A.8.6. Calculate 0 x2 e−x dx.
Problem A.8.7. Let A = (1, 5), B = [0, 3), and C = (2, 4). What is A \ (B∆C)?
Problem A.8.10. Let A be a set of numbers and define B = {−a|a ∈ A}. Show that
inf{A} = − sup{B}.
A.8. SAMPLE PROBLEMS 273
Problem A.8.12. How many distinct sets of five cards with three red cards can one draw
Problem A.8.13. Let A be a set of real numbers with an upper bound b. Show that sup{A}
exists.
PN n.
Problem A.8.14. Derive the expression for n=0 a
Problem A.8.15. Let {xn , n ≥ 1} be real numbers such that xn ≤ xn+1 and xn ≤ a < ∞
Problem A.8.16. Show that the set of finite sentences in English is countable.
274 APPENDIX A. MATHEMATICS REVIEW
Appendix B
Functions
A function f (·) is a mapping from a set D into another set V . To each point x of D, the
275
276 APPENDIX B. FUNCTIONS
Nonmeasurable Set
C.1 Overview
We defined events as being some sets of outcomes. The collection of events is closed under
countable set operations. When the sample space is countable, we can define the probability
of every set of outcomes. However, in general this is not possible. For instance, one cannot
define the length of every subset of the real line. We explain that fact in this note. These
ideas a bit subtle. We explain them only because some students always ask for a justification.
C.2 Outline
We construct a set S of real numbers between 0 and 1 with the following properties. Define
Sx = {y + x|y ∈ S}. That is, Sx is the set S shifted by x. Then there is a countable
collection C of numbers such that the union A of Sx for x in C is such that [1/3, 2/3] is a
subset of A and A is a subset of [0, 1]. Morever, Sx and Sy are disjoint whenever x and y
are distinct in C.
Assume that the “length” of S is L. The length of Sx is also L for all x. (Indeed, the
length is first defined for intervals and is shift-invariant.) The length of A must be the sum
277
278 APPENDIX C. NONMEASURABLE SET
of the length of the sets Sx for x in C since it is a countable union of these disjoint sets. If
L > 0, then the length of A is infinite, which is not possible since A is contained in [0, 1].
If L = 0, then the length of A must be 0, which is not possible since A contains [1/3, 2/3].
C.3 Constructing S
We start by defining x and y in [0, 1/3] to be equivalent if they differ by a rational num-
ber. For instance, x = (20.5)/8 and x + 0.12 are equivalent. We can then look at all the
equivalence classes of [0, 1/3], i.e., all the sets of equivalent numbers. Two different equiv-
alence classes must be disjoint. We then form a set S by picking one element from each
equivalence class. [Some philosophers will object to this selection, arguing that it is not
reasonable. They refuse the axiom of choice that postulates that such as set is well defined.]
Note that all the numbers in S are in [0, 1/3] and that any two numbers in S cannot be
equivalent since they were picked in different equivalence classes. That is, any two numbers
[0, 1/3].
Next, let C be all the rational numbers in [0, 2/3]. For x in C, Sx is a subset of [0, 1].
Also, for any two distinct x, y in C the sets Sx and Sy must be disjoint. Otherwise, they
but this implies that x − y = v − u is rational, which is not possible. It remains only to
show that the union of the sets Sx for x in C contains [1/3, 2/3]. Pick any number w in
[1/3, 2/3]. Note that w − 1/3 is in [0, 1/3] and must be equivalent to some s in S. That
for some x in C.
Appendix D
Key Results
We cover a number of important results in this course. In the table below we list these
results. We indicate the main reference and their applications. In the table, Ex. refers to
279
280 APPENDIX D. KEY RESULTS
Bertrand’s Paradox
The point of this note is that one has to be careful about the meaning of “choosing at
random.”
Consider the following question: What is the probability that a chord selected at random
in a circle is larger than the side of an inscribed equilateral triangle? There are three
plausible answers to this question: 1/2, 1/3, and 1/4. Of course, the answer depends on
Answer 1: 1/3
The first choice is shown in the left-most part of Figure E.1. To choose the chord, we fix
a point A on the circle; it will be one of the ends of the chord. We then choose another
(where ABC is equilateral), then AX is longer than the sides of ABC. Thus, the requested
probability is 1/3.
281
282 APPENDIX E. BERTRAND’S PARADOX
X'
B
X
X'
X' X
A O A
Answer 2: 1/4
The second choice is illustrated in the middle part of Figure E.1. We choose the chord by
choosing its midpoint (e.g., X) at random inside the circle. The chord is longer than the
side of the inscribed equilateral triangle if and only if X falls inside the circle with half the
Answer 2: 1/2
The third choice is illustrated in the right-most part of Figure E.1. We choose the chord by
choosing its midpoint (e.g., X) at random on a given radius OA of the circle. The chord is
longer than the side of the inscribed triangle if an only if the point is closer to the center
Simpson’s Paradox
The point of this note is that proportions do not add up and that one has to be careful
with statistics.
Consider a university where 80% of the male applicants are accepted but only 51% of
the female applicants are accepted. You will be tempted to conclude that the university
discriminates against female applicants. However, a closer look at this university shows
that it has only two colleges with the admission records shown in the table.
Note that each college admits a larger fraction of female applicants than of male appli-
cants, so that the university cannot be accused of discrimination against the female students.
283
284 APPENDIX F. SIMPSON’S PARADOX
Appendix G
Familiar Distributions
We collect here the few distributions that we encounter repeatedly in the text.
G.1 Table
G.2 Examples
Here are typical random experiments that give rise to these distributions. We also comment
285
286 APPENDIX G. FAMILIAR DISTRIBUTIONS
• Geometric: Number of flips until the first H. Memoryless. Holding time of state of
• Poisson: Number of photons that hit a given area in a given time interval. Limit of
P
B(n, p) as np = λ and n → ∞. The sum of independent P (λi ) is P ( i λi ). Random
variables.
• Exponential: Time until next photon hits. Memoryless. Holding time of state of
• Gaussian: Thermal noise. Sum of many small independent random variables (CLT).
ΣX,Y , 70
Central Limit Theorem, 177
ΣX , 70
Approximate, 178
Aperiodic Markov Chain, 230
De Moivre, 8
Approximate Central Limit Theorem, 178
Chebyshev’s Inequality, 46
Asymptotically Stationary Markov Chain, 230
Classification
Bernoulli, 6 Examples, 85
287
288 INDEX
Convolution, 71 Events
Correlation, 69 Motivation, 15
Joint, 68
Gambler’s Ruin, 193
Dynamic Programming Equations, 263
Gambling System, 93
Legendre, 5
Independence, 70
Linear Time-Invariant (LTI), 212
of collection of events, 31
Linearity of Expectation, 45
of two events, 31
LLSE
Properties, 71
Recursive, 146
subtility, 32
LLSE - Linear Least Squares Estimator, 144
v.s. disjoint, 31
Jensen, 46 Markov, 11
Markov, 46 Inequality, 46
Simpson, 8
Variance, 45
Simpson’s Paradox, 283
Properties, 76
Speech Recognition, 260
Viterbi Algorithm, 262
Standard Gaussian, 101
Stationary
Time-Reversibility
Uniform
in finite set, 17
in interval, 18
Bibliography
[3] W. Feller, Introduction to probability theory and its applications, Wiley, New York.
[4] Port S.C. Hoel, P.G. and C. J. Stone, An introduction to probability theory, Houghton
Mifflin, 1971.
[6] S. Ross, Introduction to stochastic dynamic programming, Academic Press, New York,
NY, 1984.
[8] Chisyakov V.P. Sevastyanov, B. A. and A. M. Zubkov, Problems in the theory of prob-
ability, MIR Publishers, Moscow, 1985.
[9] S. M. Stigler, The history of statistics – the measurement of uncertainty before 1900,
Belknap, Harvard, 1999.
293