Univ of California Lecture Notes-ProbabiliyTheory

Lecture Notes on Probability Theory
and Random Processes
Jean Walrand
Department of Electrical Engineering and Computer Sciences
University of California
Berkeley, CA 94720
August 25, 2004

2
Table of Contents
Table of Contents 3
Abstract 9
Introduction 1
1 Modelling Uncertainty 3
1.1 Models and Physical Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Concepts and Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Function of Hidden Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 A Look Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Probability Space 13
2.1 Choosing At Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Countable Additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Choosing uniformly in {1, 2, . . . , N } . . . . . . . . . . . . . . . . . . 17
2.5.2 Choosing uniformly in [0, 1] . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.3 Choosing uniformly in [0, 1]2 . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.1 Stars and Bars Method . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Conditional Probability and Independence 27

3.1 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3
4 CONTENTS
3.4.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.4 General Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Random Variable 37
4.1 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Examples of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Generating Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Function of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Moments of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.8 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.10 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Random Variables 67
5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Joint Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Conditional Expectation 85
6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 MMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Two Pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 Properties of Conditional Expectation . . . . . . . . . . . . . . . . . . . . . 90
6.5 Gambling System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.7 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7 Gaussian Random Variables 101

7.1 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.1.1 N (0, 1): Standard Gaussian Random Variable . . . . . . . . . . . . . 101
7.1.2 N (µ, σ 2 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
CONTENTS 5
7.2 Jointly Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.2.1 N (00, I ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2.2 Jointly Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.3 Conditional Expectation J.G. . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.5 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8 Detection and Hypothesis Testing 121

8.1 Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.2 Maximum Likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . 122
8.3 Hypothesis Testing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.3.1 Simple Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.3.3 Proof of the Neyman-Pearson Theorem . . . . . . . . . . . . . . . . 126
8.4 Composite Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.4.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.4.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.4.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.5.1 MAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.5.2 MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.5.3 Hypothesis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.6 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9 Estimation 143
9.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.2 Linear Least Squares Estimator: LLSE . . . . . . . . . . . . . . . . . . . . . 143
9.3 Recursive LLSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.4 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.5.1 LSSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.6 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10 Limits of Random Variables 163

10.1 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
10.2 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10.3 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
10.4 Convergence In Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.5 Convergence in L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
10.6 Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6 CONTENTS
10.7 Convergence of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
11 Law of Large Numbers & Central Limit Theorem 175

11.1 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 175
11.2 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 176
11.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.4 Approximate Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 178
11.5 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11.7 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
12 Random Processes Bernoulli - Poisson 189

12.1 Bernoulli Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
12.1.1 Time until next 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
12.1.2 Time since previous 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 191
12.1.3 Intervals between 1s . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
12.1.4 Saint Petersburg Paradox . . . . . . . . . . . . . . . . . . . . . . . . 191
12.1.5 Memoryless Property . . . . . . . . . . . . . . . . . . . . . . . . . . 192
12.1.6 Running Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
12.1.7 Gamblers Ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
12.1.8 Reflected Running Sum . . . . . . . . . . . . . . . . . . . . . . . . . 194
12.1.9 Scaling: SLLN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
12.1.10 Scaling: Brownian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
12.2 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
12.2.1 Memoryless Property . . . . . . . . . . . . . . . . . . . . . . . . . . 200
12.2.2 Number of jumps in [0, t] . . . . . . . . . . . . . . . . . . . . . . . . 200
12.2.3 Scaling: SLLN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
12.2.4 Scaling: Bernoulli → Poisson . . . . . . . . . . . . . . . . . . . . . . 201
12.2.5 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
12.2.6 Saint Petersburg Paradox . . . . . . . . . . . . . . . . . . . . . . . . 202
12.2.7 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
12.2.8 Time reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
12.2.9 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
12.2.10 Markov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
12.2.11 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
13 Filtering Noise 211

13.1 Linear Time-Invariant Systems . . . . . . . . . . . . . . . . . . . . . . . . . 212
13.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
13.1.2 Frequency Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
13.2 Wide Sense Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . 217
CONTENTS 7
13.3 Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

13.4 LTI Systems and Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
13.5 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
14 Markov Chains - Discrete Time 225

14.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
14.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
14.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
14.4 Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
14.5 First Passage Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
14.6 Time Reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
14.8 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
15 Markov Chains - Continuous Time 245

15.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
15.2 Construction (regular case) . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
15.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
15.4 Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
15.5 Time-Reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
15.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
15.7 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
16 Applications 255
16.1 Optical Communication Link . . . . . . . . . . . . . . . . . . . . . . . . . . 255
16.2 Digital Wireless Communication Link . . . . . . . . . . . . . . . . . . . . . 258
16.3 M/M/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
16.4 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
16.5 A Simple Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
16.6 Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
A Mathematics Review 265

A.1 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
A.1.1 Real, Complex, etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
A.1.2 Min, Max, Inf, Sup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
A.2 Summations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
A.3 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.3.1 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.3.2 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.3.3 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
A.4 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
A.5 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
8 CONTENTS
A.6 Countability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

A.7 Basic Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
A.7.1 Proof by Contradiction . . . . . . . . . . . . . . . . . . . . . . . . . 270
A.7.2 Proof by Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
A.8 Sample Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
B Functions 275
C Nonmeasurable Set 277

C.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
C.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
C.3 Constructing S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
D Key Results 279
E Bertrand’s Paradox 281
F Simpson’s Paradox 283
G Familiar Distributions 285

G.1 Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
G.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Bibliography 293
Abstract
These notes are derived from lectures and office-hour conversations in a junior/senior-level
course on probability and random processes in the Department of Electrical Engineering
and Computer Sciences at the University of California, Berkeley.
The notes do not replace a textbook. Rather, they provide a guide through the material.
The style is casual, with no attempt at mathematical rigor. The goal to to help the student
figure out the meaning of various concepts and to illustrate them with examples.
When choosing a textbook for this course, we always face a dilemma. On the one hand,
there are many excellent books on probability theory and random processes. However, we
find that these texts are too demanding for the level of the course. On the other hand,
books written for the engineering students tend to be fuzzy in their attempt to avoid subtle
mathematical concepts. As a result, we always end up having to complement the textbook
we select. If we select a math book, we need to help the student understand the meaning of
the results and to provide many illustrations. If we select a book for engineers, we need to
provide a more complete conceptual picture. These notes grew out of these efforts at filling
the gaps.
You will notice that we are not trying to be comprehensive. All the details are available
in textbooks. There is no need to repeat the obvious.
The author wants to thank the many inquisitive students he has had in that class and
the very good teaching assistants, in particular Teresa Tung, Mubaraq Misra, and Eric Chi,
who helped him over the years; they contributed many of the problems.
Happy reading and keep testing hypotheses!
Berkeley, June 2004 - Jean Walrand
9
Introduction
Engineering systems are designed to operate well in the face of uncertainty of characteristics
of components and operating conditions. In some case, uncertainty is introduced in the
operations of the system, on purpose.
Understanding how to model uncertainty and how to analyze its effects is – or should be
– an essential part of an engineer’s education. Randomness is a key element of all systems
we design. Communication systems are designed to compensate for noise. Internet routers
are built to absorb traffic fluctuations. Building must resist the unpredictable vibrations
of an earthquake. The power distribution grid carries an unpredictable load. Integrated
circuit manufacturing steps are subject to unpredictable variations. Searching for genes is
looking for patterns among unknown strings.
What should you understand about probability? It is a complex subject that has been
constructed over decades by pure and applied mathematicians. Thousands of books explore
various aspects of the theory. How much do you really need to know and where do you
start?
The first key concept is how to model uncertainty (see Chapter 2 - 3). What do we mean
by a “random experiment?” Once you understand that concept, the notion of a random
variable should become transparent (see Chapters 4 - 5). You may be surprised to learn that
a random variable does not vary! Terms may be confusing. Once you appreciate the notion
of randomness, you should get some understanding for the idea of expectation (Section 4.5)
and how observations modify it (Chapter 6). A special class of random variables (Gaussian)
1
2
are particularly useful in many applications (Chapter 7). After you master these key notions,
you are ready to look at detection (Chapter 8) and estimation problems (Chapter 9). These
are representative examples of how one can process observation to reduce uncertainty. That
is, how one learns. Many systems are subject to the cumulative effect of many sources of
randomness. We study such effects in Chapter 11 after having provided some background
in Chapter 10. The final set of important notions concern random processes: uncertain
evolution over time. We look at particularly useful models of such processes in Chapters
12-15. We conclude the notes by discussing a few applications in Chapter 16.
The concepts are difficult, but the math is not (Appendix ?? reviews what you should
know). The trick is to know what we are trying to compute. Look at examples and invent
new ones to reinforce your understanding of ideas. Don’t get discouraged if some ideas seem
obscure at first, but do not let the obscurity persist! This stuff is not that hard, it is only
new for you.

Chapter 1
Modelling Uncertainty
In this chapter we introduce the concept of a model of an uncertain physical system. We
stress the importance of concepts that justify the structure of the theory. We comment on
the notion of a hidden variable. We conclude the chapter with a very brief historical look
at the key contributors and some notes on references.
1.1 Models and Physical Reality
Probability Theory is a mathematical model of uncertainty. In these notes, we introduce
examples of uncertainty and we explain how the theory models them.
It is important to appreciate the difference between uncertainty in the physical world
and the models of Probability Theory. That difference is similar to that between laws of
theoretical physics and the real world: even though mathematicians view the theory as
standing on its own, when engineers use it, they see it as a model of the physical world.
Consider flipping a fair coin repeatedly. Designate by 0 and 1 the two possible outcomes
of a coin flip (say 0 for head and 1 for tail). This experiment takes place in the physical
world. The outcomes are uncertain. In this chapter, we try to appreciate the probability
model of this experiment and to relate it to the physical reality.
3
4 CHAPTER 1. MODELLING UNCERTAINTY
1.2 Concepts and Calculations
In our many years of teaching probability models, we have always found that what is
most subtle is the interpretation of the models, not the calculations. In particular, this
introductory course uses mostly elementary algebra and some simple calculus. However,
understanding the meaning of the models, what one is trying to calculate, requires becoming
familiar with some new and nontrivial ideas.
Mathematicians frequently state that “definitions do not require interpretation.” We
beg to disagree. Although as a logical edifice, it is perfectly true that no interpretation is
needed; but to develop some intuition about the theory, to be able to anticipate theorems
and results, to relate these developments to the physical reality, it is important to have some
interpretation of the definitions and of the basic axioms of the theory. We will attempt to
develop such interpretations as we go along, using physical examples and pictures.
1.3 Function of Hidden Variable
One idea is that the uncertainty in the world is fully contained in the selection of some
hidden variable. (This model does not apply to quantum mechanics, which we do not
consider here.) If this variable were known, then nothing would be uncertain anymore.
Think of this variable as being picked by nature at the big bang. Many choices were
possible, but one particular choice was made and everything derives from it. [In most cases,
it is easier to think of nature’s choice only as it affects a specific experiment, but we worry
about this type of detail later.] In other words, everything that is uncertain is a function of
that hidden variable. By function, we mean that if we know the hidden variable, then we
know everything else.
Let us denote the hidden variable by ω. Take one uncertain thing, such as the outcome
of the fifth coin flip. This outcome is a function of ω. If we designate the outcome of
1.4. A LOOK BACK 5
Figure 1.1: Adrien Marie Legendre
the fifth coin flip by X, then we conclude that X is a function of ω. We can denote that
function by X(ω). Another uncertain thing could be the outcome of the twelfth coin flip.
We can denote it by Y (ω). The key point here is that X and Y are functions of the same
ω. Remember, there is only one ω (picked by nature at the big bang).
Summing up, everything that is random is some function X of some hidden variable ω.
This is a model. To make this model more precise, we need to explain how ω is selected
and what these functions X(ω) are like. These ideas will keep us busy for a while!
1.4 A Look Back
The theory was developed by a number of inquiring minds. We briefly review some of their
contributions. (We condense this historical account from the very nice book by S. M. Stigler
[9]. For ease of exposition, we simplify the examples and the notation.)
Adrien Marie LEGENDRE, 1752-1833
Best use of inaccurate measurements: Method of Least Squares.
To start our exploration of “uncertainty” We propose to review very briefly the various
attempts at making use of inaccurate measurements.
Say that an amplifier has some gain A that we would like to measure. We observe the
input X and the output Y and we know that Y = AX. If we could measure X and Y
precisely, then we could determine A by a simple division. However, assume that we cannot
measure these quantities precisely. Instead we make two sets of measurements: (X, Y ) and
(X 0 , Y 0 ). We would like to find A so that Y = AX and Y 0 = AX 0 . For concreteness, say
that (X, Y ) = (2, 5) and (X 0 , Y 0 ) = (4, 7). No value of A works exactly for both sets of
measurements. The problem is that we did not measure the input and the output accurately
enough, but that may be unavoidable. What should we do?
One approach is to average the measurements, say by taking the arithmetic means:
((X + X 0 )/2, (Y + Y 0 )/2) = (3, 6) and to find the gain A so that 6 = A × 3, so that A = 2.
This approach was commonly used in astronomy before 1750.
A second approach is to solve for A for each pair of measurements: For (X, Y ), we find
A = 2.5 and for (X 0 , Y 0 ), we find A = 1.75. We can average these values and decide that A
should be close to (2.5 + 1.75)/2 = 2.125.
We skip over many variations proposed by Mayer, Euler, and Laplace.
Another approach is to try to find A so as to minimize the sum of the squares of
the errors between Y and AX and between Y 0 and AX 0 . That is, we look for A that
minimizes (Y − AX)2 + (Y 0 − AX 0 )2 . In our example, we need to find A that minimizes
(5 − 2A)2 + (7 − 4A)2 = 74 − 76A + 20A2 . Setting the derivative with respect to A equal to
0, we find −76 + 40A = 0, or A = 1.9. This is the solution proposed by Legendre in 1805.
He called this approach the method of least squares.
The method of least squares is one that produces the “best” prediction of the output
based on the input, under rather general conditions. However, to understand this notion,
we need to make a short excursion on the characterization of uncertainty.
Jacob BERNOULLI, 1654-1705
Making sense of uncertainty and chance: Law of Large Numbers.

1.4. A LOOK BACK 7
Figure 1.2: Jacob Bernoulli
If an urn contains 5 red balls and 7 blue balls, then the odds of picking “at random” a
red ball from the urn are 5 out of 12. One can view the likelihood of a complex event as
being the ratio of the number of favorable cases divided by the total number of “equally
likely” cases. This is a somewhat circular definition, but not completely: from symmetry
considerations, one may postulate the existence equally likely events. However, in most
situations, one cannot determine – let alone count – the equally likely cases nor the favorable
cases. (Consider for instance the odds of having a sunny Memorial Day in Berkeley.)
Jacob Bernoulli (one of twelve Bernoullis who contributed to Mathematics, Physics, and
Probability) showed the following result. If we pick a ball from an urn with r red balls and
b blue balls a large number N of times (always replacing the ball before the next attempt),
then the fraction of times that we pick a red ball approaches r/(r + b). More precisely, he
showed that the probability that this fraction differs from r/(r + b) by more than any given
² > 0 goes to 0 as N increases. We will learn this result as the weak law of large numbers.
Abraham DE MOIVRE, 1667 1754
Bounding the probability of deviation: Normal distribution
De Moivre found a useful approximation of the probability that preoccupied Jacob
Bernoulli. When N is large and ² small, he derived the normal approximation to the
Figure 1.3: Abraham de Moivre
Figure 1.4: Thomas Simpson
probability discussed earlier. This is the first mention of this distribution and an example
of the Central Limit Theorem.
Thomas SIMPSON, 1710-1761
A first attempt at posterior probability.
Looking again at Bernoulli’s and de Moivre’s problem, we see that they assumed p =
r/(r + b) known and worried about the probability that the fraction of N balls selected from
the urn differs from p by more than a fixed ² > 0. Bernoulli showed that this probability
goes to zero (he also got some conservative estimates of N needed for that probability to
be a given small number). De Moivre improved on these estimates.

1.4. A LOOK BACK 9
Figure 1.5: Thomas Bayes
Simpson (a heavy drinker) worried about the “reverse” question. Assume we do not
know p and that we observe the fraction q of a large number N of balls being red. We
believe that p should be close to q, but how close can we be confident that it is? Simpson
proposed a naı̈ve answer by making arbitrary assumptions on the likelihood of the values
of p.
Thomas BAYES, 1701-1761
The importance of the prior distribution: Bayes’ rule.
Bayes understood Simpson’s error. To appreciate Bayes’ argument, assume that q = 0.6
and that we have made 100 experiments. What are the odds that p ∈ [0.55, 0.65]? If you are
told that p = 0.5, then these odds are 0. However, if you are told that the urn was chosen
such that p = 0.5 or p = 1, with equal probabilities, then the odds that p ∈ [0.55, 0.65] are
now close to 1.
Bayes understood how to include systematically the information about the prior distri-
bution in the calculation of the posterior distribution. He discovered what we know today
as Bayes’ rule, a simple but very useful identity.
Pierre Simon LAPLACE, 1749-1827
Posterior distribution: Analytical methods.

Figure 1.6: Pierre Simon Laplace
Figure 1.7: Carl Friedrich Gauss
Laplace introduced the transform methods to evaluate probabilities. He provided deriva-
tions of the central limit theorem and various approximation results for integrals (based on
what is known as Laplace’s method).
Carl Friedrich GAUSS, 1777 1855
Least Squares Estimation with Gaussian errors.
Gauss developed the systematic theory of least squares estimation when the errors are
Gaussian. We explain in the notes the remarkable fact that the best estimate is linear in
the observations.
1.4. A LOOK BACK 11
Figure 1.8: Andrei Andreyevich Markov
Andrei Andreyevich MARKOV, 1856 1922
Markov Chains
A sequence of coin flips produces results that are independent. Many physical systems
exhibit a more complex behavior that requires a new class of models. Markov introduced
a class of such models that enable to capture dependencies over time. His models, called
Markov chains, are both fairly general and tractable.
Andrei Nikolaevich KOLMOGOROV, 1903-1987
Kolmogorov was one of the most prolific mathematicians of the 20th century. He made
fundamental contributions to dynamic systems, ergodic theory, the theory of functions
and functional analysis, the theory of probability and mathematical statistics, the analysis
of turbulence and hydrodynamics, to mathematical logic, to the theory of complexity, to
geometry, and topology.
In probability theory, he formulated probability as part of measure theory and estab-
lished some essential properties such as the extension theorem and many other fundamental
results.
Figure 1.9: Andrei Nikolaevich Kolmogorov
1.5 References
There are many good books on probability theory and random processes. For the level of
this course, we recommend Ross [7], Hoel et al. [4], Pitman [5], and Bremaud [2]. The
books by Feller [3] are always inspiring. For a deeper look at probability theory, Breiman
[1] are a good start. For cute problems, we recommend Sevastyanov et al. [8].
Chapter 2
Probability Space
In this chapter we describe the probability model of “choosing an object at random.” Ex-
amples will help us come up with a good definition. We explain that the key idea is to
associate a likelihood, which we call probability, to sets of outcomes, not to individual
outcomes. These sets are events. The description of the events and of their probability
constitute a probability space that characterizes completely a random experiment.
2.1 Choosing At Random
First consider picking a card out of a 52-card deck. We could say that the odds of picking
any particular card are the same as that of picking any other card, assuming that the deck
has been well shuffled. We then decide to assign a “probability” of 1/52 to each card. That
probability represents the odds that a given card is picked. One interpretation is that if we
repeat the experiment “choosing a card from the deck” a large number N of times (replacing
the card previously picked every time and re-shuffling the deck before the next selection),
then a given card, say the ace of diamonds, is selected approximated N/52 times. Note that
this is only an interpretation. There is nothing that tells us that this is indeed the case;
moreover, if it is the case, then there is certainly nothing yet in our theory that allows us to
expect that result. Indeed, so far, we have simply assigned the number 1/52 to each card
13
14 CHAPTER 2. PROBABILITY SPACE
in the deck. Our interpretation comes from what we expect from the physical experiment.
This remarkable “statistical regularity” of the physical experiment is a consequence of some
deeper properties of the sequences of successive cards picked from a deck. We will come back
to these deeper properties when we study independence. You may object that the definition
of probability involves implicitly that of “equally likely events.” That is correct as far as
the interpretation goes. The mathematical definition does not require such a notion.
Second, consider the experiment of throwing a dart on a dartboard. The likelihood of
hitting a specific point on the board, measured with pinpoint accuracy, is essentially zero.
Accordingly, in contrast with the previous example, we cannot assign numbers to individual
outcomes of the experiment. The way to proceed is to assign numbers to sets of possible
outcomes. Thus, one can look at a subset of the dartboard and assign some probability
that represents the odds that the dart will land in that set. It is not simple to assign the
numbers to all the sets in a way that these numbers really correspond to the odds of a given
dart player. Even if we forget about trying to model an actual player, it is not that simple
to assign numbers to all the subsets of the dartboard. At the very least, to be meaningful,
the numbers assigned to the different subsets must obey some basic consistency rules. For
instance, if A and B are two subsets of the dartboard such that A ⊂ B, then the number
P (B) assigned to B must be at least as large as the number P (A) assigned to A. Also, if A
and B are disjoint, then P (A ∪ B) = P (A) + P (B). Finally, P (Ω) = 1, if Ω designates the
set of all possible outcomes (the dartboard, possibly extended to cover all bases). This is the
basic story: probability is defined on sets of possible outcomes and it is additive. [However,
it turns out that one more property is required: countable additivity (see below).]
Note that we can lump our two examples into one. Indeed, the first case can be viewed
as a particular case of the second where we would define P (A) = |A|/52, where A is any
subset of the deck of cards and |A| is the number of cards in the deck. This definition is
certainly additive and it assigns the probability 1/52 to any one card.
2.2. EVENTS 15
Some care is required when defining what we mean by a random choice. See Bertrand’s
paradox in Appendix E for an illustration of a possible confusion. Another example of the
possible confusion with statistics is Simpson’s paradox in Appendix F.
2.2 Events
The sets of outcomes to which one assigns a probability are called events. It is not necessary
(and often not possible, as we may explain later) for every set of outcomes to be an event.
For instance, assume that we are only interested in whether the card that we pick is
black or red. In that case, it suffices to define P (A) = 0.5 = P (Ac ) where A is the set of all
the black cards and Ac is the complement of that set, i.e., the set of all the red cards. Of
course, we know that P (Ω) = 1 where Ω is the set of all the cards and P (∅) = 0, where ∅
is the empty set. In this case, there are four events: ∅, Ω, A, Ac .
More generally, if A and B are events, then we want Ac , A ∩ B, and A ∪ B to be
events also. Indeed, if we want to define the probability that the outcome is in A and the
probability that it is in B, it is reasonable to ask that we can also define the probability that
the outcome is not in A, that it is in A and B, and that it is in A or in B (or in both). By
extension, set operations that are performed on a finite collection of events should always
produce an event. For instance, if A, B, C, D are events, then [(A \ B) ∩ C] ∪ D should also
be an event. We say that the set of events is closed under finite set operations. [We explain
below that we need to extend this property to countable operations.] With these properties,
it makes sense to write for disjoint events A and B that P (A ∪ B) = P (A) + P (B). Indeed,
A ∪ B is an event, so that P (A ∪ B) is defined.
You will notice that if we want A ⊂ Ω (with A 6= Ω and A 6= ∅) to be an event, then
the smallest collection of events is necessarily {∅, Ω, A, Ac }.
If you want to see why, generally for uncountable sample spaces, all sets of outcomes
may not be events, check Appendix C.
2.3 Countable Additivity
This topic is the first serious hurdle that you face when studying probability theory. If
you understand this section, you increase considerably your appreciation of the theory.
Otherwise, many issues will remain obscure and fuzzy.
We want to be able to say that if the events An for n = 1, 2, . . ., are such that An ⊂ An+1
for all n and if A := ∪n An , then P (An ) ↑ P (A) as n → ∞. Why is this useful? This
property, called σ-additivity is the key to being able to approximate events. The property
specifies that the probability is continuous: if we approximate the events, then we also
approximate their probability.
This strategy of “filling the gaps” by taking limits is central in mathematics. You
remember that real numbers are defined as limits of rational numbers. Similarly, integrals
are defined as limits of sums. The key idea is that different approximations should give the
same result. For this to work, we need the continuity property above.
To be able to write the continuity property, we need to assume that A := ∪n An is an
event whenever the events An for n = 1, 2, . . ., are such that An ⊂ An+1 . More generally,
we need the set of events to be closed under countable set operations.
For instance, if we define P ([0, x]) = x for x ∈ [0, 1], then we can define P ([0, a)) = a
because if ² is small enough, then An := [0, a − ²/n] is such that An ⊂ An+1 and [0, a) :=
∪n An . We will discuss many more interesting examples.
You may wish to review the meaning of countability (see Appendix ??).
2.4. PROBABILITY SPACE 17
2.4 Probability Space
Putting together the observations of the sections above, we have defined a probability space
as follows.
Definition 2.4.1. Probability Space
A probability space is a triplet {Ω, F, P } where
• Ω is a nonempty set, called the sample space;
• F is a collection of subsets of Ω closed under countable set operations - such a collection
is called a σ-field. The elements of F are called events;
• P is a countably additive function from F into [0, 1] such that P (Ω) = 1, called a
probability measure.
Examples will clarify this definition. The main point is that one defines the probability
of sets of outcomes (the events). The probability should be countably additive (to be
continuous). Accordingly (to be able to write down this property), and also quite intuitively,
the collection of events should be closed under countable set operations.
2.5 Examples
Throughout the course, we will make use of simple examples of probability space. We review
some of those here.
2.5.1 Choosing uniformly in {1, 2, . . . , N }
We say that we pick a value ω uniformly in {1, 2, . . . , N } when the N values are equally
likely to be selected. In this case, the sample space Ω is Ω = {1, 2, . . . , N }. For any subset
A ⊂ Ω, one defines P (A) = |A|/N where |A| is the number of elements in A. For instance,
P ({2, 5}) = 2/N .

2.5.2 Choosing uniformly in [0, 1]
Here, Ω = [0, 1] and one has, for example, P ([0, 0.3]) = 0.3 and P ([0.2, 0.7]) = 0.5. That
is, P (A) is the “length” of the set A. Thus, if ω is picked uniformly in [0, 1], then one can
write P ([0.2, 0.7]) = 0.5.
It turns out that one cannot define the length of every subset of [0, 1], as we explain
in Appendix C. The collection of sets whose length is defined is the smallest σ-field that
contains the intervals. This collection is called the Borel σ-field of [0, 1]. More generally, the
smallest σ-field of < that contains the intervals is the Borel σ-field of <, usually designated
by B.
2.5.3 Choosing uniformly in [0, 1]2
Here, Ω = [0, 1]2 and one has, for example, P ([0.1, 0.4] × [0.2, 0.8]) = 0.3 × 0.6 = 0.18. That
is, P (A) is the “area” of the set A. Thus, P ([0.1, 0.4] × [0.2, 0.8]) = 0.18. Similarly, in that
case, if
B = {ω = (ω1 , ω2 ) ∈ Ω | ω1 + ω2 ≤ 1} and C = {ω = (ω1 , ω2 ) ∈ Ω | ω12 + ω22 ≤ 1},
then
1 π
P (B) = and P (C) = .
2 4
As in one dimension, one cannot define the area of every subset of [0, 1]2 . The proper
σ-field is the smallest that contains the rectangles. It is called the Borel σ-field of [0, 1]2 .
More generally, the smallest σ-field of <2 that contains the rectangles is the Borel σ-field
of <2 designated by B2 . This idea generalizes to <n , with B n .
2.6 Summary
We have learned that a probability space is {Ω, F, P } where Ω is a nonempty set, F is a
σ-field of Ω, i.e., a collection of subsets of Ω that is closed under countable set operations,
2.7. SOLVED PROBLEMS 19
and P : F → [0, 1] is a σ-additive set function such that P (Ω) = 1.
The idea is to specify the likelihood of various outcomes (elements of Ω). If one can
specify the probability of individual outcomes (e.g., when Ω is countable), then one can
choose F = 2Ω , so that all sets of outcomes are events. However, this is generally not
possible as the example of the uniform distribution on [0, 1] shows. (See Appendix C.)
2.6.1 Stars and Bars Method
In many problems, we use a method for counting the number of ordered groupings of
identical objects. This method is called the stars and bars method. Suppose we are given
identical objects we call stars. Any ordered grouping of these stars can be obtained by
separating them by bars. For example, || ∗ ∗ ∗ |∗ separates four stars into four groups of sizes
0, 0, 3, and 1.
Suppose we wish to separate N stars into M ordered groups. We need M − 1 bars to
form M groups. The number of orderings is the number of ways of placing the N identical
¡ ¢
stars and M − 1 identical bars into N + M − 1 spaces, N +M
M
−1
.
Creating compound objects of stars and bars is useful when there are bounds on the
sizes of the groups.
2.7 Solved Problems
Example 2.7.1. Describe the probability space {Ω, F, P } that corresponds to the random
experiment “picking five cards without replacement from a perfectly shuffled 52-card deck.”
1. One can choose Ω to be all the permutations of A := {1, 2, . . . , 52}. The interpretation
of ω ∈ Ω is then the shuffled deck. Each permutation is equally likely, so that pω = 1/(52!)
for ω ∈ Ω. When we pick the five cards, these cards are (ω1 , ω2 , . . . , ω5 ), the top 5 cards of
the deck.
2. One can also choose Ω to be all the subsets of A with five elements. In this case, each
¡ ¢
subset is equally likely and, since there are N := 525 such subsets, one defines pω = 1/N
for ω ∈ Ω.
3. One can choose Ω = {ω = (ω1 , ω2 , ω3 , ω4 , ω5 ) | ωn ∈ A and ωm 6= ωn , ∀m 6= n, m, n ∈
{1, 2, . . . , 5}}. In this case, the outcome specifies the order in which we pick the cards.
Since there are M := 52!/(47!) such ordered lists of five cards without replacement, we
define pω = 1/M for ω ∈ Ω.
As this example shows, there are multiple ways of describing a random experiment.
What matters is that Ω is large enough to specify completely the outcome of the experiment.
Example 2.7.2. Pick three balls without replacement from an urn with fifteen balls that
are identical except that ten are red and five are blue. Specify the probability space.
One possibility is to specify the color of the three balls in the order they are picked.
Then
10 9 8 5 4 3
Ω = {R, B}3 , F = 2Ω , P ({RRR}) = , . . . , P ({BBB}) = .
15 14 13 15 14 13
Example 2.7.3. You flip a fair coin until you get three consecutive ‘heads’. Specify the
probability space.
One possible choice is Ω = {H, T }∗ , the set of finite sequences of H and T . That is,
{H, T }∗ = ∪∞ n
n=1 {H, T } .
This set Ω is countable, so we can choose F = 2Ω . Here,
P ({ω}) = 2−n where n := length of ω.
This is another example of a probability space that is bigger than necessary, but easier
to specify than the smallest probability space we need.

Example 2.7.4. Let Ω = {0, 1, 2, . . .}. Let F be the collection of subsets of Ω that are
either finite or whose complement is finite. Is F a σ-field?
No, F is not closed under countable set operations. For instance, {2n} ∈ F for each
n ≥ 0 because {2n} is finite. However,
A := ∪∞
n=0 {2n}
is not in F because both A and Ac are infinite.
Example 2.7.5. In a class with 24 students, what is the probability that no two students
have the same birthday?
Let N = 365 and n = 24. The probability is
N N −1 N −2 N −n+1
α := × × × ··· × .
N N N N
To estimate this quantity we proceed as follows. Note that

n
X Z n
N −n+k N −n+x
ln(α) = ln( )≈ ln( )dx
N 1 N
k=1
Z 1
= N ln(y)dy = N [yln(y) − y]1a
a
N −n+1
= −(N − n + 1)ln( ) − (n − 1).
N
(In this derivation we defined a = (N − n + 1)/N .) With n = 24 and N = 365 we find that
α ≈ 0.48.
Example 2.7.6. Let A, B, C be three events. Assume that P (A) = 0.6, P (B) = 0.6, P (C) =
0.7, P (A ∩ B) = 0.3, P (A ∩ C) = 0.4, P (B ∩ C) = 0.4, and P (A ∪ B ∪ C) = 1. Find
P (A ∩ B ∩ C).
We know that (draw a picture)
P (A ∪ B ∪ C) = P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C) + P (A ∩ B ∩ C).

Substituting the known values, we find
1 = 0.6 + 0.6 + 0.7 − 0.3 − 0.4 − 0.4 + P (A ∩ B ∩ C),
so that
P (A ∩ B ∩ C) = 0.2.
Example 2.7.7. Let Ω = {1, 2, 3, 4} and let F = 2Ω be the collection of all the subsets of
Ω. Give an example of a collection A of subsets of Ω and probability measures P1 and P2
such that
(i). P1 (A) = P2 (A), ∀A ∈ A.
(ii). The σ-field generated by A is F. (This means that F is the smallest σ-field of Ω
that contains A.)
(iii). P1 and P2 are not the same.
Let A= {{1, 2}, {2, 4}}.
Assign probabilities P1 ({1}) = 18 , P1 ({2}) = 18 , P1 ({3}) = 38 , P1 ({4}) = 38 ; and P2 ({1}) =

1 2 5 4
12 , P2 ({2}) = 12 , P2 ({3}) = 12 , P2 ({4}) = 12 .
Note that P1 and P2 are not the same, thus satisfying (iii).
1 1 1
P1 ({1, 2}) = P1 ({1}) + P1 ({2}) = 8 + 8 = 4
1 2 1
P2 ({1, 2}) = P2 ({1}) + P2 ({2}) = 12 + 12 = 4
Hence P1 ({1, 2}) = P2 ({1, 2}).

1 3 1
P1 ({2, 4}) = P1 ({2}) + P1 ({4}) = 8 + 8 = 2
2 4 1
P2 ({2, 4}) = P2 ({2}) + P2 ({4}) = 12 + 12 = 2
Hence P1 ({2, 4}) = P2 ({2, 4}).
Thus P1 (A) = P2 (A)∀A ∈ A, thus satisfying (i).
To check (ii), we only need to check that ∀k ∈ Ω, {k} can be formed by set operations
on sets in A ∪ φ∪ Ω. Then any other set in F can be formed by set operations on {k}.
{1} = {1, 2} ∩ {2, 4}C

{2} = {1, 2} ∩ {2, 4}
{3} = {1, 2}C ∩ {2, 4}C
{4} = {1, 2}C ∩ {2, 4}.
Example 2.7.8. Choose a number randomly between 1 and 999999 inclusive, all choices
being equally likely. What is the probability that the digits sum up to 23? For example, the
number 7646 is between 1 and 999999 and its digits sum up to 23 (7+6+4+6=23).
Numbers between 1 and 999999 inclusive have 6 digits for which each digit has a value in
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. We are interested in finding the numbers x1 +x2 +x3 +x4 +x5 +x6 =
23 where xi represents the ith digit.
First consider all nonnegative xi where each digit can range from 0 to 23, the number
¡ ¢
of ways to distribute 23 amongst the xi ’s is 28
5 .
But we need to restrict the digits xi < 10. So we need to subtract the number of ways
to distribute 23 amongst the xi ’s when xk ≥ 10 for some k. Specifically, when xk ≥ 10 we
can express it as xk = 10 + yk . For all other j 6= k write yj = xj . The number of ways to
arrange 23 amongst xi when some xk ≥ 10 is the same as the number of ways to arrange
P ¡ ¢
yi so that 6i=1 yi = 23 − 10 is 18
5 . There are 6 possible ways for some xk ≥ 10 so there
¡ ¢
are a total of 6 18
5 ways for some digit to be greater than or equal to 10, as we can see by
using the stars and bars method (see 2.6.1).
However, the above counts events multiple times. For instance, x1 = x2 = 10 is counted
both when x1 ≥ 10 and when x2 ≥ 10. We need to account for these events that are counted
multiple times. We can consider when two digits are greater than or equal to 10: xj ≥ 10
and xk ≥ 10 when j 6= k. Let xj = 10 + yj and xk = 10 + yk and xi = yi ∀i 6= j, k. Then the
number of ways to distribute 23 amongst xi when there are 2 greater than or equal to 10 is
P
equivalent to the number of ways to distribute yi when 6i=1 yi = 23 − 10 − 10 = 3. There
¡¢ ¡¢
are 85 ways to distribute these yi and there are 62 ways to choose the possible two digits
that are greater than or equal to 10.

We are interested in when the sum of xi ’s is equal to 23. So we can have at most 2 xi ’s
greater than or equal to 10. So we are done.

¡ ¢ ¡18¢ ¡6¢¡8¢
Thus there are 28 5 − 6 5 + 2 5 numbers between 1 through 999999 whose digits
sum up to 23. The probability that a number randomly chosen has digits that sum up to
(28
5)
−6(18 + 6 8
5 ) (2)(5)
23 is 999999 .
P
Example 2.7.9. Let A1 , A2 , . . . , An , n ≥ 2 be events. Prove that P (∪ni=1 Ai ) = i P (Ai ) −
P P n+1 P (A ∩ A ∩ . . . ∩ A ).
i<j P (Ai ∩ Aj ) + i<j<k P (Ai ∩ Aj ∩ Ak ) − · · · + (−1) 1 2 n
We prove the result by induction on n.
First consider the base case when n = 2. P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) − P (A1 ∩ A2 ).
Assume the result holds true for n, prove the result for n + 1.
P (∪n+1 n n
i=1 Ai ) = P (∪i=1 Ai ) + P (An+1 ) − P ((∪i=1 Ai ) ∩ An+1 )
= P (∪ni=1 Ai ) + P (An+1 ) − P (∪ni=1 (Ai ∩ An+1 ))

X X X
= P (Ai ) − P (Ai ∩ Aj ) + P (Ai ∩ Aj ∩ Ak ) − . . .
i i<j i<j<k
X
n+1
+ (−1) P (A1 ∩ A2 ∩ . . . ∩ An ) + P (An+1 ) − ( P (Ai ∩ An+1 )
i
X X
− P (Ai ∩ Aj ∩ An+1 ) + P (Ai ∩ Aj ∩ Ak ∩ An+1 ) − . . .
i<j i<j<k
+ (−1)n+1 P (A1 ∩ A2 ∩ . . . ∩ An ∩ An+1 ))

X X X
= P (Ai ) − P (Ai ∩ Aj ) + P (Ai ∩ Aj ∩ Ak ) − . . .
i i<j i<j<k
n+2
+ (−1) P (A1 ∩ A2 ∩ . . . ∩ An+1 )
Example 2.7.10. Let {An , n ≥ 1} be a collection of events in some probability space

P
{Ω, F, P }. Assume that ∞n=1 P (An ) < ∞. Show that the probability that infinitely many
of those events occur is zero. This result is known as the Borel-Cantelli Lemma.
To prove this result we must write the event “infinitely many of the events An occur”
in terms of the An . This event can be written as
A = ∩∞ ∞
m=1 ∪n=m An .
To see this, note that ω is in infinitely many An if and only if for all m ≥ 1 there is some
n ≥ m such that ω ∈ An .
It follows from this representation of A that Bm ↓ A where Bm := ∪∞

n=m An . Now,
because of the σ-additivity of P (·), we know that P (Bm ) ↓ P (A). But

∞
X
P (Bm ) ≤ P (An ).
n=m
P∞ P∞
Also, since n=1 P (An ) < ∞, we know that n=m P (An ) ↓ 0 as m → ∞. Consequently,
P (A) = limm P (Bm ) = 0.

Chapter 3
Conditional Probability and

Independence
The theme of this chapter is how to use observations. The key idea is that observations
modify our belief about the likelihood of events. The mathematical notion of conditional
probability formalizes that idea.
3.1 Conditional Probability
Assume that we know that the outcome is in B ⊂ Ω. Given that information, what is the
probability that the outcome is in A ⊂ Ω? This probability is written P [A|B] and is read
“the conditional probability of A given B,” or “the probability of A given B”, for short.
For instance, one picks a card at random from a 52-card deck. One knows that the card
is black. What is the probability that it is the ace of clubs? The sensible answer is that
if one only knows that the card is black, then that card is equally likely to be any one of
the 26 black cards. Therefore, the probability that it is the ace of clubs is 1/26. Similarly,
given that the card is black, the probability that it is an ace is 2/26, because there are 2
black aces (spades and clubs).
We can formulate that calculation as follows. Let A be the set of aces (4 cards) and B
the set of black cards (26 cards). Then, P [A|B] = P (A ∩ B)/P (B) = (2/52)(26/52) = 2/26.
27
28 CHAPTER 3. CONDITIONAL PROBABILITY AND INDEPENDENCE
Indeed, for the outcome to be in A, given that it is in B, that outcome must be in A ∩ B.
Also, given that the outcome is in B, the probability of all the outcomes in B should be
renormalized so that they add up to 1. To renormalize these probabilities, we divide them
by P (B). This division does not modify the relative likelihood of the various outcomes in
B.
More generally, we define the probability of A given B by
P (A ∩ B)
P [A|B] = .
P (B)
This definition of conditional probability makes sense if P (B) > 0. If P (B) = 0, we define
P [A|B] = 0. This definition is somewhat arbitrary but it makes the formulas valid in all
cases.
Note that
P (A ∩ B) = P [A | B]P (B).
3.2 Remark
Define P 0 (A) = P [A|B] for any event A. Then P 0 (·) is a new probability measure. In
particular, the usual formulas apply. For instance, P 0 [A ∩ C] = P 0 [A|C]P 0 (C), i.e.,
P [A ∩ C|B] = P [A|B ∩ C]P [C|B],
which you can verify by using the definition of P [·|B]. After a while, you should be able to
write expressions such as the one above by thinking of P [· | B] as a new probability.
3.3 Bayes’ Rule
Let B1 and B2 be disjoint events whose union is Ω. Let also A be another event. We can
write
3.4. INDEPENDENCE 29
P (B1 ∩ A) P [A|B1 ]P (B1 )

P [B1 |A] = = ,
P (A) P (A)
and
P (A) = P (B1 ∩ A) + P (B2 ∩ A) = P [A|B1 ]P (B1 ) + P [A|B2 ]P (B2 ).
Hence,
P [A|B1 ]P (B1 )
P [B1 |A] = .
P [A|B1 ]P (B1 ) + P [A|B2 ]P (B2 )
This formula extends to a finite number of events Bn that partition Ω. The result is
know as Bayes’ rule. Think of the Bn as possible “causes” of some effect A. You know the
prior probabilities P (Bn ) of the causes and also the probability that each cause provokes
the effect A. The formula tells you how to calculate the probability that a given cause
provoked the observed effect. Applications abound, as we will see in detection theory. For
instance, you alarm can sound either if there is a burglar or also if there is no burglar (false
alarm). Given that the alarm sounds, what is the probability that it is a false alarm?
3.4 Independence
It may happen that knowing that an event occurs does not change the probability of another
event. In that case, we say that the events are independent. Let us look at an example
first.
3.4.1 Example 1
We roll two dice and we designate the pair of results by ω = (ω1 , ω2 ). Then Ω has 36
elements: ω = {(ω1 , ω2 )|ω1 = 1, . . . , 6 and ω2 = 1, . . . , 6}. Each of these elements has
probability 1/36. Let A = {ω ∈ Ω|ω1 ∈ {1, 3, 4}} and B = {ω ∈ Ω|ω2 ∈ {3, 5}}. Assume
that we know that the outcome is in B. What is the probability that it is in A?

Figure 3.1: Rolling two dice
Using the conditional probability formula, we find P [A|B] = P (A∩B)/P (B) = (6/36)/(12/36) =
1/2. Note also that P(A) = 18/36 = 1/2. Thus, in this example, P [A|B] = P (A).
The interpretation is that if we know the outcome of the second roll, we dont know
anything about the outcome of the first roll.
3.4.2 Example 2
We pick two points independently and uniformly in [0, 1]. In this case, the outcome ω =
(ω1 , ω2 ) of the experiment (the pair of points chosen) belongs to the set Ω = [0, 1]2 . That
point ω is picked uniformly in [0, 1]2 . Let A = [0.2, 0.5] × [0, 1] and B = [0, 1] × [0.2, 0.8].
The interpretation of A is that the first point is picked in [0.2, 0.5]; that of B is that the
second point is picked in [0.2, 0.8]. Note that P (A) = 0.3 and P (B) = 0.6. Moreover, since
A ∩ B = [0.2, 0.5] × [0.2, 0.8], one finds that P (A ∩ B) = 0.3 × 0.6 = P (A)P (B). Thus, A
and B are independent events.

3.4.3 Definition
Motivated by the discussion above, we say that two events A and B are independent if
P (A ∩ B) = P (A)P (B).
Note that the independence is a notion that depends on the probability.
Do not confuse “independent” and “disjoint.” If two events A and B are disjoint, then
they are independent only if at least one of them has probability 0. Indeed, if they are
disjoint, P (A ∩ B) = P (∅) = 0, so that P (A ∩ B) = P (A)P (B) only if P (A) = 0 or
P (B) = 0. Intuitively, if A and B are disjoint, then knowing that A occurs implies that B
does not, which is some new information about B unless B is impossible in the first place.
3.4.4 General Definition
Generally, we say that a collection of events {Ai , i ∈ I} are mutually independent if for any
finite subcollection {i, j, . . . , k} ⊂ I one has
P (Ai ∩ Aj ∩ · · · ∩ Ak ) = P (Ai )P (Aj ) · · · P (Ak ).
Subtility
The definition seems innocuous, but one has to be a bit careful. For instance, look at the
example illustrated in Figure 3.2.
The sample space Ω = {1, 2, 3, 4} has four points that have a probability 1/4 each.
The events A, B, C are defined as A = {1, 2}, B = {1, 3}, C = {2, 3}. We can verify that
A and B are independent. Indeed, P (A ∩ B) = 1/4 = P (A)P (B). Similarly, A and C
are independent and so are B and C. However, the events {A, B, C} are not mutually
independent. Indeed, P (A ∩ B ∩ C) = 0 6= P (A)P (B)P (C) = 1/8.
The point of the example is the following. Knowing that A has occurred tells us some-
thing about outcome ω of the random experiment. This knowledge, by itself, is not sufficient
Figure 3.2: Pairwise but not mutual independence
to affect our estimate of the probability that C has occurred. The same is true is we know
that B has occurred. However, if we know that both A and B have occurred, then we
know that C cannot have occurred. Thus, it is not correct to think that “A does not tell
us anything about C, B does not tell us anything about C, therefore A and B do not tell
us anything about C.” I encourage you to think about this example carefully.
3.5 Summary
We explained the definition of conditional probability P [A | B] = P (A ∩ B)/P (B) and that
of independence of two events and of mutual independence of a collection of events.
Pairwise independence does not imply mutual independence.
3.6 Solved Problems
Example 3.6.1. Is it true that
P (A ∩ B ∩ C) = P [A | B]P [B | C]P (C)?
If true, provide a proof; if false, provide a counterexample.

That identity is false. Here is one counterexample. Let Ω = {1, 2, 3, 4} and pω = 1/4
for ω ∈ Ω. Choose A = {1, 2}, B = {2, 3}, C = {3, 4}. Then P (A ∩ B ∩ C) = 0, P [A | B] =
P [B | C] = P (C) = 1/2, so that the identity does not hold.
Example 3.6.2. There are two coins. The first coin is fair. The second coin is such that
P (H) = 0.6 = 1−P (T ). You are given one of the two coins, with equal probabilities between
the two coins. You flip the coin four times and three of the four outcomes are H. What is
the probability that your coin is the fair one?
Let A designate the event “your coin is fair.” Let also B designate the event “three of
the four outcomes are H.”
Bayes’ rule implies
P (A ∩ B) P [B|A]P (A)
P [A|B] = =
P (A) P [B|A]P (A) + P [B|Ac ]P (Ac )
¡4¢ 4
3 (1/2) 2−4
= ¡4¢ ¡4¢ = −4 .
4 3
3 (1/2) + 3 (0.6) (0.4)
2 + (0.6)3 0.4
Example 3.6.3. Choose two numbers uniformly but without replacement in {0, 1, . . . , 10}.
What is the probability that the sum is less than or equal to 10 given that the smallest is
less than or equal to 5?
Draw a picture of
Ω = {0, 1, . . . , 10}2 \ {(i, i) | i = 0, 1, . . . , 10}.
The outcomes in Ω all have the same probability. Let also
A = {ω | ω1 6= ω2 and ω1 + ω2 ≤ 10}, B = {ω | ω1 6= ω2 and min{ω1 , ω2 } ≤ 5}.
The probability we are looking for is
|A ∩ B| |A|
= .
|B| |B|
Your picture shows that |A| = 10+9+8+· · ·+1 = 55 and that |B| = 10×5+4×5 = 70.
Hence, the answer is 55/70 = 11/14.
Example 3.6.4. You flip a fair coin repeatedly. What is the probability that you have to
flip it exactly 10 times to see two “heads”?
There must be exactly one head among the first nine flips and the last flip must be
another head. The probability of that event is
1 1 9
9( )9 × ( ) = 10 .
2 2 2
Example 3.6.5. a. Let A and B be independent events. Show that AC and B are inde-
pendent.
b. Let A and B be two events. If the occurrence of event B makes A more likely, then
does the occurrence of the event A make B more likely? Justify your answer.
c. If event A is independent of itself, show that P (A) is 1 or 0.
d. If P (A) is 1 or 0, show that A is independent of all events B.
a. We have
P (AC ∩ B) = P (B \ {A ∩ B}) = P (B) − P (A ∩ B)
= P (B) − P (A)P (B) = P (AC )P (B).
Hence AC and B are independent.
b. The occurrence of event B makes A more likely can be interpreted as P [A|B] > P (A).
Now,
P (A ∩ B) P [B|A]P (A)
P [A|B]) = = > P (A).
P (B) P (B)
P [B|A]
Hence P (B) > 1 and P [B|A] > P (B). Thus the occurrence of event A makes B more
likely.
c. If A is independent of itself, then P (A) = P (A ∩ A) = P (A)2 . Hence P (A) = 0 or 1.
d. Suppose P (A) = 0. Then P (A)P (B) = 0, ∀B. Now, A ∩ B ⊆ A, so 0 ≤ P (A ∩ B) ≤
P (A) = 0. Hence P (A ∩ B) = 0.
Suppose P (A) = 1. Then P (A ∩ B) = P (B), so that P (A ∩ B) = P (A)P (B).
Example 3.6.6. A man has 5 coins in his pocket. Two are double-headed, one is double-
tailed, and two are normal. The coins cannot be distinguished unless one looks at them.
a. The man shuts his eyes, chooses a coin at random, and tosses it. What is the
probability that the lower face of the coin is heads?
b. He opens his eyes and sees that the upper face of the coin is a head. What is the
probability that the lower face is a head.
c. He shuts his eyes again, picks up the same coin, and tosses it again. What is the
probability that the lower face is a head?
d. He opens his eyes and sees that the upper face is a head. What is the probability that
the lower face is a head?
Let D denote the event that he picks a double-headed coin, N denote the event that he
picks a normal coin, and Z be the event that he picks the double-tailed coin. Let HLi (and
HUi ) denote the event that the lower face (and the upper face) of the coin on the ith toss
is a head.
a. One has
P (HL1 ) = P [HL1 |D]P (D) + P [HL1 |N ]P (N ) + P [HL1 |Z]P (Z)

2 1 2 1 3
= (1)( ) + ( )( ) + (0)( ) = .
5 2 5 5 5
b. We find
2
P (HL1 ∩ HU1 ) 5 2
P [HL1 |HU1 ] = = 3 = .
P (HU1 ) 5
3
c. We write
P (HL2 ∩ HU1 )
P [HL2 |HU1 ] =
P (HU1 )
P [HL2 ∩ HU1 |D]P (D) + P [HL2 ∩ HU1 |N ]P (N ) + P [HL2 ∩ HU1 |Z]P (Z)
=
P (HU1 )
2 1 2 1
(1)( 5 ) + ( 4 )( 5 ) + (0)( 5 ) 5
= 3 = .
5
6
d. Similarly,
P (HL2 ∩ HU1 ∩ HU2 )

P [HL2 |HU1 ∩ HU2 ] =
P (HU1 ∩ HU2 )
(1)( 25 ) + (0)( 52 ) + (0)( 15 ) 4
= 2 1 2 1 = 5.
(1)( 5 ) + ( 4 )( 5 ) + (0)( 5 )
Chapter 4
Random Variable
In this chapter we define a random variable and we illustrate the definition with examples.
We then define the expectation and moments of a random variable. We conclude the chapter
with useful inequalities.
A random variable takes real values. The definition is “a random variable is a measurable
real-valued function of the outcome of a random experiment.”
Mathematically, one is given a probability space and some function X : Ω → < :=
(−∞, +∞). If the outcome of the random experiment is ω, then the value of the random
variable is X(ω) ∈ <.
Physical examples: noise voltage at a given time and place, temperature at a given time
and place, height of the next person to enter the room, and so on. The color of a randomly
picked apple is not a random variable since its value is not a real number.
4.1 Measurability
An arbitrary real-valued function defined on Ω is not necessarily a random variable.
For instance, let Ω = [0, 1] and A = [0, 0.5]. Assume that the events are [0, 1], [0, 0.5], (0.5, 1],
and ∅. For instance, assume that we have defined P ([0, 0.5]) = 0.73 and that this is all we
know. Consider the function X(ω) that takes the value 0 when ω is in [0, 0.3] and the
37
38 CHAPTER 4. RANDOM VARIABLE
value 1 when ω is in (0.3, 1]. This function is not a random variable. We cannot determine
P (X = 0) from the information we have. Accordingly, the statistical properties of X are
not defined. This is what we mean by measurability. Thus, measurability is not a subtle
notion. It is a first order idea: What are the functions whose statistics are defined by the
model? These are the measurable functions.
Let F be a collection of events of Ω. (Recall that F is closed under countable set
operations.)
Definition 4.1.1. A function X : Ω → < is F-measurable if X −1 ((−∞, a]) ∈ F for all
a ∈ <. Thus, we can define P (X ≤ a) for all a ∈ <.
Equivalently, the function is F-measurable if and only if
X −1 (B) ∈ F, ∀B ∈ B
where B is the Borel σ-field of <, i.e., the smallest σ-field that contains the intervals.
4.2 Distribution
A random variable X is discrete if it takes values in a countable set {xn , n ≥ 1} ⊂ <. We

P
can define P (X = xn ) = pn with pn > 0 and n pn = 1. The collection {xn , pn , n ≥ 1} is
then called the Probability Mass Function (pmf) of the random variable X.
In general, the function {P (X ≤ x) =: F (x), x ∈ <} – called the cumulative distribution
function (cdf) of X - completely characterizes the “statistics” of X. For short, we also call
the cdf the distribution of X.

4.2. DISTRIBUTION 39
A function F (·) : < → < is the cdf of some random variable if and only if
limx→−∞ FX (x) = 0;
limx→∞ FX (x) = 1;
If x < y, then FX (x) ≤ FX (y);
FX (x) is right-continuous. That is, y ↓ x implies F (y) → F (x). (4.2.1)
These properties follow directly from the continuity property of probability: If An ↓ A,
then P (An ) ↓ P (A). For instance, since (−∞, x) ↓ ∅ as x ↓ −∞, it follows that FX (x) ↓ 0
as x ↓ −∞. The fact that a function with these properties is a cdf can be seen from the
construction we explain in Section 4.4.
Note also that if xn ↑ x, then
P (X < x) = P (∪n≥1 {X ≤ xn }) = lim P (X ≤ xn ) = lim FX (xn ) = FX (x−) (4.2.2)

n→∞ n→∞
where FX (x−) := limy↑x FX (y). Consequently,
P (X = x) = P ({X ≤ x} \ {X < x}) = P (X ≤ x) − P (X < x) = FX (x) − FX (x−).
A random variable X is continuous if one can write

Z b
P (X ∈ (a, b]) = FX (x) = fX (x)dx (4.2.3)
a
for all real numbers a < b. In this expression, fX (·) is a nonnegative function called the
probability density function (pdf) of X. The pdf fX (·) is the derivative of the cdf FX (·).
Obviously, a discrete random variable is not continuous. Also, a random variable may
be neither discrete nor continuous.
The function FX (·) may jump at discrete points xn by an = FX (xn ) − FX (xn −) =
P (X = xn ). In that case, we define the pdf as a “formal” derivative fX (x) as follows:

X
fX (x) := g(x) + an δ(x − xn ) (4.2.4)
n
where g(·) is the derivative of FX (·) where it is differentiable and δ(x−xn ) is a Dirac impulse
at a jump xn . The formal definition of a Dirac impulse δ(x − x0 ) is that

Z b
g(x)δ(x − x0 )dx = g(x0 )1{a ≤ x0 ≤ b}
a
whenever g(·) is a function that is continuous at x0 . With this formal definition, you can
see that (4.2.3) holds in the general case.
4.3 Examples of Random Variable
We say that the random variable X has a Bernoulli distribution with parameter p ∈ [0, 1],
and we write X =D B(p) if
P (X = 1) = p and P (X = 0) = 1 − p. (4.3.1)
The random variable X has a binomial distribution with parameters n ∈ {1, 2, . . .} and
p ∈ [0, 1], and we write X =D B(n, p), if

µ ¶
n m
P (X = m) = p (1 − p)n−m , for m = 0, 1, . . . , n. (4.3.2)
m
The random variable X has a geometric distribution with parameter p ∈ (0, 1], and we
write X =D G(p), if
P (X = n) = p(1 − p)n−1 , for n ≥ 0. (4.3.3)
A random variable X with a geometric distribution has the remarkable property of being
memoryless. That is,
P [X ≥ n + m | X ≥ n] = P (X ≥ m), ∀ m, n ≥ 0.
Indeed, from (4.3.3) one finds that P (X ≥ n) = (1 − p)n , so that
P (X ≥ n + m) (1 − p)n+m
P [X ≥ n + m | X ≥ n] = = = (1 − p)m = P (X ≥ m). (4.3.4)
P (X ≥ n) (1 − p)n
4.4. GENERATING RANDOM VARIABLES 41
The interpretation is that if X is the lifetime of a light bulb (in years, say), then the residual
lifetime X − n of that light bulb, if it is still alive after n years, has the same distribution
as that of a new light bulb. Thus, if light bulbs had a geometrically distributed lifetime,
preventive replacements would be useless.
The random variable X has a Poisson distribution with parameter λ > 0, and we write
X =D P (λ), if
λn −λ
P (X = n) = e , for n ≥ 0. (4.3.5)
n!
The random variable X is uniformly distributed in the interval [a, b] where a < b, and
we write X =D U [a, b] if 
 1
b−a ,
if x ∈ [a, b]
fX (x) = (4.3.6)
 0, otherwise.
The random variable X is exponentially distributed with rate λ > 0, and we write
X =D Exd(λ), if 
 λe−λx , if x > 0
fX (x) = (4.3.7)
 0, otherwise.
Exponentially distributed random variables are memoryless in the sense that
P [X > s + t | X > t] = P (X > s), ∀ s, t > 0. (4.3.8)
Indeed, from (4.3.7) one finds that P (X > t) = e−λt , so that

P (X > s + t) e−λ(s+t)
P [X > s + t | X > t] = = = e−λs = P (X > s).
P (X > t) e−λt
The interpretation of this property is the same as for the geometric distribution.
We discuss Gaussian random variables later.
4.4 Generating Random Variables
Methods to generate a random variable X with a given distribution from uniform random
variables are useful in simulations and provide a good insight into the meaning of the cdf
and of the pdf.

The first method is to generate a random variable Z uniform in [0, 1] (using the random
number generator of your computer) and to define X(Z) = min{a|F (a) ≥ Z}. Then, for
any real number b, X ≤ b if Z ≤ F (b), which occurs with probability F (b), as desired.
The second method uses the pdf. Assume that X is continuous with pdf f (x) and that
P (a < X < b) = 1 and f (x) ≤ c. Pick a point (X, Y ) uniformly in [a, b]×[0, c] (by generating
two uniform random variables). If the point falls under the curve f (·), i.e., if Y ≤ f (X),
then keep the value X; otherwise, repeat. Then, P (a < X < a + ²) = P [A|B] where
A = {(x, y)|a < x < a + ², y < f (x)} = [a, a + ²] × [0, f (a)]. Then, P [A|B] = P (A)/P (B)
with P (A) = f (a)²/[(ba)c] and P (B) = 1/[(ba)c]. Hence P [A|B] = f (a)², as desired. (The
factor 1/[(ba)c] is to normalize our uniform distribution on [a, b] × [0, c] and the term 1 in
the numerator of P (B) comes from the fact that the p.d.f. f (·) integrates to 1.
4.5 Expectation
Imagine that you play a game of chance a large number K of times. Each time you play,
you have some probability pn of winning xn , for n ≥ 1. If our interpretation of probability
is correct, you expect to win xn approximately Kpn times out of K. Accordingly, your total
P
earnings should be approximately equal to n Kxn pn . Thus, your earnings should average
P
n xn pn per instance of the game. That value, the average earnings per experiment, is the
interpretation of the expected value of the random variable X that represents your earnings.
Formally, we define the expected value E(X) of a random variable X as follows.
Definition 4.5.1. Expected Value

P
For a discrete random variable, E(X) = n xn pn .
R∞
For a continuous random variable, E(X) = −∞ xfX (x)dx.
For a general random variable with (4.2.4),

Z ∞ X Z ∞
E(X) = xdFX (x) := xn an + xg(x)dx. (4.5.1)
−∞ n −∞
4.6. FUNCTION OF RANDOM VARIABLE 43
There are some potential problems. The sums could yield ∞ − ∞. In that case, we say
that the expectation of the random variable is not defined.
4.6 Function of Random Variable
Let X be a random variable and h : < → < be a function. Since X is some function from
Ω to <, so is h(X). Is h(X) a random variable? Well, there is that measurability question.
We must check that h(X)−1 ((−∞, a]) ∈ F for some a ∈ <. That property holds if h(·) is
Borel-measurable, as defined below.
Definition 4.6.1. A function h : < → < is Borel-measurable if
h−1 (B) ∈ B, ∀B ∈ B.
All the functions from < to < we will encounter are Borel-measurable.
Using Definition 4.1.1 we see that if h is Borel-measurable, then, for all B ∈ B, one
has A = h−1 (B) ∈ B, so that (h(X))−1 (B) = X −1 (A) ∈ F, which proves that h(X) is a
random variable.
In some cases, if X has a pdf, Y = h(X) may also have one. For instance, if X has pdf
fX (.) and Y = aX + b with a > 0, then
P (y < Y < y + dy) = P ((y − b)/a < X < (y − b)/a + dy/a) = fX ((y − b)/a)dy/a,
so that the pdf of Y , say fY (y), is fY (y) = fX ((y − b)/a)/a. We highlight that useful result
below:
1 y−b
If Y = aX + b, then fY (y) = fX ( ). (4.6.1)
|a| a
By adapting the above argument, you can check that if h(·) is differentiable and one-to-
one, then the density of Y = h(X) is equal to
1
fY (y) = fX (x)|h(x)=y . (4.6.2)
|h0 (x)|
For instance, if X =D U [0, 1] and Y = (3 + X)2 , then
1 1 √
fY (y) = 1{x ∈ [0, 1]}|(3+x)2 =y = √ 1{y ∈ [ 3, 2]}.
2(3 + x) 2 y
How do we compute E(h(X))? If X is discrete, then so is h(X). One could then
look at the values yn of Y = h(X) and their probabilities, say qn = P (Y = yn ). Then

P
E(h(X)) = n yn qn . There is a clever observation that is very useful. That observation is
that
X
E(h(X)) = h(xn )pn where pn = P (X = xn ).
n
If you think about it, it looks a bit like magic. However, it is not hard to understand what
is going on. This expression is useful because you do not have to calculate the probability
of the different values of Y.
If X has pdf fX (·), then

Z ∞
E(h(X)) = h(x)fX (x)dx.
−∞
Again, you do not have to calculate the pdf of Y . For instance, with X =D U [0, 1] and
h(X) = (3 + X)2 =: Y , we find

Z 1
1 1 37
E(h(X)) = (3 + x)2 dx = [ (3 + x)3 ]10 = [43 − 33 ] = .
0 3 3 3
A calculation based on fY gives

Z 2 Z 2
1
E(h(X)) = E(Y ) = √ yfY (y)dy = √ y √ dy.
3 3 2 y
√
If we do the change of variables y = (3 + x)2 , so that dy = 2(3 + x)dx = 2 ydx, then we
can write the integral above as

Z 1
E(Y ) = (3 + x)2 dx,
0
which hopefully explains the magic.

4.7. MOMENTS OF RANDOM VARIABLE 45
In the general case, if FX (·) is the cdf of X (see (4.2.4), then

Z ∞ X Z ∞
E(h(X)) = h(x)dFX (x) := h(xn )an + h(x)g(x)dx.
−∞ n −∞
Note for instance that the above identities show that
E(a1 h1 (X) + a2 h2 (X)) = a1 E(h1 (X)) + a2 E(h2 (X)).
Accordingly, if we define Y = h1 (X) and Z = h2 (X), we conclude that
E(a1 Y + a2 Z) = a1 E(Y ) + a2 E(Z). (4.6.3)
This property is called the linearity of expectation.
Examples will help you understand this section well.
4.7 Moments of Random Variable
The n-th moment of X is E(X n ).
The variance of X is
var(X) := E(X − E(X))2 = E(X 2 ) − {E(X)}2 .
The variance measures the “spread” of the distribution around the mean. A random
variable with a zero variance is constant. The larger the variance, the more “uncertain”
the random variable is, in the mean square sense. Note that I say “uncertain” and not
“variable” since you know by now that a random variable does not vary.
4.8 Inequalities
Inequalities are often useful to estimate some expected values. Here are a few particularly
useful ones.
Exponential bound:
1 + x ≤ exp{x}.
Chebychev:
P (X ≤ a) ≤ E(X 2 )/a2 . (4.8.1)
Markov Inequality: If f (·) is nondecreasing and nonnegative on [a, ∞), then
P (X ≤ a) ≤ {E(f (X))}/f (a). (4.8.2)
Jensen: If f (·) is convex, then
E(f (X)) ≥ f (E(X)). (4.8.3)
4.9 Summary
A random variable is a function X : Ω → < such that X −1 ((−∞, x]) ∈ F for all x ∈ <. We
then define the cdf FX (.) of X as
FX (x) = P (X ≤ x) := P (X −1 ((−∞, x])), x ∈ <.
We also defined the pmf and pdf that summarize the distribution of the random variable.
We defined the expected value and explained that

Z ∞
E(h(X)) = h(x)dFX (x).
−∞
We introduced the moments and the variance and we stated a few useful inequalities.
You should become familiar with the distributions we introduced. We put a table that
summarizes those distributions in Appendix G, for ease of reference.

4.10 Solved Problems
Example 4.10.1. A random variable X has pdf fX (·) where



 cx(1 − x), if 0 ≤ x ≤ 1;
fX (x) =

 0, otherwise.
a. Find c;
b. Find P ( 12 < X ≤ 34 );
c. Find the cdf FX (·) of X;
d. Calculate E(X) and var(X).
R∞
a. Need to have −∞ fX (x)dx = 1. Now,
Z ∞ Z 1 Z 1
x2 x3 1 1 1 c
fX (x)dx = cx(1 − x)dx = c (x − x2 )dx = c( − )]0 = c( − ) = .
−∞ 0 0 2 3 2 3 6
Thus c = 6.
b. We find
Z 3 Z 3
1 3 4 4 x2 x3 34
P( ≤ X ≤ ) = fX (x)dx = 6x(1 − x)dx = 6( − )] 1
2 4 1 1 2 3 2
2 2
9 9 1 1 11
= 6( − − + )=
32 64 8 24 32
c. The cdf is FX (x) = P (X ≤ x). For x < 0, FX (x) = 0. For 0 ≤ x < 1, FX (x) =
Rx
0 6y(1 − y)dy = 3x2 − 2x3 . For x ≥ 1, FX (x) = 1. Hence the cdf is


 0, if x < 0;

FX (x) = 3x2 − 2x3 , if 0 ≤ x < 1;



1, if x ≥ 1.
d. We find
Z ∞ Z 1
3 3
E(X) = xfX (x)dx = x6x(1 − x)dx = [2x3 ]10 − [ x4 ]10 = 2 − = 0.5,
−∞ 0 2 2
which is not surprising since fX (·) is symmetric around 0.5. Also,

Z 1
2 3 6 3 6 3
E(X ) = x2 6x(1 − x)dx = [ x4 ]10 − [ x5 ]10 = − = .
0 2 5 2 5 10
Hence,
3 1 1
var(X) = E(X 2 ) − (E(X))2 = − ( )2 = .
10 2 20
Example 4.10.2. Give an example of a probability space and a real-valued function on Ω
that is not a random variable.
Let Ω = {0, 1, 2}, F = {∅, {0}, {1, 2}, Ω}, P ({0}) = 1/2 = P ({1, 2}), and X(ω) = ω for
ω ∈ Ω. The function X is not a random variable since
X −1 ((−∞, 1]) = {0, 1} ∈

/ F.
The meaning of all this is that the probability space is not rich enough to specify
P (X ≤ 1).
Example 4.10.3. Define the random variable X as follows. You throw a dart uniformly
in a circle with radius 5. The random variable X is equal to 2 minus the distance between
the dart and the center of the circle if this distance is less than or equal to one. Otherwise,
X is equal to 0.
a. Plot carefully the probability distribution function F (x) = P (X ≤ x) for x ∈ < :=
(−∞, +∞).
b. Give the mathematical expression for the probability density function f (x) of X for
x ∈ < := (−∞, +∞).
Let Y be the distance between the dart and the center of the circle.
a. When 1 ≤ x ≤ 2, X ≤ x if Y ≥ 2−x, which occurs with probability (25−(2−x)2 )/25.
Also, X = 0 if Y > 1, which occurs with probability (25−1)/25 = 24/25. These observations
translate into the plot shown below:

b. Taking the derivative of F (x), one finds
24 2x − 4
f (x) = δ(x) + 1{1 < x < 2}.
25 25
Example 4.10.4. Express the cdf of the following random variables in terms of FX (·).
a. X + := max{0, X};
b. −X.
c. X − := max{0, −X};
d. |X|.
a. 
 0, if x < 0;
P (X + ≤ x) =
 P (X ≤ x) = FX (x), if x ≥ 0
b. We find, using (4.2.2),
P (−X ≤ x) = P (X ≥ −x) = 1 − P (X < −x) = 1 − FX (−x−).
c. Note that, if x ≥ 0, then P (X − ≤ x) = P (−X ≤ x) = 1 − FX (−x−), as we showed
above. Hence, 
 0, if x < 0;
P (X − ≤ x) =
 1 − F (−x−), if x ≥ 0.
d. First note that, if x ≥ 0,
P (|X| ≤ x) = P (−x ≤ X ≤ x) = P (X ≤ x) − P (X < −x) = FX (x) − FX (−x−).

Therefore, 
 0, if x < 0;
P (|X| ≤ x) =
 F (x) − FX (−x−), if x ≥ 0.
Example 4.10.5. A dart is flung at a circular dartboard of radius 3. Suppose that the
probability that the dart lands in some region A of the dartboard is proportional to the area
|A| of A, i.e., is equal to |A|/9π.
For each of the three scoring systems defined below:
i. Determine the distribution function of the score X;
ii. Calculate the expected value E(X) of the score.
Here are the scoring systems:
a. X = 4 − i if the distance Z of the dart to the center of the dartboard is in (i − 1, i]
for i = 1, 2, 3.
b. X = 3 − Z where Z is as in part a.
c. Assume now that the player has some probability 0.3 of missing the target altogether.
If he does not miss, he hits an area A with a probability proportional to |A|. The score X
is now 0 if the dart misses the target. Otherwise, it is 3 − Z, where Z is as before.
a.i. We see that P (X = 3) = P (Z ≤ 1) = π/9π = 1/9. Similarly, P (X = 2) = P (1 <
Z ≤ 2) = (4π − π)/9π = 1/3. Finally, P (X = 1) = P (2 < Z ≤ 3) = (9π − 4π)/9π = 5/9.
Hence,


 0, if x < 1;



 5/9, if 1 ≤ x < 2;
FX (x) =

 8/9, if 2 ≤ x < 3;



 1, if x ≥ 3.
a.ii. Accordingly,
5 1 1
E(X) = 1 × + 2 × + 3 × = 2.
9 3 9
b.i. Let x > 0. We see that X = 3 − Z ≤ x if Z ≥ 3 − x, which occurs with probability

1 − P (Z < 3 − x) = 1 − (3 − x)2 π/9π. Hence,



 0, if x < 0;

FX (x) = 1 − (3 − x)2 /9, if 0 ≤ x < 3;



1, if x ≥ 3.
b.ii. Accordingly,
Z ∞ Z 3
2 2 3 1
E(X) = xfX (x)dx = x (3 − x)dx = [ x2 − x3 ]30 = 1.
−∞ 0 9 9 2 3
c.i. Let Y be the score given that the player does not miss the target. Then Y has
the pdf that we derived in part b. The score X of the player who misses the target with
probability 0.3 is equal to 0 with probability 0.3 and to Y with probability 0.7. Hence,
FX (x) = 0.31{x ≥ 0} + 0.7FY (x).
That is, 

 0, if x < 0;

FX (x) = 1 − 0.7(3 − x)2 /9, if 0 ≤ x < 3;



1, if x ≥ 3.
c.ii. From the definition of X in terms of Y we see that
E(X) = 0.3 × 0 + 0.7E(X) = 0.7.
Example 4.10.6. Suppose you put m balls randomly in n boxes. Each box can hold an
arbitrarily large number of balls. What is the expected number of empty boxes?
Designate by p the probability that the first box is empty. Let Xk be equal to 1 when
box k is empty and to zero otherwise, for k = 1, . . . , n. The number of empty boxes is X =
X1 + · · · + Xn . By linearity of expectation, E(X) = E(X1 ) + · · · + E(Xn ) = nE(X1 ) = np.
Now,
n−1 m
p=( ) .
n
Indeed, p is the probability of the intersection of the independent events Ak for k = 1, . . . , m
where event Ak is “ball k is put in a box other than box 1.”

Example 4.10.7. A cereal company is running a promotion for which it is giving a toy in
every box of cereal. There are n different toys and each box is equally likely to contain any
one of the n toys. What is the expected number of boxes of cereal you have to purchase to
collect all n toys.
Assume that you have just collected the first m toys, for some m = 0, . . . n−1. Designate
by Xm the random number of boxes you have to purchase until you collect another different
toy. Note that P (Xm = 1) = (n − m)/n. Also, if Xm = 0, then Xm = 1 + Ym where Ym
designates the additional number of boxes that you purchase until you get another different
toy. Observe that Xm and Ym have the same distribution, so that
n−m m m m
E(Xm ) = ×1+ × E(1 + Ym ) = 1 + E(Ym ) = 1 + E(Xm ).
n n n n
Solving, we find E(Xm ) = n/(n − m). Finally, the expected number of boxes we have to
purchase is
n−1
X n−1
X n 1 1 1
E(X0 + X1 + · · · + Xn−1 ) = E(Xm ) = = n(1 + + + · · · + ).
n−m 2 3 n
m=0 m=0
Example 4.10.8. You pick a point P with a uniform distribution in [0, 1]2 . Let Θ denote
the angle made between the x-axis and the line segment that joins (0, 0) to the point P .
Find the cdf, pdf, and expected value of Θ.
Since P is chosen uniformly on the square, the probability we are within some region of
the square is just proportional to the area of that region.
First, we find the cdf. One has



 0, if θ < 0



 1 tan θ, if 0 ≤ θ ≤ π
2 4
FΘ (θ) = P (Θ ≤ θ) =

 1
1 − 2 tan( 2 − θ), if π4 < θ ≤
π π

 2

 1, if θ > π
2
Second, we differentiate the cdf to find the pdf.



 1
, if 0 ≤ θ ≤ π4

 2(cos θ)2
d 1
fΘ (θ) = FΘ (θ) = 2[cos( π2 −θ)]2
, if π4 < θ ≤ π
2
dθ 


 0, otherwise.
Finally, we use the pdf to find the expected value.

Z π Z π
4 θ 2 θ
E[Θ] = dθ + dθ
0 2(cos θ)2 π 2[cos( π2 − θ)]2
4
1 π 1 π π π
= (ln(cos θ) + θ tan θ)]04 + (ln(cos( − θ)) − θ tan( − θ))] π2
2 √ √ 2 2 2 4
1 2 π 2 π
= [ln( ) + − (ln( ) − )]
2 2 4 2 4
π
=
4
Example 4.10.9. A random variable X has the following cdf:



 0, for x < 0





 0.3, for 0 ≤ x < 2
FX (x) =



 0.3 + 0.2x, for 2 ≤ x < 3




1, for x ≥ 3.
a. Explain why FX is a cdf;
b. Find P (X = 0), P (X = 1), P (X = 2), P (X = 3);
c. Find P (0.5 < X ≤ 2.3);
d. Find P (0 < X < 3) and P (X < X ≤ 3);
e. Find fX (x);
f. Find P [X ≤ 1.4 | X ≤ 2.2];
g. Calculate E(X).
a. Figure 4.1(a) shows the cdf FX (x). To show that FX (x) is indeed a cdf we must
verify the properties (4.2.1). The figure shows that FX (·) satisfies these properties.
1.0
FX(x)
0.9
0.7
0.5
0.3
0.1
-1 1 2 3 4
x
(a)
fX(x) 0.4
0.3
0.2
0.1
-1 1 2 3 4
x
(b)
Figure 4.1: (a) cdf of X. (b) pdf of X.
b. We find the following probabilities:
P (X = 0) = FX (0) − FX (0−) = 0.3
P (X = 2) = FX (2) − FX (2−) = 0.4
P (X = 3) = FX (3) − FX (3−) = 0.1
c. One has P (0.5 < X ≤ 2.3) = P (X ≤ 2.3) − P (X ≤ 0.5) = FX (2.3) − FX (0.5) =
0.76 − 0.46 = 0.3.
d. We find P (0 < X < 3) = P (X < 3)−P (X ≤ 0) = FX (3−)−FX (0) = 0.9−0.3 = 0.6.

Also, P (0 < X ≤ 3) = P (X ≤ 3) − P (X ≤ 0) = FX (3) − FX (0) = 1 − 0.3 = 0.7.
e. According to (4.2.4), the pdf fX (·) of X is as follows:
fX (x) = 0.3δ(x) + 0.4δ(x − 2) + 0.1δ(x − 3) + 0.2 × 1{2 < x < 3}.
See Figure 4.1(b).

P [X≤1.4∩X≤2.2] P [X≤1.4] 0.3
f. P [X ≤ 1.4 | X ≤ 2.2] = P [X≤2.2] = P [X≤2.2] = 0.74 = 0.395
g. Accordingly to (4.5.1),
Z 3
E[X] = 0 × 0.3 + 2 × 0.4 + 3 × 0.1 + x × 0.2dx
2
· ¸3
x2
= 0.8 + 0.3 + 0.2 × = 1.1 + 0.2 × 2.5 = 1.6
2 2
Example 4.10.10. Let X and Y be independent random variables with common distribution
function F and density function f .
a. Show that V = max{X, Y } has distribution function FV (v) = F (v)2 and density
function fV (v) = 2f (v)F (v).
b. Find the density function of U = min{X, Y }.
Let X and Y be independent random variables each having the uniform distribution on
[0, 1].
c. Find E(U ).
d. Find cov(U, V ).
Let X and Y be independent exponential random variables with mean 1.
e. Show that U is an exponential random variable with mean 1/2.
f. Find E(V ) and var(V ).
a. We find
FV (v) = P (V ≤ v) = P (max{X, Y } ≤ v)
= P ({X ≤ v} ∩ {Y ≤ v}) = P (X ≤ v)P (Y ≤ v) = F (v)2 .

d
Differentiate the cdf to get the pdf by using the Chain Rule fV (v) = dv FV (v) =
2F (v)f (v).
b. First find the cdf of U .
FU (u) = P (U ≤ u) = P (min{X, Y } ≤ u) = 1 − P (min{X, Y } > u)
= 1 − P ({X > u} ∩ {Y > u}) = 1 − P (X > u)P (Y > u)
= 1 − (1 − P (X ≤ u))(1 − P (Y ≤ u)) = 1 − (1 − F (u))2 .
d
Differentiate the cdf to get the pdf by using the Chain Rule fU (u) = du FU (u) = 2f (u)(1 −
F (u)).
c. X and Y be independent random variables each having the uniform distribution on
[0,1]. Hence, fU (u) = 2(1 − u) for u ∈ [0, 1].

Z 1 Z 1 Z 1
E[U ] = ufU (u)du = 2u(1 − u)du = 2 (u − u2 )du
0 0 0
u2 u3 1
= 2( − )]10 = .
2 3 3
R1 R1 2
d. E[V ] = 0 vfV (v)dv = 0 2v dv = 23
cov[U, V ] = E[(U − E[U ])(V − E[V ])] = E[U V ] − E[U ]E[V ] = E[XY ] − E[U ]E[V ]
Z 1Z 1 Z 1 Z 1
= xyfX,Y (x, y)dxdy − E[U ]E[V ] = xdx ydy − E[U ]E[V ]
0 0 0 0
1 1 2 1
= ×1− × = .
2 3 3 36
e. X and Y be independent exponential random variables with mean 1, so each has

R∞
pdf f (x) = e−x and cdf F (x) = 1 − P (X > x) = 1 − x e−x̃ dx̃ = 1 − e−x . Using part b,
fU (u) = 2f (u)(1 − F (u)) = 2e−2u . Thus U is an exponential random variable with mean 12 .
f. From part a, fV (v) = 2f (v)F (v) = 2e−v (1 − e−v ).

Z ∞ Z ∞
E[V ] = vfV (v)dv = 2v(e−v − e−2v )dv
0 0
Z ∞ Z ∞
1 3
= 2 ve−v dv − v(2e−2v )dv = 2 × 1 − = .
0 0 2 2
Z ∞ Z ∞
2 2
E[V ] = v fV (v)dv = 2v 2 (e−v − e−2v )dv
0 0
Z ∞ Z ∞
2 −v
= 2v e dv − 2v 2 e−2v dv
0
Z ∞0 Z ∞
2 −v ∞
= (−2v e ]0 + 4ve−v dv) − (−v 2 e−2v + 2ve−2v dv)
0 0
Z ∞ Z ∞
−v ∞ −v −2v ∞
= (0 − 4ve ]0 + 4e dv) − (0 − ve ]0 + e−2v dv)
0 0
1 −2v ∞ 1 7
= (0 − 4e−v ]∞
0 ) − (0 − e ]0 ) = 4 − = .
2 2 2
7
V ar[V ] = E[V 2 ] − E[V ]2 = 2 − ( 32 )2 = 45 .
Example 4.10.11. Choose X in [0, 1] as follows. With probability 0.2, X = 0.3; with
probability 0.3, X = 0.7; otherwise, X is uniformly distributed in [0.2, 0.5] ∪ [0.6, 0.8]. (a).
Plot the c.d.f. of X; (b) Find E(X); (c) Find var(X); (d) Calculate P [X ≤ 0.3 | X ≤ 0.7].
a. Figure 4.2 shows the p.d.f. and the c.d.f. of X. Note that the value of the density
(1) is such that fX integrates to 1.
b. We find (see Figure 4.2)

Z 0.5 Z 0
1 1 2 0.8
E(X) = 0.2 × 0.3 + 0.3 × 0.7 + xdx + .8xdx = 0.27 + [x2 ]0.5
0.2 + [x ]0.6
0.2 0.6 2 2
1 1
= 0.27 + [(0.5)2 − (0.2)2 ] + [(0.8)2 − (0.6)2 ] = 0.515.
2 2
c. We first calculate E(X 2 ). We find

Z 0.5 Z 0
2 2 2 2 1 1 3 0.8
E(X ) = 0.2 × (0.3) + 0.3 × (0.7) + x dx + .8x2 dx = 0.165 + [x3 ]0.5
0.2 + [x ]0.6
0.2 0.6 3 3
1 1
= 0.165 + [(0.5)3 − (0.2)3 ] + [(0.8)3 − (0.6)3 ] = 0.3027.
3 3
Consequently,
var(X) = E(X 2 ) − (E(X))2 = 0.3027 − (0.515)2 = 0.0374.

FX(x)
1
0.1
0 x
fX(x)
1
0.1 x
0
0 0.3 0.7 1
Figure 4.2: The cumulative distribution function of X.
d. We have (see Figure 4.2)
P (X ≤ 0.3) 0.3 1
P [X ≤ 0.3 | X ≤ 0.7] = = = .
P (X ≤ 0.7) 0.9 3
Example 4.10.12. Let X be uniformly distributed in [0, 10]. Find the cdf of the following
random variables.
a. Y := max{2, min{4, X}};
b. Z := 2 + X 2 ;
c. V := |X − 4|;
d. W := sin(2πX).
a. FY (y) = P (Y ≤ y) = P (max{2, min{4, X}} ≤ y). Note that Y ∈ [2, 4], so that
FY (2−) = 0 and FY (4) = 1.
Let y ∈ [2, 4). We see that Y ≤ y if and only if X ≤ y, which occurs with probability
y/10.
1.0
FY(y)
0.9
0.7
0.5
0.4
0.3
0.2
0.1
y
-1 1 2 3 4
0.6
fY (y)
0.5
0.4
0.3
0.2
0.1
y
-1 1 2 3 4
Figure 4.3: cdf and pdf of max{2, min{4, X}}
Hence,


 0, if y < 2

FY (y) = y/10, if 2 ≤ y < 4



1, if y ≥ 4.
Accordingly,
fY (y) = 0.2δ(y − 2) + 0.6δ(y − 4) + 0.1 × 1{2 < y < 4}.

√
b. FZ (z) = P (Z ≤ z) = P (2 + X 2 ≤ z) = P (X ≤ z − 2). Consequently,


 0, if z < 2
 √
FZ (z) = z − 2, if 2 ≤ z < 102



1, if z ≥ 102.
Also, 

 0, if z ≤ 2

fZ (z) = √1 , if 2 < z < 102

 2 z−2

0, if z ≥ 102
c. FV (v) = P (V ≤ v) = P (| X − 4 |≤ v) = P (−v + 4 ≤ X ≤ v + 4). Hence,


 0, if v < 0



 0.2v, if 0 ≤ v < 4
FV (v) =

 0.1v + 0.4, if 4 ≤ v < 6



 1, if v ≥ 6.
Also, 

 0, if v ≤ 0



 0.2, if 0 < v ≤ 4
fV (v) =

 0.1, if 4 < v ≤ 6



 0, if v > 6
d. FW (w) = P (W ≤ w) = P (sin(2πX) ≤ w). Note that W ∈ [−1, 1], so that
FW (−1−) = 0 and FW (1) = 1. The interesting case is w ∈ (−1, 1). A picture shows that


 0, if w < −1

FW (w) = 0.5 + π1 sin−1 (w), if − 1 ≤ w < 1



1, if w ≥ 1.
Example 4.10.13. Assume that a dart flung at a target hits a point ω uniformly distributed
in [0, 1]2 . The random variables X(ω), Y (ω), Z(ω) are defined as follows. X(ω) is the
maximum distance between ω and any side of the square. Y (ω) is the minimum distance
between ω and any side of the square. Z(ω) is the distance between ω and a fixed vertex of
the square.
Find the probability density functions fX , fY , fZ .

1 1 1
C2
1-x x x2
C1
A B
x 1-x
0 0 0
0 x 1-x 1 0 1-x x 1 0 x1 1
Figure 4.4: Diagram for Example 4.10.13
Figure 4.4 shows the following events:
A = {ω | X ≥ x};
B = {ω | Y ≤ x};
C1 = {ω | Z ≤ x1 } when x1 ≤ 1;
C2 = {ω | Z ≤ x2 } when x2 > 1.
Note the difference in labels on the axes for the events A and B. For C1 and C2 , the
reference vertex is (0, 0).
a. From these figures, we find that



 0, if x < 0

FX (x) = 1 − (1 − 2x)2 , if 0 ≤ x ≤ 0.5



1, if x ≥ 0.5.
Accordingly,


 0, if x < 0

fX (x) = 4(1 − 2x), if 0 ≤ x ≤ 0.5



0, if x ≥ 0.5.
b. Similarly,


 0, if x < 0.5

FY (x) = (2x − 1)2 , if 0.5 ≤ x ≤ 1



1, if x ≥ 1.5.
Accordingly, 

 0, if x < 0

fY (x) = 4(2x − 1), if 0.5 ≤ x ≤ 1



0, if x > 1.
c. The area of C1 is πx21 /4. That of C2 consists of a rectangle [0, v] × [0, 1] plus the
p
integral over uin[v, 1] of x22 − u2 . One finds

 1

 2 πz 0≤z<1
1
√
fZ (z) = 2 πz − 2zcos−1 ( z1 ) 1 ≤ z < 2

 √

0 z≥ 2
Example 4.10.14. A circle of unit radius is thrown on an infinite sheet of graph paper
that is grid-ruled with a square grid with squares of unit side. Assume that the center of the
circle is uniformly distributed in the square in which it falls. Find the expected number of
vertex points of the grid that fall in the circle.
There is a very difficult way to solve the problem and a very easy way. The difficult
way is as follows. Let X be the number of vertex points that fall in the circle. We find
P (X = k) for k = 1, 2, . . . and we compute the expectation. This is very hard because the
sets of possible locations of the center of the circle for these various events are complicated
intersections of circles.
The easy way is as follows. We consider the four vertices of the square in which the
center of the circle lies. For each of these vertices, there is some probability p that it is in
the circle. Accordingly, we can write X = X1 + X2 + X3 + X4 where Xi is 1 if vertex i of
the square is in the circle and is 0 otherwise. Now,
E(X) = E(X1 ) + E(X2 ) + E(X3 ) + E(X4 ) = p + p + p + p = 4p.

(2,1)
(1,0) (1,1) (1,2)
(0,0) (0,1)
Figure 4.5: Diagram for Example 4.10.14
The key observation here is that the average value of a sum of random variables is the sum
of their average values, even when these random variables are not independent.
It remains to calculate p. To do that, note that the set of possible locations of the center
of the circle in a given square such that one vertex is in the circle is a quarter-circle with
radius 1. Hence, p = π/4 and we conclude that E(X) = π.
Example 4.10.15. Ten numbers are selected from {1, 2, 3, . . . , 30} uniformly and without
replacement. Find the expected value of the sum of the selected numbers.
Let X1 , . . . , X10 be the ten numbers you pick in {1, 2, . . . , 30} uniformly and without
replacement. Then E(X1 + · · · + X10 ) = E(X1 ) + · · · + E(X10 ). Consider any Xk for some
k ∈ {1, . . . , 10}. By symmetry, Xk is uniformly distributed in {1, 2, . . . , 30}. Consequently,

P
E(Xk ) = 30 1
n=1 n × 30 = 15.5. Hence E(X1 + · · · + X10 ) = 10 × 15.5 = 155. Once again, the
trick is to avoid looking at the joint distribution of the Xi , as in the previous example.
Example 4.10.16. We select a point X according to some probability density function fX .

Find the value of a that minimizes the average value of the square distance between the point
(a, 1) and the random point (X, 0) in the plane.
Let the random variable Z be the squared distance between (X, 0) and (a, 1). That is,
Z = (X − a)2 + (0 − 1)2 .
We wish to minimize E[Z] i.e minimize E[(X − a)2 + (0 − 1)2 ].

dE[Z]
To find the value of a, we solve da = 0. We find
d 2
(a − 2aE[X] − E[X 2 ] + a).
da
We find that the value of a for which this expression is equal to zero is a = E(X).
d2
The value of da2
(a2 − 2aE[X] − E[X 2 ] + a) for a = E(X) is equal to 2. Since this is
positive, a indeed corresponds to the minimum.
Example 4.10.17. Construct a pdf of a random variable X on [0, ∞) so that P [X > a + b |
X > a] > P (X > b) for all a, b > 0.
The idea is that X is a lifetime of an item whose residual lifetime X gets longer as it
gets older. An example would be an item whose lifetime is either Exd(1) or Exd(2), each
with probability 0.5, say. As the item gets older, it becomes more likely that its lifetime
is Exd(1) (i.e., with mean 1) instead of Exd(2) (with mean 1/2). Let’s do the math to
confirm the intuition.
From the definition,
fX (x) = 0.5 × e−x + 0.5 × 2e−2x , x ≥ 0.
Hence,
Z ∞
P (X > a) = fX (x)dx = 0.5e−a + 0.5e−2a , a ≥ 0,
a
so that
0.5e−(a+b) + 0.5e−2(a+b)
P [X > a + b | X > a] = .
0.5e−a + 0.5e−2a
Simple algebra allows to verify that P [X > a + b | X > a] ≥ P (X > b).
Example 4.10.18. Construct a pdf of a random variable X on [0, ∞) so that P [X > a + b |
X > a] < P (X > b) for all a, b > 0.
Here, we can choose a pdf that decays faster than exponentially. Say that the lifetime
has a density
fX (x) = A exp{−x2 }
R∞
where A is such that 0 fX (x)dx = 1. The property we are trying to verify is equivalent
to
Z ∞ Z ∞ Z ∞
fX (x)dx < fX (x)dx fX (x)dx,
a+b a b
or
Z ∞ Z ∞ Z ∞ Z ∞
fX (x)dx fX (x)dx < fX (x)dx fX (x)dx.
0 a+b a b
We can write this inequality as

Z ∞Z ∞ Z ∞Z ∞
−(x2 +y 2 ) 2 +y 2 )
e dxdy < e−(x dxdy.
0 a+b a b
That is,
φ(A) + φ(B) ≤ φ(B) + φ(C)
where
Z Z
2 +y 2 )
φ(D) = e−(x dxdy
D
for a set D ⊂ <2 and A = [0, b]×[a+b, ∞), B = [b, ∞)×[a+b, ∞), and C = [b, ∞)×[a, a+b].
To show φ(A) < φ(C), we note that each point (x, a + b + y) in A corresponds to a point
(b + y, a + x) in C and
2 +(a+b+y)2 ) 2 +(a+x)2 )
e−(x < e−((b+y) ,
by convexity of g(z) = z 2 .
Example 4.10.19. Suppose that the number of telephone calls made in a day is a Poisson
random variable with mean 1000.
a. What is the probability that more than 1142 calls are made in a day?
b. Find a bound for this probability using Markov’s Inequality.
c. Find a bound for this probability using Chebyshev’s Inequality.
Let N denote the number of telephone calls made in a day. N is Poisson with mean 1000
e−1000 1000n
so the pmf is// P (N = n) = n! and V ar[N ] = 1000.
P∞ 1000n
a. P (N > 1142) = e−1000 n=1143 n! .
E[N ] 1000
b. P (N > 1142) = P (N ≥ 1143) ≤ 1143 = 1143 .
V ar[N ] 1000
c. P (N > 1142) = P (N ≥ 1143) ≤ P (|N − E[N ]| ≥ 143) ≤ 1432
= 20449 .
Chapter 5
Random Variables
A collection of random variables is a collection of functions of the outcome of the same ran-
dom experiment. We explain how one characterizes the statistics of these random variables.
We have looked at one random variable. The idea somehow was that we made one
numerical observation of one random experiment. Here we extend the idea to multiple
numerical observations about the same random experiment. Since there is one random
experiment, nature chooses a single value of ω ∈ Ω. The different observations, say X, Y, Z,
are all functions of the same ω. That is, one models these observations as X(ω), Y (ω), and
Z(ω).
As you may expect, these values are related in some way. Thus, observing X(ω) provides
some information about Y (ω). In fact, one of the interesting questions is how one can use
the information that some observations contain about some other random variables that we
do not observe directly.
5.1 Examples
We pick a ball randomly from a bag and we note its weight X and its diameter Y .
We observe the temperature at a few different locations.
We measure the noise voltage at different times.
67
68 CHAPTER 5. RANDOM VARIABLES
We track the evolution over time of the value of Cisco shares and we want to forecast
future values.
A transmitter sends some signal and the receiver observes the signal it receives and tries
to guess which signal the transmitter sent.
5.2 Joint Statistics
Let {X(ω), Y (ω)} be a pair of random variables.
The joint distribution of {X(ω), Y (ω)} is specified by the joint cumulative distribution
function (jcdf) FX,Y defined as follows:
FX,Y (x, y) := P (X ≤ x and Y ≤ y), x, y ∈ <.
In the continuous case,

Z x Z y
FX,Y (x, y) = fX,Y (u, v)du dv
−∞ −∞
for a nonnegative function fX,Y (x, y) that is called the joint pdf (jpdf) of the random
variables.
These ideas extend to an arbitrary number of random variables.
This joint distribution contains more information than the two individual distributions.
For instance, let {X(ω), Y (ω)} be the coordinates of a point chosen uniformly in [0, 1]2 .
Define also Z(ω) = X(ω). Observe that the individual distributions of the each of the
random variables in the pairs {X(ω), Y (ω)} and {X(ω), Z(ω)} are the same. The tight
dependency of X and Z is revealed by their joint distribution.
The covariance of a pair of random variables (X, Y ) is defined as
cov(X, Y ) = E((X − E(X))(Y − E(Y )) = E(XY ) − E(X)E(Y ).

5.2. JOINT STATISTICS 69
Figure 5.1: Correlation
The random variables are positively (resp. negatively, un-) correlated if cov(X, Y ) > 0
(resp. < 0, = 0). The covariance is a measure of dependence. The idea is that if E(XY ) is
larger than E(X)E(Y ), then X and Y tend to be large or small together more than if they
were independent. In our example above, E(XZ) = E(X 2 ) = 1/3 > E(X)E(Z) = 1/4.
Figures 5.1 illustrates the meaning of correlation. Each of the figures shows the possible
values of a pair (X, Y ) of random variables; all the values are equally likely. In the left-most
figure, X and Y tend to be large or small together. These random variables are positively
correlated. Indeed, the product XY is larger on average than it would be if a larger value of
X did not imply a larger than average value of Y . The other two figures can be understood
similarly.
If h : <2 → < is nice (Borel-measurable - once again, all the functions from <2 to <
that we encounter have that property), then h(X, Y ) is a random variable. One can show,
as in the case of a single random variable, that

Z Z
E(h(X, Y )) = h(x, y)dFX,Y (x, y).
The continuous case is similar to the one-variable case.

It is sometimes convenient to use vector notation. To do that, one defines the expected
value of a random vector to be the vector of expected values. Similarly, the expected value
of a matrix is the matrix of expected values. Let X be a column vector whose n elements
X1 , . . . , Xn are random variables. That is, X = (X1 , . . . , Xn )T where (·)T indicates the
X ) = (E(X1 ), . . . , E(Xn ))T . For a random matrix

transposition operation. We define E(X
W whose entry (i, j) is the random variable Wi,j for i = 1, . . . , m and j = 1, . . . , n, we define
W ) to be the matrix whose entry (i, j) is E(Wi,j ). Recall that if A and B are matrices of
E(W
AB)T = B T A T . Recall also the definition of the trace tr(A

AB
compatible dimensions, then (AB A)
P
A) = i Ai,i . Note that if a
of a square matrix A as the sum of its diagonal elements: tr(A
and b are vectors with the same dimension, then

X
aT b = ab T ) = tr(bba T ).
ai bi = tr(a
i
If X and Y are random vectors, we define the following covariance matrices:
X , Y ) := E((X
ΣX,Y := cov(X X − E(X
X ))(Y Y ))T ) = E(XY
Y − E(Y XY T ) − E(X Y T)
X )E(Y
and
X − E(X
ΣX := E((X X ))(X X ))T ) = E(X
X − E(X X X T ) − E(X X T ).
X )E(X
Using linearity, one finds that
AX
cov(AX
AX, BY ) = A cov(X BT .
X , Y )B (5.2.1)
Similarly,
X T Y ) = E(tr(X
E(X X Y T )) = trE(X
X Y T ). (5.2.2)
5.3 Independence
We say that two random variables {X, Y } are independent if
P (X ∈ A and Y ∈ B) = P (X ∈ A)P (Y ∈ B) (5.3.1)

for all subsets A and B of the real line (... Borel sets, to be precise).
More generally, a collection of random variables are said to be mutually independent
if the probability that any finite subcollection of them belongs to any given subsets is the
product of the probabilities.
We have seen examples before: flipping coins, tossing dice, picking (X, Y ) uniformly in
a square or a rectangle, and so on.
Theorem 5.3.1. a. The random variables X, Y are independent if and only if the joint
cdf FX,Y (x, y) is equal to FX (x)FY (y), for all x, y. A collection of random variables are
mutually independent if the jcdf of any finite subcollection is the product of the cdf.
b. If the random variables X, Y have a joint pdf fX,Y (x, y), they are independent if and
only if fX,Y (x, y) = fX (x)fY (y), for all x, y. A collection of random variables with a jpdf
are mutually independent if the jpdf of any finite subcollection is the product of the pdf.
c. If X, Y are independent, then f (X) and g(Y ) are independent.
d. If X and Y are independent, then E(XY ) = E(X)E(Y ). [The converse is not true!]
That is, independent random variables are uncorrelated.
e. More generally, if X, Y, , W are mutually independent, then
E(XY · · · W ) = E(X)E(Y ) · · · E(W ).
f. The variance of the sum of pairwise independent random variables is the sum of their
variances.
g. If X and Y are continuous and independent, then

Z ∞
fX+Y (x) = fX (u)fY (x − u)du. (5.3.2)
−∞
The expression to the right of the identity is called the convolution of fX and fY . Hence,
the pdf of the sum of two independent random variables is the convolution of their pdf.
Proof:
We provide sketches of the proof of these important results. The derivation should help
you appreciate the results.
a. Assume that X, Y are independent. Then, by (5.3.1),
FX,Y (x, y) = P (X ≤ x and Y ≤ y) = P (X ≤ x)P (Y ≤ y) = FX (x)FY (y), for all x, y ∈ <.
Conversely, assume that the identity above holds. It is easy to see that
P (X ∈ (a, b] and Y ∈ (c, d]) = FX,Y (b, d) − FX,Y (a, d) − FX,Y (b, c) + FX,Y (a, c).
Using FX,Y (x, y) = FX (x)FY (y) in this expression, we find after some simple algebra that
P (X ∈ (a, b] and Y ∈ (c, d]) = P (X ∈ (a, b])P (Y ∈ (c, d]).
Since the probability is countably additive, the expression above implies that
P (X ∈ A and Y ∈ B) = P (X ∈ A)P (Y ∈ B)
for a collection of sets A and B that is closed under countable operations and that contains
the intervals. Consequently, the identity above holds for all A, B ∈ B where B is the Borel
σ-field of <. Hence, X, Y are independent.
The same argument proves the corresponding result for mutual independence of random
variables.
b. Assume that X, Y are independent. Then (5.3.1) implies that
fX,Y (x, y)dxdy = P (X ∈ (x, x + dx) and Y ∈ (y, y + dy))
= P (X ∈ (x, x + dx))P (Y ∈ (y, y + dy)) = fX (x)dxfY (y)dy,
so that
fX,Y (x, y) = fX (x)fY (y), for all x, y ∈ <.

Conversely, if the identity above holds, then

Z x Z y
FX,Y (x, y) = fX,Y (u, v)dudv
Z x Z y −∞ −∞
= fX (u)fY (v)dudv
−∞ −∞
Z x Z y
= fX (u)du fY (v)dv = FX (x)FY (y).
−∞ −∞
Part (a) then implies that X and Y are independent.
A similar argument proves the result for the mutual independence.
c. Assume X and Y are independent. Note that g(X) ∈ A if and only if X ∈ g −1 (A),
by definition of g −1 (·). A similar result holds for h(Y ). Hence
P (g(X) ∈ A and h(Y ) ∈ B) = P (X ∈ g −1 (A) and Y ∈ h−1 (B))
= P (X ∈ g −1 (A))P (Y ∈ h−1 (B)) = P (g(X) ∈ A)P (h(Y ) ∈ B),
which shows that g(X) and h(Y ) are independent. The derivation of the mutual indepen-
dence result is similar.
d. Assume that X and Y are independent and that they are continuous. Then
fX,Y (x, y) = fX (x)fY (y), so that

Z ∞Z ∞
E(XY ) = xyfX,Y dxdy
−∞ −∞
Z ∞Z ∞
= xyfX (x)fY (y)dxdy
−∞ −∞
Z ∞ Z ∞
= xfX (x)dx yfX (y)dy = E(X)E(Y ).
−∞ −∞
The same derivation holds in the discrete case. The hybrid case is similar.
Note that the converse is not true. For instance, assume that (X, Y ) is equally likely to
take the four values {(−1, 0), (0, −1), (1, 0), (0, 1)}. We find that E(XY ) = 0 = E(X)E(Y ).
However, these random variables are not independent since P [Y = 0 | X = 1] = 1 6= P (Y =
0) = 1/2.
e. We can prove this result by induction by noticing that if {Xn , n ≥ 1} are mutually
independent, then Xn+1 and X1 × · · · × Xn are independent. Hence,
E(X1 × · · · × Xn Xn+1 ) = E(X1 × · · · × Xn )E(Xn+1 ).
f. Let {X1 , . . . , Xn } be pairwise independent. By subtracting their means, we can
assume that they are zero-mean. Now,

X
var(X1 + · · · + Xn ) = E((X1 + · · · + Xn )2 ) = E(X12 + · · · + Xn2 + Xi Xj )
i6=j
= E(X12 ) + ··· + E(Xn2 ) = var(X1 ) + · · · + var(Xn ).
In this calculation, we used the fact that E(Xi Xj ) = E(Xi )E(Xj ) for i 6= j because the
random variables are pairwise independent.
g. Note that
Z ∞
P (X + Y ≤ x) = P (X ≤ x − u and Y ∈ (u, u + du))
Z−∞
∞
= P (X ≤ x − u)fY (u)du.
−∞
Taking the derivative with respect to x, we find

Z ∞
fX+Y (x) = fX (x − u)fY (u)du.
−∞
5.4 Summary
We explained that multiple random variables are defined on the same probability space.
X )). In particular,
We discussed the joint distribution. We showed how to calculate E(h(X
we defined the variance, covariance, k-th moment. The vector notation has few secrets for
you.
You also know the definition (and meaning) of independence and mutual independence
and you know that the mean value of a product of independent random variables is the
product of their mean values. You can also prove that functions of independent random
variables are independent. We also showed that the variance of the sum of pairwise inde-
pendent random variables is the sum of their variances.
5.5 Solved Problems
Example 5.5.1. Let X1 , . . . , Xn be i.i.d. B(p). Show that X1 + · · · + Xn is B(n, p).
This follows from the definitions by computing the pmf of the sum.
Example 5.5.2. Let X1 and X2 be independent and such that Xi is Exd(λi ) for i = 1, 2.
Calculate
P [X1 ≤ X2 |X1 ∧ X2 = x].
(Note: a ∧ b = min{a, b}.)
We find
P [X1 ≤ X2 |X1 ∧ X2 = x] = A/(A + B)
where
A = P (X1 ∈ (x, x + dx), X2 ≥ x) = λ1 exp{−λ1 x}dx exp{−λ2 x} = λ1 exp{−(λ1 + λ2 )x}dx
and
B = P (X2 ∈ (x, x + dx), X1 ≥ x) = λ2 exp{−(λ1 + λ2 )x}dx
Hence,
λ1
P [X1 ≤ X2 |X1 ∧ X2 = x] = .
λ1 + λ2
Example 5.5.3. Let {Xn , n ≥ 1} be i.i.d. U [0, 1]. Calculate var(X1 + 2X2 + X32 ).
We use the following easy but useful properties of the variance:
(i) var(X) = E(X 2 ) − (E(X))2 ;
(ii) If X and Y are independent, then var(X + Y ) = var(X) + var(Y );
(iii) For any X and a ∈ <, var(aX) = a2 var(X).
Using these properties, we find
var(X1 + 2X2 + X32 ) = var(X1 ) + 4var(X2 ) + var(X32 ).
Also note that if X =D U [0, 1], then

Z 1
k 1 1 1
E(X ) = xk dx = 1/(k + 1) and var(X) = E(X 2 ) − (E(X))2 = − = .
0 3 4 12
Consequently,
1 4 5 1 1
var(X1 + 2X2 + X32 ) = + + E(X38 ) − (E(X34 ))2 = + − ( )2 ≈ 0.49.
12 12 12 9 5
Example 5.5.4. Let X, Y be i.i.d. U [0, 1]. Compute and plot the pdf of X + Y .
We use (5.3.2):
Z ∞
fX+Y (x) = fX (u)fY (x − u)du.
−∞
For a give value of x, the convolution is the integral of the product of fY (u) and fX (x − u).
The latter function is obtained by flipping fX (u) around the vertical axis and dragging it
to the right by x. Figure 5.2 shows the result.
Example 5.5.5. Let X = (X1 , X2 )T be a vector of two i.i.d. U [0, 1] random variables. Let
also A be a 2 × 2 matrix. When do the random variables Y = AX have a j.p.d.f.? What is
that j.p.d.f. when it exists? If it does not exist, how do you characterize the distribution of
Y.
Assume that the two rows of A are proportional to each other. Then so are Y1 and Y2 .
In that case, the distribution of Y is concentrated on a line segment in <2 . Consequently,

fX + Y (x)
fY (u) fX + Y (x)
fX(x - u) .
1 1
u
x
0 1 x 0 1 2
Figure 5.2: Convolution in example 5.5.4
it cannot have a density, for the integral in the plane of any function that is nonzero only
on a line is equal to zero, which violates the requirement that the density must integrate
to 1. Thus, if A is singular, Y has no density. We can characterize the distribution of Y
by writing that Y2 = αY1 and Y1 is a linear combination of i.i.d. U [0, 1] random variables.
For instance, if Y1 = a1 X1 + a2 X2 , then we can calculate the pdf of Y1 as the convolution
of U [0, a1 ] and U [0, a2 ] as in the previous example.
If A is nonsingular, then we can write
1
fY (yy ) = A−1y ),
fX (A
|A|
using the change of variables formula from calculus.
Example 5.5.6. Let X1 , . . . , Xn be mutually independent random variables. Show that
g(X1 , . . . , Xm ) and h(Xm+1 , . . . , Xn ) are independent random variables for any functions
g(·) and h(·) and any m ∈ {1, . . . , n − 1}.
Fix A and B as subsets of <. Note that
g(X1 , . . . , Xm ) ∈ A if and only if (X1 , . . . , Xm ) ∈ g −1 (A),
and similarly,
h(Xm+1 , . . . , Xn ) ∈ B if and only if (Xm+1 , . . . , Xn ) ∈ h−1 (B).

Hence,
P (g(X1 , . . . , Xm ) ∈ A and h(Xm+1 , . . . , Xn ) ∈ B)
= P ((X1 , . . . , Xm ) ∈ g −1 (A) and (Xm+1 , . . . , Xn ) ∈ h−1 (B))
= P ((X1 , . . . , Xm ) ∈ g −1 (A))P ((Xm+1 , . . . , Xn ) ∈ h−1 (B)))
= P (g(X1 , . . . , Xm ) ∈ A)P (h(Xm+1 , . . . , Xn ) ∈ B),
which proves the independence. (The next-to-last line follows from the mutual independence
of the Xi .)
Example 5.5.7. Let X, Y be two points picked independently and uniformly on the circum-
ference of the unit circle. Define Z = ||X − Y ||2 . Find fZ (.).
By symmetry we can assume that the point X has coordinates (1, 0). The point Y then
has coordinates (cos(θ), sin(θ)) where θ is uniformly distributed in [0, 2π]. Consequently,
X −Y has coordinates (1−cos(θ), − sin(θ)) and Z = (1−cos(θ))2 +sin2 (θ) = 2(1−cos(θ)) =:
g(θ).
We now use the basic results on the density of a function of a random variable. To
review how this works, note that if θ ∈ (θ0 , θ0 + ²), then
g(θ) ∈ (g(θ0 ), g(θ0 + ²)) = (g(θ0 ), g(θ0 ) + g0(θ0 )²).
Accordingly,
g(θ) ∈ (z, z + δ)
if and only if
δ
θ ∈ ((θn , θn + )
g0(θn )
for some θn such that g(θn ) = z. It follows that, if Z = g(θ), then
X 1
fZ (z) = fθ (θn ).
n
|g0(θn )|
In this expression, the sum is over all the θn such that g(θn ) = z.
Coming back to our example, g(θ) = z if 2(1 − cos(θ)) = z. In that case, |g0(θ)| =
p
2| sin(θ)| = 2 1 − (1 − z2 )2 . Note that there are two values of θ such that g(θ) = z whenever
z ∈ (0, 4). Accordingly,
1 1 1
fZ (z) = 2 × p z 2
× = q , for z ∈ (0, 4).
2 1 − (1 − 2 ) 2π 2π z − z2
4
Example 5.5.8. The two random vectors X and Y are selected independently and uniformly
in [−1, 1]2 . Calculate E(||X

X − Y ||2 ).
Let X = (X1 , X2 ) and similarly for Y . Then
X − Y ||2 ) = E(|X1 − Y1 |2 + |X2 − Y2 |2 ) = 2E(|X1 − Y1 |2 ),

E(||X
by symmetry.
Now,
E(|X1 − Y1 |2 ) = E(X12 ) + E(Y1 )2 − 2E(X1 Y1 ) = 2E(X12 )
since X1 and Y1 are independent and zero-mean.
Also,
Z 1
1 x3 1
E(X12 ) = x2 dx = [ ]1−1 = .
−1 2 6 3
Finally, putting the pieces together, we get
4
X − Y ||2 ) = 4E(X12 ) = .
E(||X
3
Example 5.5.9. Let {Xn , n ≥ 1} be i.i.d. B(p). Assume that g, h : <n → < have the
property that if x , y ∈ <n are such that xi ≤ yi for i = 1, . . . , n, then g(x

x) ≤ g(yy ) and
x) ≤ h(yy ). Show, by induction on n, that cov(g(X

h(x X ), h(X
X )) ≥ 0 where X = (X1 , . . . , Xn ).
X ) and h(X
The intuition is that g(X X ) are large together and small together.
For n = 1 this is easy. We must show that cov(g(X1 )h(X1 )) ≥ 0. By redefining
g̃(x) = g(x) − g(0) and h̃(x) = h(x) − h(0), we see that it is equivalent to show that
cov(g̃(X1 ), h̃(X1 )) ≥ 0. In other words, we can assume without loss of generality that
g(0) = h(0) = 0. If we do that, we need only show that
E(g(X1 )h(X1 )) = pg(1)h(1) ≥ E(g(X1 ))E(h(X1 )) = pg(1)ph(1)
which is seen to be satisfied since g(1) and h(1) are nonnegative and p ≤ 1.
Assume that the result is true for n. Let X = (X1 , . . . , Xn ) and V = Xn+1 . We must
show that
X , V )h(X
E(g(X X , V )) ≥ E(g(X
X , V )E(h(X
X , V )).
We know, by the induction hypothesis, that
X , i)h(X
E(g(X X , i)) ≥ E(g(X
X , i)E(h(X
X , i)), for i = 0, 1.
Assume, without loss of generality, that
X , 0)) = 0.
E(g(X
Then we know that
X , 1)) ≥ 0 and E(g(X

E(g(X X , 0)h(X
X , 0)) ≥ 0
and
X , V )) ≤ E(h(X
E(h(X X , 1)),
so that
X , V ))E(h(X
E(g(X X , V )) = pE(g(X
X , 1))E(h(X
X , V )) ≤ pE(g(X
X , 1))E(h(X
X , 1))
X , 1)h(X
≤ pE(g(X X , 1)) ≤ pE(g(X
X , 1)h(X
X , 1)) + (1 − p)E(g(X
X , 0)h(X
X , 0))
X , V )h(X
= E(g(X X , V )),
which completes the proof.

Example 5.5.10. Let X be uniformly distributed in [0, 2π] and Y = sin(X). Calculate the
p.d.f. fY of Y .
Since Y = g(X), we know that
X 1
fY (y) = fX (xn )
|g 0 (x n )|
where the sum is over all the xn such that g(xn ) = y.
For each y ∈ (−1, 1), there are two values of xn in [0, 2π] such that g(xn ) = sin(xn ) = y.
For those values, we find that
q p
|g 0 (xn )| = | cos(xn )| = 1 − sin2 (xn ) = 1 − y2,
and
1
fX (xn ) = .
2π
Hence,
1 1 1
fY (y) = 2 p = p .
1 − y 2π
2 π 1 − y2
Example 5.5.11. Let {X, Y } be independent random variables with X exponentially dis-
tributed with mean 1 and Y uniformly distributed in [0, 1]. Calculate E(max{X, Y }).
Let Z = max{X, Y }. Then
P (Z ≤ z) = P (X ≤ z, Y ≤ z) = P (X ≤ z)P (Y ≤ z)

 z(1 − e−z ), for z ∈ [0, 1]
=
 1 − e−z , for z ≥ 1.
Hence, 
 1 − e−z + ze−z , for z ∈ [0, 1]
fZ (z) =
 e−z , for z ≥ 1.
Accordingly,
Z ∞ Z 1 Z ∞
−z −z
E(Z) = zfZ (z)dz = z(1 − e + ze )dz + ze−z dz
0 0 1
To do the calculation we note that

Z 1
zdz = [z 2 /2]10 = 1/2,
0
Z 1 Z 1 Z 1
ze−z dz = − zde−z = −[ze−z ]10 + e−z dz
0 0 0
−1
= −e − [e−z ]10 = 1 − 2e −1
.
Z 1 Z 1 Z 1
z 2 e−z dz = − z 2 de−z = −[z 2 e−z ]10 + 2ze−z dz
0 0 0
−1 −1 −1
= −e + 2(1 − 2e ) = 2 − 5e .
Z ∞ Z 1
−z
ze dz = 1 − ze−z dz = 2e−1 .
1 0
Collecting the pieces, we find that
1
E(Z) = − (1 − 2e−1 ) + (2 − 5e−1 ) + 2e−1 = 3 − 5e−1 ≈ 1.16.
2
Example 5.5.12. Let {Xn , n ≥ 1} be i.i.d. with E(Xn ) = µ and var(Xn ) = σ 2 . Use
Chebyshev’s inequality to get a bound on
X1 + · · · + Xn
α := P (| − µ| ≥ ²).
n
Chebyshev’s inequality (4.8.1) states that
1 X1 + · · · + Xn 1 nvar(X1 ) σ2
α≤ 2
var( )= 2 2
= 2.
² n ² n n²
This calculation shows that the sample mean gets closer and closer to the mean: the
variance of the error decreases like 1/n.

Example 5.5.13. Let X =D P (λ). You pick X white balls. You color the balls indepen-
dently, each red with probability p and blue with probability 1 − p. Let Y be the number
of red balls and Z the number of blue balls. Show that Y and Z are independent and that
Y =D P (λp) and Z =D P (λ(1 − p)).
We find
µ ¶
m+n m
P (Y = m, Z = n) = P (X = m + n) p (1 − p)n
m
µ ¶
λm+n m+n m λm+n (m + n)! m
= p (1 − p)n = × p (1 − p)n
(m + n)! m (m + n)! m!n!
(λp)m −λp (λ(1 − p))n −λ(1−p)
= [ e ]×[ e ],
m! n!
which proves the result.

‘
Chapter 6
Conditional Expectation
Conditional expectation tells us how to use the observation of a random variable Y (ω)
to estimate another random variable X(ω). This conditional expectation is the best guess
about X(ω) given Y (ω) if we want to minimize the mean squared error. Of course, the value
of this conditional expectation is a function of Y (ω), so that it is also a random variable.
We will learn to calculate the conditional expectation.
6.1 Examples
6.1.1 Example 1
Assume that the pair of random variables (X, Y ) is discrete and takes values in {x1 , . . . , xm }×
{y1 , . . . , yn }. This pair of random variables is defined by specifying p(i, j) = P (X = xi , Y =
yj ), i = 1, . . . , m; j = 1, . . . , n. From this information, we can derive
X X
P (Y = yj ) = P (X = xi , Y = yj ) = p(i, j).
i i
We can then calculate P [X = xi |Y = yj ] = P (X = xi , Y = yj )/P (Y = yj ) and we
define
X
E[X|Y = yj ] = xi P [X = xi |Y = yj ].
i
85
86 CHAPTER 6. CONDITIONAL EXPECTATION
We then define E[X|Y ] = E[X|Y = yj ] when Y = yj . In other words,
X
E[X|Y ] = E[X|Y = yj ]1{Y = yj }.
j
Note that E[X|Y ] is a random variable.
For instance, your guess about the temperature in San Francisco certainly depends on
the temperature you observe in Berkeley. Since the latter is random, so is your guess about
the former.
Although this definition is sensible, it is not obvious in what sense this is the best guess
about X given {Y = yj }. We discuss this below.
6.1.2 Example 2
Consider the case where (X, Y ) have a joint density f (x, y) and marginal densities fX (x)
and fY (y). One can then define the conditional density of X given that Y = y as follows.
We see that
f (x, y)²δ f (x, y)²

P [x < X ≤ x + ²|y < Y ≤ y + δ] = = =: fX|Y [x|y]².
fY (y)δ fY (y)
As δ goes down to zero, we see that fX|Y [x|y] is the conditional density of X given
{Y = y}. We then define
Z ∞
E[X|Y = y] = xfX|Y [x|y]dx. (6.1.1)
−∞
6.1.3 Example 3
The ideas of Examples 1 and 2 extend to hybrid cases. For instance, consider the situation
illustrated in Figure 6.1:
The figure shows the joint distribution of (X, Y ). With probability 0.4, (X, Y ) =
(0.75, 0.25). Otherwise (with probability 0.6), the pair (X, Y ) is picked uniformly in the
6.2. MMSE 87
Figure 6.1: Hybrid Distribution: Neither discrete nor continuous
square [0, 1]2 . You see that E[X|Y = y] = 0.5 if y 6= 0.25. Also, if Y = 0.25, then X = 0.75,
so that E[X|Y = 0.25] = 0.75.
Thus E[X|Y ] = g(Y ) where g(0.25) = 0.75 and g(y) = 0.5 for y 6= 0.25.
In this case, E[X|Y ] is a random variable such that E[X|Y ] = 0.5 w.p. 0.6 and E[X|Y ] =
0.75 w.p. 0.4.
(Note that the expected value of E[X|Y ] is 0.5 × 0.6 + 0.75 × 0.4 = 0.6 and you can
observe that E(X) = 0.5 × 0.6 + 0.75 × 0.4 = 0.6. That is, E(E[X|Y ]) = E(X) and we will
learn that this is always the case.)
6.2 MMSE
The examples that we have explored led us to define E[X|Y ] as the expected value of X
when it has its conditional distribution given the value of Y . In this section, we explain
that E[X|Y ] can be defined as the function g(Y ) of the observed value Y that minimizes
E((X − g(Y )2 ). That is, E[X|Y ] is the best guess about X that is based on Y , where best
means that it minimizes the mean squared error.

To verify that fact, choose an arbitrary function g(Y ). We show that
E((X − g(Y ))2 ) ≥ E((X − E[X|Y ])2 ).
The first step of the derivation is to write
E((X − g(Y ))2 ) = E((X − E[X|Y ] + E[X|Y ] − g(Y ))2 )
= E((X − E[X|Y ])2 ) + E((E[X|Y ] − g(Y ))2 ) + 2E((X − E[X|Y ])(E[X|Y ] − g(Y )))
= E((X − E[X|Y ])2 ) + E((E[X|Y ] − g(Y ))2 ) + 2E((X − E[X|Y ])h(Y )) (6.2.1)
where h(Y ) := E[X|Y ] − g(Y ).
The second step is to show that the last term in (6.2.1) is equal to zero. To show that,
we calculate
Z Z Z
E(h(Y )E[X|Y ]) = h(y)E[X|Y = y]fY (y)dy = h(y){ xfX|Y [x|y]dx} fY (y)dy
Z Z
= xh(y)fX,Y (x, y)dxdy = E(h(Y )X). (6.2.2)
The next-to-last identity uses the fact that fX|Y [x|y]fY (y) = fX,Y (x, y), by definition of
the conditional density.
The final step is to observe that (6.2.1) with the last term equal to zero implies that
E((X −g(Y ))2 ) = E((X −E[X|Y ])2 )+E((E[X|Y ]−g(Y ))2 ) ≥ E((X −E[X|Y ])2 ). (6.2.3)
This is the story when joint densities exist. The derivation can be adapted to the case when
joint densities do not exist.
6.3 Two Pictures
The left-hand part of Figure 6.2 shows that E[X|Y ] is the average value of X on sets that
correspond to a constant value of Y . The figure also highlights the fact that E[X|Y ] is a
random variable.
6.3. TWO PICTURES 89
X
X
E[X|Y]
E[X|Y]
g(Y)
Y {k(Y) | k(.) is a function}
0 1
Figure 6.2: Conditional Expectation - Pictures
The right-hand part of Figure 6.2 depicts random variables as points in some vector
space. The figure shows that E[X|Y ] is the function of Y that is closest to X. The
metric in the space is d(V, W ) = (E(V − W )2 )1/2 . That figure illustrates the relations
(6.2.3). These relations are a statement of Pythagora’s theorem: the square of the length
of the hypothenuse d2 (X, g(Y )) is the sum of the squares of the sides of the right triangle
d2 (X, E[X|Y ]) + d2 (E[X|Y ], g(Y )). This figure shows that E[X|Y ] is the projection of X
onto the hyperplane {k(Y ) | k(·) is a function }. The figure also shows that for E[X|Y ] to
be that projection, the vector X − E[X|Y ] must be orthogonal to every function k(Y ), and
in particular to E[X|Y ] − g(Y ), as (6.2.2) states.
To give you a concrete feel for this vector space, imagine that Ω = {ω1 , . . . , ωN } and that
pk is the probability that ω is equal to ωk . In that case, the random variable X corresponds
to the vector (X(ω1 )(p1 )1/2 , . . . , X(ωN )(pN )1/2 ) in <N . For a general Ω, the random variable
X is a function of ω and it belongs to a function space. This space is a vector space since
linear combinations of functions are also functions. If we restrict our attention to random
variables X with E(X 2 ) < ∞, that space, with the metric that one defines, turns out to
be closed under limits of convergent sequences. (Such a space is called a Hilbert space.)
This property is very useful because it implies that as one chooses functions gn (Y ) whose
distances to X approach the minimum distance, these functions converge to a function
g(Y ). This argument implies the existence of conditional expectation. The uniqueness is
intuitively clear: if two random variables g(Y ) and g 0 (Y ) achieve the minimum distance to
X, they must be equal.
Thus, we can define E[X|Y ] as the function g(Y ) that minimizes E((X − g(Y ))2 ). This
definition does not assume the existence of a conditional density fX|Y [·|·], nor of a joint
pmf.
6.4 Properties of Conditional Expectation
One often calculates the conditional expectation by using the properties of that operator.
We derive these properties in this section. We highlight the key observation we made in the
derivation of the MMSE property of conditional expectation as the following lemma.
Lemma 6.4.1. A random variable g(Y ) is equal to E[X|Y ] if and only if
E(g(Y )h(Y )) = E(Xh(Y )) for any function h(·). (6.4.1)
We proved in (6.2.2) that E[X|Y ] satisfies that property. To show that a function that
satisfies (6.4.1) is the conditional expectation, one observes that if both g(Y ) and g 0 (Y )
satisfy that condition, they must be equal. To see that, note that
E((X − g(Y ))2 ) = E((X − g 0 (Y ) + g 0 (Y ) − g(Y ))2 )
= E((X − g 0 (Y ))2 ) + E((g 0 (Y ) − g(Y ))2 ) + 2E((g 0 (Y ) − g(Y ))(X − g 0 (Y ))).
Using (6.4.1), we see that the last term is equal to zero. Hence,
E((X − g(Y ))2 ) = E((X − g 0 (Y ))2 ) + E((g(Y ) − g 0 (Y ))2 ).
But we assume that E((X −g(Y ))2 ) = E((X −g 0 (Y ))2 ). Consequently, E((g(Y )−g 0 (Y ))2 ) =
0, which implies that g(Y ) − g 0 (Y ) = 0 (with probability 1).

6.4. PROPERTIES OF CONDITIONAL EXPECTATION 91
Using this lemma, we can prove the following properties.
Theorem 6.4.2. Properties of Conditional Expectation
a. Linearity:
E[a1 X1 + a2 X2 | Y ] = a1 E[X1 | Y ] + a2 E[X2 | Y ]; (6.4.2)
b. Known Factor:
E[Xk(Y ) | Y ] = k(Y )E[X | Y ]; (6.4.3)
c. Averaging:
E(E[X | Y ]) = E(X); (6.4.4)
d. Independence: If X and Y are independent, then
E[X | Y ] = E(X). (6.4.5)
e. Smoothing:
E[E[X | Y, Z] | Y ] = E[X | Y ]. (6.4.6)
Proof:
The derivation of these identities is a simple exercise, but going through it should help
you appreciate the mechanics.
a. Linearity is fairly clear from our original definition (6.1.1), when the conditional
density exists. In the general case, we can use the lemma as follows. We do this derivation
step by step as the other properties follow the same pattern.
To show that the function g(Y ) := a1 E[X1 | Y ] + a2 E[X2 | Y ] is in fact E[X|Y ] with
X := a1 X1 + a2 X2 , we show that it satisfies (6.4.1). That is, we must show that
E((a1 E[X1 | Y ] + a2 E[X2 | Y ])h(Y )) = E((a1 X1 + a2 X2 )h(Y )) (6.4.7)
for all h(·). But we know that, for i = 1, 2,
E((ai E[Xi | Y ])h(Y )) = E(E[Xi | Y ](ai h(Y ))) = E(Xi (ai h(Y ))) = E(ai Xi h(Y )). (6.4.8)
Figure 6.3: Conditional Expectation - Smoothing Property
Indeed, the third identity follows from the property (6.4.1) applied to E[Xi | Y ] with h(Y )
replaced by ai h(Y ). The other identities are obvious.
By adding up the expressions (6.4.8) for i = 1, 2, one gets (6.4.7).
b. To show that g(Y ) := k(Y )E[X|Y ] is equal to E[Xk(Y )|Y ] we prove that it satisfies
(6.4.1). That is, we must show that
E(k(Y )E[X|Y ]h(Y )) = E(Xk(Y )h(Y )), ∀h(·).
This identity follows from the property (6.4.1) of E[X|Y ] where one replaces h(Y ) by
k(Y )h(Y ).
c. The averaging property is a particular case of (6.4.1) where h(Y ) = 1.
d. We use the lemma. Let g(Y ) = E[X|Y ]. We show that for any h(Y ) one has
E(g(Y )h(Y )) = E(E[X|Y, Z]h(Y )). Now, E(E[X|Y, Z]h(Y )) = E(Xh(Y )) by (6.4.1) and
E(g(Y )h(Y )) = E(E[X|Y ]h(Y )) = E(Xh(Y )), also by (6.4.1). This completes the proof.
Figure 6.3 shows why E[E[X|Y, Z]|Y ] = E[X|Y ].

6.5. GAMBLING SYSTEM 93
6.5 Gambling System
Conditional expectation provides a way to evaluate gambling systems. Let {Xn , n ≥ 1} be
i.i.d. random variables with P (Xn = −1) = P (Xn = 1) = 0.5. The random variable Xn
represents your gain at the n-th game of roulette, playing black or red and assuming that
there is no house advantage (no 0 nor double-zero). Say that you have played n times and
observed (X1 , X2 , . . . , Xn ). You then calculate the amount Yn = hn (X1 , X2 , . . . , Xn ) that
you gamble on the next game. You earn Yn Xn+1 on that next game. After a number of
such games, you have accumulated
Z = Y0 X1 + Y1 X2 + · · · + Yn Xn+1 .
(Here, Y0 is some arbitrary initial bet.) Assume that the random variables Yn are
bounded (which is not unreasonable since there may be a table limit), then you find that
E(Yn Xn+1 ) = E(E[Yn Xn+1 |X1 , X2 , . . . , Xn ]) = E(Yn E[Xn+1 |X1 , X2 , . . . , Xn ]) = E(Yn 0) = 0.
Consequently, E(Z) = 0. This result shows the “impossibility” of a gambling system
and guarantees that the casinos will be doing well.
6.6 Summary
The setup is that (X, Y ) are random variables on some common probability space, i.e., with
some joint distribution. We observe Y and want to estimate X.
The minimum mean squares estimator of X given Y is defined as the function g(Y )
that minimizes E((X − g(Y ))2 ). We know that the answer is g(Y ) = E[X | Y ]. How do we
calculate it?
Direct Calculation
The direct calculation uses (6.1.1) or the discrete version. We look at hybrid cases in the
examples.
Symmetry
Assume X1 , . . . , Xn are i.i.d. Then
m
E[X1 + · · · + Xm | X1 + · · · + Xn ] = × (X1 + · · · + Xn ) for 1 ≤ m ≤ n.
n
Note also that
E[X1 + · · · + Xm | X1 + · · · + Xn ] = X1 + · · · + Xn + (m − n)E(X1 ) for 1 ≤ n ≤ m.
To derive these identities, we first note that
E[Xi | X1 + · · · + Xn ] = Y, for i = 1, . . . , n
where Y is some random variable. Second, by (6.4.2), if we add up these identities for
i = 1, . . . , n, we find
nY = E[X1 + · · · + Xn | X1 + · · · + Xn ] = X1 + · · · + Xn .
Hence,
Y = (X1 + · · · + Xn )/n,
so that
E[Xi | X1 + · · · + Xn ] = (X1 + · · · + Xn )/n, for i = 1, . . . , n.
Using these identities we can now derive the two properties stated above.
Properties
Often one can use the properties of conditional expectation states in Theorem 6.4.2 to
calculate E[X | Y ].
6.7 Solved Problems
Example 6.7.1. Let (X, Y ) be a point picked uniformly in the quarter circle {(x, y) | x ≥
0, y ≥ 0, x2 + y 2 ≤ 1}. Find E[X | Y ].
p
Given Y = y, X is uniformly distributed in [0, 1 − y 2 ]. Hence
1p
E[X | Y ] = 1 − Y 2.
2
Example 6.7.2. A customer entering a store is served by clerk i with probability pi , i =
1, 2, . . . , n. The time taken by clerk i to service a customer is an exponentially distributed
random variable with parameter αi .
a. Find the pdf of T , the time taken to service a customer.
b. Find E[T ].
c. Find V ar[T ].
Designate by X the clerk who serves the customer.

P P
a. fT (t) = ni=1 pi fT |X [t|i] = ni=1 pi αi e−αi t
P
b. E[T ] = E(E[T | X]) = E( α1X ) = ni=1 pi α1i .
Pn
c. We first find E[T 2 ] = E(E[T 2 | X]) = E( α12 ) = 2
i=1 pi α2i . Hence, var(T ) =
i
P P
E(T 2 ) − (E(T ))2 = ni=1 pi α22 − ( ni=1 pi α1i )2 .
i
Example 6.7.3. The random variables Xi are i.i.d. and such that E[Xi ] = µ and var(Xi ) =
σ 2 . Let N be a random variable independent of all the Xi s taking on nonnegative integer
values. Let S = X1 + X2 + . . . + XN .
a. Find E(S).
b. Find var(S).
a. E(S)] = E(E[S | N ]) = E(N µ) = µE(N ).

b. First we calculate E(S 2 ). We find
E(S 2 ) = E(E[S 2 | N ]) = E(E[(X1 + X2 + . . . + XN )2 | N ])

X
= E(E[X12 + · · · + XN
2
+ Xi Xj | N ])
i6=j
= E(N E(X12 ) + N (N − 1)E(X1 X2 )) = E(N (µ2 + σ 2 ) + N (N − 1)µ2 )
= E(N )σ 2 + E(N 2 )µ2 .
Then,
var(S) = E(S 2 ) − (E(S))2 = E(N )σ 2 + E(N 2 )µ2 − µ2 (E(N ))2 = E(N )σ 2 + var(N )µ2 .
Example 6.7.4. Let X, Y be independent and uniform in [0, 1]. Calculate E[X 2 | X + Y ].
Given X + Y = z, the point (X, Y ) is uniformly distributed on the line {(x, y) | x ≥
0, y ≥ 0, x + y = z}. Draw a picture to see that if z > 1, then X is uniform on [z − 1, 1] and
if z < 1, then X is uniform on [0, z]. Thus, if z > 1 one has
Z 1
1 1 x3 1 1 − (z − 1)3
E[X 2 | X + Y = z] = x2 dx = [ ]z−1 = .
z−1 2−z 2−z 3 3(2 − z)
Similarly, if z < 1, then
Z z
1 1 x3 z2
E[X 2 | X + Y = z] = x2 dx = [ ]z0 = .
0 z z 3 3
Example 6.7.5. Let (X, Y ) be the coordinates of a point chosen uniformly in [0, 1]2 . Cal-
culate E[X | XY ].
This is an example where we use the straightforward approach, based on the definition.
The problem is interesting because is illustrates that approach in a tractable but nontrivial
example. Let Z = XY .
Z 1
E[X | Z = z] = xf[X|Z] [x | z]dx.
0
Now,
fX,Z (x, z)
f[X|Z] [x | z] = .
fZ (z)
Also,
fX,Z (x, z)dxdz = P (X ∈ (x, x + dx), Z ∈ (z, z + dz))
= P (X ∈ (x, x + dx))P [Z ∈ (z, z + dz) | X = x] = dxP (xY ∈ (z, z + dz))

z z dz dz
= dxP (Y ∈ ( , + )) = dx 1{z ≤ x}.
x x x x
Hence, 
 1
x, if x ∈ [0, 1] and z ∈ [0, x]
fX,Z (x, z) =
 0, otherwise.
Consequently,
Z 1 Z 1
1
fZ (z) = fX,Z (x, z)dx = dx = −ln(z), 0 ≤ z ≤ 1.
0 z x
Finally,
1
f[X|Z] [x | z] = − , for x ∈ [0, 1] and z ∈ [0, x],
xln(z)
and
Z 1
1 z−1
E[X | Z = z] = x(− )dx = ,
z xln(z) ln(z)
so that
XY − 1
E[X | XY ] = .
ln(XY )
Examples of values:
E[X | XY = 1] = 1, E[X | XY = 0.1] = 0.39, E[X | XY ≈ 0] ≈ 0.
Example 6.7.6. Let X, Y be independent and exponentially distributed with mean 1. Find
E[cos(X + Y ) | X].
We have
Z ∞ Z ∞
−y
E[cos(X + Y ) | X = x] = cos(x + y)e dy = Re{ ei(x+y)−y dy}
0 0
eix cos(x) − sin(x)
= Re{ }= .
1−i 2
Example 6.7.7. Let X1 , X2 , . . . , Xn be i.i.d. U [0, 1] and Y = max{X1 , . . . , Xn }. Calculate
E[X1 | Y ].
Intuition suggests, and it is not too hard to justify, that if Y = y, then X1 = y with prob-
ability 1/n, and with probability (n − 1)/n the random variable X1 is uniformly distributed
in [0, y]. Hence,

1 n−1Y n+1
E[X1 | Y ] = Y + = Y.
n n 2 2n
Example 6.7.8. Let X, Y, Z be independent and uniform in [0, 1]. Calculate E[(X + 2Y +
Z)2 | X].
One has, E[(X + 2Y + Z)2 | X] = E[X 2 + 4Y 2 + Z 2 + 4XY + 4Y Z + 2XZ | X]. Now,
E[X 2 + 4Y 2 + Z 2 + 4XY + 4Y Z + 2XZ | X]
= X 2 + 4E(Y 2 ) + E(Z 2 ) + 4XE(Y ) + 4E(Y )E(Z) + 2XE(Z)
= X 2 + 4/3 + 1/3 + 2X + 1 + X = X 2 + 3X + 8/3.
Example 6.7.9. Let X, Y, Z be three random variables defined on the same probability
space. Prove formally that
E(|X − E[X | Y ]|2 ) ≥ E(|X − E[X | Y, Z]|2 ).
Let X1 = E[X | Y ] and X2 = E[X | Y, Z]. Note that
E((X − X2 )(X2 − X1 )) = E(E[(X − X2 )(X2 − X1 ) | Y, Z])

and
E[(X − X2 )(X2 − X1 ) | Y, Z] = (X2 − X1 )E[X − X2 | Y, Z] = X2 − X2 = 0.
Hence,
E((X −X1 )2 ) = E((X −X2 +X2 −X1 )2 ) = E((X −X2 )2 )+E((X2 −X1 )2 ) ≥ E((X −X2 )2 ).
Example 6.7.10. Pick the point (X, Y ) uniformly in the triangle {(x, y) | 0 ≤ x ≤
1 and 0 ≤ y ≤ x}.
a. Calculate E[X | Y ].
b. Calculate E[Y | X].
c. Calculate E[(X − Y )2 | X].
a. Given {Y = y}, X is U [y, 1], so that E[X | Y = y] = (1 + y)/2. Hence,
1+Y
E[X | Y ] = .
2
b. Given {X = x}, Y is U [0, x], so that E[Y | X = x] = x/2. Hence,
X
E[Y | X] = .
2
c. Since given {X = x}, Y is U [0, x], we find

Z x Z x
2 21 1 x2
E[(X − Y ) | X = x] = (x − y) dy = y 2 dy = . Hence,
0 x x 0 3
X2
E[(X − Y )2 | X] = .
3
Example 6.7.11. Assume that the two random variables X and Y are such that E[X |
Y ] = Y and E[Y | X] = X. Show that P (X = Y ) = 1.
We show that E((X − Y )2 ) = 0. This will prove that X − Y = 0 with probability one.
Note that
E((X − Y )2 ) = E(X 2 ) − E(XY ) + E(Y 2 ) − E(XY ).

Now,
E(XY ) = E(E[XY | X]) = E(XE[Y | X]) = E(X 2 ).
Similarly, one finds that E(XY ) = E(Y 2 ). Putting together the pieces, we get E((X −
Y )2 ) = 0.
Example 6.7.12. Let X, Y be independent random variables uniformly distributed in [0, 1].
Calculate E[X|X < Y ].
Drawing a unit square, we see that given {X < Y }, the pair (X, Y ) is uniformly dis-
tributed in the triangle left of the diagonal from the upper left corner to the bottom right
corner of that square. Accordingly, the p.d.f. f (x) of X is given by f (x) = 2(1 − x). Hence,
Z 1
1
E[X|X < Y ] = x × 2(1 − x)dx = .
0 3
Chapter 7
Gaussian Random Variables
Gaussian random variables show up frequently. (This is because of the central limit theorem
that we discuss later in the class.) Here are a few essential properties that we explain in
the chapter.
• The Gaussian distribution is determined by its mean and variance.
• The sum of independent Gaussian random variables is Gaussian.
• Random variables are jointly Gaussian if an arbitrary linear combination is Gaussian.
• Uncorrelated jointly Gaussian random variables are independent.
• If random variables are jointly Gaussian, then the conditional expectation is linear.
7.1 Gaussian
7.1.1 N (0, 1): Standard Gaussian Random Variable
Definition 7.1.1. We say that X is a standard Gaussian (or standard Normal) random
variable, and we write X =D N (0, 1), if
1
fX (x) = √ exp{−x2 /2}, for x ∈ <.
2π
101
102 CHAPTER 7. GAUSSIAN RANDOM VARIABLES
To see that fX (·) is a proper density we should verify that it integrates to one. We do
the calculation next. Let

Z
1
A := √ exp{−x2 /2}dx.
2π
Then
Z Z
2 1 1
A = √ exp{−x /2}dx √ exp{−y 2 /2}dy
2
Z Z 2π 2π
1
= exp{−(x2 + y 2 )/2}dxdy.
2π
We do a change of variables from Cartesian to polar coordinates by letting x = r cos(θ)
and y = r sin(θ). Then, dxdy = rdrdθ and x2 + y 2 = r2 . We then rewrite the integral above
as follows:
Z ∞ Z 2π Z ∞
1 −r2 /2 2 /2
A2 = re drdθ = re−r dr
0 0 2π 0
Z ∞
2 /2 2 /2
=− de−r = [e−r ]∞
0 = 1,
0
which proves that A = 1, as we wanted to verify.
We claim that X =D N (0, 1) if and only if
E(exp{iuX}) = exp{−u2 /2}.
To see this, assume that X =D N (0, 1). Then

Z
iuX 1
φ(u) := E(e )= eiux √ exp{−x2 /2}dx,
2π
so that
Z
0 d 1
φ (u) := φ(u) = ixeiux √ exp{−x2 /2}dx
du 2πZ
Z
1 iux −x2 /2 1 2
= −i √ e de = i √ e−x /2 deiux
Z 2π 2π
1
= −u eiux √ exp{−x2 /2}dx = −uφ(u).
2π
7.1. GAUSSIAN 103
Thus, φ0 (u) = −uφ(u). We can rewrite this identity as follows:

dφ(u)
= −udu,
φ(u)
or
d(ln(φ(u))) = d(−u2 /2).
Integrating both sides from 0 to u, we find
ln(φ(u)) − ln(φ(0)) = −u2 /2.

2 /2
But φ(0) = E(ei0X ) = 1. Hence, ln(φ(u)) = −u2 /2, which implies E(eiuX ) = e−u , as
claimed.
2 /2
We have shown that if X =D N (0, 1), then E(eiuX ) = e−u . To see the converse, one
observes that
Z ∞
iuX
E(e )= eiux fX (x)dx,
−∞
so that E(eiuX ) is the Fourier transform of fX (·). It can be shown that the Fourier transform
specifies fX (·) uniquely. That is, if two random variables X and Y are such that E(eiuX ) =
2 /2
E(eiuY ), then fX (x) = fY (x) for x ∈ <. Accordingly, if E(eiuX ) = e−u , it must be that
X =D N (0, 1).
Since
1 1 1
E(exp{iuX}) = E(1 + iuX − u2 X 2 − iu3 X 3 + · · · + (iuX)n + · · · )
2 3! n!
2 1 2 1 4 1 6 1
= exp{−u /2} = 1 − u + u − u + · · · + (−u2 /2)m + · · · ,
2 8 48 m!
we see that E(X m ) = 0 for m odd and
1 1
(iu)2m E(X 2m ) = (−u2 /2)m ,
(2m)! m!
(2m)!
so that E(X 2m ) = 2m m! .
The cdf of X does not admit a closed form expression. Its values have been tabulated
and many software packages provides that function. Table 7.1 shows sample values of the
standard Gaussian distribution.

x 1 1.64 1.96 2 2.58 7.13

P (N (0, 1) > x) 15.9% 5% 2.5% 2.27% 0.5% 5 × 10−13
P (|N (0, 1)| > x) 31.7% 10% 5% 4.55% 1% 10−12
Table 7.1: Sample values of probabilities for N (0, 1).
7.1.2 N (µ, σ 2 )
Definition 7.1.2. X = N (µ, σ 2 ) if X = µ+σY where Y = N (0, 1), so that E(exp{iuX}) =
exp{iuµ − u2 σ 2 /2}.
Using (4.6.1), we see that the pdf of a N (µ, σ 2 ) is
1 (x − µ)2
√ exp{− }, for x ∈ <.
2πσ 2 2σ 2
7.2 Jointly Gaussian
So far we have considered a single Gaussian random variable. In this section we discuss
collections of random variables that have a Gaussian joint distribution. Such random vari-
ables occur frequently in applications. A simple example is a collection of independent
Gaussian random variables. Another example is a collection of linear combinations of a set
of independent Gaussian random variables.
7.2.1 N (00, I )
Definition 7.2.1. X = N (00, I ) if the components of X are i.i.d. N (0, 1).
7.2.2 Jointly Gaussian
Random variables X are said to be jointly Gaussian if u T X is Gaussian for any vector u .
7.2. JOINTLY GAUSSIAN 105
Assume Y := u T X is Gaussian and let µ = E(X

X ) and Σ := ΣX . Since E(Y ) = u T µ
and var(Y ) = u T Σu
u (see (7.5.7)), we see that
) = eiE(Y )−var(Y )/2 = eiuu

TX T µ −u
uT Σu
E(eiY ) = E(eiuu u/2
. (7.2.1)
Now,
Z ∞ Z ∞
uT X Tx
E(e iu
)= ··· eiuu x)dx1 . . . dxn
fX (x (7.2.2)
−∞ −∞
x). As in the one-dimensional case, it turns

is the n-dimensional Fourier transform of fX (x
out that this Fourier transform completely determines the joint density.
Using these preliminary calculations we can derive the following very useful result.
Theorem 7.2.1. Independence and Correlation
Jointly Gaussian random variables are independent if and only if they are uncorrelated.
Proof:
We know from Theorem 5.3.1 that the random variables X = (X1 , . . . , Xn )T are inde-
pendent if and only if their joint density is the product of their individual densities. Now,
identity (7.2.2) shows that this happens if and only if

Z ∞ Z ∞
uT X
iu
E(e )= ··· ei(u1 x1 +···+un xn ) fX1 (x1 ) . . . fXn (xn )dx1 . . . dxn
−∞ −∞
Z ∞ Z ∞
iu1 x1
= e fX1 x1 dx1 · · · eiun xn fXn xn dxn
−∞ −∞
iu1 X1 iun Xn
= E(e ) × · · · × E(e ).
That is, random variables are mutually independent if and only if their joint characteris-
TX
tic function E(eiuu ) is the product of their individual characteristic functions E(eium Xm ).
To complete the proof, we use the specific form (7.2.1) of the joint characteristic function of
jointly Gaussian random variables and we note that it factorizes if and only if Σ is diagonal,
i.e., if and only if the random variables are uncorrelated.
¤
µ, AA T ). Assume that A is
Note also that if X = N (00, I ), then Y = µ + AX = N (µ
nonsingular. Then
1
fY (yy ) = A−1y ),
fX (A
|A|
in view of the change of variables. Hence,
1 1
fY (yy ) = n/2
exp{− (yy − µ )T Σ −1 (yy − µ )
Σ|)
(2π|Σ 2
where Σ = AA T . We used the fact that the determinant of the product of square matrices
Σ| = |A
is the product of their determinants, so that |Σ AT | = |A
A| × |A A|2 and |A Σ|1/2 .
A| = |Σ
7.3 Conditional Expectation J.G.
We explain how to calculate the conditional expectation of jointly Gaussian random vari-
ables. The result is remarkably simple: the conditional expectation is linear in the obser-
vations!
Theorem 7.3.1. Let (X, Y ) be two jointly Gaussian random variables. Then
a. One has
cov(X, Y )
E[X | Y ] = E(X) + (Y − E(Y )). (7.3.1)
var(Y )
X , Y ) be jointly Gaussian.
Let (X
b. If ΣY is invertible, then
X |Y
E[X X ) + ΣX ,YY ΣY−1 (Y
Y ] = E(X Y − E(Y
Y )). (7.3.2)
c. If ΣY is not invertible, then
X |Y
E[X X ) + ΣX ,YY ΣY† (Y
Y ] = E(X Y − E(Y
Y )) (7.3.3)
where ΣY† is such that
ΣX ,YY = ΣX ,YY ΣY† ΣY .

7.3. CONDITIONAL EXPECTATION J.G. 107
Proof:
Then we can look for a vector a and a matrix B of compatible dimensions so that
X − a − BY is zero-mean and uncorrelated with Y . In that case, these random variables
are independent and we can write
X |Y
E[X Y ] = E[X
X − a − BY + a + BY |Y
Y]
X − a − BY |Y
= E[X Y ] + E[a
a + BY |Y
Y ], by (6.4.2)
X − a − BY ) + a + BY , by (6.4.5)
= E(X
= a + BY .
X − a − BY ] = 0, or
To find the desired a and B , we solve E[X
X ) − B E(Y
a = E(X Y)
Y T ] = 0, or
X − a − BY )Y
and E[(X
ΣX ,YY = B ΣY . (7.3.4)
If ΣY is invertible, this gives
B = ΣX ,YY ΣY−1 ,
so that
X |Y
E[X X ) + ΣX ,YY ΣY−1 (Y
Y ] = E(X Y − E(Y
Y )).
If ΣY ,YY is not invertible is not, we choose a pseudo-inverse ΣY† that solves (7.3.4), i.e., is
such that
ΣX ,YY = ΣX ,YY ΣY† ΣY .
We then find
X |Y
E[X X ) + ΣX ,YY ΣY† (Y
Y ] = E(X Y − E(Y
Y )).
Examples will help you understand how to use these results.

7.4 Summary
We defined the Gaussian random variables N (0, 1), N (µ, σ 2 ), and N (µ

µ, Σ ) both in terms of
their density and their characteristic function.
Jointly Gaussian random variables that are uncorrelated are independent.
If X, Y are jointly Gaussian, then E[X | Y ] = E(X) + cov(X, Y )var(Y )−1 (Y − E(Y )).
In the vector case,
E[X X ) + ΣX,Y ΣY−1 (Y

X | Y ] = E(X Y − E(Y
Y ),
when ΣY is invertible. We also discussed the non-invertible case.
7.5 Solved Problems
Example 7.5.1. The noise voltage X in an electric circuit can be modelled as a Gaussian
random variable with mean zero and variance equal to 10−8 .
a. What is the probability that it exceeds 10−4 ? What is the probability that it exceeds
2 × 10−4 ? What is the probability that its value is between −2 × 10−4 and 10−4 ?
b. Given that the noise value is positive, what is the probability that it exceeds 10−4 ?
c. What is the expected value of |X|?
Let Z = 104 X, then Z =D N (0, 1) and we can reformulate the questions in terms of Z.
a. Using (7.1) we find P (Z > 1) = 0.159 and P (Z > 2) = 0.023. Indeed, P (Z > d) =
P (|Z| > d)/2, by symmetry of the density. Moreover,
P (−2 < Z < 1) = P (Z < 1)−P (Z ≤ −2) = 1−P (Z > 1)−P (Z > 2) = 1−0.159−0.023 = 0.818.
b. We have
P (Z > 1)
P [Z > 1 | Z > 0] = = 2P (Z > 1) = 0.318.
P (Z > 0)
c. Since Z = 104 X, one has E(|Z|) = 104 E(|X|). Now,

Z ∞ Z ∞ Z ∞
1 1
E(|Z|) = |z|fZ (z)dz = 2
zfZ (z)dz = 2 √ z exp{− z 2 }dz
−∞ 0 0 2π 2
r Z ∞ r
2 1 2 2
= − d[exp{− z }] = .
π 0 2 π
Hence,
r
−4 2
E(|X|) = 10 .
π
Example 7.5.2. Let U = {Un , n ≥ 1} be a sequence of independent standard Gaussian
random variables. A low-pass filter takes the sequence U and produces the output sequence
Xn = Un + Un+1 . A high-pass filter produces the output sequence Yn = Un − Un+1 .
a. Find the joint pdf of Xn and Xn−1 and find the joint pdf of Xn and Xn+m for m > 1.
b. Find the joint pdf of Yn and Yn−1 and find the joint pdf of Yn and Yn+m for m > 1.
c. Find the joint pdf of Xn and Ym .
We start with some preliminary observations. First, since the Ui are independent, they
are jointly Gaussian. Second, Xn and Yn are linear combinations of the Ui and thus are
also jointly Gaussian. Third, the jpdf of jointly gaussian random variables Z is
1 1
fZ (zz ) = p exp[− (zz − m )C −1 (zz − m )]
(2π)n det(C) 2
where n is the dimension of Z , m is the vector of expectations of Z , and C is the covariance
matrix
  Z − m )(Z
E[(Z Z − m )T ]. Finally, we need some basic facts
 from algebra. If C =
a b d −b
 , then det(C) = ad − bc and C −1 = 1  . We are now ready to
det(C)
c d −c a
answer the questions.
U.
a. Express in the form X = AU
 
    Un−1
1 1  
Xn 0
 = 2 2   Un 
1 1  
Xn−1 2 2 0
Un+1
X ] = AE[U
Then E[X U ] = 0.
  
  1 0 0 0 1  
1 1 2 1 1
0   
U U T ]AT = 
X X T ] = AE[U
C = E[X 2 2 0 1 0  1 1 = 2 4 
1 1   2 2  1 1
2 2 0 1 4 2
0 0 1 2 0
1 1 3
Then det(C) = 4 − 16 = 16 and
 
16  12 − 14

C −1 =
3 − 14 1
2
2
fXn Xn−1 (xn , xn−1 ) = √
π 3
exp[− 43 (x2n − xn xn−1 + x2n−1 )]
For m > 1,  
    Un

Xn 1 1
0 0  Un+1 
 = 2 2 



Xn+m 0 0 1 1  Un+m 
2 2  
Un+m+1
X ] = AE[U
Then E[X U ] = 0.
  
1
  1 0 0 0 0
 2
  
1 1
0 0  0 1 0 0  1
0  1
0
U U T ]AT = 
X X T ] = AE[U
C = E[X 2 2 


 2   2
 = 
0 0 1 1  0 0 1 0  0 1 
0 1
2 2   2  2
1
0 0 0 1 0 2
1
Then det(C) = 4 and  
2 0
C −1 =  
0 2
1
fXn Xn+m (xn , xn+m ) = π exp[− 14 (x2n + x2n+m )]
b.  
    Un−2
Yn 0 − 12 1  
 = 2   Un−1 
 
Yn−1 − 12 1
2 0
Un
Y ] = AE[U
Then E[Y U ] = 0.
  
  1 0 0 0 − 21  
0 − 12 1    1
− 14
U U T ]AT = 
Y Y T ] = AE[U
C = E[Y 2   0 1 0   −1 1 = 2 
  2 2 
− 12 1
2 0 1
− 14 1
2
0 0 1 2 0
1 1 3
Then det(C) = 4 − 16 = 16 and
 
1 1
16  2 4 
C −1 =
3 1 1
4 2
2
fYn Yn−1 (yn , yn−1 ) = √
π 3
exp[− 43 (yn2 + yn yn−1 + yn−1
2 )]
For m > 1,
 
    Un−1

Yn − 21 1
0 0  Un 
 = 2 



Yn+m 0 0 − 21 1  Un+m−1 
2  
Un+m
Y ] = AE[U
Then E[Y U ] = 0.
  
  1 0 0 0 − 12 0
   
− 12 1
0 0  0 1 0 0  1 0  1
0
U U T ]AT = 
Y Y T ] = AE[U
C = E[Y 2 

 2

  2
 = 
0 0 − 12 1  0 0 1 0  0 − 12  0 1
2    2
1
0 0 0 1 0 2
1
Then det(C) = and
4  
2 0
C −1 =  
0 2
1
fYn Yn+m (yn , yn+m ) = π exp[− 14 (yn2 + yn+m
2 )]
c. We have 3 cases when i. m=n, ii. m=n+1, and iii. otherwise.
i. First consider when m=n.

 
    Un−1
1 1  
Xn 0
 = 2 2   Un 
 
Yn − 12 1
2 0
Un+1
Then E[[Xn Yn ]T ] = AE[U

U ] = 0.
  
  1 0 0 0 − 12  
1 1    1 1
0
U U T ]AT = 
C = AE[U 2 2  0 1 0  1 1 = 2 4 
  2 2 
− 21 1
2 0 1
1
4
1
2
0 0 1 2 0
1 1 3
Then det(C) = 4 − 14 = 16 and
 
1
16  − 14
C −1 = 2 
3 − 14 1
2
2
fXn Yn (xn , yn ) = √
π 3
exp[− 43 (x2n − xn yn + yn2 )]
ii. Consider m=n+1.

    
1 1
Xn Un
 = 2 2  
1
Yn+1 2 − 12 Un+1
Then E[[Xn Yn+1 ]T ] = AE[U

U ] = 0.
     
1 1 1 1 1
1 0 0
U U T ]AT = 
C = AE[U 2 2   2 2 = 2 
1
2 − 12 0 1 1
2 − 12 0 1
2
1
2 0
C −1 =  
0 2
1
fXn Yn+1 (xn , yn+1 ) = π exp[− 14 (x2n + yn+1
2 )]
iii. For all other m.

 
    Un

Xn 1 1
0 0  Un+1 
 = 2 2 



Ym 0 0 − 12 1  Um−1 
2  
Um
Then E[[Xn Ym ]T ] = AE[U

U ] = 0.
  
1
  1 0 0 0 0
 2
  
1 1
0 0  0 1 0 0  0 1  1
0
    2
U U T ]AT = 
C = AE[U 2 2   2
 = 
0 0 − 12 1  0 0 1 0   0 −1  0 1
2   2  2
1
0 0 0 1 0 2
1
2 0
C −1 =  
0 2
1
fXn Ym (xn , ym ) = π exp[− 14 (x2n + ym
2 )]
Example 7.5.3. Let X, Y, Z, V be i.i.d. N (0, 1). Calculate E[X + 2Y |3X + Z, 4Y + 2V ].
We have  
3X + Z
E[X + 2Y |3X + Z, 4Y + 2V ] = a Σ−1  
4Y + 2V
where
a = [E((X + 2Y )(3X + Z)), E((X + 2Y )(4Y + 2V ))] = [3, 8]
and
   
var(3X + Z) E((3X + Z)(4Y + 2V )) 10 0
Σ= = .
E((3X + Z)(4Y + 2V )) var(4Y + 2V ) 0 20
Hence,
  
10−1 0 3X + Z
E[X+2Y |3X+Z, 4Y +2V ] = [3, 8]    = 3 (3X+Z)+ 4 (4Y +2V ).
0 20−1 4Y + 2V 10 10
Example 7.5.4. Assume that {X, Yn , n ≥ 1} are mutually independent random variables
with X = N (0, 1) and Yn = N (0, σ 2 ). Let X̂n = E[X | X + Y1 , . . . , X + Yn ]. Find the
smallest value of n such that
P (|X − X̂n | > 0.1) ≤ 5%.
We know that X̂n = an (nX + Y1 + · · · + Yn ). The value of an is such that
E((X − X̂n )(X + Yj )) = 0, i.e., E((X − an (nX + Yj ))(X + Yj )) = 0,
which implies that

1
an = .
n + σ2
Then
var(X − X̂n ) = var((1 − nan )X − an (Y1 + · · · + Yn )) = (1 − nan )2 + n(an )2 σ 2

σ2
= .
n + σ2
σ 2
Thus we know that X − X̂n = N (0, n+σ 2 ). Accordingly,
σ2 0.1
P (|X − X̂n | > 0.1) = P (|N (0, 2
)| > 0.1) = P (|N (0, 1)| > )
n+σ αn
q
σ2
where αn = n+σ 2
. For this probability to be at most 5% we need
r
0.1 σ2 0.1
= 2, i.e., αn = 2
= ,
αn n+σ 2
so that
n = 19σ 2 .
The result is intuitively pleasing: If the observations are more noisy (σ 2 large), we need
more of them to estimate X.
Example 7.5.5. Assume that X, Y are i.i.d. N (0, 1). Calculate E[(X + Y )4 | X − Y ].
Note that X + Y and X − Y are independent because they are jointly Gaussian and
uncorrelated. Hence,
E[(X +Y )4 | X −Y ] = E((X +Y )4 ) = E(X 4 +4X 3 Y +6X 2 Y 2 +4XY 3 +Y 4 ) = 3+6+3 = 12.
Example 7.5.6. Let X, Y be independent N (0, 1) random variables. Show that W :=
X 2 + Y 2 =D Exd(1/2). That is, the sum of the squares of two i.i.d. zero-mean Gaussian
random variables is exponentially distributed!
We calculate the characteristic function of W . We find
Z ∞ Z ∞
iuW 2 +y 2 ) 1 −(x2 +y2 )/2
E(e ) = eiu(x e dxdy
−∞ −∞ 2π
Z 2π Z ∞
2 1 −r2 /2
= eiur e rdrdθ
2π
Z0 ∞ 0
2 2 /2
= eiur e−r rdr
0
Z ∞
1 2 2 1
= d[eiur −r /2 ] = .
0 2iu − 1 1 − 2iu
On the other hand, if W =D Exd(λ), then

Z ∞
iuW
E(e ) = eiux λe−λx dx
0
λ 1
= = .
λ − iu 1 − λ−1 iu
Comparing these expressions shows that X 2 + Y 2 =D Exd(1/2) as claimed.
Example 7.5.7. Let {Xn , n ≥ 0} be Gaussian N (0, 1) random variables. Assume that
Yn+1 = aYn + Xn for n ≥ 0 where Y0 is a Gaussian random variable with mean zero and
variance σ 2 independent of the Xn ’s and |a| < 1.
a. Calculate var(Yn ) for n ≥ 0. Show that var(Yn ) → γ 2 as n → ∞ for some value γ 2 .
b. Find the values of σ 2 so that the variance of Yn does not depend on n ≥ 1.
a. We see that
var(Yn+1 ) = var(aYn + Xn ) = a2 var(Yn ) + var(Xn ) = a2 var(Yn ) + 1.
Thus, we αn := var(Yn ), one has
αn+1 = a2 αn + 1 and α0 = σ 2 .
Solving these equations we find
1 − a2n
var(Yn ) = αn = a2n σ 2 + , for n ≥ 0.
1 − a2
Since |a| < 1, it follows that
1
var(Yn ) → γ 2 := as n → ∞.
1 − a2
b. The obvious answer is σ 2 = γ 2 .

Example 7.5.8. Let the Xn ’s be as in Example 7.5.7.
a.Calculate
E[X1 + X2 + X3 | X1 + X2 , X2 + X3 , X3 + X4 ].
b. Calculate
E[X1 + X2 + X3 | X1 + X2 + X3 + X4 + X5 ].
a. We know that the solution is of the form Y = a(X1 + X2 ) + b(X2 + X3 ) + c(X3 + X4 )
where the coefficients a, b, c must be such that the estimation error is orthogonal to the
conditioning variables. That is,
E((X1 + X2 + X3 ) − Y )(X1 + X2 )) = E((X1 + X2 + X3 ) − Y )(X2 + X3 ))
= E((X1 + X2 + X3 ) − Y )(X3 + X4 )) = 0.
These equalities read
2 − a − (a + b) = 2 − (a + b) − (b + c) = 1 − (b + c) − c = 0,
and solving these equalities gives a = 3/4, b = 1/2, and c = 1/4.
b. Here we use symmetry. For k = 1, . . . , 5, let
Yk = E[Xk | X1 + X2 + X3 + X4 + X5 ].
Note that Y1 = Y2 = · · · = Y5 , by symmetry. Moreover,
Y1 +Y2 +Y3 +Y4 +Y5 = E[X1 +X2 +X3 +X4 +X5 | X1 +X2 +X3 +X4 +X5 ] = X1 +X2 +X3 +X4 +X5 .
It follows that Yk = (X1 + X2 + X3 + X4 + X5 )/5 for k = 1, . . . , 5. Hence,
3
E[X1 + X2 + X3 | X1 + X2 + X3 + X4 + X5 ] = Y1 + Y2 + Y3 = (X1 + X2 + X3 + X4 + X5 ).
5
Example 7.5.9. Let the Xn ’s be as in Example 7.5.7. Find the jpdf of (X1 + 2X2 +
3X3 , 2X1 + 3X2 + X3 , 3X1 + X2 + 2X3 ).

These random variables are jointly Gaussian, zero mean, and with covariance matrix Σ
given by  
14 11 11
 
Σ= 
 11 14 11  .
11 11 14
Indeed, Σ is the matrix of covariances. For instance, its entry (2, 3) is given by
E((2X1 + 3X2 + X3 )(3X1 + X2 + 2X3 )) = 2 × 3 + 3 × 1 + 1 × 2 = 11.
We conclude that the jpdf is
1 1
x) =
fX (x exp{− x T Σ−1x }.
(2π)3/2 |Σ|1/2 2
We let you calculate |Σ| and Σ−1 .
Example 7.5.10. Let X1 , X2 , X3 be independent N (0, 1) random variables. Calculate
Y ] where
E[X1 + 3X2 |Y  
 
 X1 
 1 2 3  
Y =  
 X2 
3 2 1  
X3
By now, this should be familiar. The solution is Y := a(X1 + 2X2 + 3X3 ) + b(3X1 +
2X2 + X3 ) where a and b are such that
0 = E((X1 +3X2 −Y )(X1 +2X2 +3X3 )) = 7−(a+3b)−(4a+4b)−(9a+3b) = 7−14a−10b
and
0 = E((X1 +3X2 −Y )(3X1 +2X2 +X3 )) = 9−(3a+9b)−(4a+4b)−(3a+b) = 9−10a−14b.
Solving these equations gives a = 1/12 and b = 7/12.
Example 7.5.11. Find the jpdf of (2X1 + X2 , X1 + 3X2 ) where X1 and X2 are independent
N (0, 1) random variables.

These random variables are jointly Gaussian, zero-mean, with covariance Σ given by
 
5 5
Σ= .
5 10
Hence,
1 1
x) =
fX (x 1/2
exp{− x T Σ−1x }
2π|Σ| 2
1 1 T −1
= exp{− x Σ x }
10π 2
where  
1  10 −5
Σ−1 = .
25 −5 5
Example 7.5.12. The random variable X is N (µ, 1). Find an approximate value of µ so
that
P (−0.5 ≤ X ≤ −0.1) ≈ P (1 ≤ X ≤ 2).
We write X = µ + Y where Y is N (0, 1). We must find µ so that
g(µ) := P (−0.5 − µ ≤ Y ≤ −0.1 − µ) − P (1 − µ ≤ Y ≤ 2 − µ) ≈ 0.
We do a little search using a table of the N (0, 1) distribution or using a calculator. I
find that µ ≈ 0.065.
Example 7.5.13. Let X be a N (0, 1) random variable. Calculate the mean and the variance
of cos(X) and sin(X).
a. Mean Values. We know that
2 /2
E(eiuX ) = e−u and eiθ = cos(θ) + i sin(θ).
Therefore,
2 /2
E(cos(uX) + i sin(uX)) = e−u ,
so that
2 /2
E(cos(uX)) = e−u and E(sin(uX)) = 0.
In particular, E(cos(X)) = e−1/2 and E(sin(X)) = 0.
b. Variances. We first calculate E(cos2 (X)). We find
1 1 1
E(cos2 (X)) = E( (1 + cos(2X))) = + E(cos(2X)).
2 2 2
Using the previous derivation, we find that
2 /2
E(cos(2X)) = e−2 = e−2 ,
so that E(cos2 (X)) = (1/2) + (1/2)e−2 . We conclude that
1 1 −2 1 1
var(cos(X)) = E(cos2 (X)) − (E(cos(uX)))2 = + e − (e−1/2 )2 = + e−2 − e−1 .
2 2 2 2
Similarly, we find
1 1 −2
E(sin2 (X)) = E(1 − cos2 (X)) = − e = var(sin(X)).
2 2
Example 7.5.14. Let X be a N (0, 1) random variable. Define



 X, if |X| ≤ 1
Y =

 −X, if |X| > 1.
Find the pdf of Y .
By symmetry, X is N (0, 1).
Example 7.5.15. Let {X, Y, Z} be independent N (0, 1) random variables.
a. Calculate
E[3X + 5Y | 2X − Y, X + Z].
b. How does the expression change if X, Y, Z are i.i.d. N (1, 1)?

a. Let V1 = 2X − Y, V2 = X + Z and V = [V1 , V2 ]T . Then
E[3X + 5Y | V ] = a Σ−1
V V
where
V T ) = [1, 3]
a = E((3X + 5Y )V
and  
5 2
ΣV =  .
2 2
Hence,
 −1  
5 2 1  2 −2 
E[3X + 5Y | V ] = [1, 3]   V = [1, 3] V
2 2 6 −2 5
1 2 13
V = − (2X − Y ) + (X + Z).
= [−4, 13]V
6 3 6
b. Now,
1
E[3X + 5Y | V ] = E(3X + 5Y ) + a Σ−1 V − E(V
V (V V − [1, 2]T )
V )) = 8 + [−4, 13](V
6
26 2 13
= − (2X − Y ) + (X + Z).
6 3 6
Example 7.5.16. Let (X, Y ) be jointly Gaussian. Show that X − E[X | Y ] is Gaussian
and calculate its mean and variance.
We know that
cov(X, Y )
E[X | Y ] = E(X) + (Y − E(Y )).
var(Y )
Consequently,
cov(X, Y )
X − E[X | Y ] = X − E(X) − (Y − E(Y ))
var(Y )
and is certainly Gaussian. This difference is zero-mean. Its variance is
cov(X, Y ) 2 cov(X, Y ) [cov(X, Y )]2

var(X) + [ ] var(Y ) − 2 cov(X, Y ) = var(X) − .
var(Y ) var(Y ) var(Y )
Chapter 8
Detection and Hypothesis Testing
The detection problem is roughly as follows. We want to guess which of finitely many
possible causes produced an observed effect. For instance, you have a fever (observed effect);
do you think you have the flu or a cold or the malaria? As another example, you observe
some strange shape on an X-ray; is it a cancer or some infection of the tissues? A receiver
gets a particular waveform; did the transmitter send the bit 0 or the bit 1? (Hypothesis
testing is similar.) As you can see, these problems are prevalent in applications.
There are two basic formulations: either we know the prior probabilities of the possible
causes (Bayesian) or we do not (non-Bayesian). When we do not, we can look for the
maximum likelihood detection or we can formulate a hypothesis-testing problem.
8.1 Bayesian
Assume that X takes values in a finite set {0, . . . , M }. We know the conditional density or
distribution of Y given {X = x} and the prior distribution of X: {P (X = x), x = 0, . . . , M }.
We want to choose Z in {0, . . . , N } on the basis of Y to minimize E(c(X, Z)) where c(·, ·)
is a known nonnegative function.
Since E(c(X, Z)) = E(E[c(X, Z)|Y ]), we should choose Z = g(Y ) where
g(y) = arg minz E[c(X, z)|Y = y].
121
122 CHAPTER 8. DETECTION AND HYPOTHESIS TESTING
A particular example is when M = N and c(m, n) = 1{m 6= n}. In that case,
g(y) = M AP [X|Y = y] := arg maxx P [X = x|Y = y], (8.1.1)
the maximum a posteriori estimate of X given {Y = y}, the most likely value of X given
{Y = y}.
To calculate the MAP, one uses the fact that
P (X = x)fY |X [y|x]
P [X = x|Y = y] = ,
fY (y)
so that
M AP [X|Y = y] = arg maxx P (X = x)fY |X [y|x]. (8.1.2)
The common criticism of this formulation is that in many cases the prior distribution
of X is not known at all. For instance, consider designing a burglar alarm for your house.
What prior probability should you use? You suspect a garbage in, garbage out effect here
and you are correct.
8.2 Maximum Likelihood estimation
Instead of choosing the value of X that is most likely given the observation, one can
choose the value of X that makes the observation most likely. That is, one can choose
arg maxx P [Y = y|X = x]. This estimator is called the maximum likelihood estimator of X
given {Y = y}, or M LE[X|Y = y].
Identity (8.1.2) shows that M LE[X|Y = y] = M AP [X|Y = y] when the prior distri-
bution of X is uniform, i.e., when P (X = x) has the same value for all x. Note also that
M LE[X|Y = y] can be calculated without knowing the prior distribution of X. Finally, a
deeper property of the MLE is that under weak assumptions it tends to be a good estimator
(asymptotically efficient).
8.3. HYPOTHESIS TESTING PROBLEM 123
8.3 Hypothesis Testing Problem
Consider the problem of designing a fire alarm system. You want to make the alarm as
sensitive as possible as long as it does not generate too many false alarms. We formulate
and solve that problem in this section.
8.3.1 Simple Hypothesis
We consider the case of a simple hypothesis. We define the problem and state the solution
in the form of a theorem. We then examine some examples. We conclude the section by
proving the theorem.
There are two possible hypotheses H0: X = 0 or H1: X = 1. Should one reject H0 on
the basis of the observation Y ?
One is given the distribution of the observation Y given X. The problem is to choose
Z = g(Y ) ∈ {0, 1} to minimize the probability of missed detection P [Z = 0|X = 1] subject
to a bound on the probability of false alarm: P [Z = 1|X = 0] ≤ β, for a given β ∈ (0, 1).
(Think of {X = 1} = “fire” and {X = 0} = “no fire”.) For convenience, we designate
the solution of that problem by Z = HT [X|Y ], which means that Z is the solution of the
hypothesis testing problem we just described.
Discrete Case
Given X, Y has a known p.m.f. P [Y = y|X]. Let L(y) = P [Y = y|X = 1]/P [Y = y|X = 0]
(the likelihood ratio).
Theorem 8.3.1. Neyman-Pearson; discrete case

The solution is randomized:



 1, if L(y) > λ,



Z= 0, if L(y) < λ




 1 w.p. γ and 0 w.p. 1 − γ, if L(y) = λ.
The values of λ and γ are selected so that P [Z = 1|X = 0] = β.
If L(y) is increasing in y, then the solution is



 1, if y > y0 ,



Z= 0, if y < y0




 1 w.p. γ and 0 w.p. 1 − γ, if y = y0
where y0 and γ are selected so that P [Z = 1|X = 0] = β.
It is not too difficult to show that there is a choice of λ and γ for which [Z = 1|X =
0] = β. We leave you the details.
Continuous Case
Given X, Y has a known p.d.f. fY |X [y|x]. Let L(y) = fY |X [y|1]/fY |X [y|0] (the likelihood
ratio).
Theorem 8.3.2. Neyman-Pearson; continuous case
The solution is 

 1, if L(y) > λ
Z=

 0, if L(y) ≤ λ.
The value of λ is selected so that P [Z = 1|X = 0] = β.
If L(y) is increasing in y, the solution becomes



 1, if L(y) > y0
Z=

 0, if L(y) ≤ y0 .
The value of y0 is selected so that P [Z = 1|X = 0] = β.

8.3.2 Examples
We examine a few examples to illustrate how the theorem is used.
Example 1
If X = k, Y is exponentially distributed with mean µ(k), for k = 0, 1 where 0 < µ(0) <
µ(1). Here, fY |X [y|x] = κ(x) exp{−κ(x)y} where κ(x) = µ(x)−1 for x = 0, 1. Thus Z =
1{Y > y0 } where y0 is such that P [Z = 1|X = 0] = β4. That is, exp{−κ(0)y0 } = β, or
y0 = − ln(β)/κ(0) = −µ(0) ln(β).
Example 2
If X = k, Y is Gaussian with mean µ(k) and variance 1, for k = 0, 1 where 0 < µ(0) <
√
µ(1). Here, fY |X [y|x] = K exp{−(x − µ(k))2 /2} where K = 1/ 2π. Accordingly, L(y) =
B exp{x(µ(1) − µ(0))} where B = exp{(µ(0)2 − µ(1)2 )/2}. Thus Z = 1{Y > y0 } where y0
is such that P [Z = 1|X = 0] = β. That is, P (N (µ(0), 1) > y0 ) = β, and one finds the value
of y0 using a table of the c.d.f. of N (0, 1) or using a computer.
Example 3
You flip a coin 100 times and count the number Y of heads. You must decide whether the
coin is fair or biased, say with P (H) = 0.6. The goal is to minimize the probability of
missed detection P [f air|biased] subject to a false alarm probability P [biased|f air] ≤ β.
Solution: You can verify that L(y) in increasing in y. Thus, the best decision is Z = 1
if Y > y0 , Z = 0 (fair) if Y < y0 , and Z = 1 w.p. γ if Y = y0 . You must choose y0 and γ
so that P [Z = 1|f air] = β. Let us plot P [Y = y|f air]:
Figure 8.1 illustrates P [Z = 1|f air]. For β = 0.001, one finds (using a calculator)
y0 = 66; for β = 0.01, y0 = 63.

Figure 8.1: Number of heads for unbiased or biased coin
One finds also that P [Y ≥ 58|f air] = 0.066 and P [Y ≥ 59|f air] = 0.043; accordingly, if
β = 0.05 one should decide Z = 1 w.p. 1 if Y >= 59 and Z = 1 w.p. 0.3 if Y = 58. Indeed,
in that case, P [Z = 1|f air] = P [Y ≥ 59|f air] + 0.3P [Y = 58|f air] = 0.043 + 0.3(0.066 −
0.043) = 0.05.
8.3.3 Proof of the Neyman-Pearson Theorem
Before we give a formal proof, we discuss an analogy that might help you understand the
structure of the result. Imagine that you have a finite budget with which to buy food items
from a given set. Your objective is to maximize the total number of calories of the items
you buy. Intuitively, the best strategy is to rank the items in decreasing order of calories
per dollar and to buy the items in that order until you run out of money. When you do
that, it might be that you still have some money left after purchasing item n − 1 but not
quite enough to buy item n. In that case, if you could, you would buy a fraction of item n.
If you cannot, and if we care only about the expected amount of money you spend, then you
could buy the next item with some probability between 0 and 1 chosen so that you spend all
you money, on average. Now imagine that the items are values of the observation Y when
you decide to sound the alarm. Each item y has a cost P [Y = y|X = 0] in terms of false
alarm probability and some ‘reward’ P [Y = y|X = 1] in contributing to the probability
of correct detection (the caloric content of the item in our previous example). According
to our intuition, we rank the items y in decreasing order of the reward/cost ratio which is
precisely the likelihood ratio. Consequently, you sound the alarm when the likelihood ratio
exceeds to value λ and you may have to randomize at some item to spend you total budget,
on average.
Let Z be as specified by the theorem and let V be another random variable based on Y ,
possibly with randomization, and such that P [V = 1|X = 0] ≤ β. We want to show that
P [Z = 0|X = 1] ≤ P [V = 0|X = 1]. Note that (λ − L(Y ))(Z − V ) ≤ 0, so that
L(Y )(Z − V ) ≥ λ(Z − V ). (8.3.1)
For the next step, we need the fact that if W is a function of Y , then E[W L|X = 0] =
E[W |Z = 1]. We show this fact in the continuous case. The other cases are similar. We
find
Z Z
fY |X [y|1]
E[W L|X = 0] = W (y)L(y)fY |X [y|0]dy = W (y) f [y|0]dy
fY |X [y|0] Y |X
Z
= W (y)fY |X [y|1]dy = E[W |X = 1],
as we wanted to show.
Taking E[·|X = 0] of both sides of (8.3.1), we find
E[Z − V | X = 1] ≥ λE[Z − V | X = 0],
so that
P [Z = 1 | X = 1] − P [V = 1 | X = 1] ≥ λ(P [Z = 1 | X = 0] − P [V = 1 | X = 0]) ≥ 0
where the inequality comes from P [Z = 1 | X = 0] = β and P [V = 1 | X = 0] ≤ β. It
follows, since λ ≥ 0, that
P [Z = 1 | X = 1] ≥ P [V = 1 | X = 1],
which is what we needed to prove. ¤
8.4 Composite Hypotheses
So far we have learned how to decide between two hypotheses that specify the distribution
of the observations. In this section we consider composite hypotheses. Each of the two
alternatives corresponds to a set of possible distributions and we want to decide which set
is in effect. We explain this problem through examples.
8.4.1 Example 1
Consider once again Examples 1 and 2 in Section 8.3.2. Note that the optimal decision Z
does not depend on the value of µ(1). Consequently, the optimal decision would be the
same if the two hypotheses were
H0: µ = µ(0)
H1: µ > µ(0).
The hypothesis H1 is called a composite hypothesis because it does not specify a unique
value of the parameter to be tested.
8.4.2 Example 2
Once again, consider Examples 1 and 2 in Section 8.3.2 but with the hypotheses
H0: µ ≤ µ(0)
H1: µ > µ(0).

8.4. COMPOSITE HYPOTHESES 129
Both hypotheses H0 and H1 are composite. We claim that the optimal decision Z is
still the same as in the original simple hypotheses case. To see that, observe that P [Z =
1|µ] = P [Y > y0 |µ] ≤ P [Y < y0 |µ(0)] = β, so that our decision meets the condition that
P [Z = 1|H0] ≤ β, and it minimizes P [Z = 0|H1] subject to that requirement.
8.4.3 Example 3
Both examples 8.4.1 and 8.4.2 consider one-sided tests where the values of the parameter µ
under H1 are all larger than those permitted under H0. What about a two-sided test with
H0: µ = µ(0)
H1: µ 6= µ(0).
More generally, one might consider a test with
H0: µ ∈ A
H1: µ ∈ B
where A and B are two disjoint sets.
In general, optimal tests for such situations do not exist and one resorts to approxima-
tions. We saw earlier that the optimal decisions for a simple hypothesis test is based on
the value of the likelihood ratio L(y), which is the ratio of the densities of Y under the
two hypotheses X = 1 and X = 0, respectively. One might then try to extend this test by
replacing L(y) by the ratio of the two densities under H1 and H0, respectively. How do we
define the density under the hypothesis H1 “µ ∈ B”? One idea is to calculate the MLE of
µ given Y and H1, and similarly for H0. This approach works well under some situations.
However, the details would carry us a bit to far. Interested students will find expositions
of these methods in any good statistics book. Look for the keywords “likelihood ratio test,
goodness-of-fit test”.
8.5 Summary
The detection story is that X, Y are random variables, X ∈ {0, 1} (we could consider more
values), and we want to guess the value of X based on Y . We call this guess X̂. There are
a few possible formulations. In all cases, we assume that f[Y |X] [y | x] is known. It tells us
how Y is related to X.
8.5.1 MAP
We know P (X = x) for x = 0, 1. We want to minimize P (X 6= X̂). The solution is
X̂ = M AP [X | Y = y] := argmaxx f[Y |X] [y | x]P (X = x).
8.5.2 MLE
We do not know P (X = x) for x = 0, 1. By definition, the M LE of X given Y is
X̂ = M LE[X | Y = y] := argmaxx f[Y |X] [y | x].
The MLE is the MAP when the values of X are equally likely prior to any observation.
8.5.3 Hypothesis Test
We do not know P (X = x) for x = 0, 1. We are given an acceptable probability β of
deciding X̂ = 1 when X = 0 (false alarm) and we want to minimize the probability of
deciding X̂ = 0 when X = 1 (missed detection). The solution is


 1, if L(Y ) > λ

X̂ = HT [X | Y ] := 0, if L(Y ) < λ



1 with probability γ, if L(Y ) = λ.
Here,
f[Y |X] [y | 1]
L(y) =
f[Y |X] [y | 0]
is the likelihood ratio and we have to choose λ ≥ 0 and γ ∈ [0, 1] so that P [X̂ = 1 | X =
0] = β.
Easy case 1: Continuous
If L(Y ) is a continuous random variable, then we don’t have to bother with the case L(Y ) =
λ. The solution is then

 1, if L(Y ) > λ
X̂ =
 0, if L(Y ) < λ
where we choose λ so that P [X̂ = 1 | X = 0] = β.
Easy case 2: Continuous and Monotone
In some cases, L(Y ) is a continuous random variable that is strictly increasing in Y . Then
the decision is
X̂ = 1{Y ≥ y0 }
where y0 is such that
P [Y ≥ y0 | X = 0] = α.
8.6 Solved Problems
Example 8.6.1. Given X, Y = N (0.1 + X, 0.1 + X), for X ∈ {0, 1}. (Model inspired from
optical communication links.) Assume that P (X = 0) =: π(0) = 0.4 and P (X = 1) =:
π(1) = 0.6. a. Find X̂ = M AP [X | Y ]. b. Calculate P (X̂ 6= X).
a. We find
1 (y − 1.1)2
f[Y |X] [y | 1] = √ exp{− }
2.2π 2.2
and
1 (y − 0.1)2
f[Y |X] [y | 0] = √ exp{− }.
0.2π 0.2
Hence, X̂ = 1 if f[Y |X] [y | 1]π(1) ≥ f[Y |X] [y | 0]π(0). After some algebra, one finds that
this condition is equivalent to
X̂ = 1{y ∈
/ (−0.53, 0.53)}.
Intuitively, if X = 0, then Y =D N (0.1, 0.1) and is likely to be close to zero. On the other
hand, if X = 1, then Y =D N (1.1, 1.1) and is more likely to take large positive value or
negative values.
b. The probability of error is computed as follows:
P (X̂ 6= X) = P (X = 0)P [X̂ = 1 | X = 0] + P (X = 1)P [X̂ = 1 | X = 1]
0.4P [|Y | > 0.53 | X = 0] + 0.6P [|Y | < 0.53 | X = 1]
= 0.4P (|N (0.1, 0.1)| > 0.53) + 0.6P (|N (1.1, 1.1)| < 0.53)
= 0.8P (N (0.1, 0.1) < −0.53) + 1.2P (N (1.1, 1.1) < −0.53)
= 0.8P (N (0, 0.1) < −0.63) + 1.2P (N (0, 1.1) < −1.63)
√ √
= 0.8P (N (0, 1) < −0.63/ 0.1) + 1.2P (N (0, 1) < −1.63/ 1.1)
= 0.8P (N (0, 1) < −1.99) + 1.2P (N (0, 1) < −1.554) ≈ 0.09.
We used a calculator to evaluate the last expression.
Example 8.6.2. Given X = i, Y = Exd(λi ), for i = 0, 1. Assume λ0 < λ1 . Find
X̂ = M AP [X | Y ] and calculate P (X̂ 6= X). We know π(i) = P (X = i) for i = 0, 1.
We find, for i = 0, 1,
f[Y |X] [y | i]π(i) = π(i)λi e−λi y ,

so that X̂ = 1 if
π(1)λ1 e−λ1 y ≥ π(0)λ0 e−λ0 y .
i.e., if
1 π(1)λ1
y≤ ln{ } =: y0 .
λ1 − λ0 π(0)λ0
Consequently,
P (X̂ 6= X) = P (X̂ = 0, X = 1) + P (X̂ = 1, X = 0)
= π(1)P [Y > y0 | X = 1] + π(0)P [Y < y0 | X = 0] = π(1)e−λ1 y0 + π(0)(1 − e−λ0 y0 ).
Example 8.6.3. When X = 1, Y = N (0, 2) and when X = 0, Y = N (1, 1). Find X̂ =
HT [X | Y ] with β = 10−2 .
We have
1 y2 1 (y − 1)2
L(y) = [ √ exp{− }]/[ √ exp{− }].
2π2 4 2π 2
We see that L(y) is a strictly increasing function of y 2 − 4y (not of y!). Thus, L(y) ≥ λ is
equivalent to y 2 − 4y ≥ τ for some τ . Accordingly, we can write the solution as
X̂ = 1{y 2 − 4y ≥ τ }
where we choose τ so that P [X̂ = 1 | X = 0] = β, i.e., so that
P [Y 2 − 4Y ≥ τ |X = 0] = β.
Note that Y 2 − 4Y ≥ τ if and only if Y ≤ y0 or Y ≥ y1 where

√ √
y0 = 2 − 4 + τ and y1 = 2 + 4 + τ .
How do we find the value of τ for β = 10−2 ? A brute force approach consists in calculating
g(τ ) := P [Y 2 − 4Y ≥ τ |X = 0] for different values of τ and to zoom in on what works. I
used excel and explored g(τ ). We find τ = 40.26. Thus,
X̂ = 1{Y < −4.65 or Y > 8.65}.

Example 8.6.4. If X = 0, Y =D Exd(λ0 ) and if X = 1, Y =D Exd(λ1 ) where λ0 = 1 >
λ1 > 0. Let β = 10−2 . Find HT [X | Y ].
We have
λ1 e−λ1 y
L(y) = .
λ0 e−λ0 y
Thus L(y) is strictly increasing in y and is continuous. Thus,
X̂ = 1{Y > y0 }
where y0 is such that
P [Y > y0 | X = 0] = e−y0 = α, i.e., y0 = − ln(β).
This decision rule does not depend on the value of λ1 < 1. Accordingly, X̂ solves the
problem of deciding between H0 : E(Y ) = 1 and H1 : E(Y ) > 1 so as to minimize the
probability of deciding H0 when H1 is in force subject to the probability of deciding H1
when H0 is in force being at most β.
Example 8.6.5. If X = 0, Y = U [−1, 1] and if X = 1, Y = U [0, 2]. Calculate minP [X̂ =
0 | X = 1] over all X̂ based on Y such that P [X̂ = 1 | X = 0] ≤ β.
Here,
1{0 ≤ y ≤ 2}
L(y) = .
1{−1 ≤ y ≤ 1}
Thus, L(y) = 0 for y ∈ [−1, 0); L(y) = 1 for y ∈ [0, 1]; and L(y) = ∞ for y ∈ (1, 2]. The
decision is then


 1, if Y > 1

X̂ = HT [X | Y ] := 0, if Y < 0



1 with probability γ, if Y ∈ [0, 1].
We choose γ so that
β = 0.2 = P [X̂ = 1 | X = 0]
1
= γP [Y ∈ [0, 1] | X = 0] + P [Y > 1 | X = 0] = γ ,
2
so that γ = 2β. It then follows that
P [X̂ = 0 | X = 1] = P [Y < 0 | X = 1] + (1 − γ)P [Y ∈ [0, 1] | X = 1]

1 1 1
= (1 − γ) = (1 − 2β) = − β.
2 2 2
Example 8.6.6. Pick the point (X, Y ) uniformly in the triangle {(x, y) | 0 ≤ x ≤ 1 and 0 ≤
y ≤ x}.
a. Find the function g : [0, 1] → {0, 0.5, 1} that minimizes E((X − g(Y ))2 ).
b. Find the function g : [0, 1] → < that minimizes E(h(X − g(Y ))) where h(·) is a
function whose primitive integral H(·) is anti-symmetric strictly convex over [0, ∞). For
instance, h(u) = u4 or h(u) = |u|.
a. The key observation is that, as follows from (6.4.4),
E((X − g(Y ))2 ) = E(E[(X − g(Y ))2 | Y ]).
For each y, we should choose the value g(y) := v ∈ {0, 0.5, 1} that minimizes E[(X − v)2 |
Y = y]. Recall that Given {Y = y}, X is U [y, 1]. Hence,

Z 1
2 1 1
E[(X − v) | Y = y] = (x − v)2 dx = [(x − v)3 ]1y
1−y y 3(1 − y)
1
= [(1 − v)3 − (y − v)3 ].
3(1 − y)
We expect that the minimizing value g(y) of v is nondecreasing in y. That is, we expect
that 

 0, if y ∈ [0, a)

g(y) = 0.5, if y ∈ [a, b)



1, if y ∈ [b, 1].
The “critical” values a < b are such that the choices are indifferent. That is,
E[(X−0)2 | Y = a] = E[(X−0.5)2 | Y = a] and E[(X−0.5)2 | Y = b] = E[(X−1)2 | Y = b].
Substituting the expression for the conditional expectation, these equations become
(1 − a)3 = (0.5)3 − (a − 0.5)3 and (0.5)3 − (b − 0.5)3 = −(b − 1)3 .
Solving these equations, we find a = b = 0.5. Hence,


 0, if y < 0.5
g(y) =
 1, if y ≥ 0.5.
b. As in previous part,
E(h(X − g(Y ))) = E(E[h(X − g(Y )) | Y ]),
so that, for each given y, we should choose g(y) to be the value v that minimizes E[h(X −v) |
Y = y]. Now,
Z 1
1 1
E[h(X − v) | Y = y] = h(x − v)dx = [H(1 − v) − H(y − v)].
1−y y 1−y
Now, we claim that the minimizing value of v is v ∗ = (1 + y)/2. To see that, note that for
v ∈ (y, 1) one has
(1 − v) + (v − y)
H(1−v)−H(y −v) = H(1−v)+H(v −y) > 2H( ) = H(1−v ∗ )+H(v ∗ −y),
2
by anti-symmetry and convexity.
Example 8.6.7. For x, y ∈ {0, 1}, let P [Y = y | X = x] = P (x, y) where P (0, 0) =
1 − P (0, 1) = 0.7 and P (1, 1) = 1 − P (1, 0) = 0.6. Assume that P (X = 1) = 1 − P (X =
0) = p ∈ [0, 1].
a. Find the MLE of X given Y .
b. Find the MAP of X given Y .
c. Find the estimate X̂ based on Y that minimizes P [X̂ = 0 | X = 1] subject to
P [X̂ = 1 | X = 0] ≤ β, for β ∈ (0, 1).

The solution is a direct application of the definitions plus some algebra.
a. By definition,
M LE[X | Y = y] = arg maxx P [Y = y | X = x] = arg maxx P (x, y).
Hence, M LE[X | Y = 0] = 0 because P [Y = 0 | X = 0] = 0.7 > P [Y = 0 | X = 1] = 0.4.
Similarly, M LE[X | Y = 1] = 1 because P [Y = 1 | X = 1] = 0.6 > P [Y = 1 | X = 0] = 0.3.
Consequently,
M LE[X | Y ] = Y.
b. We know that
M AP [X | Y = y] = arg maxx P (X = x)P [Y = y | X = x] = arg maxx P (X = x)P (x, y).
Therefore,
M AP [X | Y = 0]
= arg maxx {g(x = 0) = P (X = 0)P (0, 0) = 0.7(1 − p), g(x = 1) = P (X = 1)P (1, 0) = 0.4p}
and
M AP [X | Y = 1]
= arg maxx {h(x = 0) = P (X = 0)P (0, 1) = 0.3(1 − p), h(x = 1) = P (X = 1)P (1, 1) = 0.6p}.
Consequently,


 0, if y = 0 and p < 7/11;



 1, if y = 0 and p ≥ 7/11;
M AP [X|Y = y] =

 0, if y = 1 and p < 1/3;



 1, if y = 1 and p ≥ 1/3.
c. We know that


 1, if L(y) = P (1, y)/P (0, y) > λ;

X̂ = 1 w.p. γ, if L(y) = P (1, y)/P (0, y) = λ;



0, if L(y) = P (1, y)/P (0, y) < λ.
We must find λ and γ so that P [X̂ = 1 | X = 0] = β. Note that L(1) = 2 and L(0) = 4/7.
Accordingly,


 0, if λ > 2;



 1 w.p. P (0, 1)γ = 0.3γ if λ = 2;
P [X̂ = 1 | X = 0] =
 1 w.p. P (0, 1) = 0.3 if λ ∈ (4/7, 2);




 1 w.p. P (0, 0)γ + P (0, 1) = 0.7γ + 0.3 if λ = 4/7; 1, if λ < 4/7.
To see this, observe that if λ > 2, then L(0) < L(1) < λ, so that X̂ = 0. Also, if λ = 2, then
L(0) < λ = L(1), so that X̂ = 1 w.p. γ when Y = 1. Since P [Y = 1 | X = 0] = P (0, 1), we
see that P [X̂ = 1 | X = 0] = P (0, 1)γ. The other cases are similar.
It follows that
λ = 2, γ = β/0.3, if β ≤ 0.3;
λ = 4/7, γ = (β − 0.3)/0.7, if β > 0.3.
Example 8.6.8. Given X, the random variables {Yn , n ≥ 1} are exponentially distributed
with mean X. Assume that P (X = 1) = 1 − P (X = 2) = p ∈ (0, 1).
a. Find the MLE of X given Y n := {Y1 , . . . , Yn }.
b. Find the MAP of X given Y n .
c. Find the estimate X̂n based on Y n that minimizes P [X̂n = 1 | X = 2] subject to
P [X̂n = 2 | X = 1] ≤ β, for β ∈ (0, 1}.
First we compute the likelihood ratio L(y n ). We find
fY n |X=2 (y n |X = 2) 1 n 1 Pni=1 yi
L(y n ) = = ( ) e2 .
fY n |X=1 (y n |X = 1) 2
a. Recall that M LE[X | Y n = y n ] = 2 if L(y n ) > 1 and is equal to 1 otherwise. Hence,


 1, if 1 Pn y < 2 ln(2);
n i=1 i
M LE[X | Y n n
=y ]=
 2, otherwise.
b. We know that M AP [X | Y n = y n ] = 2 if L(y n ) > P (X = 1)/P (X = 2). Hence,


 1, if 1 Pn y < 2 ln(2) + p ;
n n n i=1 i n(1−p)
M AP [X | Y = y ] =
 2, otherwise.
c. Ignoring the unlikely marginal case when L(y n ) = λ, we know that X̂ = HT [X |
Y n = y n ] = 2 if L(y n ) > λ and is equal to 1 otherwise, for some λ. Equivalently,


 2, if Pn Y > ρ;
i=1 i
X̂ =
 1, if Pn Yi ≤ ρ.
i=1
We choose ρ so that P [X̂ = 2 | X = 1] = β, i.e., so that

Xn
P[ Yi > ρ | X = 1] = β.
i=1
Unfortunately, there is no closed-form solution and one has to resort to a computer
to determine the suitable value of ρ. After Chapter 11 we will be able to use a Gaussian
approximation. See Example 11.7.7.
Example 8.6.9. Let X, Y be independent random variables where P (X = −1) = P (X =
0) = P (X = +1) = 1/3 and Y is N (0, σ 2 ). Find the function g : < → {−1, 0, +1} that
minimizes P (X 6= g(X + Y )).
Let Z = X + Y . Since the prior distribution of X is uniform, we know that the solution
is X̂ = M LE[X | Z]. That is, g(y) = arg maxx fZ|X [z|x].
Now,
1 (z − x)2
fZ|X [z|x] = √ exp{− }.
2πσ 2 2σ 2
Hence,
g(y) = arg minx |z − x|.
Consequently, 

 −1, if z ≤ −0.5;

X̂ = 0, if − 0.5 < z < 0.5;



1, if z ≥ 0.5.
Example 8.6.10. Assume that X is uniformly distributed in the set {1, 2, 3, 4}. When
Z where v i ∈ <2 for i = 1, 2, 3, 4, I is the identity

X = i, one observes Y = N (vv i , I ) =D v i +Z
matrix in <2×2 , and Z =D N (00, I ).
a. Find the function g : <2 → {1, 2, 3, 4} that minimizes P (X 6= g(Y

Y )).
Y )) where g(.) was found in part (a)?

b. Can you find an estimate of P (X 6= g(Y
a. From the statement,
1 1
f [yy |X = i] := fY |X [yy |i] = exp[− ||yy − vi ||2 ].
2π 2
Y ) = M LE[X|Y
Also, because the prior of X is uniform, we now that g(Y Y ]. That is,
g(yy ) = arg maxi f [yy |X = i] = arg mini ||yy − vi ||.
That is, the MAP of X given Y corresponds to the vector vi that is closest to the received
vector Y . This is fairly intuitive, given the shape of the Gaussian density.
b. It is difficult to get precise estimates. However, we can say that P [X̂ = i | X = i] ≥ αi
Z || < 0.5 min{||vv i − v j ||, j 6= i} =: 0.5di . Indeed, this

where αi is the probability that ||Z
condition guarantees that Z is closer to v i than to any v j . Now,
Z || ≤ d) = P (Z12 + Z22 ≤ α2 ) = 1 − exp{−0.5d2 }

P (||Z
where the last inequality follows from the result in Example 7.5.6.
Hence,
P [X̂ = i | X = i] ≥ 1 − exp{−0.5d2i }.
Consequently,
4
X
P [X̂ = X] = P [X̂ = i | X = i]P (X = i)
i=1
4
1X
≥ (1 − exp{−0.5d2i }).
4
i=1
p p
For instance, assume that the vectors v i are the four corners of the square [− ρ/2, ρ/2]2 .
p
In that case, di = ρ/2 for all i and we find that
P [X̂ = X] ≥ 1 − exp{−0.25ρ}.
Note also that ||vv i ||2 = ρ, so that ρ is the power that the transmitter sends. As ρ increases,
so does our lower bound on the probability of decoding the signal correctly.
This type of simple bound is commonly used in the evaluation of communication systems.
Example 8.6.11. A machine produces steel balls for ball bearings. When the machine
operates properly, the radii of the balls are i.i.d. and N (100, 4). When the machine is
defective, the radii are i.i.d. and N (98, 4).
a. You measure n balls produced by the machine and you must raise an alarm if you
believe that the machine is defective. However, you want to limit the probability of false
alarm to 1%. Explain how you propose to do this.
b. Compute the probability of missed detection that you obtain in part (a). This prob-
ability depends on the number n of balls, so you cannot get an explicit answer. Select the
value of n so that this probability of missed detection is 0.1%.
a. This is a hypothesis test. Let X = 0 when the machine operates properly and X = 1
otherwise. Designate by Y1 , . . . , Yn the radii of the balls. The likelihood ratio is

P
exp{− 81 nk=1 (yk − 98)2 }
L(y1 , . . . , yn ) = P
exp{− 18 nk=1 (yk − 100)2 }
n
1X
= exp{− yk + 49.5 × n}.
2
k=1
Pn
Since L(y1 , . . . , yn ) is decreasing in k=1 yk , we conclude that

 1, if Pn Y < λ
k=1 k
X̂ =
 0, if Pn Yk > λ
k=1
where λ is such that P [X̂ = 1|X = 0] = 1%. That is,
Xn
1% = P [ Yk < λ|X = 0] = P (N (n × 100, n × 4) < λ)
k=1
λ − 100n
= P (N (0, 4n) < λ − 100n) = P (N (0, 1) < √ ).
2 n
Accordingly, we find that
λ − 100n √
√ = −2.3, i.e., λ = 100n − 4.6 n.
2 n
b. We have
Xn
P [X̂ = 0|X = 1] = P [ Yk > λ|X = 1] = P (N (n × 98, n × 4) > λ)
k=1
√
λ − 98n 2n − 4.6 n
= P (N (0, 1) > √ ) = P (N (0, 1) > √ )
2 n 2 n
√
= P (N (0, 1) > n − 2.3).
For this probability to be equal to 0.1%, we need
√
n − 2.3 = 3.1, i.e., n = (2.3 + 3.1)2 ≈ 29.16.
We conclude that we have to measure 30 balls.

Chapter 9
Estimation
The estimation problem is similar to the detection problem except that the unobserved
random variable X does not take values in a finite set. That is, one observes Y ∈ < and
one must compute an estimate of X ∈ < based on Y that is close to X in some sense.
Once again, one has Bayesian and non-Bayesian formulations. The non-Bayesian case
typically uses M LE[X|Y ] defined as in the discussion of Detection.
9.1 Properties
An estimator of X given Y is a function g(Y ). The estimator g(Y ) is unbiased if E(g(Y )) =
E(X), i.e., if its mean is the same as that of X. Recall that E[X | Y ] is unbiased.
If we make more and more observations, we look at the estimator X̂n of X given
(Y1 , . . . , Yn ) : X̂n = gn (Y1 , . . . , Yn ). We say that X̂n is asymptotically unbiased if lim E(X̂n ) =
E(X).
9.2 Linear Least Squares Estimator: LLSE
In this section we study a class of estimators that are linear in the observations. They have
the advantage of being easy to calculate.
143
144 CHAPTER 9. ESTIMATION
Definition 9.2.1. LLSE
Let (X, Y ) be a pair of random variables on some probability space. The linear least
squares estimator (LLSE) of X given Y , designated by L[X | Y ], is the linear function
a + bY of Y that minimizes E((X − a − bY )2 ).
In the multivariate case, let X , Y be vectors of random variables on some probability
X | Y ] is the linear function a + BY that

space. The LLSE of X given Y , designated by L[X
X − a − BY ||2 ).
minimizes E(||X
The next theorem summarizes the key result.
Theorem 9.2.1. Linear Least Squares Estimator
One has
cov(X, Y )
L[X | Y ] = E(X) + (Y − E(Y )). (9.2.1)
var(Y )
Also, if ΣY is invertible,
L[X X ) + ΣX,Y ΣY−1 (Y

X | Y ] = E(X Y − E(Y
Y )). (9.2.2)
Finally, if ΣY is not invertible,
L[X X ) + ΣX,Y ΣY† (Y

X | Y ] = E(X Y − E(Y
Y ). (9.2.3)
where ΣY† is such that
ΣX ,YY ΣY† = ΣY ,YY .
Proof:
We provide the proof in the scalar case. The vector case is very similar and we leave the
details to the reader. The key observation is that Z = LLSE[X|Y ] if and only if Z = a+bY
is such that E(Z) = E(X) and cov(X − Z, Y ) = 0.
To see why that is the case, assume that Z has those two properties and let V = c + dY
be some other linear estimator. Then,
E((X − V )2 ) = E((X − Z + Z − V )2 ) = E((X − Z)2 ) + E((Z − V )2 ) + 2E((X − Z)(Z − V )).

9.2. LINEAR LEAST SQUARES ESTIMATOR: LLSE 145
But
E((X − Z)(Z − V )) = E((X − Z)(a + bY − c − dY ))
= (a − c)E(X − Z) + (b − d)E((X − Z), Y )
= (a − c)(E(X) − E(Z)) + (b − d)cov(X − Z, Y )
where the last term is justified by the fact that
cov(X − Z, Y ) = E((X − Z)Y ) − E(X − Z)E(Y ) = E((X − Z)Y ),
since E(X) = E(Z). Hence, E((X − Z)(Z − V )) = 0 as we find that
E((X − V )2 ) = E((X − Z)2 ) + E((Z − V )2 ) ≥ E((X − Z)2 ),
so that Z = LLSE[X | Y ].
Conversely, if Z = LLSE[X | Y ] = a+bY , then φ(c, d) := E((X −c−dY )2 ) is minimized
by c = a and d = b. Setting the derivative of φ(c, d) with respect to c to zero and similarly
with respect to d, we find
∂φ(c, d)
0= |c=a,d=b = 2E(X − a − bY ) = 2E(X − Z)
∂c
and
0 = 2E((X − a − bY )Y ) = 2E((X − Z)Y ).
These two expressions imply that E(X) = E(Z) and cov(X − Z, Y ) = 0.
These conditions show that
cov(X, Y )
Z = E(X) + (Y − E(Y )).
var(Y )
Note that these conditions say that the estimation error X − Z should be orthogonal
to Y , where orthogonal means zero-mean and uncorrelated with Y . We write X − Z ⊥ Y

Figure 9.1: Adding observations to improve an estimate
to indicate that X − Z is orthogonal to Y . That is, Z is the projection of X on the set of
linear functions of Y : {V = c + dY | c, d ∈ <}.
9.3 Recursive LLSE
There are many cases where one keeps on making observations. How do we update the
estimate? For instance, how do we calculate LLSE[X|Y, Z] = bY + cZ if one knows
X̂ = LLSE[X|Y ] = aY ? The answer lies in the observations captured in Figure 9.1 (we
assume all the random variables are zero-mean to simplify the notation).
We want X − bY + cZ ⊥ Y, Z where ⊥ designates orthogonality. A picture shows that
L[X|Y, Z] = L[X|Y ] + k(Z − L[Z|Y ]). We must choose kX − L[X|Y, Z] ⊥ Z.
These ideas lead to the Kalman filter which is a recursive estimator linear in the obser-
vations.
9.4 Sufficient Statistics
Assume that the joint pdf of X depends on some parameter Θ. To estimate Θ given X
X ) instead of all the functions of X . This

it may be enough to consider functions of T (X
9.5. SUMMARY 147
X ) does not depend on Θ. Indeed, in that

happens certainly if the density of X given T (X
X ). In such a
case, there is no useful information in X about Θ that is not already in T (X
X ) is a sufficient statistic for (estimating) Θ. Equivalently, T (X

situation, we say that T (X X)
x|θ] of X given Θ has the

is a sufficient statistic for (estimating) Θ if the density fX |Θ [x
following form:
x|θ] = h(x
fX |Θ [x x)g(T (x
x); θ).
We provide the formal derivation of these ideas in Example 9.6.4.
For instance, if given {Θ = θ} the random variables X1 , . . . , Xn are i.i.d. N (θ, σ 2 ), then
one can show that X1 + . . . + Xn is a sufficient statistic for Θ. Knowing a sufficient statistic
enables us to “compress” observations without loss of relevant information.
X ) is a sufficient statistic for (estimating) Θ. Then you can verify that

Assume that T (X
X ], M AP [Θ|X
M LE[Θ|X X ], HT [Θ|X
X ], and E[Θ|X
X ] are all functions of T (X
X ). The correspond-
X ].
ing result does not hold for LLSE[Θ|X
9.5 Summary
9.5.1 LSSE
We discussed the linear least squares estimator of X given Y . The formulas are given in
(9.2.1), (9.2.2), and (9.2.3).
The formulas are the same as those for the conditional expectation when the random
variables are jointly Gaussian. In the non-Gaussian case, the conditional expectation is not
linear and the LLSE is not as close to X as the conditional expectation. That is, in general,
E((X − E[X | Y ])2 ) ≤ E((X − L[X | Y ])2 ),
and the inequality is strict unless the conditional expectation happens to be linear, as in
the jointly Gaussian case or other particular cases.

X ) that contains all the information

We also explained the notion of sufficient statistic T (X
in the observations X that is relevant for estimating some parameter Θ. The necessary and
sufficient condition is (9.6.1).
9.6 Solved Problems
Example 9.6.1. Let X, Y be a pair of random variables. Find the value of a that minimizes
the variance of X − aY .
Note that
E((X − aY − b)2 ) = var(X − aY − b) + (E(X − aY − b))2 = var(X − aY ) + (E(X − aY − b))2 .
We also know that the values of a and b that minimize E((X − aY − b)2 ) are such that
a = cov(X, Y )/var(Y ). Consequently, that value of a minimizes var(X − aY ).
Example 9.6.2. The random variable X is uniformly distributed in {1, 2, . . . , 100}. You
are presented a bag with X blue balls and 100 − X red balls. You pick 10 balls from the
bag and get b blue balls and r red balls. Explain how to calculate the maximum a posteriori
estimate of X.
Your intuition probably suggests that if we get b blue balls out of 10, there should be
about 10b blue balls out of 100. Thus, we expect the answer to be x = 10b. We verify that
intuition.
Designate by A the event “we got b blue balls and r = 10 − b red balls.” Since the prior
p.m.f. of X is uniform,
M AP [X | A] = M LE[X | A] = argmaxx P [A | X = x].
Now,
¡x¢¡100−x¢
b
P [A | X = x] = ¡100r¢ .
10
Hence,
µ ¶µ ¶
x 100 − x x!r!(100 − x − r)!
M AP [X | A] = argmaxx = argmaxx .
b r b!(x − b)!(100 − x)!
Hence, M AP [X | A] = argmaxx α(x) where
x!(90 + b − x)!
α(x) := .
(x − b)!(100 − x)!
We now verify that α(x) is minimized by x = 10b.
Note that
α(x) x(90 + b − x)
= .
α(x − 1) (x − b)(100 − x)
Hence,
α(x) < α(x − 1) iff x(90 + b − x) < (x − b)(100 − x),
i.e.,
α(x) < α(x − 1) iff x > 10b.
Thus, as x increases from b to 100 − r = 90 + b, we see that α(x) increases as long as x ≤ 10b
and then decreases. It follows that
M AP [X | A] = 10b.
Usually, intuition is much quicker than the algebra... .
Example 9.6.3. Let X, Y, Z be i.i.d. and with a B(n, p) distribution (that is, Binomial
with parameters n and p).
a. What is var(X)?
b. Calculate L[X | X + 2Y, X + 3Z].
c. Calculate L[XY | X + Y ].
a. Since X is the sum of n i.i.d. B(p) random variables, we find var(X) = np(1 − p).
b. We use the formula (9.2.2) and find

   
X + 2Y 3np
L[X | X + 2Y, X + 3Z] = np + [np(1 − p), np(1 − p)]Σ−1 [ − ]
X + 3Z 4np
where
     
5np(1 − p) np(1 − p) 5 1 1 10 −1
Σ=  = np(1−p)   , so that Σ−1 =  .
np(1 − p) 10np(1 − p) 1 10 np(1 − p) −1 5
Putting the pieces together we find
L[X | X + 2Y, X + 3Z] = −6np + 9(X + 2Y ) + 4(X + 3Z).
c. Similarly,
cov(XY, X + Y )
L[XY | X + Y ] = (np)2 + (X + Y − 2np).
var(X + Y )
Now,
cov(XY, X + Y ) = E(X 2 Y + XY 2 ) − E(XY )E(X + Y )
and
E(X 2 Y ) = E(X 2 )E(Y ) = (var(X) + E(X)2 )E(Y ) = (np(1 − p) + (np)2 )(np).
Hence,
cov(XY, X + Y ) = 2(np)2 (1 − p) + 2(np)3 − (np)2 (2np) = 2(np)2 (1 − p).
Finally
2(np)2 (1 − p)
L[XY | X + Y ] = (np)2 + (X + Y − 2np) = np(X + Y ) − (np)2 .
2np(1 − p)
X ) is a sufficient statistic for estimating Θ given X if

Example 9.6.4. Recall that T (X
x | θ]) = h(x
fX |Θ ([x x)g(T (x
x); θ). (9.6.1)
X ) does
a. Show that the identity (9.6.1) holds if and only if the density of X given T (X
not depend on θ.
b. Show that if given θ the random variables {Xn , n ≥ 1} are i.i.d. N (θ, σ 2 ), then
X1 + · · · + Xn is a sufficient statistic for estimating θ given X = {X1 , . . . , Xn }.
c. Show that if given θ the random variables {Xn , n ≥ 1} are i.i.d. Poisson with mean
θ, then X1 + · · · + Xn is a sufficient statistic for estimating Θ given X = {X1 , . . . , Xn }.
d. Give an example where X1 + · · · + Xn is not a sufficient statistic for estimating the
mean θ of i.i.d. random variables {X1 , . . . , Xn }.
a. First assume that (9.6.1) holds. Then
X ≈ x , T ≈ t|θ]
P [X h(x x)g(t; θ)dx
x h(x)dx x
X ≈ x | T ≈ t, θ] ≈
P [X ≈R 0 0
≈R 0 0
.
P [T ≈ t|θ] x0 )=t} h(x )g(t; θ)dx
x0 |T (x
{x {x0 |T (x0 )=t} h(x )dx
This shows that the density of X given T does not depend on θ.
For the converse, assume that
X ≈ x | T = t, θ] ≈ h(x
P [X xdx
x).
Then,
X ≈ x | θ] = P [X
P [X X ≈ x | T = t, θ]P [T ≈ t | θ] ≈ h(x
x)dx
xP [T ≈ t | θ] = h(x
x)g(T (x
x); θ)dx
x.
(b) This is immediate from (9.6.1) since
X n
1
x | θ] =
fX |θ [x 2 n/2
exp{− (xi − θ)2 /2σ 2 }
(2πσ ) i=1
Xn Xn
1 2
=[ exp{− xi }] × [exp{(−2 xi + nθ)2 /2σ 2 }] = h(x
x)g(T (x
x); θ)
(2πσ 2 )n/2 i=1 i=1
Pn
x) =
where T (x i=1 xi ,
X n
1
x) =
h(x exp{− x2i }, and
(2πσ 2 )n/2 i=1
x) + nθ)2 /2σ 2 }.
x); θ) = exp{(−2T (x
g(T (x
(c) This is similar to the previous example. One finds that
P
θ xi θ i xi
x | θ] =
fX |θ [x Πni=1 [ exp{−θ}] = x)g(T (x
exp{−nθ} = h(x x); θ)
xi ! Πi xi !
Pn
x) =
where T (x i=1 xi ,
x) = [Πni=1 xi !]−1 , and

h(x
x); θ) = θT (xx) exp{−nθ}.

g(T (x
(d) Assume that, given θ, X1 and X2 are i.i.d. N (θ, θ). Then, with X = (X1 , X2 ) and
x) = x1 + x2 ,
T (x
1 (xi − θ)2 1 x2 + x22

x | θ] = Π2i=1 [ √
fX |θ [x exp{− }] = exp{− 1 x) − θ}.
+ T (x
2πθ 2θ 2πθ 2θ
x)g(T (x
This function cannot be factorized in the form h(x x); θ).
Example 9.6.5. Assume that {Xn , n ≥ 1} are independent and uniformly distributed in
[0, 1]. Calculate
L[2X1 + 3X2 |X12 + X3 , X1 + 2X4 ].
Let V1 = 2X1 + 3X2 , V2 = X12 + X3 and V3 = X1 + 2X4
Then E[V1 ] = 52 , E[V2 ] = 56 , E[V3 ] = 3

2 and var(V1 ) = 13
12 , var(V2 ) = 31
180 , var(V3 ) = 5
12 .
Also, cov(V1 , V2 ) = 16 , cov(V1 , V3 ) = 1

6 and cov(V2 , V3 ) = 83 . Hence,
L[V1 |V2 , V3 ] = a + b(V2 − E[V2 ]) + c(V3 − E[V3 ])
where
5
a = E[V1 ] =
2
and,
   −1    −1    
31 8 1
b V ar(V2 ) cov(V2 , V3 ) cov(V1 , V2 ) 0.0533
 =   = 180 3   6 = 
8 5 1
c cov(V2 , V3 ) V ar(V3 ) cov(V1 , V3 ) 3 12 6 0.0591
5
Hence, L[V1 |V2 , V3 ] = 2 + 0.0533(V2 − 56 ) + 0.0591(V3 − 32 ).
Example 9.6.6. Let the point (X, Y ) be picked randomly in the quarter circle {(x, y) ∈
<2+ | x2 + y 2 ≤ 1}.
a. Find L[X | Y ].
b. Find L[X 2 | Y ].
We first calculate the required quantities.
Z √1−y2 p
4 4 1 − y2
fY (y) = dx =
0 π π
Z 1
p Z π/2
2 24 1 − y2 4
E(Y ) = y dx = cos2 (θ)sin2 (θ)dθ
0 π 0 π
Z π/2 Z π/2
1 1
= sin2 (2θ)dθ = (1 − cos(4θ))dθ
0 π 0 2π
1 sin(4θ) π/2 1
= [ (x − )]0 = .
2π 4 4
Similarly, E[X 2 ] = 0.25.
Z 1
p Z 0
1 − y2 4y −4 2
E(Y ) = dy = z dz
0 π 1 π
−4z 3 0 4
= [ ] = .
3π 1 3π
1 16 4
Hence var(Y ) = 4 − 9π 2
= 0.0699. Similarly, E[X] = 3π .
Z √
1Z Z 11−x2
4 4
cov(XY ) = xy dydx = x(1 − x2 )dx
0 0 π 0 2π
4 x2 x4 1 1
= [ ( − )]0 = = 0.1592.
2π 2 4 2π
Z √
1Z Z 1
1−x2
2 4 2 4 2
cov[X Y ] = x y dydx = x (1 − x2 )dx
0 0 π 0 2π
4 x3 x5 1 4
= [ ( − )]0 = = 0.0849.
2π 3 5 15π
(a) Using the above quantities, we find that
cov(XY ) 0.1592
L[X|Y ] = (Y −E[Y ])+E[X] = (Y −0.4244)+0.4244 = 2.2775(Y −0.4244)+0.4244.
var(Y ) 0.0699
(b) Similarly,
cov(X 2 Y ) 0.0849
L[X 2 |Y ] = (Y −E[Y ])+E[X 2 ] = (Y −0.4244)+0.25 = 1.2146(Y −0.4244)+0.25.
var(Y ) 0.0699
Calculate L[Y 2 | 2X + Y ].
One has
E(Y 2 (2X + Y )) − E(Y 2 )E(2X + Y )

L[Y 2 | 2X + Y ] = E(X 2 ) + (2X + Y − E(2X + Y ))
var(2X + Y )
1 1/3 + 1/4 − (1/3)(3/2)
= + (2X + Y − 3/2).
3 4(1/3 − 1/4) + (1/3 − 1/4)
Example 9.6.8. Let {Xn , n ≥ 1} be independent N (0, 1) random variables. Define Yn+1 =
aYn +(1−a)Xn+1 for n ≥ 0 where Y0 is a N (0, σ 2 ) random variable independent of {Xn , n ≥
0}. Calculate
E[Yn+m |Y0 , Y1 , . . . , Yn ]
for m, n ≥ 0.
Hint: First argue that observing {Y0 , Y1 , . . . , Yn } is the same as observing {Y0 , X1 , . . . , Xn }.
Second, get an expression for Yn+m in terms of Y0 , X1 , . . . , Xn+m . Finally, use the inde-
pendence of the basic random variables.

One has
Yn+1 = aYn + (1 − a)Xn+1 ;
Yn+2 = aYn+1 + (1 − a)Xn+2 = a2 Yn + (1 − a)Xn+2 + (1 − a)2 Xn+1 ;
...
Yn+m = am Yn + (1 − a)Xn+m + (1 − a)2 Xn+m−1 + · · · + (1 − a)m Xn+1 .
Hence,
E[Yn+m | Y0 , Y1 , . . . , Yn ] = am Yn .
Example 9.6.9. Given {Θ = θ}, the random variables {Xn , n ≥ 1} are i.i.d. U [0, θ].
Assume that θ is exponentially distributed with rate λ.
a. Find the MAP θ̂n of Θ given {X1 , . . . , Xn }.
b. Calculate E(|Θ − θ̂n |).
One finds that
1
fX|Θ [x | θ]fΘ (θ) = 1{xk ≤ θ, k = 1, . . . , n}λe−λθ .
θn
Hence,
θ̂n = max{X1 , . . . , Xn }.
Consequently, by symmetry,
1
E[Θ − θ̂n |θ] = θ.
n+1
Finally,
1
E(|Θ − θ̂n |) = E(E[Θ − θ̂n |θ])) = .
λ(n + 1)
A few words about the symmetry argument. Consider a circle with a circumference
length equal to 1. Place n + 1 point independently and uniformly on that circumference.

By symmetry, the average distance between two points is 1/(n + 1). Pick any one point and
open the circle at that point, calling one end 0 and the other end 1. The other n points are
distributed independently and uniformly on [0, 1]. So, the average distance between 1 and
the closest point in 1/(n + 1). Of course, we could do a direct calculation.
a. Calculate E[X|X 2 + Y 2 ]. b. Calculate L[X|X 2 + Y 2 ].
a. Once again, we draw a unit square. Let R2 = X 2 + Y 2 . Given {R = r}, the pair
(X, Y ) is uniformly distributed on the intersection of the circumference of the circle with
radius r centered at the origin and the unit square.
If r < 1, then this intersection is the quarter of the circumference and we must calculate
E(r cos(θ)) where θ is uniform in [0, π/2]. We find
Z π/2
2 2 π/2 2r
E[X|R = r] = E(r cos(θ)) = r cos(x) dx = [r sin(x) ]0 = .
0 π π π
If r > 1, then θ is uniformly distributed in [θ1 , θ2 ] where cos(θ1 ) = 1/r and sin(θ2 ) = 1/r.
Hence,
Z θ2
1
E[X|R = r] = E(r cos(θ)) = r cos(x) dx
θ1 θ2 − θ1
1 sin(θ2 ) − sin(θ1 )
= [r sin(x) ]θ2 =r
θ2 − θ1 θ1 θ2 − θ1
√
1 − r2 − 1
= .
sin−1 (1/r) − cos−1 (1/r)
b. Let V = X 2 + Y 2 . Then,
cov(X, V )
L[X|V ] = E(X) + (V − E(V )).
var(V)
Now, E(X) = 1/2. Also,
cov(X, V ) = E(XV ) − E(X)E(V ) = E(X 3 + XY 2 ) − (1/2)E(X 2 + Y 2 )

1 1 1 1 1 1
= + × − ×( + )
4 2 3 2 3 3
1
= .
12
In addition,
8
var(V) = E(V 2 ) − (E(V ))2 = E(X 4 + 2X 2 Y 2 + Y 4 ) − (2/3)2 =
45
and
2
E(V ) = .
3
Consequently,
1 1/12 2 3 15
L[X|V ] = + (V − ) = + (X 2 + Y 2 ).
2 8/45 3 16 32
Example 9.6.11. Suppose we observe
Yi = si X + Wi , i = 1, . . . , n
where W1 , . . . , Wn are independent N (0, 1) and X takes the values +1 and −1 with equal
probabilities and is independent of the Wi . The discrete-time signal si (i = 1, . . . , n) is a
known deterministic signal. Determine the MAP rule for deciding on X based on Y =
(Y1 , . . . , Yn ).
Since the distribution of X is uniform, X̂ = M AP [X | Y ] = M LE[X | Y ]. That is, X̂
Y ) > 1 and is −1 otherwise where

is 1 if L(Y
fY |X [yy | + 1]
L(yy ) = .
fY |X [yy | − 1]
Now,
1 1
fY |X [yy |X = x] = ΠN
i=1 [ √ exp{− (yi − si x)2 }].
2π 2
Accordingly,
P
exp{− 12 ni=1 (yi − si )2 }
L(yy ) = P
exp{− 12 ni=1 (yi + si )2 }
n n
1X 2 1X
= exp{− (yi − si ) + (yi + si )2 }}
2 2
i=1 i=1
n
X
= exp{2 si yi } = exp{2yy ¦ ss}
i=1
where
n
X
y ¦ s := si yi .
i=1
Consequently, 
 +1, if y ¦ s > 0;
X̂ =
 −1, if y ¦ s ≤ 0.
Notice that y ¦ s is a sufficient statistic for estimating X given Y .
Example 9.6.12. a. Suppose
Y = gX + W
where X and W are independent zero-mean Gaussian random variables with respective
2 and σ 2 . Find L[X | Y ] and the resulting mean square error.
variances σX W
b. Suppose now we have two observations:
Yi = gi X + Wi , i = 1, 2
2 and independent of X. Find

where the Wi are independent and Gaussian with variance σW
L[X | Y1 , Y2 ].
a. We know that
cov(X, Y ) cov(X, Y )
X̂ := L[X | Y ] = E(X) + (Y − E(Y )) = Y.
var(Y ) var(Y )
Now, because X and Y are zero-mean,
cov(X, Y ) = E(XY ) = E(X(gX + W )) = gE(X 2 ) = gσX

2
and
var(Y ) = var(gX + W ) = var(gX) + var(W ) = g 2 var(X) + var(W ) = g 2 σX

2 2
+ σW .
Hence,
2
gσX
L[X | Y ] = 2 + σ 2 Y.
g 2 σX W
The resulting mean square error is
cov(X, Y )
E((X̂ − X)2 ) = E(( Y − X)2 )
var(Y )
cov2 (X, Y ) cov(X, Y )
= var(Y ) − 2 cov(X, Y ) + var(X)
var2 (Y ) var(Y )
cov2 (X, Y )
= var(X) − .
var(Y )
Hence,
g 2 σX
4 2 σ2
σX
E((X̂ − X)2 ) = σX
2
− 2 + σ2 = W
2 + σ2 .
g 2 σX W g 2 σX W
b.Let Y = (Y1 , Y2 )T . Then
X̂ = L[X | Y ] = ΣXYY ΣY−1Y
where
Y T ) = E(X(Y1 , Y2 )) = σX
ΣXYY = E(XY 2
(g1 , g2 )
and    
Y12 Y1 Y2 g12 σX
2 + σ2
W
2
g1 g2 σX
YY T) = E 
ΣY = E(Y = .
Y2 Y1 Y22 2
g1 g2 σX g22 σX
2 + σ2
W
Hence,
 −1
g12 σX
2+ σW 2 2
g1 g2 σX
2
X̂ = σX (g1 , g2 )   Y
2
g1 g2 σX g22 σX
2 + σ2
W
 
1 g22 σX
2 + σ2
W
2
−g1 g2 σX
2
= σX (g1 , g2 )  Y
g12 σX
2 σ2 + g2σ2 σ2 + σ4
W 2 X W W
2
−g1 g2 σX g12 σX
2 + σ2
W
σX2
= Y
2 + g 2 σ 2 + σ 2 (g1 , g2 )Y .
g12 σX 2 X W
Thus,
σX2
X̂ = 2 + g 2 σ 2 + σ 2 (g1 Y1 + g2 Y2 ).
g12 σX 2 X W
Example 9.6.13. Suppose we observe Yi = 3Xi +Wi where W1 , W2 are independent N (0, 1)
and X = (X1 , X2 ) is independent of (W1 , W2 ) and has the following pmf:
1 1
X = x 1 := (1, 0)) = , P (X
P (X X = x 2 := (0, 1)) = ,
2 6
1 1
X = x 3 := (−1, 0)) = , and P (X
P (X X = x 4 := (0, −1)) = .
12 4
Find M AP [(X1 , X2 ) | (Y1 , Y2 )].
We know that
X | Y = y ] = arg maxP (X
M AP [X X = x )fY |X y | x ].
X [y
Now,
1 1
fY |X y | x] =
X [y exp{− [(y1 − 3x1 )2 + (y2 − 3x2 )2 ]}.
2π 2
Hence,
M AP [X x||2 − 2 ln(P (X
X | Y = y ] = arg min{||yy − 3x X = x )} =: arg min c(yy , x ).
Note that
c(yy , x i ) < c(yy , x j ) ⇔ y ¦ (x

xi − x j ) > αi − αj
where, for y , z ∈ <2 ,
y ¦ z := y1 z1 + y2 z2
and
3 i 2 1
X = x i ), for i = 1, 2, 3, 4.
x || − ln(P (X
αi = ||x
2 3
y2
^ c1 < c 2
X = (0, 1) c1 < c 3
c4 < c 3

^ ^
X = (1, 0)
X^ = (-1, 0)

c2 < c 4
y1

= 0.1

c2 < c 3
^ = (0, -1)
X c1 < c 4
Figure 9.2: The MAP regions for Example 9.6.13
Thus, for every i, j ∈ {1, 2, 3, 4} with i 6= j, there is a line that separates the points y where
c(yy , x i ) < c(yy , x j ) from those where c(yy , x i ) > c(yy , x j ). These lines are the following:
c1 < c2 ⇔ y2 < y1 + 0.3662
c1 < c3 ⇔ y1 > −0.2987
c1 < c4 ⇔ y2 > −y1 − 0.231
c2 < c3 ⇔ y2 > −y1 − 0.231
c2 < c4 ⇔ y2 > 0.0676
c3 < c4 ⇔ y2 < y1 + 0.3662
We draw these six lines in Figure 9.2.
The figure allows us to identify the regions of <2 that correspond to each of the values
of the MAP. We can summarize the results as follows:



 (1, 0), if − y1 − 0.231 < y2 < y1 + 0.3662;



 (−1, 0), if y + 0.3662 < y < −y − 0.231;
1 2 1
X̂ =

 (0, 1), if − y2 − 0.231 < y1 < y2 − 0.3662;



 (0, −1), if y − 0.3662 < y < −y − 0.231.
2 1 2
Chapter 10
Limits of Random Variables
Random behaviors often become more tractable as some parameter of the system, such as
the size or speed, increases. This increased tractability comes from statistical regularity.
When many sources of uncertainty combine their effects, their individual fluctuations may
compensate one another and the combined result may become more predictable. For in-
stance, if you flip a single coin, the outcome is very unpredictable. However, if you flip a
large number of them, the proportion of heads is less variable. To make precise sense of
these limiting behaviors one needs to define the limit of random variables.
In this chapter we explain what we mean by Xn → X as n → ∞. Mathematically, Xn
and X are functions. Thus, it takes some care to define the convergence of functions. For
the same reason, the meaning of “Xn is close to X” requires some careful definition.
We explain that there are a number of different notions of convergence of random vari-
ables, thus a number of ways of defining that Xn approaches X. The differences between
these notions may seem subtle at first, however they are significant and correspond to very
different types of approximation. We use examples to highlight the differences. We start
with convergence in distribution, then explain how to use transform methods to prove that
type of convergence. We then discuss almost sure convergence and convergence in proba-
bility and in L2 . We conclude with a discussion of the relations between these difference
forms of convergence and we comment on the convergence of expectation.
163
164 CHAPTER 10. LIMITS OF RANDOM VARIABLES
Looking ahead, the strong law of large numbers is an almost sure convergence result;
the weak law is a convergence in probability result; the central limit theorem is about
convergence in distribution. It is important to appreciate the meaning of these types of
convergence in order to understand these important results.
10.1 Convergence in Distribution
Intuitively we can say that the random variables X and Y are similar if their cdf are about
the same, i.e., if P (X ≤ x) ≈ P (Y ≤ x) for all x ∈ <. For example, we could say that X is
almost a standard Gaussian random variable if this approximation holds when Y = N (0, 1).
It is in this sense that one can show that many random variables that occur in physical
systems are almost Gaussian or Poisson.
Correspondingly, we define convergence in distribution as follows.
Definition 10.1.1. Convergence in Distribution
The random variables Xn are said to converge in distribution to the random variable X,
and we write Xn →D X, if
lim FXn (x) = FX (x), for all x ∈ < such that FX (x) = FX (x−). (10.1.1)
n→∞
The restriction FX (x) = FX (x−) requires some elaboration. Consider the random
variables Xn = 1 + 1/n for n ≥ 1 and the random variable X = 1. Intuitively we want to
be able to say that Xn approaches X in distribution. You see that FXn (x) → FX (x) for all
x 6= 1. However, FXn (1) = 0 for all n and FX (1) = 1. The restriction in (10.1.1) takes care
of such discontinuity.
With this definition, you can check that if Xn and X are discrete with P (Xn = xm,n ) =
pm,n for m, n ≥ 1 and P (X = xm ) = pm for m ≥ 1, then Xn →D X if limn→∞ pm,n = pm
and limn→∞ xm,n = xm for n ≥ 1. In other words, the definition is conform to our intuition:
the possible values get close and so do their probabilities.

10.2. TRANSFORMS 165
If the random variable X is continuous and if Xn →D X, then
lim P (Xn ∈ (a, b)) = P (X ∈ (a, b)), ∀a < b ∈ <.

n→∞
10.2 Transforms
Transform methods are convenient to show convergence in distribution. We give an example
here. (See the Central Limit Theorem in Section 11.3 for another example.) For n ≥ 1 and
p > 0, let X(n, p) be binomial with the parameters (n, p). That is,
µ ¶
n m
P (X(n, p) = m) = p (1 − p)n−m , m = 0, 1, . . . , n.
m
We want to show that as p ↓ 0 and np → λ, one has X(n, p) →D X where X is Poisson
with mean λ. We do this by showing that E(z X(n,p) ) → E(z X ) for all complex numbers
z. These expected values are the z-transforms of the probability mass functions of the
random variables. One then invokes a theorem that says that if the z-transforms converge,
then so do the probability mass functions. In these notes, we do the calculation and we
accept the theorem. Note that X(n, p) is the sum of n i.i.d. random variables that are 1
with probability p and zero otherwise. If we designate one such generic random variable by
V (p), we have
E(z X(n,p) ) = (E(z V (p) ))n = ((1 − p) + pz)n → (1 + λ(z − 1)/n)n → exp{λ(z − 1)},
by (??). Also,
X (λ)n
E(z X ) = zn exp{−λ} = exp{λ(z − 1)}.
n
n!
10.3 Almost Sure Convergence
A strong form of convergence is when the real numbers Xn (ω) approach the real number
X(ω) for all ω. In that case we say that the random variables Xn converge to the random
variable X almost surely. Formally, we have the following definition.
Definition 10.3.1. Almost Sure Convergence
Let Xn , n ≥ 1 and X be random variables defined on the same probability space
{Ω, F, P }. We say that Xn converge almost surely to X, and we write Xn →a.s. X, if
lim Xn (ω) = X(ω), for almost all ω. (10.3.1)

n→∞
The expression “for almost all ω” means for all ω except possibly for a set of ω with
probability zero.
Anticipating future results, look at the fraction Xn of heads as you flip a fair coin n times.
You expect Xn to approach 1/2 for every possible realization of this random experiment.
However, the outcome where you always gets heads is such that the fraction of heads does
not go to 1/2. In fact, there are many conceivable sequences of coin flips where Xn does not
approach 1/2. However, all these sequences together have probability zero. This example
shows that it would be silly to insist that Xn (ω) → X(ω) for all ω. This condition would
be satisfied only in trivial models.
Almost sure convergence is a very strong result. It states that the sequence of real
numbers Xn (ω) approaches the real number X(ω) for every possible outcome of the random
experiment. By possible, we mean here “outside of a set of probability zero.” Thus, as you
perform the random experiment, you find that the numbers Xn (ω) approach the number
X(ω).
The following observation is very useful in applications. Assume that Xn →a.s. X and
that g : < → < is a continuous function. Note that Xn (ω) → X(ω) implies by continuity
10.3. ALMOST SURE CONVERGENCE 167
that g(Xn (ω)) → g(X(ω)). Accordingly, we find that the random variables g(Xn ) converge
almost surely to the random variable g(X).
10.3.1 Example
A common technique to prove almost sure convergence is to use the Borel-Cantelli Lemma
2.7.10. We provide an illustration of this technique.
Let Xn , n ≥ 1 be zero-mean random variables defined on {Ω, F, P } with var(Xn ) ≤ 1/n2 .
We claim that Xn →a.s. 0.
To prove that fact, fix ² > 0. Note that, by Chebyshev’s inequality (4.8.1),
var(Xn ) 1
P (|Xn | ≥ ²) ≤ 2
≤ 2 2 , for n ≥ 1.
² n ²
Consequently,
∞
X ∞
X 1
P (|Xn | ≥ ²) ≤ < ∞.
n ²2
2
n=1 n=1
Using the Borel-Cantelli Lemma 2.7.10, we conclude that
P (|Xn | ≥ ² for infinitely many values of n) = 0.
Accordingly, there must be some finite n0 so that
|Xn | < ² for n ≥ n0 .
This almost shows that Xn →a.s. 0. Now, to be technically precise, we should show
that these is some set A of probability 0 so that if ω ∈

/ A, then for all ² > 0, there is some n0
such that |Xn (ω)| ≤ ² whenever n ≥ 0. What is missing in our derivation is that the set A
may depend on ². To fix this problem, let An be the set A that corresponds to ² = 1/n and
P
choose A = ∪n An . Then P (A) ≤ n P (An ) = 0 and that set has the required property.
10.4 Convergence In Probability
It may be that Xn and X are more and more likely to be close as n increases. In that case,
we say that the random variables Xn converge in probability to the random variable X.
Formally, we have the following definition.
Definition 10.4.1. Convergence in Probability
Let Xn , n ≥ 1 and X be random variables defined on the same probability space
{Ω, F, P }. We say that the Xn converge in probability to X, and we write Xn →P X,
if
lim P (|Xn − X| > ²) = 0, for all ² > 0. (10.4.1)

n→∞
It takes some reflection to appreciate the difference between convergence in probability
and almost sure convergence. Let us try to clarify the difference. If Xn →P X, there is no
guarantee that Xn (ω) converges to X(ω).
Here is an example that illustrates the point. Let {Ω, F, P } be the interval [0, 1) with
the uniform distribution. The random variables {X1 , X2 , . . .} are equal to zero everywhere
except that they are equal to one on the intervals [0, 1), [0, 1/2), [1/2, 1), [0, 1/4), [1/4, 2/4),
[2/4, 3/4), [3/4, 1), [0, 1/8), [1/8, 2/8), [2/8, 3/8), [3/8, 4/8), [4/8, 5/8), . . ., respectively.
Thus, X1 = 1, X2 (ω) = 1{ω ∈ [0, 1/2)}, X9 (ω) = 1{ω ∈ [1/8, 2/8)}, and so on. From this
definition we see that P (Xn 6= 0) → 0, so that Xn →P 0. Moreover, for every ω < 1 the
sequence {Xn (ω), n ≥ 1} contains infinitely many ones. Accordingly, Xn does not converge
to 0 for any ω. That is, P (Xn → 0) = 0.
Thus it is possible for the probability of An := {|Xn − X| > ²} to go to zero but for
those sets to keep on “scanning” Ω and, consequently, for |Xn (ω) − X(ω)| to be larger than
² for infinitely many values of n. In that case, Xn →P X but Xn 9a.s. X.
Imagine that you simulate the sequence {Xn , n ≥ 1} and X. If the sequence that you
observe from your simulation run is such that Xn (ω) 9 X(ω), then you can conclude that
10.5. CONVERGENCE IN L2 169
Xn 9a.s. X. However, you cannot conclude that Xn 9P X.
10.5 Convergence in L2
Another way to say that X and Y are close to each other is when E(|X − Y |2 ) is small.
This corresponds to the meaning of convergence in L2 . Specifically, we have the following
definition.
Definition 10.5.1. Convergence in L2
Let Xn , n ≥ 1 and X be random variables defined on a common probability space
{Ω, F, P }. We say that Xn converges in L2 to X, and we write Xn →L2 X if
E|Xn − X|2 → 0 as n → ∞. (10.5.1)
This definition has an intuitive meaning: the error becomes small in the mean squares
sense. If you recall our discussion of estimators and the interpretation of MMSE and LLSE
in a space with the metric based on the mean squared error, then convergence in L2 is
precisely convergence in that metric. Not surprisingly, this notion of convergence is well
matched to the study of properties of estimators.
10.6 Relationships
All these convergence notions make sense. How do they relate to one another? Figure 10.1
provided a summary.
The discussion below of these implications should help you clarify your understanding
of the different forms of convergence.
We explained above that convergence in probability does not imply almost sure conver-
gence. We now prove that the opposite implication is true. That is, we prove the following
theorem.
Figure 10.1: Relationships among convergence properties
Theorem 10.6.1. Let Xn , n ≥ 1 and X be random variable defined on the same probability
space {Ω, F, P }. If Xn →a.s. X, then Xn →P X.
Proof:
Assume that Xn →a.s. X. We show that for all ² > 0, P (An ) → 0 where An := {ω |
|Xn (ω) − X(ω)| > ²}.
To do this, we define Bn = ∪∞
m=n Am . Note that
Bn ↓ B := ∩∞ ∞ ∞
n=1 Bn = ∩n=1 ∪m=n An = {ω | ω ∈ An for infinitely many values of n}.
Thus, B ⊂ {ω | Xn (ω) 9 X(ω)} and we conclude that P (B) = 0, so that P (Bn ) ↓ 0.
Finally, since An ⊂ Bn we find that P (An ) → 0, as we wanted to show. ¤
Recapitulating, the idea of the proof is that if Xn (ω) → X(ω), then this ω is not in An
for all n large enough. Therefore, the only ω’s that are in B must be those where Xn does
not converge to X, and these ω’s have probability zero. The consideration of Bn instead of
An is needed because the sets An may not decrease even though they are contained in the
sets Bn whose probability goes down to zero.
We now prove another implication in Figure 10.1.
Theorem 10.6.2. Let Xn , n ≥ 1 and X be random variable defined on the same probability
space {Ω, F, P }. If Xn →L2 X, then Xn →P X.

10.6. RELATIONSHIPS 171
Proof:
We use Chebychev’s inequality (4.8.1):
E(|Xn − X|2 )
P (|Xn − X| > ²) ≤ .
²2
This inequality shows that if Xn →L2 X, then P |Xn − X| > ²) → 0 for any given ² > 0,
so that Xn →P X. ¤
It is useful to have a simple example in mind that shows that convergence in probability
does not imply convergence in L2 . Such an example is as follows. Let {Ω, F, P } be [0, 1] with
the uniform probability. Define Xn (ω) = n × 1{ω ≤ 1/n}. You can see that Xn →a.s. 0.
However, E(Xn2 ) = n 9 0, so that Xn 9L2 0.
This example also shows that Xn →a.s. X does not necessarily imply that E(Xn ) →
E(X). Thus, in general,
lim E(Xn ) 6= E( lim Xn ). (10.6.1)

n→∞ n→∞
We explain in the next section that some simple sufficient condition guarantee the equality
in the expression above.
We turn our attention to the last implication in Figure 10.1. We show the following
result.
Theorem 10.6.3. Assume that Xn →P X. Then Xn →D X.
Proof:
The idea of the proof is that if |Xn − X| ≤ ², then P (X ≤ x − ²) ≤ P (Xn ≤ x) ≤ P (X ≤
x + ²). If ² is small enough, both side of this inequality are close to P (X ≤ x), so that we
get P (Xn ≤ x) ≈ P (X ≤ x). Now, if n is large, then P (|Xn − X| ≤ ²) ≈ 1 so that this
approximation is off only with a small probability.
We proceed with the formal derivation of this idea.

Fix α > 0. First, find ² > 0 so that P (X ∈ [x − ², x + ²]) ≤ α/2. This is possible because
FX (x−) = FX (x). Second, find n0 so that if n ≥ n0 , then P (|Xn − X| ≥ ²) ≤ α/2. This is
possible because Xn →P X.
Observe that
P (Xn ≤ x and |Xn − X| ≤ ²) ≤ P (X ≤ x + ²). (10.6.2)
Also,
P (X ≤ x − ² and |Xn − X| ≤ ²) ≤ P (Xn ≤ x). (10.6.3)
Now, if P (A) ≥ 1 − δ, then P (A ∩ B) ≥ P (B) − δ. Indeed, P (A ∩ B) = 1 − P (Ac ∪ B c ) ≥
1 − P (Ac ) − P (B c ) = P (B) − P (Ac ) ≥ P (B) − δ. Consequently, (10.6.2) implies that, for
n ≥ n0 ,
P (Xn ≤ x) − α/2 ≤ P (X ≤ x + ²) ≤ FX (x) + α/2,
so that
P (Xn ≤ x) ≤ FX (x) + α.
Similarly, (10.6.3) implies
(FX (x) − α/2) − α/2 ≤ P (X ≤ x − ²) − α/2 ≤ P (Xn ≤ x),
so that
FX (x) − α ≤ P (Xn ≤ x).
These two inequalities imply that |FX (x) − P (Xn ≤ x)| ≤ α for n ≥ n0 . Hence Xn →D X.
10.7 Convergence of Expectation
Assume Xn →as X. We gave an example that show that it is generally not the case that
E(Xn ) → E(X). (See 10.6.1).) However, two simple sets of sufficient conditions are known.
10.7. CONVERGENCE OF EXPECTATION 173
Theorem 10.7.1. Lebesgue
a. Assume Xn →as X and 0 ≤ Xn ≤ Xn+1 for all n. Then E(Xn ) → E(X).
b. Assume Xn →as X and |Xn | ≤ Y for all n with E(Y ) < ∞. Then E(Xn ) → E(X).
We refer the reader to probability textbooks for a proof of this result that we use in
examples below.
Chapter 11
Law of Large Numbers & Central

Limit Theorem
We started the course by saying that, in the long term, about half of the flips of a fair
coin yield tail. This is our intuitive understanding of probability. The law of large number
explains that our model of uncertain events conforms to that property. The central limit
theorem tells us how fast this convergence happens. We discuss these results in this chapter.
11.1 Weak Law of Large Numbers
We first prove an easy convergence result.
Theorem 11.1.1. Weak Law of Large Numbers
Let {Xn , n ≥ 1} be i.i.d. random variables with mean µ and finite variance σ 2 . Let also
Yn = (X1 + · · · + Xn )/n be the sample mean of the first n random variables Then
Yn →P µ as n → ∞.
Proof:
We use Chebychev’s inequality (4.8.1). We have
X1 + · · · + Xn 1 X1 + · · · + Xn
P (| − µ| ≥ ²) ≤ 2 E(( − µ)2 ).
n ² n
175
176 CHAPTER 11. LAW OF LARGE NUMBERS & CENTRAL LIMIT THEOREM
Now, E( X1 +···+X
n
n
− µ) = 0, so that
X1 + · · · + Xn X1 + · · · + Xn
E(( − µ)2 ) = var( ).
n n
We know that the variance of a sum of independent random variables is the sum of their
variances (see Theorem 5.3.1). Hence,
X1 + · · · + Xn 1 1 σ2
var( ) = 2 var(X1 + · · · + Xn ) = 2 × nσ 2 = .
n n n n
Combining these results, we find that
X1 + · · · + Xn σ2
P (| − µ| ≥ ²) ≤ → 0 as n → ∞.
n n
Hence,
X1 + · · · + Xn
→P 0 as n → ∞.
n
11.2 Strong Law of Large Numbers
The following result is remarkable.
Theorem 11.2.1. Strong Law of Large Numbers Let {Xn , n ≥ 1} be i.i.d. random variables
with mean µ and finite variance σ 2 . Let also Yn = (X1 + · · · + Xn )/n be the sample mean
of the first n random variables Then
Yn →a.s. µ as n → ∞.
This result is stronger than the weak law:
• Mathematically because almost sure convergence implies convergence in probability;
• In applications because it states that Yn becomes a good estimate of µ as n increases
(always, you cannot be unlucky!).
The proof of this result is a bit too technical for this course.
11.3. CENTRAL LIMIT THEOREM 177
11.3 Central Limit Theorem
The next result estimates the speed of convergence of the sample mean to the expected
value.
Theorem 11.3.1. CLT Let {Xn , n ≥ 1} be i.i.d. random variables with mean µ and finite
variance σ 2 . Then
√
Zn := (Yn µ) n →D N (0, σ 2 ) as n → ∞.
√
This result says (roughly) that the error Yn µ is of order σ/ n when n is large. Thus,
if one makes four times more observations, the error on the mean estimate is reduced by a
factor of 2.
Equivalently,
√
n
(Yn µ) →D N (0, 1) as n → ∞.
σ
Proof:
2 /2
Here is a rough sketch of the proof. We show that E(eiuZn ) → e−u and we then
invoke a theorem that says that if the Fourier transform of fZn converge to that of fV , then
Zn →D V .
We find that
1
E(eiuZn ) = E(exp{iu √ ((X1 − µ) + · · · + (Xn − µ))})
σ n
1
= [E(exp{iu √ (X1 − µ)})]n
σ n
1 1 1
≈ [1 + iu √ E(X1 − µ) + (iu √ )2 E((X1 − µ)2 )]n
σ n 2 σ n
2
u n 2
≈ [1 − ] ≈ e−u /2 ,
2n
by (??).
The formal proof must justify the approximations. In particular, one must show that one
can ignore all the terms of order higher than two in the Taylor expansion of the exponential.
¤
11.4 Approximate Central Limit Theorem
In the CLT, one must know the variance of the random variables to estimate the convergence
rate of the sample mean to the mean. In practice it is quite common that one does not
know that variance. The following result is then useful.
Theorem 11.4.1. Approximate CLT Let {Xn , n ≥ 1} be i.i.d. random variables with mean
µ and such that E(Xnα ) < ∞ for some α > 2. Let Yn = (X1 + · · · + Xn )/n, as before. Then
√
n
(Yn µ) →D N (0, 1) as n → ∞
σn
where
(X1 − Yn )2 + · · · + (Xn − Yn )2
σn2 = .
n
That result, which we do not prove here, says that one can replace the variance of the
random variables by an estimate of that variance in the CLT.
11.5 Confidence Intervals
Using the approximate CLT, we can construct a confidence interval about our sample mean
estimate. Indeed, we can say that when n gets large, (see Table 7.1)
√
n
P (|(Yn − µ) | > 2) ≈ 5%,
σn
so that
σn σn
P (µ ∈ [Yn − 2 √ , Yn + 2 √ ]) ≈ 95%.
n n
We say that
σn σn
[Yn − 2 √ , Yn + 2 √ ]
n n
is a 95%-confidence interval for the mean.

11.6. SUMMARY 179
Similarly,
σn σn
[Yn − 2.6 √ , Yn + 2.6 √ ]
n n
is a 99%-confidence interval for the mean and
1.6σn 1.6σn
[µn − √ , µn + √ ] is the 90% confidence interval for µ.
n n
11.6 Summary
The Strong Law of Large Numbers (SLLN) and the Central Limit Theorem (CLT) are very
useful in many applications.
The SLLN says that the sample mean Yn of n i.i.d. random variables converges to their
√
expected value. The CLT shows that the error multiplied by n is approximately Gaussian.
The CLT enables us to construct confidence intervals
σn σn
[Yn − α √ , Yn − α √ ]
n n
where σn is the sample estimate of the standard deviation. The probability that the mean
is in that interval is approximately P (|N (0, 1)| ≤ α). Using Table 7.1 one can select the
value of α that corresponds to the desired degree of confidence.
Example 11.7.1. Let {Xn , n ≥ 1} be independent and uniformly distributed in [0, 1].
sin(X1 )+···+sin(Xn )
a. Calculate limn→∞ n .
b. Calculate limn→∞ sin( X1 +···+X

n
n
).
c. Calculate limn→∞ E( sin(X1 )+···+sin(X

n
n)
).
d. Calculate limn→∞ E(sin( X1 +···+X

n
n
)).
Remember that the Strong Law of Large Numbers says that that if Xn , n ≥ 1 are i.i.d
with mean µ. Then

Pn
i=1 Xn
→as µ.
n
a. We apply the above result to conclude that
Pn
i=1 sin(Xn )
→as E[sin(X1 )] = 1 − cos(1) = 0.4597.
n
X1 +···+Xn
b. Let Yn = n . Then, from the strong law,
Yn →as E(X1 ) = 0.5.
Hence, since sin(·) is a continuous function,
lim sin(Yn ) = sin( lim Yn ) = sin(0.5) = 0.0087.

n→∞ n→∞
c. Note that
sin(X1 ) + · · · + sin(Xn )
| | ≤ 1.
n
Consequently, the result of part (a) and Theorem 10.7.1 imply that
sin(X1 ) + · · · + sin(Xn )
lim E( ) = 0.4597.
n→∞ n
d. Using the same argument as in part (c) we find that
X1 + · · · + Xn
lim E(sin( )) = 0.5.
n→∞ n
Example 11.7.2. Let {Xn , n ≥ 1} be i.i.d. N (0, 1). What can you say about
X14 + X24 + · · · + Xn4

cos( )
n
as n → ∞?
By the SLLN, we know that
X14 + X24 + · · · + Xn4

→a.s. E(X14 ) = 3 as n → ∞.
n
Since cos(.) is a continuous function, it follows that
X14 + X24 + · · · + Xn4

cos( ) →a.s. cos(3) as n → ∞.
n
Example 11.7.3. Let {Xn , n ≥ 1} be i.i.d. with mean µ and finite variance σ 2 . Let
X1 + · · · + Xn (X1 − µn )2 + · · · + (Xn − µn )2
µn := and σn2 = . (11.7.1)
n n
Show that
σn2 →a.s. σ 2 as n → ∞.
By the SLLN, µn →a.s. µ. You can then show that σn2 − s2n →a.s. 0 where
(X1 − µ)2 + · · · + (Xn − µ)2

s2n := .
n
But, by SLLN, s2n →a.s. σ 2 as n → ∞. This completes the proof.
Example 11.7.4. We want to poll a population to estimate the fraction p of people who
will vote for Bush in the next presidential election. We want to to find the smallest number
of people we need to poll to estimate p with a margin of error of plus or minus 3% with 95%
confidence.
For simplicity it is assumed that the decisions Xn of the different voters are i.i.d. B(p).
Then we know that
2σn 2σn
[µn − √ , µn + √ ] is the 95% confidence interval for µ.
n n
We also know that the variance of the random variables is bounded by 1/4. Indeed,
var(Xn ) = p(1 − p) ≤ 1/4. Thus,
1 1
[µn − √ , µn + √ ] contains the 95% confidence interval for µ.
n n
If we want the error to be 3%, we need
1
√ ≤ 3%,
n
i.e.,
1 2
n≥( ) = 1112.
0.03
Example 11.7.5. If X = 0, Y = B(n, p0 ) and if X = 1, Y = B(n, p1 ) where p0 = 0.5
and p1 = 0.55. Let α = 0.5% and X̂ = HT [X | Y ]. Find the value of n needed so that
P [X̂ = 0 | X = 1] = α. Use the CLT to estimate the probabilities. (Note that X̂ is such
that P [X̂ = 1 | X = 0] = α, but we want also P [X̂ = 0 | X = 1] = α.)
We know that X̂ = 1{Y ≥ y0 } where P [Y ≥ y0 | X = 0] = α. We can write Y =
Y1 + · · · + Yn where the Yk are i.i.d. B(1/2) if X = 0. Now, by CLT,
√ Y1 + · · · + Yn
Zn := n[ − 0.5] ≈ N (0, 1/4).
n
Then,
√ y0
P (Y1 + · · · + Yn > y0 ) = P (Zn > n( − 0.5))
n
1 √ y0 √ y0
≈ P (N (0, ) > n( − 0.5))) = P (N (0, 1) > 2 n( − 0.5)).
4 n n
We need
√ y0
2 n( − 0.5)) ≥ 2.6
n
for this probability to be 0.5%. Thus,
√
y0 = 0.5n + 1.3 n.
Then
P [X̂ = 0 | X = 1] = P [Y < y0 | X = 1].
Now, when X = 1, Y = W1 + · · · + Wn where the Wk are B(0.55). Thus,

P [Y < y0 | X = 1] = P (W1 + · · · + Wn < y0 ).
By CLT,
√ W1 + · · · + Wn
Un := n[ − 0.55] ≈ N (0, 1/4).
n
(We used the fact that 0.55(1 − 0.55) ≈ 1/4.)
Thus,
√ y0
P (W1 + · · · + Wn < y0 ) = P (Un < n[ − 0.55])
n
1 √ y0 √ y0
≈ P (N (0, ) < n[ − 0.55]) = P (N (0, 1) < 2 n[ − 0.55])
4 n √ n
√ 0.5n + 1.3 n
= P (N (0, 1) < 2 n[ − 0.55]).
n
For this probability to be equal to α = 0.5%, we need

√
√ 0.5n + 1.3 n
2 n[ − 0.55] = −2.6.
n
This gives n = (52)2 = 2704.
Example 11.7.6. Let {Xn , n ≥ 1} be i.i.d. random variables that are uniformly distributed
in [0, 1]. Express the following limit
√ X1 + · · · + Xn 1
lim P ( n| − | > ²)
n→∞ n 2
in terms of Q(x) := P (N (0, 1) > x), for x ∈ <.
From the Central Limit Theorem,
√ X1 + · · · + Xn 1
n| − | →D N (0, σ 2 )
n 2
where
1 1 1
σ 2 = var(X1 ) = E(X12 ) − (E(X1 ))2 = − = .
3 4 12
Hence,
√ X1 + · · · + Xn 1 1
P ( n| − | > ²) ≈ P (|N (0, )| > ²).
n 2 12
Now,
1 1
P (|N (0, )| > ²) = P (| √ N (0, 1)| > ²)
12 12
√ √ √
= P (|N (0, 1)| > ² 12) = 2P (N (0, 1) > ² 12) = 2Q(² 12).
Consequently,
√ X1 + · · · + Xn 1 √
lim P ( n| − | > ²) = 2Q(² 12).
n→∞ n 2
Example 11.7.7. Given X, the random variables {Yn , n ≥ 1} are exponentially distributed
with mean X. Assume that P (X = 1) = 1 − P (X = 2) = p ∈ (0, 1).
Find the estimate X̂n based on Y n that minimizes P [X̂n = 1 | X = 2] subject to
P [X̂n = 2 | X = 1] ≤ β, for β = 5% and n large.
Hint: Use the result of Example 8.6.8 and the CLT.
In Example 8.6.8, we saw that


 2, if Pn Y > ρ;
i=1 i
X̂ = P
 1, if n
i=1 Yi ≤ ρ.
We choose ρ so that
Xn
P[ Yi > ρ | X = 1] = β.
i=1
√
Let us write ρ = n + α n and find α so that
Xn
√
P[ Yi > n + α n | X = 1] = β.
i=1
Equivalently,
Xn
√
P [( Yi − n)/ n > α | X = 1] = β.
i=1
Now, given {X = 1}, the Yi are i.i.d. Exd(1), so that, by CLT,

Xn
√
( Yi − n)/ n →D N (0, σ 2 )
i=1
where σ 2 = var(Yi ) = 1. According to (7.1) we need α = 1.7. Hence,


 2, if Pn Y > n + 1.7√n;
i=1 i
X̂ = P √
 1, if n
i=1 Yi ≤ n + 1.7 n.
Example 11.7.8. Let {Xn , n ≥ 1} be a sequence of independent Bernoulli random variables
with mean p ∈ (0, 1). We construct an estimate p̂n of p from {X1 , . . . , Xn }. We know that
p ∈ (0.4, 0.6). Find the smallest value of n so that
|p̂n − p|
P( ≤ 5%) ≥ 95%.
p
For Bernoulli Random Variables we have E[Xn ] = p and var[Xn ] = p(1 − p). Let
p̂n = (X1 + · · · + Xn )/n. We know that p̂n →as p. Moreover, from the CLT,
√ p̂n − p
np →D N (0, 1).
p(1 − p)
Now,
√
|p̂n − p| √ p̂n − p np
≤ 0.05 ⇔ | n p | ≤ 0.05 √ .
p p(1 − p) 1−p
Hence, for n large,
√
|p̂n − p| 0.05 × np
P( ≤ 0.05) ≈ P (|N (0, 1)| ≤ √ ).
p 1−p
Using (7.1) we find P (|N (0, 1)| ≤ 2) ≥ 0.95. Hence, P ( |p̂np−p| ≤ 0.05) if
√
0.05 × np
√ ≥ 2,
1−p
i.e.,
1−p
n ≥ 1600 =: n0 .
p
Since we know that p ∈ [0.4, 0.6], the above condition implies 1067 ≤ n0 ≤ 2400. Hence
the lowest value of n required to ensure a 95% accuracy is 2400.

Example 11.7.9. Given X, the random variables {Yn , n ≥ 1} are independent and Bernoulli(X).
We want to decide whether X = 0.5 or X = 0.6 by observing Y = {Y1 , . . . , Yn }. Let X̂ be
the estimator that minimizes P [X̂ = 0.5|X = 0.6] subject to P [X̂ = 0.6|X = 0.5] ≤ 5%.
Using the CLT, estimate the smallest value of n so that P [X̂ = 0.5|X = 0.6] ≤ 5%.
Lets define the default and the alternative hypothesis as follows:
H0 : X = 0.5
H1 : X = 0.6.
Then the Likelihood ratio L(yy ) can be computed as follows:

Qn yi 0.41−yi µ ¶Pni=1 yi µ ¶Pni=1 (1−yi ) µ ¶k µ ¶n−k
0.6 0.6 0.4 0.6 0.4
L(yy ) = Qi=1
n y 1−y
= =
i=1 0.5 0.5 0.5 0.5 0.5 0.5
i i
Pn
where k = i=1 yi is the number of heads in n Bernoulli trials. Note that L(yy ) is an
increasing function of the number of heads. According to the Neyman-Pearson theorem, we
are looking for that value of k = k0 such that
Xn
0.05 = P [X̂ = 0.6|X = 0.5] = P ( Yi ≥ k0 ).
i=1
We use CLT to estimate k0 . We observe that

n r n r r
X n 1X n 1 n 1
Yi ≥ k0 ⇔ ( Yi − 0.5) ≥ ( k0 − 0.5) ⇔ N (0, 1) ≥ ( k0 − 0.5).
0.25 n 0.25 n 0.25 n
i=1 i=1
Using (7.1) we find P (N (0, 1) ≥ 1.7) ≈ 0.05. Hence, we need to find k0 such that
r
n 1
( k0 − 0.5) ≈ 1.7. (11.7.2)
0.25 n
Next, we consider P [X̂ = 0.5|X = 0.6] ≤ 5%. Note that

n r n r r
X n 1X n 1 n 1
Yi ≤ k0 ⇔ ( Yi − 0.6) ≤ ( k0 − 0.6) ⇔ N (0, 1) ≤ ( k0 − 0.6).
0.24 n 0.24 n 0.24 n
i=1 i=1
Hence, for n large, using (7.1) we find P [X̂ = 0.5|X = 0.6] ≤ 5% is equivalent to
r
n 1
( k0 − 0.6) ≈ −1.7. (11.7.3)
0.24 n
Ignoring the difference between 0.24 and 0.25, we find that the two equations (11.7.2)-
(11.7.3) imply
1 1
k0 − 0.5 ≈ −( k0 − 0.6),
n n
1
which implies n k0 ≈ 0.55. Substituting this estimate in (11.7.2), we find
r
n
(0.55 − 0.5) ≈ 1.7,
0.25
which implies n ≈ 72, which finally yields k0 ≈ 0.55n ≈ 40.
Example 11.7.10. The number of days that a certain type of component functions before
failing is a random variable with pdf (one time unit = 1 day)
fX (x) = 2x, 0 < x < 1.
Once the component fails it is immediately replaced by another one of the same type. If
P
we designate by Xi the lifetime of the ith component to be put to use, then Sn = ni=1 Xi
represents the time of the nth failure.
a. The long-term rate at which failures occurs, call it r, is defined by
n
r = lim .
n→∞ Sn
Assuming that the random variables Xi , i ≥ 1 are independent, determine r.
b. Use the CLT to determine how many components one would need to have to be
approximately 95% certain of not running out after 365 days.

R1
a. We first note that E(X) = 0 xfX (x)dx = 2/3. Using the Strong Law of Large
Numbers, we find that
n Sn Sn −1 2
r = lim = lim [ ]−1 = [ lim ] = [ ]−1 = 1.5 failures per day.
n→∞ Sn n→∞ n n→∞ n 3
(Note: In the third identity we can interchange the function g(x) = x−1 and the limit
because g(x) is continuous at x = 2/3.)

b. Since there are about 1.5 failures per day we will need a few more than 365×1, 5 = 548
components. We use the CLT to estimate how many more we need. You can verify that
var(X) = 1/18. Let Z =D N (0, 1). We want to find k so that P (Sk > 365) ≥ 0.95. Now,
Sk − k(2/3) 365 − k(2/3) 365 − k(2/3)

P (Sk > 365) = P ( p > p ) ≈ P (Z > p ).
k/18 k/18 k/18
Using (7.1), we should choose k so that
365 − k(2/3)
p ≤ −1.7,
k/18
or equivalently,
2 p 2 1
k − 365 ≥ 1.7 k/18, i.e., (k − 365)2 ≥ (1.7)2 k ,
3 3 18
or
4 2 4 1 4 4 1
k − × 365k + (365)2 ≥ (1.7)2 k , i.e., k 2 − [ × 365 + (1.7)2 ]k + (365)2 ≥ 0.
9 3 18 9 3 18
That is,
0.444k 2 − 486.6k + 133, 225 ≥ 0.
This implies
p
486.6 + (486.6)2 − 4 × 0.444 × 133, 225
k≥ ≈ 563.
2 × 0.444
Chapter 12
Random Processes Bernoulli -

Poisson
We have looked at a finite number of random variables. In many applications, one is
interested in the evolution in time of random variables. For instance, one watches on an
oscilloscope the noise across two terminals. One may observe packets that arrive at an
Internet router, or cosmic rays hitting a detector. If ω designates the outcome of the
random experiment (as usual) and t the time, then one is interested in a collection of
random variables X = {X(t, ω), t ∈ T } where T designates the set of times. Typically,
T = {0, 1, 2, . . .} or T = {. . . , −2, −1, 0, 1, 2, . . .} or T = [0, T ] for some T < ∞ or T =
[0, ∞) or T = (−∞, ∞). When T is countable, one says that X is a discrete-time random
process. When T is an interval (possibly infinite), one says that X is a continuous-time
random process.
We explained that a collection of random variables is characterized by their joint cdf.
Similarly, a random process is characterized by the joint cdf of any finite collection of
the random variables. These joint cdf are called the finite dimensional distributions of
the random process. For instance, to specify a random process {Xt , t ≥ 0} one must
specify the joint cdf of {Xt1 , Xt2 , . . . , Xtn } for any value of n ≥ 1 and any 0 < t1 < · · · <
tn . In most applications, the process is defined by means of a collection of other random
189
190 CHAPTER 12. RANDOM PROCESSES BERNOULLI - POISSON
variables that have well-defined joint cdf. In some cases however, one specifies the finite
dimensional distributions. An interesting mathematical question is whether there is always
a random process that corresponds to a set of finite dimensional distributions. Obviously,
to correspond to a random process, the finite dimensional distributions must be consistent.
That is, the finite dimensional distribution specifies the joint cdf of a set of random variables
must have marginal distributions for subsets of them that agree with the finite dimensional
distribution of that subset. For instance, the marginal distribution of Xt1 obtained from
the joint distribution of (Xt1 , Xt2 ) should be the same as the distribution specified for Xt1 .
Remarkably, these obviously necessary conditions are also sufficient! This result is known
as Kolmogorov’s extension theorem.
In this section, we look at two simple random processes: the Bernoulli process and the
Poisson process. We consider two other important examples in the next two chapters.
12.1 Bernoulli Process
Definition 12.1.1. Let X = {Xn , n ≥ 1} be i.i.d. with P (Xn = 1) = p = 1 − P (Xn = 0).
The discrete-time random process is called the Bernoulli process. This process models
flipping a coin.
12.1.1 Time until next 1
Assume we have watched the first n coin flips. How long do we wait for the next 1? That
is, let
τ = min{m > 0|Xn+m = 1}.
We want to calculate
P [τ > m|X1 , X2 , . . . , Xn ].
12.1. BERNOULLI PROCESS 191
Because of independence, we find
P [τ > m|X1 , X2 , . . . , Xn ] = P [Xn+1 = 0, Xn+2 = 0, . . . , Xn+m = 0|X1 , X2 , . . . , Xn ] = (1−p)m .
Thus, the random time until the next 1 is G(p).
12.1.2 Time since previous 1
Assume that we have been flipping the coin forever. How has it been since the last 1? Let
σ designate that number of steps. We see that, for m ≥ 0, σ = m if Xn = 0, Xn−1 =
0, . . . , Xn−m−1 = 0, Xn−m = 1 and the probability of that event is p(1 − p)m . Thus,
σ + 1 = G(p). That is, E(σ) = (1/p) − 1.
12.1.3 Intervals between 1s
There are two ways of looking at the time between two successive 1s.
The first way is to choose some time n and to look at the time since the last 1 and the
time until the next 1. We know that these times have mean (1/p) − 1 and 1/p, respectively.
In particular, the average time between these two 1s around some time n is (2/p) − 1. Note
that this time is equal to 1 when p = 1, as it should be.
The second way is to start at some time n, wait until the next 1 and then count the
time until the next 1. But in this way, we find that the time between two consecutive 1s is
geometrically distributed with mean 1/p.
12.1.4 Saint Petersburg Paradox
The two different answers that we just described are called the Saint Petersburg paradox.
The way to explain it is to notice that when we pick some time n and we look at the previous
and next 1s, we are more likely to have picked an n that falls in a large gap between two
1s than an n that falls in a small gap. Accordingly, we expect the average duration of the
interval in which we have picked n to be larger than the typical interval between consecutive
1s. In other words, by picking n we face a sampling bias.
12.1.5 Memoryless Property
The geometric distribution is memoryless. That is, if τ is geometrically distributed with
mean 1/p, then if we know that τ is larger than n, then τ −n is still geometrically distributed
with mean 1/p. Indeed, we find that
P (τ > m + n and τ > n)

P [τ − n > m|τ > n] = P [τ > m + n|τ > n] =
P (τ > n)
P (τ > m + n) (1 − p)m+n
= = = (1 − p)m = P (τ > m).
P (τ > n) (1 − p)n
Intuitively, this result is immediate if we think of τ as the time until the next 1 for a
Bernoulli process. Knowing that we have already flipped the coin n times and got 0s does
not change how many more times we still have to flip it to get the next 1. (Remember that
we assume that we know the probability of getting a 1.)
12.1.6 Running Sum
As before, let {Xn , n ≥ 1} be i.i.d. B(p). Define Yn = Y0 + 2(X1 + . . . + Xn )n. The
interpretation of Yn is that it represents the accumulated fortune of a gambler who gains 1
with probability p and loses 1 otherwise at each step of a game; the steps are all independent.
Two typical evolutions of Yn are shown in Figure 12.1 when Y0 = 0. The top graph
corresponds to p = 0.54, the bottom one to p = 0.46. The sequence Yn is called a random
walk because its increments {Yn − yn−1 , n ≥ 1} are i.i.d.

Figure 12.1: Random Walk with p = 0.54 (top) and p = 0.46 (bottom).
12.1.7 Gamblers Ruin
Assume the gambler plays the above game and starts with an initial fortune Y0 = A > 0.
What is the probability that her fortune will reach B > A before it reaches 0 and she is
bankrupt? To solve this problem, we define TB = min{n ≥ 0|Yn = B}. We define T0 in
a similar way. We want to compute α(A) = P [TB < T0 |Y0 = A]. A moment (or two) of
reflection shows that if A > 0, then
α(A) = P [TB < T0 and X1 = 1|Y0 = A] + P [TB < T0 and X1 = −1|Y0 = A]
= pP [TB < T0 |Y0 = A, X1 = 1] + (1 − p)P [TB < T0 |Y0 = A, X1 = −1]
= pP [TB < T0 |Y1 = A + 1, X1 = 1] + (1 − p)P [TB < T0 |Y1 = A − 1, X1 = −1]
= pP [TB < T0 |Y1 = A + 1] + (1 − p)P [TB < T0 |Y1 = A − 1]
= pα(A + 1) + (1 − p)α(A − 1).
Note that these equations are derived by conditioning of the first step of the process.
Equations derived in that way are called first step equations.
The boundary conditions of these ordinary difference equations are α(0) = 0 and α(B) =
Figure 12.2: Reflected Random Walk with p = 0.43 (top) and p = 0.3 (bottom).
1. You can verify that the solution is


 A , if p = 0.5
B
α(A) =
 ρBA −1 with ρ := 1−p
ρ −1 p , if p 6= 0.5.
For instance, with p = 0.48, A = 100, and B = 1000, one finds α(A) = 2 × 10−35 .
Remember this on your next trip to Las Vegas.
12.1.8 Reflected Running Sum
Assume now that you have a rich uncle who gives you $1.00 every time you go bankrupt,
so that you can keep on playing the game forever. To define the game precisely, say that
when you hit 0, you can still play the game. If you lose again, you are back at 0, otherwise
to have 1, and so on. The resulting process is called a reflected random walk. Figure 12.2
shows a typical evolution of your fortune. The top graph corresponds to p = 0.43 and the
other one to p = 0.3. Not surprisingly, with p = 0.3 you are always poor.
One interesting question is how much money you have, on average. For instance, looking
at the lower graph, we can see that a good fraction of the time your fortune is 0, some other
fraction of the time it is 1, and a very small fraction of the time it is larger than 2. How
van we calculate these fractions? One way to answer this question is as follows. Assume
that π(k) = P (Yn = k) for k ≥ 1 and all n ≥ 0. That is, we assume that the distribution
of Yn does not depend on n. Such a distribution π is said to be invariant. Then, for k > 0,
one has
P (Yn+1 = k) = P (Yn+1 = k, Yn = k − 1) + P (Yn+1 = k, Yn = k + 1),
so that
π(k) = P [Yn+1 = k|Yn = k − 1]P (Yn = k − 1) + P [Yn+1 = k|Yn = k + 1]P (Yn = k + 1)
= pπ(k − 1) + (1 − p)π(k + 1), k = 1, 2, . . . .
For k = 0, one has
P (Yn+1 = 0) = P (Yn+1 = 0, Yn = 0) + P (Yn+1 = 0, Yn = 1),
so that
π(0) = P [Yn+1 = 0|Yn = 0]P (Yn = 0) + P [Yn+1 = 0|Yn = 1]P (Yn = 1)
= pπ(0) + (1 − p)π(1).
The above identities are called the balance equations. You can verify that the solution
of these equations is
p
π(k) = Aρk , k ≥ 0 where ρ = .
1−p
P
Since k π(k) = 1, we see that a solution is possible if ρ < 1, i.e.. if p < 0.5. The
solution corresponds to A = 1 − ρ, so that
p
π(k) = (1 − ρ)ρk , k ≥ 0 where ρ = when p < 0.5.
1−p
When p ≥ 0.5, there is not solution where P (Yn = k) does not depend on n. To
understand what happens in that case, lets look at the case p = 0.6. The evolution of the
fortune in that case is shown in Figure 12.3.

Figure 12.3: Random Walk with p = 0.6.
Figure 12.4: Random Walk with p = 0.5.

The graph shows that, as time evolves, you get richer and richer. The case p = 0.5 is
more subtle. Figure 12.4 shows a simulation result.
In this case, the fortune fluctuates and makes very long excursions. One can show that
the fortune comes back to 0 infinitely often but that it takes a very long time to come back.
So long in fact that the fraction of time that the fortune is zero is negligible. In fact, the
fraction of the time that the fortune is any given value is zero!
How do we show all this? Let p = 0.5. We know that P [TB < T0 |Y0 = A] = A/B. Thus,
P [T0 < TB |Y0 = A] = (B − A)/B. As B increases to infinity, we see that TB also increases
to infinity, to that P [T0 < TB |Y0 = A] increases to P [T0 < ∞|Y0 = A] (by continuity of
probabilities). Thus, P [T0 < ∞|Y0 = A] = 1 (because (B − A)/B tends to 1 as B increases
to infinity). This shows that the fortune comes back to 0 infinitely often. We can calculate
how longs it takes to come back to 0. To do this, define β(A) = E[min{T0 , TB }|Y0 = A].
Arguing as we have for the gamblers ruin problem, you can justify that
β(A) = 1 + 0.5β(A − 1) + 0.5β(A + 1) for A > 0.
Also, β(0) = 0 = β(B). You can verify that the solution is β(A) = A(B − A). Now,
as B increases to infinity, TB also increases to infinity. Since T0 is finite, it follows that
min{T0 , TB } increases to T0 . It follows from Theorem 10.7.1 that E[T0 |Y0 = A] = ∞ for all
A > 0.
12.1.9 Scaling: SLLN
Going back to the case p > 0.5, we see that the fortune appears to be increasing at an
approximately constant rate. In fact, using the SLLN we can see that
Yn+m − Yn Xn+1 + · · · + Xn+m

= ≈ E(Xn ) = p + (−1)(1 − p) = 2p − 1 for m À 1.
m m
Figure 12.5: Scaled versions or Random Walk.
We can use that observation to say that Yn grows at rate 2p − 1. We can make this more
precise by defining the scaled process Zk (t) := Y[kt] /k where [kt] is the smallest integer
larger than or equal to kt. We can then say that Zk (t) → (2p − 1)t as k → ∞. The
convergence is almost sure for any given t. However, we would like to say that this is true
for the trajectories of the process. To see what we mean by this convergence, lets look as a
few scaled versions shown in Figure 12.5.
These three scaled versions show that the fluctuations gets smaller as we scale and that
the trajectory becomes closer to a straight line, in some uniform sense.
12.1.10 Scaling: Brownian
Of course, if we look very closely at the last graph above, we can still see some fluctuations.
One way to see these well is to blow up the y-axis. We show a portion of this graph in
Figure 12.6.
Figure 12.6: Blowing up a scaled random walk.
The fluctuations are still there, obviously. One way to analyze these fluctuations is to
use the central limit theorem. Using the CLT, we find that
Yn+m − Yn − m(2p − 1) Xn+1 + · · · + Xn+m − m(2p − 1)

√ = √ ≈ N (0, σ 2 )
m m
where σ 2 = p(1 − p). This shows that, properly scaled, the increments of the fortune look
Gaussian. The case p = 0.5 is particularly interesting, because then we do not have to
worry about the mean. In that case,
Yn+m − Yn
√ ≈ N (0, 1/4).
m
√
We can then scale the process differently and look at Zk (t) = Y[kt] / k. We find that as
k becomes very large, the increments of Zk (t) become independent and Gaussian. In fact,
Zk (t + u) − Zk (t) is N (0, u/4). If we multiply the process by 2, we end up with a process
W (t) with independent increments such that W (t + u) − W (t) = N (0, u). Such a process
is called a standard Brownian motion process or Wiener process.

12.2 Poisson Process
Definition 12.2.1. Let X = {X(t), t ≥ 0} be defined as follows. For t ≥ 0, X(t) is the
number of jumps in [0, t]. The jump times are {Tn , n ≥ 1} where {T1 , T2 − T1 , T3 − T2 , T4 −
T3 , . . .} are i.i.d. and exponentially distributed with mean 1/λ where λ > 0. Thus, the
times between jumps are exponentially distributed with mean 1/λ and are all independent.
The process X is called a Poisson process with rate λ.
12.2.1 Memoryless Property
Recall that the exponential distribution is memoryless. This implies that, for any t > 0,
given {X(s), 0 ≤ s ≤ t}, the process {X(s) − X(t), s ≥ t} is again Poisson with rate λ.
12.2.2 Number of jumps in [0, t]
The number of jumps in [0, t], X(t), is a Poisson random variable with mean λt. That is,
P (X(t) = n) = (λt)n exp{−λt}/n! for n ≥ 0. In view of the memoryless property, the
increments of the Poisson process are independent and Poisson distributed.
Indeed, the jump times {T1 , T2 , . . .} are separated by i.i.d. Exp(λ) random variables.
That is, for any selection of 0 < t1 < . . . < tn < t,
P (T1 ∈ (t1 , t1 + ²), . . . , Tn ∈ (tn , tn + ²), Tn+1 > t)
= λ² exp{−λt1 }λ² exp{−λ(t2 − t1 )} · · · λ² exp{−λ(tn − tn−1 )} exp{−λ(t − tn )}
= (λ²)n exp{−λt}.
Hence, the probability that there are n jumps in [0, t] is the integral of the above density
on the set S = {t1 , . . . , tn |0 < t1 < . . . < tn < t}, i.e., |S|λn exp{−λt} where |S| designates
the volume of the set S. This set is the fraction 1/n! of the cube [0, t]n . Hence |S| = tn /n!
and
12.2. POISSON PROCESS 201
P (n jumps in [0, t]) = [(λt)n /n!] exp{−λt}.
12.2.3 Scaling: SLLN
We know, from the strong law of large numbers, that X(nt)/n → λt almost surely as
n → ∞. As in the case of the Bernoulli process, the process {X(nt)/n, t ≥ 0} converges to
{λt, t ≥ 0} in a “trajectory” sense that we have not defined.
12.2.4 Scaling: Bernoulli → Poisson
Imagine a Bernoulli process {Xn , n ≥ 1} with parameter p. We look at Yn = X1 + · · · + Xn .
The claim is that if p → 0 and k → ∞ in a way that kp → λ, then the process {Wk (t) =
Ykt , t ≥ 0} approaches a Poisson process with rate λ. This property follows from the
corresponding approximation of a Poisson random variable by a binomial random variable.
(See Section 10.2.)
12.2.5 Sampling
Imagine customers that arrive as a Poisson process with rate λ. With probability p, a
customer is male, independently of the other customers and of the arrival times. The claim
is that the processes {X(t), t ≥ 0} and {Y (t), t ≥ 0} that count the arrivals of male and
female customers, respectively, are independent Poisson processes with rates λp and λ(1−p).
Using Example 5.5.13, we know that X(t) and Y (t) are independent random variables with
means λpt and λ(1 − p)t, respectively. To conclude the proof, one needs to show that the
increments of X(t) and Y (t) are independent. We leave you the details.
12.2.6 Saint Petersburg Paradox
The Saint Petersburg paradox holds for Poisson processes. In fact, the Poisson process seen
from an arrival time is the Poisson process plus one point at that time.
12.2.7 Stationarity
Let X = {X(t), t ∈ <} be a random process. We say that it is stationary if {X(t+u), t ∈ <}
has the same distribution for all u ∈ <. In other words, the statistics do not change over
time. We cannot use this process to measure time. This would be the case of the weather
if there were no long-term trend such as global warming or even seasonal effects. As an
exercise, you can show that our reflected fortune process with p < 0.5 and started with
P (Y0 = k) = π(k) is stationary. Also, the Poisson process is stationary.
12.2.8 Time reversibility
Let X = {X(t), t ∈ <} be a random process. We say that it is time-reversible if {X(t), t ∈ <}
and {X(u−t), t ∈ <} have the same distribution for all u ∈ <. In other words, the statistics
do not change when we watch the movie in reverse.
You should verify that a time-reversible process is necessarily stationary.
Also, you should show that our reflected fortune process with p < 0.5 and started with
P (Y0 = k) = π(k) for k ≥ 0 is time-reversible. Also, the Poisson process is time-reversible.
12.2.9 Ergodicity
Roughly, a stochastic process is ergodic if statistics that do not depend on the initial phase
of the process are constant. That is, such statistics do not depend on the realization of the
process. For instance, if you simulate an ergodic process, you need only one simulation run;
it is representative of all possible runs.

Let {X(t), t ∈ <} be a random process. We compute some statistic Z(ω, u) = φ(X(ω, t+
u), t ∈ <). That is, we perform calculations on the process starting at time u. We are
interested in calculations such that Z(ω, u) = Z(ω, 0) for all u. (We give examples shortly;
don’t worry.) Let us call such calculations “invariant” random variables of the process X.
The process X is ergodic if all its invariant random variables are constant.
RT
As an example, let Z(ω, u) = limT →∞ (1/T ) 0 X(u + t)dt. This random variable is
invariant. If the process is ergodic, then Z is the same for all ω: it is constant.
You should show that our reflected fortune process with p ≤ 0.5 is ergodic. The trick
is that this process must eventually go back to 0. One can then couple two versions of
the process that start off independently and merge the first time they meet. Since they
remained glued forever, the long-term statistics are the same. (See Example 14.8.14 for an
illustration of a coupling argument.)
12.2.10 Markov
A random process {X(t), t ∈ <} is Markov if, given X(t), the past {X(s), s < t} and the
future {X(s), s > t} are independent. Markov chains are examples of Markov process.
A process with independent increments is Markov. For instance, the Poisson process
and the Brownian motion process are Markov.
The reflected fortune process is Markov.
For a simple example of a process that is not Markov, let Yn be the reflected fortune
process and define Wn = 1{Yn < 2}.
It is not Markov because if you see that Wn−1 = 0 and Wn = 1, then you know that
Yn = 1. Consequently, we can write
P [W2 = 0|W1 = 1, W0 = 0] = p.
However,
P [W2 = 0|W1 = 1, W0 = 1] < p.

Note that this example shows that a function of a Markov process may not be a Markov
process.
12.2.11 Solved Problems
Example 12.2.1. Let {Nt , t ≥ 0} be a Poisson process with rate λ.
a. What is the p.m.f. (probability mass function) of N1 ?
b. What is the p.m.f. of N2 − N1 ?
c. Calculate L[N2 | N1 , N3 ].
d. Calculate the Maximum Likelihood Estimator of λ given {N1 , N2 , N3 }.
λn −λ
a. We know that N1 =d P (λ), so that P (N1 = n) = n! e .
b. Same.
c. Let U = N1 , V = N2 − N1 , W = N3 − N2 . We want to calculate
L[U + V | U, U + V + W ]
where the random variables U, V, W are i.i.d. P (λ). We could use the straightforward
approach, with the general formula. However, a symmetry argument turns out to be easier.
Note that, by symmetry
X := L[U + V | U, U + V + W ] = L[U + W | U, U + V + W ].
Consequently,
2X = L[(U + V ) + (U + W ) | U, U + V + W ] = U + (U + V + W ).
Hence,
1 1
X = L[N2 | N1 , N3 ] = (U + (U + V + W )) = (N1 + N3 ).
2 2
d. We are given the three i.i.d. P (λ) random variables U, V, W defined earlier. Then
λu+v+w −3λ
P [U = u, V = v, W = w | λ] = e .
u!v!w!
Consequently, the MLE is the value of λ that maximizes
λn e−3λ ,
with n = u + v + w. Taking the derivative with respect to λ and setting it to 0, we get
n
λ= .
3
Hence,
N3
M LE[λ | N1 , N2 , N3 ] = .
3
Example 12.2.2. Let N = {Nt , t ≥ 0} be a counting process. That is, Nt+s − Ns ∈
{0, 1, 2, . . .} for all s, t ≥ 0. Assume that the process has independent and stationary in-
crements. That is, assume that Nt+s − Ns is independent of {Nu , u ≤ s} for all s, t ≥ 0
and has a p.m.f. that does not depend on s. Assume further that P (Nt > 1 ) = o(t). Show
that N is a Poisson process. (Hint: Show that the times between jumps are exponentially
distributed and independent.)
Let
a(s) = P (Nt+s = Nt ).
Then
a(s + u) = P (Nt+s+u = Nt ) = P (Nt+s+u = Nt+s )P (Nt+s = Nt ) = a(s)a(u), s, u > 0.
Taking the derivative with respect to s, we find
a0 (s + u) = a0 (s)a(u).
At s = 0, this gives
a0 (u) = a0 (0)a(u).
The solution is a(u) = a(0) exp{a0 (0)u}, which shows that the distribution of the first jump
time is exponentially distributed. The independence of the increments allows to conclude
that the times between successive jumps are i.i.d. and exponentially distributed.
Example 12.2.3. Construct a counting process N such that for all t > 0 the random
variable Nt is Poisson with mean λt but the process is not a Poisson process.
Assume that At is a Poisson process with rate λ. Let F (t; x) = P (Nt ≤ x) for x, t ≥ 0.
Let also G(t; u) be the inverse function of F (t; .). That is, G(t; u) = min{x | F (t; x) ≥ u}
for u ∈ [0, 1]. Then we know that if we pick ω uniformly distributed in [0, 1], the random
variable G(t; ω) has p.d.f. F (t; x), i.e., is Poisson with mean λt. Consider the process
N = {Nt (ω) := G(t; ω), t ≥ 0}. It is such that Nt is Poisson with mean λt for all t.
However, if you know the first jump time of Nt , then you know ω and all the other jump
times. hence N is not Poisson.
Example 12.2.4. Let N be a Poisson process with rate λ and define Xt = X0 (−1)Nt for
t ≥ 0 where X0 is a {−1, +1}-valued random variable independent of N .
a. Does the process X have independent increments?
b. Calculate P (Xt = 1) if P (X0 = 1) = p.
c. Assume that p = 0.5, so that, by symmetry, P (Xt = 1) = 0.5 for all t ≥ 0. Calculate
E(Xt+s Xs ) for s, t ≥ 0.
a. Yes, the process has independent increments. This can be seen by considering the
following conditional probabilities:
P (Xt+s = 1|Xt = 1, (X(u), 0 ≤ u ≤ s)) = P (N (t, s + t] is even|Xt = 1, (X(u), 0 ≤ u ≤ s))
= P (N (t, s + t] is even)
Similarly,
P (Xt+s = −1|Xt = 1, (X(u), 0 ≤ u ≤ s)) = P (N (t, s + t] is odd|Xt = 1, (X(u), 0 ≤ u ≤ s))
= P (N (t, s + t] is odd)
Similarly, we can show independent increments for the case when X(t) = −1
b. First lets calculate P (N (0, t] is even)
∞
X (λt)2i e−λt
P (N (0, t] is even) =
2i!
i=0
X∞ ∞
e−λt (λt)i X (−λt)i
= ( + )
2 i! i!
i=0 i=0
e−λt + eλt
= e−λt ( )
2
1 + e−2λt
=
2
1−e−2λt
Similarly we get P (N (0, t] is odd) = 2
P (Xt = 1) = P (Xt = 1|X0 = 1)P (X0 = 1) + P (Xt = 1|X0 = 0)P (X0 = 0)
= P (N (0, t] is even)p + P (N (0, t] is odd)(1 − p)

1 + e−2λt 1 − e−2λt
= p+ (1 − p)
2 2
c. Assume that p = 0.5, so that, by symmetry, P (Xt = 1) = 0.5 for all t ≥ 0. Calculate
E(Xt+s Xs ) for s, t ≥ 0.
E(Xt+s Xs ) = 1 P (Xt+s = Xs ) + −1 P (Xt+s 6= Xs )

1 1
= (1 + e−2λt ) − (1 − e−2λt )
2 2
= e−2λt
Example 12.2.5.
Problem 5. Let N be a Poisson process with rate λ. At each jump time Tn of N , a random
number Xn of customers arrive at a cashier waiting line. The random variables {Xn , n ≥ 1}
are i.i.d. with mean µ and variance σ 2 . Let At be the number of customers who arrived by
time t, for t ≥ 0.
a. Calculate E(At ) and var(At ).
b. What can you say about At /t as t → ∞?
c. What can you say about
At − µλt
√
t
as t → ∞?
P
a. We have, E(At |Nt = n) = E( ni=1 Xi ) = nµ and
∞
X
E(At ) = E(At |Nt = n)P (Nt = n)
i=1
∞
X
= µ nP (Nt = n)
i=1
= µλt
Xn
E(A2t |Nt = n) = E( Xi )2
i=1
n
X n
X
= µ Xi2 +2 Xi Xj
i=1 i=1,0<j<i
= n(µ + σ 2 ) + n(n − 1)µ2

2
∞
X
E(A2t ) = E(A2t |Nt = n)P (Nt = n)
i=1
X∞
= (n(µ2 + σ 2 ) + n(n − 1)µ2 )P (Nt = n)
i=1
∞
X ∞
X
= (µ2 + σ 2 ) n P (Nt = n) + µ2 n(n − 1) P (Nt = n)
i=1 i=1
= (µ2 + σ 2 )λt + µ2 (λt + (λt)2 − λt)
= (µ2 + σ 2 )λt + µ2 (λt)2
Hence,
V ar(At ) = E(A2t ) − E(At )2
= (µ2 + σ 2 )λt + µ2 (λt)2 − µ2 (λt)2
= (µ2 + σ 2 )λt
b. We can write At as:

PNt
i=1 Xi Nt
At =
Nt t
We see that as t → ∞, N (t) → ∞
Nt
Then, t approaches the long term rate of the Poisson process (λ). Similarly, by the
PNt
i=1 Xi a.s
Strong law of Large numbers, Nt → E[X1 ] = µ
Hence,
At
limt→∞ =µλ
t
c. Define the RV Y1 as the number of customers added in the first 1 sec. Similarly,
define RV Yi as the number of customers added in the ith second. Then (Yn , n ≥ 1) are iid
RVs with mean µλ and variance (µ2 + σ 2 )λ.
Let tc = dte. Define ∆c = tc − t and let Zc denote the number of customers arriving in
time interval (t, tc ]. Then E(Zc ) = µλ∆c and var(Zc ) = (µ2 + σ 2 )λ∆c . Hence,
Ptc
At − µλt i=1 Yi
− µλtc − Zc + µλ∆
limt→∞ √ = limtc →∞ √
t tc − ∆c
Ptc
Yi − µλtc Zc − µλ∆
≥ limtc →∞ i=1 √ − √
tc tc
The last term in the expression above goes to zero as t → ∞. And using CLT we can
Ptc
Yi −µλtc d
show that limtc →∞ i=1√
tc
→ N (0, λ(µ2 + σ 2 ))
Similarly, we can show that limt→∞ At √

−µλt
t
is bounded above by another RV which also
has the distribution N (0, λ(µ2 + σ 2 )).
Hence, limt→∞ At √
−µλt
t
→ N (0, λ(µ2 + σ 2 )).
Chapter 13
Filtering Noise
When your FM radio is tuned between stations it produces a noise that covers a wide range
of frequencies. By changing the settings of the bass or treble control, you can make that
noise sound harsher or softer. The change in the noise is called filtering. More generally,
filtering is a transformation of a random process. Typically, one filters a random process to
extract desired information.
In this chapter, we define the power that a random process has at different frequencies
and we describe how one can design filters that modify that power distribution across
frequencies. We also explain how one can use a filter to compute the LLSE of the current
value of some random process given a set of observations made so far.
We discuss the results in discrete time. That is, we consider sequences of values instead
of functions of a continuous time. Digital filters operate on discrete sets of values. For
instance, a piece of music is sampled before being processes, so that continuous time is
made discrete. Moreover, for simplicity, we consider that time takes only a finite number of
values. Physically, we can take a set of time that is large enough to represent an arbitrarily
long time. As an example, we represent a song by a finite number of sample values.
211
212 CHAPTER 13. FILTERING NOISE
13.1 Linear Time-Invariant Systems
The terminology is that a system transforms its input into an output. The input and output
are sequences of values describing, for instance, a voltage as a function of time. Accordingly,
a system maps functions to functions.
The functions that we consider are sequences such as {x(n), n = 0, 1, . . . , N − 1}.
Such sequences represent signals that, for instance, encode voice or audio. For mathe-
matical convenience, we extend these sequences periodically over all the integers in Z :=
{. . . , −2, −1, 0, 1, 2, . . .}. That is, given {x(n), n = 0, 1, . . . , N − 1} we define {x(n), n ∈ Z}
so that x(n) = x(n − N ) for all n ∈ Z. We will consider systems whose input are the
extension of the original finite sequence. For instance, imagine that we want to study how a
particular song is processed. We consider that the song is played over and over and we look
at the result of the processing. Intuitively, this artifice does not change the nature of the
result, except possibly at the very beginning or end of the song. However, mathematically,
this construction greatly simplifies the analysis
13.1.1 Definition
A system is linear if the output that corresponds to a sum of inputs is the sum of their
respective outputs. A system is time-invariant if delaying the input only delays the output
by the same amount.
Definition 13.1.1. Linear Time-Invariant (LTI) System
A linear time-invariant system (LTI) is the transformation of an input {x(n), n =
0, 1, . . . , N − 1} into the output y = {y(n), n ∈ Z} defined as follows:
N
X −1
y(n) = h(m)x(n − m), n ∈ Z (13.1.1)
n=0
where the sequence {x(n), n ∈ Z} is the periodic extension of {x(n), n = 0, 1, . . . , N − 1}.

13.1. LINEAR TIME-INVARIANT SYSTEMS 213
The function {h(n), n = 0, 1, . . . , N − 1} is called the impulse response of the system.
Thus, the output is the convolution of the (extended) input with the impulse response.
Roughly, the impulse response is the output that corresponds to the input {1, 0, . . . , 0}.
P
We say roughly, because the extension of this input is x(n) = ∞
k=−∞ 1{n = kN }, so that
the corresponding output is

∞
X
y(n) = h(n − kN ).
k=−∞
For instance, imagine that h(n) = 0 for n ∈

/ {0, 1, . . . , N − 1}. Then y(n) = h(n) for
n ∈ {0, 1, . . . , N − 1}. In that case, the impulse response is the output that corresponds to
the impulse input {1, 0, . . . , 0}. If h(·) is nonzero over a long period of time that exceeds
the period N of the periodic repetition of the input, then the outputs of these different
repetitions superpose each other and the result is complicated. In applications, one chooses
N large enough to exceed the duration of the impulse response. Assuming that h(n) = 0
for n < 0 means that the system does not produce an output before the input is applied.
Such a system is said to be causal. Obviously, all physical systems are causal.
Consider a system that delays the input by k time units. Such a system has impulse
response h(n) = 1{n − k}. Similarly, a system which averages k successive input values
and produces y(n) = (x(n) + x(n − 1) + · · · + x(n − k + 1))/k for some k < N has impulse
response
1
h(n) = 1{0 ≤ n ≤ k − 1}, n = 0, . . . , N − 1.
k
As another examples, consider a system described by the following identities:
y(n)+a1 y(n−1)+· · ·+ak y(n−k) = b0 x(n)+b1 x(n−1)+· · ·+bm x(n−m), m ∈ Z. (13.1.2)
We assume m < N − 1. This system is LTI because the identities are linear and their
coefficients do not depend on n. To find the impulse response, assume that x(n) = 1{n = 0}
and designate by h(n) the corresponding output y(n) when y(n) = 0 for n < 0. For instance,
by considering (13.1.2) successively for n = 0, 1, 2, . . .:
h(0) = b0 ;
h(1) + a1 h(0) = b1 , so that h(1) = b1 − a1 b0 ;
h(2) + a1 h(1) + a2 h(0) = b2 , so that h(2) = b2 − a1 (b1 − a1 b0 ) − a2 b0 ;
and so on.
We want to analyze the effect of such LTI systems on random processes. Before doing
this, we introduce some tools to study the systems in the frequency domain.
13.1.2 Frequency Domain
LTI systems are easier to describe in the “frequency domain” than in the “time domain.”
To understand these ideas, we need to review (or introduce) the notion of frequencies.
As a preliminary, assume that x(n) = ei2πun = β −nu for n = 0, 1 . . . , N − 1 and for a
fixed u ∈ {0, 1, . . . , N − 1}. Then (13.1.1) implies that

N
X −1 N
X −1
y(n) = h(m)x(n − m) = h(m)β −(n−m)u
n=0 n=0
N
X −1
−nu
=β h(m)β mu = x(n)H(u) = x(n)|H(u)|eiθ(u)
n=0
PN −1
where H(u) = |H(u)|eiθ(u) = m=0 h(m)β mu . Thus, if the input is a complex sine wave,
then so is the output. Using complex-valued signals is a mathematical artefact that simpli-
fies the analysis. In a physical system, the quantities are real. For instance, h(n) is real. If
we take the imaginary part of the expression above, we find that

N
X −1
Im{y(n)} = Im{ h(m)x(n − m)} = Im{x(n)|H(u)|eiθ(u) }.
n=0
This gives
N
X −1 N
X −1
h(m) Im{x(n − m)} = h(m) sin(2πu(n − m)/N )
n=0 n=0
= Im{ei2πnu/N |H(u)|e iθ(u)
} = |H(u)| sin(2πnu/N + θ(u)).
13.1. LINEAR TIME-INVARIANT SYSTEMS 215
This expression shows that if the input is the sine wave sin(2πun/T ), then so is the output,
except that its amplitude is multiplied by some gain |H(u)| and that it is delayed. Note
that the analysis is easier with complex functions. One can then take the imaginary part
to recover the output to a sine wave.
The example above motivates the following definition.
Definition 13.1.2. Discrete Fourier Transform (DFT)
The Discrete Fourier Transform (DFT) of {x(n), n = 0, 1, . . . , N − 1} is the sequence of
complex number {X(u), u = 0, 1, . . . , N − 1} where

N
X −1
X(u) = x(n)β nu , u = 0, 1, . . . , N − 1 (13.1.3)
n=0
with β := e−i2π/N .
It turns out that one can recover x(·) from X(·) by computing the Inverse Discrete
Fourier Transform:
N −1
1 X
x(n) = X(u)β −nu , n = 0, 1, . . . , N − 1. (13.1.4)
N
u=0
Indeed,
N −1 N −1 N −1 N −1 N −1
1 X −nu 1 X X mu −nu
X 1 X u(m−n)
X(u)β = [ x(m)β ]β = x(m)[ β ].
N N N
u=0 u=0 m=0 m=0 u=0
But,
N −1
1 X u(m−n)
β = 1, for m = n
N
u=0
and, for m 6= n,
N −1
1 X u(m−n) 1 β N (m−n) − 1
β = = 0,
N N β m−n − 1
u=0
because βN = e−2iπ = 1. Hence,
N −1 N −1
1 X X
X(u)β −nu = x(m)1{m = n} = x(n),
N
u=0 m=0
as claimed.
The following result shows that LTI systems are easy to analyze in the frequency domain.
Theorem 13.1.1. Let x and y be the input and output of an LTI system, as in (13.1.1).
Then
Y (u) = H(u)X(u), u = 0, 1, . . . , N − 1. (13.1.5)
The theorem shows that the convolution (13.1.1) in the time domain becomes a multi-
plication in the frequency domain.
Proof:
We find
N
X −1 N
X −1 N
X −1
nu
Y (u) = y(n)β = h(m)x(n − m)β nu
n=0 n=0 m=0
N
X NX
−1 −1 N
X −1 N
X −1
= h(m)x(n − m)β (n−m)u β mu = h(m)β mu [ x(n − m)β (n−m)u ]
n=0 m=0 m=0 n=0
N
X −1
= h(m)β mu X(u) = H(u)X(u).
m=0
PN −1
The next-to-last identity follows from the observation that n=0 x(n − m)β (n−m)u does
not depend on m because both x(n) and β nu are periodic with period N . ¤
As an example, consider the LTI system (13.1.2). Assume that the input is x(n) =
ei2πun = β −nu for n ∈ Z. We know that y(n) = H(u)β −nu , n ∈ Z for some H(u) that we
calculate next. Using (13.1.2) we find
H(u)β −nu +a1 H(u)β −(n−1)u +· · ·+ak H(u)β −(n−k)u = b0 β −nu +b1 β −(n−1)u +· · ·+bm β −(n−m)u .
Hence, after dividing both sides by β −nu ,
H(u)[1 + a1 β u + · · · + ak β ku ) = b0 + b1 β u + · · · + β mu .
Consequently,
b0 + b1 β u + · · · + β mu
H(u) = . (13.1.6)
1 + a1 β u + · · · + ak β ku
13.2. WIDE SENSE STATIONARY PROCESSES 217
As an example, consider the moving average filter with
1
y(n) = (x(n) + x(n − 1) + · · · + x(n − m + 1)).
m
Here, a1 = · · · = am = 0 and b0 = · · · = bm−1 = 1/m. Consequently,
1 1 1 − β mu 1 1 − e−i2πmu/N
H(u) = (1 + β u + · · · + β (m−1)u ) = = . (13.1.7)
m m 1 − βu m 1 − e−i2πu/N
13.2 Wide Sense Stationary Processes
We are interested in analyzing systems when their inputs are random processes. One could
argue that most real systems are better modelled that way than by assuming that the input
is purely deterministic. Moreover, we are concerned by the power of the noise, as we see
that noise as a disturbance that clouds our view of a signal of interest.
The power of a random process has to do with the average value of its squared magnitude.
This should not be surprising. The power of an electrical signal is the product of its voltage
by its current. For a given load, the current is proportional to the voltage, so that that
the power is proportional to the square of the voltage. If the signal fluctuates, the power
is proportional to the long term average value of the square of the voltage. If the law of
large numbers applies (say, if the process is ergodic), then this long term average value is
the expected value. This discussion motivates the definition of power as the expected value
of the square.
We introduce a few key ideas on a simple model. Assume the the input of the system
at time n is the random variable X(n) that the output is Y (n) = X(n) + X(n − 1). The
power of the input is E(X(n)2 ). For this quantity to be well-defined, we assume that
it does not depend on n. The average power of the output is E(Y (n; ω)) = E((X(n) +
X(n − 1))2 ) = E(X(n)2 ) + 2E(X(n)X(n − 1)) + E(X(n − 1)2 ). For this quantity to be
well-defined, we assume that E(X(n)X(n − 1)) does not depend on n. More generally, this
example shows that the definitions become much simpler if we consider processes such that
E(X(n)X(n − m)) does not depend on n for m ≥ 0. These considerations motivate the
following definition. For convenience, we consider complex-valued processes.
Definition 13.2.1. Wide Sense Stationary
The random process {X(n), n ∈ Z} is wide sense stationary (wss) if
E(X(n)) = µ, n ∈ Z (13.2.1)
and
E(X(n)X ∗ (n − m)) = RX (m), n, m ∈ Z (13.2.2)
As we did in the deterministic case, we will extend periodically a random sequence
{X(n), n = 0, . . . , N −1} into a sequence defined for n ∈ Z. For the extended sequence to be
wss, the original sequence must satisfy a related condition that E(X(n)X ∗ (n−m)) = RX (m)
for n, m ∈ {0, . . . , N −1} with the convention that n−m is replaced by n−m+N if n−m < 0.
When this condition holds, we say that the sequence {X(0), . . . , X(N − 1)} is wss. Note
also in that case that RX (n) is periodic with period N . Although this condition may seem
somewhat contrived, we see below that it is satisfied for many processes.
One first example is when {X(n), n = 0, . . . , N − 1} are i.i.d. with mean µ and variance
σ 2 . In that case, one finds RX (m) = (µ2 + σ 2 )1{m = 0} for m = 0, 1, . . . , N − 1. The
sequence RX (n) for n ∈ Z is the periodic extension of {RX (0), . . . , RX (N − 1)}.
The following result explains how one can generate many examples of wss process.
Theorem 13.2.1. Assume that the input of a LTI system with impulse response {h(n), n =
0, 1, . . . , N − 1} is the wss process {X(n), n = 0, . . . , N − 1}. Then the output {Y (n), n ∈ Z}
is wss.
Proof::
13.3. POWER SPECTRUM 219
First, we notice that

N
X −1 N
X −1 N
X −1
E(Y (n)) = E( h(m)X(n − m)) = h(m)E(X(n − m)) = µ h(m), n ∈ Z.
m=0 m=0 m=0
Second, we calculate
N
X −1 N
X −1
∗
E(Y (n)Y (n − m)) = E( h(k)X(n − k) h(k 0 )X ∗ (n − m − k 0 )
k=0 k0 =0
N
X −1 N
X −1
= h(k)h(k 0 )RX (m + k 0 − k),
k=0 k0 =0
which shows that the result does not depend on n. We designate that result as RY (m).
Note also that RY (m) is periodic in m with period N , since RX has that property.
We take note of the result of the calculation above, for further reference:
N
X −1 N
X −1
RY (m) = h(k)h(k 0 )RX (m + k 0 − k). (13.2.3)
k=0 k0 =0
13.3 Power Spectrum
The result of the calculation in (13.2.3) is cumbersome and hard to interpret. We introduce
the notion of power spectrum which will give us a simpler interpretation of the result.
Definition 13.3.1. Let X be a wss process. We define the power spectrum of the process
X as {SX (u), u = 0, . . . , N − 1}, the DFT of RX . That is,

N
X −1
SX (u) = RX (n)β nu , u = 0, 1, . . . , N − 1 (13.3.1)
n=0
with β := e−i2π/N .
We look at a few examples to clarify the meaning of the power spectrum.

First consider the random process
X(n) = β nu0 eiΘ , n = 0, 1, . . . , N − 1
where Θ =D U [0, 2π].
This process is a sine wave with frequency u0 and a random phase. (The unit of frequency
is 1/N .) Without the random phase, the process would certainly not be wss since its mean
would depend on n. However, as defined, this process is wss. Indeed,
E(X(n)) = β nu0 E(eiΘ ) = 0, by symmetry,
and
E(X(n)X ∗ (n − m)) = β −mu0 =: RX (m).
We find that
SX (u) = N 1{u = u0 }, u = 0, 1, . . . , N − 1. (13.3.2)
Indeed, by the identity (13.1.4) for the inverse DFT, we see that
N −1
1 X
RX (m) = SX (u)β −mu , n = 0, 1, . . . , N − 1 (13.3.3)
N
u=0
which agrees with the expression (13.3.2) for SX . This expression (13.3.2) says that the
power spectrum of the process is concentrated on frequency u0 , which is consistent with the
definition of X(n) as a sine wave at frequency u0 .
Second, we look at the process with i.i.d. random variables {X(n), n = 0, . . . , N − 1}.
We saw earlier that RX (u) = (µ2 + σ 2 )1{n = 0} for n = 0, 1, . . . , N − 1. Hence,
N
X −1 N
X −1
SX (u) = RX (n)β nu = (µ2 + σ 2 )1{n = 0}β nu = µ2 + σ 2 , u = 0, 1, . . . , N − 1.
n=0 n=0
This process has a constant power spectrum. A process with that property is said to be a
white noise. In a sense, it is the opposite of a pure sine wave with a random phase.
13.4. LTI SYSTEMS AND SPECTRUM 221
13.4 LTI Systems and Spectrum
The following theorem is the central result about filtering.
Theorem 13.4.1. Let X be a wss input of a LTI system with impulse response h(·) and
output Y . Then
SY (u) = |H(u)|2 SX (u), u = 0, 1, . . . , N − 1. (13.4.1)
Proof:
We use (13.2.3) to find

N
X −1
SY (u) = RY (m)β mu
m=0
N
X NX
−1 −1 N
X −1
= [ h(k)h(k 0 )RX (m + k 0 − k)]β mu
m=0 k=0 k0 =0
N
X −1 N
X −1 N
X −1
0 0
= h(k)h(k 0 )RX (m + k 0 − k)β ku β −k u β (m+k −k)u
m=0 k=0 k0 =0
N
X −1 N
X −1 N
X −1
ku 0 0
= h(k)β { h(k 0 )β −k u [ RX (m + k 0 − k)β (m+k −k)u ]}.
k=0 k0 =0 m=0
Now,
N
X −1
0
RX (m + k 0 − k)β (m+k −k)u = SX (u)
m=0
0
because both RX (m + k0 − k) and β (m+k −k)u are periodic in N − 1, so that we can set
k 0 − k = 0 in the calculation without changing the result. Hence,

N
X −1 N
X −1
0
SY (u) = h(k)β ku { h(k 0 )β −k u SX (u)} = H(u)H ∗ (u)SX (u) = |H(U )|2 SX (u),
k=0 k0 =0
as claimed. ¤
The theorem provides a way to interpret the meaning of the power spectrum. Imagine
a wss random process X. We want to understand the meaning of SX (u). Assume that we
√
build an ideal LTI system with transfer function H(u) = N 1{u = u0 }. This filter lets
only the frequency u0 go through. The process X goes through the LTI system. The power
of the output, E(Yn2 ) is calculated as follows: (we use (13.3.3))

N −1 N −1
1 X 1 X
E(|Yn |2 ) = RY (0) = SY (u) = |H(u)|2 SX (u) = SX (u0 ).
N N
u=0 u=0
Thus, SX (u0 ) measures the power of the random process X at frequency u0 .
Example 13.5.1. Averaging a process over time should make it smoother and reduce its
rapid changes. Accordingly, we expect that such processing should cut down the high fre-
quencies. Verify that fact.
We use (13.1.7) to find that
1 1 − e−i2πmu/N 2 1 sin2 (2πmu/N ) + (1 − cos(2πmu/N ))2

|H(u)|2 = | | =
m 1 − e−i2πu/N m2 sin2 (2πu/N ) + (1 − cos(2πu/N ))2
1 1 − cos(2πmu/N )
= 2 .
m 1 − cos(2πu/N )
Figure 13.1 plots the value of |H(u)|2 . The figure shows the “low-pass” filtering effect:
the low frequencies go through but the high frequencies are greatly attenuated. This plot is
the power spectrum of the output when the input is a white noise such as an uncorrelated
sequence. That output is a “colored” noise with most of the power in the low frequencies.
Example 13.5.2. By calculating the differences between successive values of a process one
should highlight its high-frequency components. Verify that fact.
We consider the LTI system
y(n) = x(n) − x(n − 1).
Using (13.1.7) we find that its transfer function H(u) is given by
H(u) = 1 − β u = 1 − ei2πu/N .
1.2
1
m = 10
0.8
m = 20
0.6 m = 40
0.4
0.2
0
-0.2 0 5 10 15 20 25
Figure 13.1: |H(u)|2 for Moving Average.
0
0 20 40 60 80 100 120 140
Figure 13.2: |H(u)|2 for Difference Filter.
Consequently,
|H(u)|2 = (1 − cos(2πu/N ))2 + sin2 (2πu/N ) = 2 − 2 cos(2πu/N ) = 4 sin2 (πu/N ).
Figure 13.2 plots that expression and shows that the filter boosts the high frequencies, as
expected. You will note that this system acts as a high-pass filter for frequencies up to 64
(in this example, N = 128). In practice, one should choose N large enough so that the filter
covers most of the range of frequencies of the input process. Thus, if the power spectrum
of the input process is limited to K, one can choose N = 2K and this system will act as a
high-pass filter.
Chapter 14
Markov Chains - Discrete Time
A Markov chain models the random motion in time of an object in a countable set. The
key feature of that motion is that the object has no memory of its past and does not carry a
watch. That is, the future motion depends only on the current location. Consequently, the
law of motion is specified by the one-step transition probabilities from any given location.
Markov chains are an important class of models because they are fairly general and good
numerical techniques exist for computing statistics about Markov chains.
14.1 Definition
A discrete-time Markov chain is a process X = {Xn , n ≥ 0} that takes values in a countable
set S and is such that
P [Xn+1 = j|X0 , . . . , Xn−1 , Xn = i] = P (i, j) for all i, j ∈ S and n ≥ 0.
The matrix P = [P (i, j), i, j ∈ S] is the transition probability matrix of the Markov
chain. The matrix P is any nonnegative matrix whose rows sum to 1. Such a matrix is
called a stochastic matrix.
The finite dimensional distributions of X are specified by the initial distribution π0 =
{π0 (i) = P (X0 = i), i ∈ S} and by P . Indeed,
225
226 CHAPTER 14. MARKOV CHAINS - DISCRETE TIME
P (X0 = i0 , X1 = i1 , . . . , Xn = in )
= P (X0 = i0 )P [X1 = i1 |X0 = i0 ]P [X2 = i2 |X0 = i0 , X1 = i1 ]
× · · · × P [Xn = in |X0 = i0 , . . . , Xn−1 = in−1 ]
= P (X0 = i0 )P [X1 = i1 |X0 = i0 ]P [X2 = i2 |X1 = i1 ] · · · P [Xn = in |Xn−1 = in−1 ]
= π0 (i0 )P (i0 , i1 )P (i1 , i2 ) · · · P (in−1 , in ).
Note in particular (by summing over i1 , i2 , . . . , in−1 ) that
P [Xn = j|X0 = i] = P n (i, j) and P (Xn = j) = π0 P n (j)
where P n is the n-th power of the matrix P and π0 is the row vector with entries π0 (j).
(The power of a stochastic matrix is defined as that of a finite matrix.)
A state transition diagram can represent the transition probability matrix. Such a dia-
gram shows the states and the probabilities are represented by numbers on arrows between
states. By convention, no arrow between two states means that the corresponding transition
probability is 0. (See examples below.)
14.2 Examples
We first look at the following state transition diagram.
Diagram 14.1 represents a Markov chain with S = {0, 1}. Its transition probability
matrix is shown next to the diagram.
The following state transition diagram 14.2 corresponds to a Markov chain with S =
{0, 1, 2, . . .}.
This is the state transition diagram of the sequence of fortunes with the rich uncle.
Also, for future reference, we introduce a few other examples. Markov chain 14.3 has
14.2. EXAMPLES 227
Figure 14.1: Markov chain with two states.
Figure 14.2: State transitions diagram of reflected random walk.
two sets of states that do not communicate. (Recall that the absence of an arrow means
that the corresponding transition probability is 0.)
Markov chain 14.4 has one state, state 4, that cannot be exited from. Such a state is
said to be absorbing. Note that P (4, 4) = 1.
Markov chains 14.5 and 14.6 have characteristics that we will discuss later.
Consider the following “non-Markov” chain. The possible values are 0, 1, 2. The initial
Figure 14.3: Markov chain with two disconnected sets of states.

Figure 14.4: Markov chain with absorbing state.
Figure 14.5: Periodic Markov chain.
Figure 14.6: Aperiodic Markov chain.

14.3. CLASSIFICATION 229
value is picked randomly. Also, with probability 1/2, the sequence increases (modulo 3),
otherwise, it decreases. Thus, with probability π(0)/2, the sequence is {0, 1, 2, 0, 1, 2, . . .}
and with probability π(0)/2 it is {0, 2, 1, 0, 2, 1, 0, . . .}. Similarly for the other two possible
starting values. This is not Markov since by looking at the previous two values you can
predict exactly the next one, which you cannot do if you only see the current value. Note
that you can “complete the state” by considering the pair of two successive values. This
pair is a Markov chain. More generally, if a sequence has a finite memory of duration k,
the vector of k successive values is Markov.
Here is another example. Let Xn = (X0 + n)mod3 where X0 is uniformly distributed in
{0, 1, 2}. That is, if X0 = 0, then (Xn , n ≥ 0) = (0, 1, 2, 0, 1, 2, . . .) whereas if X0 = 1, then
(Xn , n ≥ 0) = (1, 2, 0, 1, 2, . . .), and similarly if X0 = 2. Let g(0) = g(1) = 5 and g(2) = 6.
Then {Yn = g(Xn ), n ≥ 0} is not a Markov chain. Indeed,

1
P [Y2 = 6 | Y1 = 5, Y0 = 5] = 1 6= P [Y2 = 6 | Y1 = 5] = .
2
14.3 Classification
The properties of Markov chains are determined largely (completely for finite Markov
chains) by the “topology” of their state transition diagram. We need some terminology.
A Markov chain (or its probability transition matrix) is said to be irreducible if it can
reach every state from every other state (not necessarily in one step). For instance, the
Markov chains in Figures 14.1, 14.2, 14.5, and 14.6 are irreducible but those in Figures 14.3
and 14.4 are not.
Define d(i) = g.c.d.{n > 0| it is possible to go from i to i in n steps }. That is,
d(i) = g.c.d.{n > 0|P n (i, i) > 0}.
Here, g.c.d. means the greatest common divisor of the integers in the set. For instance,
g.c.d.{6, 9, 15} = 3 and g.c.d.{12, 15, 25} = 1. For instance, for the Markov chain in Figure
14.1, d(1) = g.c.d.{1, 2, 3, . . .} = 1. For Figure 14.2, d(2) = g.c.d.{2, 4, 5, 6, . . .} = 1. For
Figure 14.5, d(1) = g.c.d.{3, 6, 9, . . . } = 3. For Figure 14.6, d(1) = g.c.d.{3, 5, 6, . . .} = 1.
If the Markov chain is irreducible, then it can be shown that d(i) has the same value
for all i ∈ S. If this common value d is equal to 1, then the Markov chain is said to be
aperiodic. Otherwise, the Markov chain is said to be periodic with period d. Accordingly,
the Markov chains [1], [2], [6] are aperiodic and Markov chain [5] is periodic with period 3.
Define Ti = min{n ≥ 0|Xn = i}. If the Markov chain is irreducible, then one can show
that P [Ti < ∞|X0 = i] has the same value for all i ∈ S. Moreover, that value is either 1 or
0. If it is 1, the Markov chain is said to be recurrent. If it is 0, then the Markov chain is
said to be transient.
Moreover, if the irreducible Markov chain is recurrent, then E[Ti |X0 = i] is either finite
for all i ∈ S or infinite for all i ∈ S. If E[Ti |X0 = i] is finite, then the Markov chain is
said to be positive recurrent. Every finite irreducible Markov chain is positive recurrent. If
E[Ti |X0 = i] is infinite, then the Markov chain is null recurrent. Also, one can show that
1
lim (1{X1 = j} + 1{X2 = j} + · · · + 1{XN = j}) = 0
N →∞ n
for all j if the Markov chain is null recurrent and
1
lim (1{X1 = j} + 1{X2 = j} + · · · + 1{XN = j}) =: π(j) > 0
N →∞ n
for all j if the Markov chain is positive recurrent.
Finally, if the Markov chain is irreducible, aperiodic, and positive recurrent, then
P [Xn = j|X0 = i] → π(j)
for all i, j ∈ S, as n → ∞. The Markov chain is said to be asymptotically stationary.
The Markov chain in Figure 14.2 is transient when p > 0.5, null recurrent when p = 0.5,
and positive recurrent when p < 0.5.

14.4. INVARIANT DISTRIBUTION 231
14.4 Invariant Distribution
If P (Xn = i) = π(i) for i ∈ S (i.e., does not depend on n), the distribution π is said to be
invariant. Since
X
P (Xn+1 = i) = P (Xn = j, Xn+1 = i)
j
X X
= P [Xn+1 = i|Xn = j]P (Xn = j) = P (Xn = j)P (j, i),
j j
if π is invariant, then
X
π(i) = π(j)P (j, i), for i ∈ S.
j
These identities are called the balance equations. Thus, a distribution is invariant if and
only if it satisfies the balance equations.
An irreducible Markov chain has at most one invariant distribution. It has one if and
only if it is positive recurrent. In that case, the Markov chain is ergodic and asymptotically
stationary in that the distribution of Xn converges to the stationary distribution as we saw
in the previous section.
Examples of calculations of stationary distribution abound.
The following theorem summarizes the discussion of the previous two sections.
Theorem 14.4.1. Consider an irreducible Markov chain. It is either transient, null recur-
rent, or positive recurrent. Only the last case is possible for a finite-state Markov chain.
If the Markov chain is transient or null recurrent, then it has no invariant distribution;
the fraction of time it is in state j converges to zero for all j, and the probability that it is
in state j converges to 0 for all j.
If the Markov chain is positive recurrent, it has a unique invariant distribution π. The
fraction of time that the Markov chain is in state j converges to π(j) for all j. If the Markov
chain is aperiodic, then the probability that it is in state j converges to π(j) for all j.
14.5 First Passage Time
We can extend the example of the gambler’s ruin to a general Markov chain. For instance,
let X be a Markov chain on S with transition probability matrix P . Let also A ⊂ S be a
given subset and T = min{n ≥ 0|Xn ∈ A} and define β(i) = E[T |X0 = i]. Then one finds
that
X
β(i) = 1 + P (i, j)β(j), for i ∈
/ A.
j
Of course, β(i) = 0 for i ∈ A. In finite cases, these equations suffice to determine β(i).
In infinite cases, one may have to introduce a boundary as we did in the case of the reflected
fortune process. In many cases, no simple solution can be found. These are the first step
equations for the first passage time.
14.6 Time Reversal
Assume that X is a stationary irreducible Markov chain with invariant distribution π and
transition probability matrix P on S. What does the time-reversed process X 0 = {X 0 (n) :=
XN −n , n ≥ 0} look like? It turns out that X 0 is also a stationary Markov chain with the
same invariant distribution (obviously) and with transition probability matrix P 0 given by
π(i)P (i, j)
P 0 (i, j) = , i, j ∈ S.
π(j)
In some cases, P = P 0 . In those cases, X and X 0 have the same finite dimensional
distributions (are indistinguishable statistically) and X is said to be time-reversible. Thus,
X is time-reversible if and only if it is stationary and
π(i)P (i, j) = π(j)P (j, i), i, j ∈ S.
These equations are called the detailed balance equations. You can use these equations
to verify that the stationary reflected fortune process is time-reversible.

14.7. SUMMARY 233
14.7 Summary
Recall that a sequence of random variables X = {Xn , n ≥ 0} taking values in a countable
set S is a Markov chain if
P [Xn+1 = j | Xn = i, Xm , m ≤ n − 1] = P (i, j), ∀i, j ∈ S , n ≥ 0.
The key point of this definition is that, given the present value of Xn , the future {Xm , m ≥
n + 1} and the past {Xm , m ≤ n − 1} are independent. That is, the evolution of X starts
afresh from Xn . In other words, the state Xn contains all the information that is useful for
predicting the future evolution.
The First Step Equations are difference equations about some statistics of a Markov
chain {Xn , n ≥ 0} that are derived by considering the different possible values of the first
step, i.e., for X1 .
Finally, a stationary Markov chain reversed in time is again a Markov chain, generally
with a different transition probability matrix, unless the detailed balanced equations hold,
in which case the Markov chain is time-reversible.
Example 14.8.1. We flip a coin repeatedly until we get three successive heads. What is
the average number of coin flips?
We can model the problem as a Discrete Time Markov chain where the states denote
the number of successive heads obtained so far. Figure 14.7 shows the transition diagram
of this Markov chain.
If we are in state 0, we jump back to state 0 if we receive a Tail else we jump to state 1.
The probabilities of each of these actions is 1/2. Similarly if we are in state 2, we jump back
to state 0 if we receive a Tail else we jump state 3, and similarly for the other transitions.
2
1/2 1/2
1 1/2
3
1/2
1/2
0
1/2
Figure 14.7: (a) Markov chain for example 14.8.1
We are interested in finding the mean number of coin flips until we get 3 successive
heads. This is the mean number of steps taken to reach state 3 from state 0. Let N =
min{n ≥ 0, Xn = 3}, i.e., N is a random variable which specifies the first time we hit state
3. Let β(i) = E[N | X0 = i] for i = 0, 1, 2, 3.
The first step equations are as follows:
1 1
β(0) = β(0) + β(1) + 1;
2 2
1 1
β(1) = β(2) + β(0) + 1;
2 2
1 1
β(2) = β(3) + β(0) + 1;
2 2
β(3) = 0.
Solving these three equations we get β(0) = 14, β(1) = 12, β(2) = 8.
Hence the expected numbed of tosses to until we get 3 successive heads is 14.
Example 14.8.2. Let {Xn , n ≥ 0} be a Markov chain on S with transition probability
matrix P and initial distribution π. Specify the probability space.
The simplest choice is the canonical probability space defined as follows. Ω = S ∞ ; F is

the smallest σ-field that contains all the events of the form
{ω | ω0 = i0 , . . . ωn = in };
P is the σ-additive set function on F such that
P ({ω | ω0 = i0 , . . . ωn = in }) = π(i0 )P (i0 , i1 ) × · · · × P (in−1 , in ).
Example 14.8.3. We flip a biased coin forever. Let X1 = 0 and, for n ≥ 2, let Xn = 1
if the outcomes of the n-th and (n − 1)-st coin flips are identical and Xn = 0 otherwise. Is
X = {Xn , n ≥ 1} a Markov chain?
Designate by Yn the outcome of the n-th coin flip. Let P (H) = p = 1 − q. If X is a
Markov chain, then
P [X4 = 1 | X3 = 1, X2 = 1] = P [X4 = 1 | X3 = 1, X2 = 0].
The left-hand side is
P [(Y1 , Y2 , Y3 , Y4 ) ∈ {HHHH, T T T T } | (Y1 , Y2 , Y3 ) ∈ {HHH, T T T }]

P ((Y1 , Y2 , Y3 , Y4 ) ∈ {HHHH, T T T T }) p4 + q 4
= = 3 .
P ((Y1 , Y2 , Y3 ) ∈ {HHH, T T T }) p + q3
Similarly, the right-hand side is
P [(Y1 , Y2 , Y3 , Y4 ) ∈ {T HHH, HT T T } | (Y1 , Y2 , Y3 ) ∈ {T HH, HT T }]

P ((Y1 , Y2 , Y3 , Y4 ) ∈ {T HHH, HT T T }) qp3 + pq 3
= = 2 = p2 + q 2 .
P ((Y1 , Y2 , Y3 ) ∈ {T HH, HT T }) qp + pq 2
Algebra shows that the expressions are equal if and only if p = 0.5. Thus, if X is a Markov
chain, p = 0.5. Conversely, if p = 0.5, then we see that the random variables {Xn , n ≥ 2}
are i.i.d. B(0.5) and X is therefore a Markov chain.
Example 14.8.4. A clumsy man tries to go up a ladder. At each step, he manages to go
up one rung with probability p, otherwise he falls back to the ground. What is the average
time he takes to go up to the n-th rung.

Let β(m) be the average time to reach the n-th rung, starting from the m-th one, for
m ∈ {0, 1, 2, . . . , n}. The FSE are
β(m) = 1 + pβ(m + 1) + (1 − p)β(0), for m ∈ {0, 1, . . . , n − 1}
β(n) = 0
The first equation is of the form β(m + 1) = aβ(m) + b with a = 1/p and b = −1/p − (1 −
p)β(0)/p. The solution is
1 − am
β(m) = am β(0) + b, m = 0, 1, . . . , n.
1−a
Since β(n) = 0, we find that

1 − an
an β(0) + b = 0.
1−a
Substituting the values of a and b, we find
1 − pn
β(0) = .
pn − pn+1
For instance, with p = 0.8 and n = 10, one finds β(0) = 41.5.
Example 14.8.5. You toss a fair coin repeatedly with results Y0 , Y1 , Y2 , . . . that are 0 or 1
independently with probability 1/2 each. For n ≥ 1 let Xn = Yn + Yn−1 . Is Xn a Markov
chain?
No because
1 1
P [Xn+1 = 0 | Xn = 1, Xn−1 = 2] = 6= P [Xn+1 = 0 | Xn = 1] = .
2 4
Example 14.8.6. Consider a small deck of three cards 1, 2, 3. At each step, you take the
middle card and you place it first with probability 1/2 or last with probability 1/2. What is
the average time until the cards are in the reversed order 3, 2, 1?
The possible states are the six permutations {123, 132, 312, 321, 231, 213}. The state
transition diagram consists of these six states placed around a circle (in the order indicated),
with a probability 1/2 of transition of one step clockwise or counterclockwise. Relabelling
the states 1, 2, . . . , 6 for simplicity, with 1 = 321 and 4 = 123, we write the FSE for the
average time β(i) from state i to state 1 as follows:
β(i) = 1 + 0.5β(i + 1) + 0.5β(i − 1), i 6= 1
β(1) = 0.
In these equations, the conventions are that 6 + 1 = 1 and 1 - 1 = 6. Solving the equations
gives β(1) = 0, β(2) = β(6) = 5, β(3) = β(5) = 8, β(4) = 9. Accordingly, the answer to our
problem is that it takes an average of 9 steps to reverse the order of the cards.
Example 14.8.7. For the same Markov chain as in the previous example, what it the
probability F (n) that it takes at most n steps to reverse the order of the cards?
Let F (n; i) be the probability that it takes at most n steps to reach state 1 from state
i, for i ∈ {1, 2, . . . , 6}. The FSE for F (n; i) are
F (n; i) = 0.5F (n − 1; i + 1) + 0.5F (n − 1; i − 1), i 6= 1, n ≥ 1
F (n; 1) = 1, n ≥ 0
F (0; i) = 1{i = 1}.
Again we adopt the conventions that 6 + 1 = 1 and 1 - 1 = 6. We can solve the equations
numerically and plot the values of F (n) = F (4; n). The graph is shown in Figure 14.8.
Example 14.8.8. Is the previous Markov chain periodic?
Yes, it takes 2, 4, 6, ... steps to go from state i to itself. Thus, the Markov chain is
periodic with period 2. Recall that this implies that the probability of being in state i does
not converge to the invariant distribution (1/6, 1/6, . . . , 1/6). The graph in Figure 14.9
1.2
1
F(n)
0.8
0.6
0.4
0.2
0
0 10 20 30
n 40
Figure 14.8: Graph of F (n) in example 14.8.7
1.2
1
0.8
Pn(4, 4)
0.6
0.4
0.2
0
0 5 10 15 20 n 25
Figure 14.9: Graph of P n (4, 4) in example 14.8.8
shows the probability of being in state 4 at time n given that X0 = 4. This is derived by
P
calculating P (4, 4)n . Since P n+1 (4, j) = i P n (4, i)P (i, j) = 0.5P n (4, j − 1) + 0.5P n (4, j +
1), one can compute recursively by iterating a vector with 6 elements instead of a matrix
with 36.
Example 14.8.9. We flip a fair coin repeatedly until we get either the pattern HHH or
HT H. What is the average number of coin flips?
Let Xn be the last two outcomes. After two flips, we start with X0 that is equally likely
to be any of the four pairs in {H, T }2 . Look at the transition diagram of Figure 14.10.
The FSE for the average time to hit one of the two states HT H or HHH from the
HTH HHH
1/2
1/2 1/2 1/2 1/2
TT TH 1/2 HT HH
1/2 1/2
1/2
Figure 14.10: Transition diagram of Xn in example 14.8.9
other states are as follows:
β(T T ) = 1 + 0.5β(T T ) + 0.5β(T H)
β(T H) = 1 + 0.5β(HH) + 0.5β(HT )
β(HT ) = 1 + 0.5β(T T )
β(HH) = 1 + 0.5β(HT )
Solving these equations we find
1
(β(T T ), β(HT ), β(HH), β(T H)) = (34, 22, 16, 24)
5
and the answer to our problem is then
1 34
2 + (β(T T ) + β(T H) + β(HT ) + β(HH)) = = 6.8.
4 5
Example 14.8.10. Give an example of a discrete time irreducible Markov chain with period
3 that is transient.
The figure below shows the state diagram of an irreducible Markov chain with period 3.
Indeed, the Markov chain can go from 0 to 0 in 3, 6, 9, . . . steps. The probability of motion
to the right is much larger than to the left. Accordingly, one can expect that the Markov
chain goes to infinity.
0.1 0.1 0.1 0.1 0.1
1 1
0 1 2 0.9 3 1
4 0.9 1 0.9 1 0.9 1
Example 14.8.11. For n = 1, 2, . . ., during year n, barring any catastrophe, your company
makes a profit equal to Xn , where the random variables Xn are i.i.d. and uniformly dis-
tributed in {0, 1, 2}. Unfortunately, during year n, there is also a probability of a catastrophe
that sets back your company’s total profit to 0. Such a catastrophe occurs with probability
5% independently each year. Explain precisely how to calculate the average time until your
company’s total profit reaches 100. (The company does not invest its money; the total profit
is the sum of the profits since the last catastrophe.) Do not perform the calculations but pro-
vide the equations to be solved and explain a complete algorithm that I could use to perform
the calculations.
Let β(i) be the average time until the profits reach 100 starting from i.
Then
β(100) = β(101) = 0
0.95 0.95 0.95
β(i) = 1 + 0.05β(0) + β(i) + β(i + 1) + β(i + 2), for i = 0, 1, . . . , 99.
3 3 3
Fix β(0) and β(1) arbitrarily. use the second equation to find β(2), β(3), . . . , β(100), β(101),
in that order. Use the first two equations to determine the two unknowns β(0), β(1).
Example 14.8.12. Consider the discrete time Markov chain on {0, 1, 2, 3, 4} with transition
probabilities such that 


 0.5, if j = (i + 1)mod5



P (i, j) = 0.5, if j = (i + 2)mod5




 0, otherwise.
Thus, P (0, 1) = P (0, 2) = P (1, 2) = P (1, 3) = . . . = P (3, 4) = P (3, 0) = P (4, 0) =
P (4, 1) = 0.5. If you picture the states {0, 1, . . . , 4} as the vertices of a pentagon whose
labels increase clockwise, then the Markov chain makes one or two steps clockwise with
probability 0.5 each at every time instant.

a. Is this Markov chain periodic and, if it is, what is the period?
b. What is an invariant distribution of the Markov chain?
c. Is that invariant distribution unique?
d. Write the first step equations to calculate the average time for the Markov chain to
go from state 0 back to state 0. Can you guess the answer from the result of part (b)?
a. The Markov chain is aperiodic. For instance, it can go from state 0 to itself in 4 or
5 steps.
b. By symmetry it has to be (1/5, 1/5, 1/5, 1/5, 1/5).
c. A finite irreducible Markov chain has always one and only one invariant distribution.
d. Let β(i) be the average time to reach 0 starting from i. Then
β(0) = 0,
β(i) = 1 + 0.5β(i + 1) + 0.5β(i + 2), for i 6= 0.
In the last equation, the addition is modulo 5.
Example 14.8.13. Let {Xn , n ≥ 0} be i.i.d. Bernoulli with mean p. Define Y0 = X0 . For
n ≥ 1, let Yn = max{Xn , Xn−1 }. Is {Yn , n ≥ 0} a Markov chain? Prove or disprove.
The sequence {Yn , n ≥ 0} is not a Markov chain unless p = 1 or p = 0. To see this, note
that
P [Y3 = 1 | Y2 = 1, Y1 = 0] = P [X0 = 0, X1 = 0, X2 = 1 | X0 = 0, X1 = 0, X2 = 1] = 1.
However, if p ∈ (0, 1), one finds that
P [Y3 = 1 | Y2 = 1, Y1 = 1] = P [{0101, 101, 111} | {010, 101, 111}

p2 q 2 + p2 q + p3
= < 1.
pq 2 + p2 q + p3
In this derivation, by 0101 we mean X0 = 0, X1 = 1, X2 = 0, X3 = 1; by 101 we mean
X0 = 1, X1 = 0, X2 = 1, and similarly for the other terms.
The average time from state 0 to itself can be guessed from the invariant distribution as
follows. Imagine that once the Markov chain reaches state 0 it takes on average β steps for
it to return to 0. Then, the Markov chain spends one step out every 1 + β steps in state 0,
on average. Hence, the probability of being in state 0 should be equal to 1/(1 + β). Thus,
1/(1 + β) = 1/5, so that β = 4.
Example 14.8.14. Consider a modified random walk defined as follows. If Yn = k, then
Yn+1 = max{0, min{k + Xn+1 , N }} where the {Xn , n ≥ 1} are i.i.d. with P (Xn = +1) =
p = 1 − P (Xn = −1). The random variable Y0 is independent of {Xn , n ≥ 1}. Assume that
0 < p < 1.
a. Calculate the stationary distribution of {Yn , n ≥ 0}.
b. Use a probabilistic (coupling) argument to show that {Yn , n ≥ 0} is asymptotically
stationary.
c. Write the first step equations for the average time until Yn = 0 given Y0 = k.
a. We write and solve the balance equations:
p
π(0) = (1 − p)π(0) + (1 − p)π(1) ⇒ π(1) = ρπ(0) where ρ :=
1−p
π(1) = pπ(0) + (1 − p)π(2) = (1 − p)π(1) + (1 − p)π(2) ⇒ π(2) = ρ2 π(0)
Continuing this way shows that
π(n) = ρn π(0), n = 0, 1, . . . , N.
Since the probabilities add up to 1, find π(0) and we conclude that
ρn (1 − ρ)
π(n) = , n = 0, 1, . . . , N.
1 − ρN +1
b. Let {Zn , n ≥ 0} be a stationary version of the Markov chain. That is, its initial value
Z0 is selected with the invariant distribution and if Zn = k, then Zn+1 = max{0, min{k +
Xn+1 , N }}. Define the sequence {Yn , n ≥ 0} so that Y0 is arbitrary and if Yn = k, then
Yn+1 = max{0, min{k + Xn+1 , N }}. The random variables Xn are the same for the two
sequences. Note that Zn and Yn will be equal after some finite random time τ . For instance,
we can choose τ to be the first time that N successive Xn ’s are equal to −1. Indeed, at
that time, both Y and Z must be zero and they remain equal thereafter. Now,
|P (Yn = k) − P (Zn = k)| ≤ P (Yn 6= Zn ) ≤ P (n < τ ) → 0 as n → ∞.
Since P (Zn = k) = π(k) for all n, this shows that P (Zn = k) → π(k).
c. Let
β(k) = E[T0 | Y0 = k].
Then
β(k) = 1 + pβ(k + 1) + (1 − p)β(k − 1), for 0 < k < N ;
β(0) = 1 + (1 − p)β(0) + pβ(1);
β(N ) = 1 + (1 − p)β(N − 1) + pβ(N );
β(0) = 0.
Chapter 15
Markov Chains - Continuous Time
We limit our discussion to a simple case: the regular Markov chains. Such a Markov chain
visits states in a countable set. When it reaches a state, it stays there for an exponentially
distributed random time (called the state holding time) with a mean that depends only on
the state. The Markov chain then jumps out of that state to another state with transition
probabilities that depend only on the current state. Given the current state, the state
holding time, the next state, and the evolution of the Markov chain prior to hitting the
current state are independent. The Markov chain is regular if jumps do not accumulate,
i.e., if it makes only finitely many jumps in finite time. We explain this construction and
we give some examples. We then state some results about the stationary distribution.
15.1 Definition
A random process X = {X(t), t ∈ <} is a continuous-time Markov chain on the countable
set S with generator (or rate matrix) Q = [q(i, j), i, j ∈ S] if
P [X(t + ²) = j|X(t) = i, X(s), s ≤ t] = 1{i = j} + q(i, j)² + o(²), i, j ∈ S. (15.1.1)
Here, Q is a matrix with nonnegative off-diagonal elements, finite nonpositive diagonal
elements, and row sums equal to zero. Such a matrix is called a rate matrix or generator.
245
246 CHAPTER 15. MARKOV CHAINS - CONTINUOUS TIME
Figure 15.1: Construction of continuous-time Markov chain.
The formula says (if you can read between the symbols) that given X(t), the future and
the past are independent.
One represents the rate matrix by a state transition diagram that shows the states; an
arrow from i to j 6= i marked with q(i, j) shows the transition rate between these states.
No arrow is drawn when the transition rate is zero.
15.2 Construction (regular case)
Let Q be a rate matrix. Define the process X = {X(t), t ≥ 0} as follow. Start by choosing
X(0) according to some distribution in S. When X reaches a state i (and when it starts
in that state), it stays there for some exponentially distributed time with mean 1/q(i)
where q(i) = −q(i, i). When it leaves state i, the process jumps to state j 6= i with
probability q(i, j)/q(i). The evolution of X then continues as before. Figure 15.1 illustrates
this construction.
This construction defines a process on the positive real line if the jumps do not accumu-
late, which we assume here. A simple argument to shows that this construction corresponds
15.3. EXAMPLES 247
Figure 15.2: Markov chain with two states.
Figure 15.3: State transition diagram of Poisson process.
to the definition. Essentially, the memoryless property of the exponential distribution and
the definition of the jump probabilities yield (15.1.1).
15.3 Examples
The example of Figure 15.2 corresponds to a Markov chain with two states. The example
of Figure 15.3 corresponds to a Poisson process with rate λ. The example of Figure 15.4
corresponds to a reflected difference between two Poisson processes. (See Applications.)
Figure 15.4: Reflected difference of two Poisson processes.

15.4 Invariant Distribution
The classification results are similar to the discrete time case. An irreducible Markov chain
(can reach every state from every other state) is either null recurrent, positive recurrent, or
transient (defined as in discrete time). The positive recurrent Markov chains have a unique
invariant distribution, the others do not have any. Also, a distribution π is invariant if and
only if it solves the following balance equations:
πQ = 0. (15.4.1)
For instance, the Markov chain in Figure 15.4 is transient if ρ := λ/µ > 1; it is null recurrent
if ρ = 1; it is positive recurrent if ρ < 1 and then its invariant distribution is
π(n) = (1 − ρ)ρn , n = 0, 1, 2, . . . .
15.5 Time-Reversibility
A stationary irreducible Markov chain is time-reversible if and only if π satisfies the detailed
balance equations:
π(i)q(i, j) = π(j)q(j, i) for all i, j ∈ S. (15.5.1)
For instance, in Figure 15.4 is time-reversible.
15.6 Summary
The random process X = {Xt , t ≥ 0} taking values in the countable set S is a Markov chain
with rate matrix Q if it satisfies (15.1.1).
The definition specifies the Markov property that given Xt the past and the future are
independent. Recall that the Markov chain stays in state i for an exponentially distributed
time with rate q(i), then jumps to state j with probability q(i, j)/q(i) for j 6= i, and the
evolution continues in that way.
We discussed the classification of Markov chains. In particular, we explained the notions
of irreducibility, positive and null recurrence, and transience.
A distribution π is invariant if and only if it satisfies the balance equations (15.4.1).
A stationary Markov chain is time-reversible if the detailed balance equations (15.5.1) are
satisfied.
Example 15.7.1. Consider n light bulbs that have independent lifetimes exponentially dis-
tributed with mean 1. What is the average time until the last bulb dies?
Let Xt be the number of bulbs still alive at time t ≥ 0. Because of the memoryless
property of the exponential distribution, {Xt , t ≥ 0} is a Markov chain. Also, the rate
matrix is seen to be such that
q(m) = q(m, m − 1) = m, m ∈ {1, 2, . . . , n}.
The average time in state m is 1/m and the Markov chain goes from state n to n − 1 to
n − 2, and so on until it reaches 0. The average time to hit 0 is then
1 1 1 1
+ + · · · + + + 1.
n n−1 3 2
To fix ideas, one finds the average time to be about 3.6 when n = 20.
Example 15.7.2. In the previous example, assume that the janitor replaces a burned out
bulb after an exponentially distributed time with mean 0.1. What is the average time until
all the bulbs are out?
The rate matrix now corresponds to the state diagram shown in Figure 15.5. Defining
n m+1 m
n m + 1 10 m 10 m - 1 0
10 n - 1
Figure 15.5: State transitions diagram in example 15.7.2
β(m) as the average time from state m to state 0, for m ∈ {0, 1, . . . , n}, we can write the
FSE as
1 m 10
β(m) = + β(m − 1) + β(m + 1), for m ∈ {1, 2, . . . , n − 1}
m + 10 m + 10 m + 10
1
β(n) = + β(n − 1)
n
β(0) = 0.
If we knew β(n − 1), we could solve recursively for all values of β(m). We could then check
that β(0) = 0. Choosing n = 20 and adjusting β(19) so that β(0) = 0, we find numerically
that β(20) ≈ 2, 488.
Example 15.7.3. Let A = {At , t ≥ 0} and B = {Bt , t ≥ 0} be two independent Poisson
processes with rates λ and µ, respectively. Let also X0 be a random variable independent
of A and B . Show that {Xt = X0 + At − Bt , t ≥ 0} is a Markov chain. What is its rate
matrix? Show that it is irreducible (unless λ = µ = 0). For what values of λ and µ is the
Markov chain positive recurrent, null recurrent, transient? Explain.
To prove that Xt is a CTMC, we need to show (15.1.1). We find
limh→∞ P [At+h − At = 1, Bt+h − Bt = 0|Xu , 0 ≤ u ≤ t]
= limh→∞ P [At+h − At = 1|Au , B(u), X0 , 0 ≤ u ≤ t]P [Bt+h − Bt = 0|Au , B(u), X0 , 0 ≤ u ≤ t]
= limh→∞ P [At+h − At = 1|Au , 0 ≤ u ≤ t]P [Bt+h − Bt = 0|B(u), 0 ≤ u ≤ t]
= (λh + o(h))(1 − µh + o(h))
= λh + o(h).
Similarly, we can show that
limh→∞ P [Xt+h − Xt = −1|Xu , 0 ≤ u ≤ t] = µh + o(h)
To study the recurrence or transience of the Markov chain, consider the jump chain with
the following transition probabilities:

 λ

 λ+µ j =i+1
Pij = µ
j =i−1


λ+µ

0 otherwise.
λ µ λ µ
We know that this DTMC is transient if λ+µ 6= λ+µ and is null recurrent if λ+µ = λ+µ .
Hence the original CTMC is null recurrent if λ = µ and transient if λ 6= µ.
Example 15.7.4. Let Q be a rate matrix on a finite state space S . For each pair of states
(i, j) ∈ S 2 with i 6= j, let N (i, j) = {Nt (i, j), t ≥ 0} be a Poisson process with rate q(i, j).
Assume that these Poisson processes are all mutually independent. Construct the process
X = {Xt , t ≥ 0} as follows. Pick X0 randomly in S , independently of the Poisson processes.
If Xt = i, let s be the first jump time after time t of one of the Poisson processes N (i, j)
for j 6= i. If s is a jump time of N (i, j), then let Xu = i for u ∈ [t, s) and let Xs = j.
Continue the construction using the same procedure. Show that X is a Markov chain with
rate matrix Q.
First note the following. If we have two random variables X =D Exd(λ) and Y =D
Exd(µ), then P (X < Y ) = λ/(λ + µ). Also, min{A, B} =D Exd(λ + µ).

P
Since Q is a rate matrix, q(i, i) = − Sj=0,j6=i q(i, j). Imagine that we are in state i
at time t. S − 1 exponential clocks, each with rate q(i, j), 1 ≤ j ≤ S, j 6= i are running
simultaneously. If clock k expires first we jump to state k. Probability of the k th clock
expiring first is given by PS q(i,k) . The distribution of the time spent in state i is
j=0,j6=i q(i,j)
PS
exponential with rate j=0,j6=i q(i, j).
Next, we must look at the infinitesimal rate of jumping from state i to state j.
limh→∞ P (Xt+h = j|Xt = i, X(u), 0 ≤ u ≤ t) = q(i, j)h + o(h)
Also,
S
X
limh→∞ P (Xt+h = i|Xt = i, X(u), 0 ≤ u ≤ t) = 1 − q(i, j) + o(h)
j=0,j6=i
Hence the rate matrix obtained for Xt is also Q.
Example 15.7.5. Let X be a Markov chain with rate matrix Q on a finite state space S .
Let λ be such that λ ≥ −q(i, i) for all i ∈ S . Define the process Y as follows. Choose
Y0 = X0 . Let {Nt , t ≥ 0} be a Poisson process with rate λ and let T be its first jump time.
If Y0 = i, then let Yu = i for u ∈ [0, T ). Let also YT = j with probability q(i, j)/λ for j 6= i
and YT = i with probability 1 + q(i, i)/λ. Continue the construction in the same way. Show
that Y is a Markov chain with rate Q.
Consider the infinitesimal rate of jumping from state i to state j. In the original chain
we had:
limh→∞ P (Xt+h = j|Xt = i, X(u), 0 ≤ u ≤ t) = q(i, j)h + o(h)
In the new chain Y , when in state i, we start an exponentially distributed clock of rate
λ instead of rate −q(i, i) as in the original chain. When the clock expires we jump to state
q(i,j) q(i,j)
j with probability λ . (In chain X this probability was −q(i,i) ).
So in chain Y , the probability of jumping from state i to state j in an infinitesimal time
h is the probability that the exponential clock expires and we actually jump.
q(i, j)
limh→∞ P (Yt+h = j|Yt = i, X(u), 0 ≤ u ≤ t) = ∗ (λh + o(h))
λ
= q(i, j)h + o(h)
In Markov chain X, the probability of staying in state i in an infinitesimal time period
h was given by.
limh→∞ P (Xt+h = i|Xt = i, X(u), 0 ≤ u ≤ t) = 1 + q(i, i)h + o(h)
In the new chain Y , the probability of staying in the current state is equal to the
probability that exponential clock does not expire in the infinitesimal time interval h or
that clock expires but we do not jump.
q(i, i)
limh→∞ P (Yt+h = i|Yt = i, Y (u), 0 ≤ u ≤ t) = 1 − λh + o(h) + (λh + o(h))(1 + )
λ
= 1 + q(i, i)h + o(h)
Hence the new chain Y exhibits the same rate matrix as chain X.
Example 15.7.6. Let X be a Markov chain on {1, 2, 3, 4} with the rate matrix Q given by
 
−2 1 0 1
 
 
 0 0 0 0 
 
Q= .
 
 1 1 0 −2 
 
1 0 1 −2
a. Calculate E[T2 | X0 = 1] where T2 = min{t ≥ 0 | Xt = 2}.
b. How would you calculate E[eiuT2 | X0 = 1] for u ∈ <?
Define β(i) = E[T2 |X0 = i]. The general form of the first step equations for the CTMC
is:
N
X q(i, j)
1
β(i) = + β(j)
−q(i, i) −q(i, i)
j=1
where N is the size of the state space. For our example,
1 1 1
β(1) = + β(2) + β(4)
2 2 2
β(2) = 0
1 1 1
β(3) = + β(1) + β(2)
2 2 2
1 1 1
β(i4) = + β(1) + β(3).
2 2 2
Solving these equations we get: β(1) = 1.4, β(3) = 1.2, β(4) = 1.8.
b. We can find E[eiuT2 |X0 = 1] by writing the first step equations for the characteristic
function. For instance, note that the time T (1) to hit 2 starting from 1 is equal to an
exponential time with rate 2, say τ , plus, with probability 1/2, the time T (4) to hit 2 from
state 4. That is, we can write
1 1
v1 (u) := E(eiuT (1) ) = E(eiuτ ) + E(eiu(τ +T (4)) )
2 2
1 2 1 2
= + v4 (u)
2 2 − iu 2 2 − iu
1 1
= + v4 (u).
2 − iu 2 − iu
In this derivation, we used the fact that E(eiuτ ) = 2/(2 − iu) and we defined v4 (u) =
E(eiuT (4) ).
Similarly, we find
1 1
v3 (u) := E(eiuT (3) ) = + v1 (u),
2 − iu 2 − iu
1 1
v4 (u) = v1 (u) + v3 (u).
2 − iu 2 − iu
Solving these equations for v1 (u) we find
(2 − iu)3 + (2 − iu)
E[eiuT2 | X0 = 1] = .
(2 − iu)4 − (2 − iu)2 − 1
Chapter 16
Applications
Of course, one week of lectures on applications of this theory is vastly inadequate. Many
courses are devoted to such applications, including courses on communication systems,
digital communication theory, stochastic control, performance evaluation of communication
networks, information theory and coding, and many others. In this brief chapter we take a
look at a few representative applications.
16.1 Optical Communication Link
Consider the following model of an optical communication link: [Laser] → [Fiber] →
[Photodetector] → [Receiver].
To send a bit “1” we turn the laser on for T seconds; to send a bit “0” we turn it
off for T seconds. We agree that we start by turning the laser on for T seconds before
the first bit and that we send always groups of N bits. [More sophisticated systems exist
that can send a variable number of bits.] When the laser is on, it produces light that
reaches the photodetector with some intensity λ1 . This light is “seen” by the photodetector
that converts it into electricity. When the laser is off, no light reaches the photodetector.
Unfortunately, the electronic circuitry adds some thermal noise. As a result, the receiver
sees light with an intensity λ0 + λ1 when the laser is on and with an intensity λ0 when
255
256 CHAPTER 16. APPLICATIONS
the laser is off. (That is, λ0 is the equivalent intensity of light that corresponds to the
noise current.) By “light with intensity λ” we mean a Poisson process of photons that has
intensity λ. Indeed, a laser does not produce photons are precise times. Rather, it produces
a Poisson stream of photons. The brighter the laser, the larger the intensity of the Poisson
process.
The problem of the receiver is to decide whether it has received a bit 0 or a bit 1 during
a given time interval of T seconds. For simplicity, we assume that the boundary between
time intervals is known. (What would you do to find these boundaries?)
Let Y be the number of photons that the receiver sees during the time interval. Let
X = 0 if the bit is 0 and X = 1 if the bit is 1. Then fY |X [y|0] is Poisson with mean λ0 T
and fY |X [y|1] is Poisson with mean (λ0 + λ1 )T . The problem can then be formulated as an
MLE, MAP, or HT problem.
Assume that the bits 0 and 1 are equally likely. To minimize the probability of detection
error we should detect using the MLE (which is the same as the MAP in this case). That
is, we decide 1 if P [Y = y|X = 1] > P [Y = y|X = 0]. To simplify the math, we use a
Gaussian approximation of the Poisson random variable. Specifically, we approximate a
Poisson random variable with mean µ with a N (µ, µ). With this approximation,
1
fY |X [y|0] = √ exp{−(y − λ0 T )2 /(2λ0 T )}
2πλ0 T
and
1
fY |X [y|1] = p exp{−(y − (λ0 + λ1 )T )2 /(2(λ0 + λ1 )T )}
2π(λ0 + λ1 )T
Figure 16.1 shows these densities when λ0 T = 10 and (λ0 + λ1 )T = 25.
Using the graphs, one sees that the receiver decides that it got a bit 1 whenever Y > 16.3.
(In fact, I used the actual values of the densities to identify this threshold.) You also see
that P [Decide “100 |X = 0] = P (N (10, 10) > 16.3) = 0.023 and P [Decide “000 |X = 1] =
P (N (25, 25) < 16.3) = 0.041. To find the numerical values, you use a calculator or a
16.1. OPTICAL COMMUNICATION LINK 257
Figure 16.1: Approximate densities of number of photons under bits 0 and 1.
table of the c.d.f. of N (0, 1) after you write that P (N (10, 10) > 16.3) = P (N (0, 1) >
(16.3 − 10)/(100.5 ) and similarly for the other value.
One interesting question is to figure out how to design the link so that the probability
of errors are acceptably small. A typical target for an optical link is a probability of error
of the order of 10−12 , which is orders of magnitude smaller than what we have achieved so
far. To reduce the probability of error, one must reduce the amount of “overlap” of the two
densities shown in the figure. If the values of λ0 and λ1 are given (λ0 depends on the noise
and λ1 depends on the received light power from the transmitter laser), one solution is to
increase T , i.e., to transmit the bits more slowly by spending more time for each bit. Note
that the graphs will separate if one multiplies the means and variances by some constant
larger than 1.
16.2 Digital Wireless Communication Link
Consider the following model of an wireless communication link: [Transmitter and antenna]
→ [Free Space] → [Antenna and receiver]
For simplicity, assume a discrete time model of the system. To transmit bit 0 the
transmitter sends a signal a := {an , n = 1, 2, . . . , N }. To transmit bit 1 the transmitter
sends a signal b := {bn , n = 1, 2, . . . , N }. The actual values are selected based on the
efficiency of the antenna at transmitting such signals and on some other reasons. In any
case, it seems quite intuitive that the two signals should be quite different if the receiver
must be able to distinguish a 0 from a 1. We try to understand how the receiver makes its
choice. As always, the difficulty is that the receiver gets the transmitted signal corrupted
by noise. A good model of the noise is that the receiver gets
Yn = an + Zn , n = 1, 2, . . . , N
when the transmitter sends a bit 0 and
Yn = bn + Zn , n = 1, 2, . . . , N
when the transmitter sends a bit 1, where {Zn , n = 1, 2, . . . , N } are i.i.d., N (0, σ 2 ).
P P
The MLE decides 0 if ||YY − a ||2 := n (Yn − an )2 < n (Yn − bn )2 =: ||Y Y − b ||2 , and
decides 1 otherwise. That is, the MLE decides 0 if the received signal Y is closer to the
signal a than to the signal b . Nothing counterintuitive; of course the key is to measure
“closeness” correctly. One can then study the probability of making an error and one finds
that it depends on the energy of the signals a and b ; again this is not too surprising. The
problem can be extended to more than 2 signals so that we can send more than one bit in
N steps. The problem of designing the best signals (that minimize the probability of error)
with a given energy is then rather tricky.

16.3. M/M/1 QUEUE 259
16.3 M/M/1 Queue
Consider jobs that arrive at a queue according to a Poisson process with rate λ. The jobs are
served by a single server and they require i.i.d. Exd(µ) service times. This queue is called
the M/M/1 queue. The notation designates (inter-arrivals/service/number of servers) and
M means memoryless (i.e., Poisson arrivals and exponential service times).
Because of the memoryless property of the exponential distribution, the number X(t)
of jobs in the queue at time t is a Markov chain with the state transition diagram shown in
Figure 15.4. The balance equations of that MC are as follows:
λπ(0) = µπ(1)
(λ + µ)π(n) = λπ(n − 1) + µπ(n + 1), n = 1, 2, . . . .
You can check that the solution is π(n) = ρn (1 − ρ) for n = 0, 1, . . ., with ρ = λ/µ,
provided that λ < µ. (Otherwise, there is no invariant distribution.)
Consider a job that arrives at the queue and the queue is in steady-state (i.e., X(·) has
its invariant distribution). What is the probability that it finds n other jobs already in the
queue? Say that the job arrives during (t, t + ²), then we want
P [X(t) = n|X(t + ²) = X(t) + 1] = P (X(t) = n) = π(n)
because X(t+²)−X(t) and X(t) are independent. (The memoryless property of the Poisson
process.) This result is know as PASTA: Poisson arrivals see time averages.
How long will the job spend in the queue? To do the calculation, first recall that a
random variable W is Exd(α) if and only if E(exp{−uW }) = α/(u+α). Next, observe that
with probability π(n), the job has to wait for n + 1 exponential service times Z1 , . . . , Zn+1
before it leaves. Consequently, if we designate its time in the queue by V ,
X
E(exp{−uV }) = π(n) exp{−u(Z1 + · · · + Zn+1 )}
n
X µ−λ
= ρn (1 − ρ)(λ/(u + λ))n+1 = · · · = .
n
u+µ−λ
This calculation shows that the job spends an exponential time in the queue with mean
1/(µ − λ).
16.4 Speech Recognition
We explain a simplified model of speech recognition and the algorithm that computes the
MAP.
A speaker pronounces a string of syllables (X1 , . . . , Xn ) that is modelled as a Markov
chain:
P (X1 = x1 , . . . , Xn = xn ) = π(x1 )P (x1 , x2 ) · · · P (xn−1 , xn )
for x1 , . . . , xn ∈ S. Here, π(·) and P (., .) are supposed to model the language. The listener
hears sounds (Y1 , . . . , Yn ) such that
P [Y1 = y − 1, . . . , Yn = yn |X1 = x1 , . . . , Xn = xn ] = Q(x1 , y1 ) · · · Q(xn , yn ).
This model is called a hidden Markov chain model: The string that the speaker pro-
nounces is a Markov chain that is hidden from the listener who cannot read her mind but
instead only hears the sounds. We want to calculate M AP [X|Y ]. Note that
P (X = x)P [Y = y|X = x]
P [X = x|Y = y] =
P (X = x)
= π(x1 )P (x1 , x2 ) · · · P (xn−1 , xn )Q(x1 , y1 ) · · · Q(xn , yn )/P (X = x).
Consequently,
16.4. SPEECH RECOGNITION 261
Figure 16.2: Shortest path in speech recognition
M AP [X|Y = y] = arg maxπ(x1 )P (x1 , x2 ) · · · P (xn−1 , xn )Q(x1 , y1 ) · · · Q(xn , yn )
where the maximization is over x ∈ Sn . To maximize this product, we minimize the negative
of its logarithm, which we write as d1 (0, x1 ) + d2 (x1 , x2 ) + · · · + dn (xn−1 , xn ) where
d1 (0, x) = − log(π(x)Q(x, y1 ))
and
dk (x, x0 ) = − log(P (x, x0 )Q(x0 , yk ))
for x, x0 ∈ S and k = 1, 2, . . . , n. (Recall that y is given.)
Minimizing d1 (0, x1 )+d2 (x1 , x2 )+ · · · +dn (xn−1 , xn ) is equivalent to finding the shortest
path in the graph shown in Figure 16.2.

A shortest path algorithm is due Bellman-Ford. Let A(x, k) be the length of the
shortest path from 0 to x at step k. Then we can calculate recursively
A(x, k + 1) = min{A(x0 , k) + dk+1 (x0 , x)}
where the minimum is over all x0 in S. This algorithm, applied to calculate the MAP is
called Viterbi’s algorithm.
16.5 A Simple Game
Consider the following “matching pennies” game. Alice and Bob both have a penny and
they select which face they show. If they both show the same face, Alice wins $1.00 from
Bob. If they show different faces, Bob wins $1.00 from Alice. Intuitively it is quite clear
that the best way to play the game is for both players to choose randomly and with equal
probabilities which face to show. This strategy constitutes an “equilibrium” in the sense
that no player has an incentive to deviate unilaterally from it. Indeed, if Bob plays randomly
(50/50), then the average reward of Alice is 0 no matter how she plays, so she might as well
play randomly (50/50). It is also not hard to see that this is the only equilibrium. Such an
equilibrium is called a Nash Equilibrium
This example is a particular case of a general type of games that can be described as
follows. If Alice chooses the action a ∈ A and Bob the action b ∈ B, then Alice gets the
payoff A(a, b) and Bob the payoff B(a, b). If Alice and Bob choose a and b randomly and
independently in A and B, then they get the corresponding expected rewards. Nash proved
the remarkable result that if A and B are finite, then there must be at least one random
choice that is an equilibrium, thus generalizing the matching pennies result.

16.6. DECISIONS 263
16.6 Decisions
One is given a perfectly shuffled 52-card deck. The cards will be turned over one at a time.
You must try to guess when an ace is about to be turned over. If you guess correctly, you
win $1.00, otherwise you lose. One strategy might be to wait until a number of cards are
turned over; if you are lucky, the fraction of aces left in the deck will get larger than 4/52
and this will increase your odds of guessing right. Unfortunately, things might go the other
way and you might see a number of aces being turned over quickly, thus reducing your odds
of winning. What should you do?
A simple argument shows that it does not really matter how you play. To see this,
designate by V (n, m) your maximum expected reward given that there remain m aces and
a total of n cards in the deck. By maximum, we mean the expected reward you can get by
playing the game in the best possible way. The key to the analysis is to observe that you
have two choices as the game starts with n cards and m aces: either you gamble on the next
card or you don’t. If you do, that next card is an ace with probability m/n and you win $1.00
with that probability. If you don’t, then, after the next card is turned over, with probability
m/n you find yourself with a deck of n − 1 cards with m − 1 aces; with probability 1 − m/n,
you face a deck of n − 1 cards and m aces. Accordingly, if you play the game optimally
after the next card, your expected reward is (m/n)V (n − 1, m − 1) + (1 − m/n)V (n − 1, m).
Hence, the maximum reward V (n, m) should be either m/n (if you gamble on the next
card) or (m/n)V (n − 1, m − 1) + (1 − m/n)V (n − 1, m) (if you don’t). We conclude that
V (n, m) = max{m/n, (m/n)V (n − 1, m − 1) + (1 − m/n)V (n − 1, m)}. (16.6.1)
You can verify that the solution of the above equations is V (n, m) = m/n whenever
n > 0. Thus, both alternatives (gamble on the next card or don’t) yield the same expected
reward.
The equations (16.6.1) are called the Dynamic Programming Equations (DPE). They
express the maximum expected reward by comparing the consequences of the different
decisions that can be made at a stage of the game. The trick to write down that equation
is to identify correctly the “state” of the game (here, the pair (n, m)). By solving the DPE
and choosing at each stage the decision that corresponds to the maximum, one derives the
optimal strategy. These ideas are due to Richard Bellman. (I borrowed this simple example
from the great little book by Sheldon Ross [6].)

Appendix A
Mathematics Review
A.1 Numbers
A.1.1 Real, Complex, etc
You are familiar with whole, rational, real, and complex numbers. You know how to perform
operations on complex numbers and how to convert them to and from the r × eiθ notation.
√
Recall that |a + ib| = a2 + b2 .
For instance, you can show that
3+i
= 1 + 2i
1−i
and that
√ iπ/6
2+i= 5e .
A.1.2 Min, Max, Inf, Sup
Let A be a set of real numbers. An upper bound of A is a finite number b such that b ≥ a
for all a in A. If there is an upper bound of A that is in A, it is called the maximal element
of A and is designated by max{A}. If A has an upper bound, it has a lowest upper bound
265
266 APPENDIX A. MATHEMATICS REVIEW
that is designated by sup{A}. One defines a lower bound, the minimal element min{A},
and the greatest lower bound inf{A} similarly.
For instance, let A = (2, 5]. Then 6 is an upper bound of A, 1 is a lower bound,
5 = max{A} = sup{A}, 2 = inf{A} and A has no minimal element.
For any real number x one defines x+ = max{x, 0} and x− = (−x)+ . Note that
|x| = x+ + x− . We also use the notation x ∧ y = min{x, y} and x ∨ y = max{x, y}. For
instance, (−5)+ = 0, (−5)− = 5, 3 ∨ 6 = 6, and 3 ∧ 6 = 3.
A.2 Summations
You recall the notations

N
X
xn and ΠN
n=0 xn
n=0
and you can calculate the corresponding expressions for some specific examples of se-
quences {xn , n ≥ 1}. For instance, you remember and you can prove that if a 6= 1, then
N
X 1 − aN +1
an = .
1−a
n=0
You also remember that

∞
X N
X
xn := lim xn
N →∞
n=0 n=0
when the limit exists. For instance, you remember and you can prove that if |a| < 1, then
∞
X 1
an = .
1−a
n=0
By taking the derivative of the above expression with respect to a, you find that, when
|a| < 1,
∞
X 1
nan−1 = .
(1 − a)2
n=0
By taking the derivative one more time, we get
∞
X 2
n(n − 1)an−2 = .
(1 − a)3
n=0
A.3. COMBINATORICS 267
It is sometimes helpful to exchange the order of summation. For instance,
N X
X n N X
X N
xm,n = xm,n .
n=0 m=0 m=0 n=m
A.3 Combinatorics
A.3.1 Permutations
There are n! ways to order n distinct elements, where
n! = 1 × 2 × 3 × · · · × n.
By convention, 0! = 1.
For instance, there are 120 ways of seating 5 people at a table with 5 chairs.
A.3.2 Combinations
¡N ¢
There are n distinct groups of n objects selected without replacement from a set of N
distinct objects, where
µ ¶
N N!
= .
n (N − n)!n!
For instance, there are about 2.6 × 106 distinct sets of five cards picked from a 52-card
deck.
You remember that
N µ ¶
X
N N n N −n
(a + b) = a b .
n
n=0
A.3.3 Variations
You should be able to apply these ideas and their variations. For instance, you can count
the number of strings of five letters that have exactly one E.

A.4 Calculus
You remember the meaning of
Z b
f (x)dx.
a
In particular, you know how to calculate some simple integrals. You recall that, for
n = 0, 1, 2, . . .,
Z 1
xn dx = 1/(n + 1).
0
Also,
Z A
1
dx = ln A.
1 x
You know the integration by parts formula and you can calculate
Z y
xn ex dx.
0
A useful fact is that, for any complex number a,
a n
(1 + ) → ea as n → ∞.
n
You also remember that
1 2 1
ex = 1 + x + x + x3 + · · ·
2! 3!
and
ln(1 + x) ≈ x whenever |x| ¿ 1.
A.5 Sets
A set is a well-defined collection of elements. That is, for every element one can determine
whether it is in the set or not. Recall the notation x ∈ A meaning that x is an element of
the set A. We also say that x belongs to A.

A.6. COUNTABILITY 269
It is usual to characterize a set by a proposition that its elements satisfy. For instance
one can define
A = {x | 0 < x < 1 and x is a rational number}.
You recall the definition
A ∩ B = {x | x ∈ A and a ∈ B}.
Similarly, you know how to define A ∪ B, A \ B, and A∆B. You also know the meaning
of A ⊂ B and of the complement Ac of A.
You can show that
(A ∪ B)c = Ac ∩ B c and (A ∩ B)c = Ac ∪ B c .
You are not confused by the notation and you would never write [1, 2] ∈ [0, 3] because
you know that [1, 2] ⊂ [0, 3]. Similarly, you would never write 1 ⊂ [0, 3] but you would write
1 ∈ [0, 3] or {1} ⊂ [0, 3]. Along the same lines, you know that 0 ∈ [0, 3] but you would never
write 0 ∈ (0, 3].
You could meditate on the meaning of
S = {x | x is not a member of x}.
A.6 Countability
A set A is countable if it is finite or if one can enumerate its elements as A = {a1 , a2 , a3 , . . .}.
A subset of a countable set is countable. If the sets An are countable for n ≥ 1, then so is
their union
A = ∪∞
n=1 An := {a | a ∈ An for some n ≥ 1}.
The cartesian product A × B = {(a, b) | a ∈ A and b ∈ B} of countable sets is countable.
The set of rational numbers is countable.
The set [0, 1] is not countable. To see that,k imagine that one can enumerate its elements
and write them as decimal expansions, for instance as {0.23145..., 0.42156.., 0.13798..., . . .}.
This list does not contain the number 0.135... selected in a way that the first digit in the
expansion differs from the first digit of the first element in the list, its second digit differs
from that of the second element, and so on. This diagonal argument shows that there is no
possible list that contains all the elements of [0, 1].
A.7 Basic Logic
A.7.1 Proof by Contradiction
Let p and q be two propositions. We say “if p then q” if the proposition q is true whenever
p is. For instance, if p means “it rains” and q means “the roof gets wet,” then we can
postulate “if p then q.”
You know that if the statement “if p then q” is true, then so is the statement “if not q
then not p.” However, the statement “if not p then not q” may not be true.
Therefore, if we know that the statement “if p then q” is true, a method for proving
“not p” is to prove “not q”.

√
As an example, let us prove by contradiction that the statement “ 2 is irrational” is
√
true. Let p designate the statement “ 2 is rational.” We know that “if p then q” where q
√
is the statement “ 2 = a/b where a and b are integers”. We will prove that “not q” is true.
√
To do this, assume that q is true, i.e., that 2 = a/b. We can simplify that fraction until a
and b are not both multiples of 2. Taking the square, we get 2 = a2 /b2 . This implies that
a2 = 2b2 is even, which implies that a is even and that b is not (since a and b are not both
multiples of 2). But then a = 2c and a2 = 4c2 = 2b2 , which shows that b2 = 2c2 is even, so
A.8. SAMPLE PROBLEMS 271
that b must also be even, which contradicts our assumption.
A.7.2 Proof by Induction
Assume that for n ≥ 1, p(n) designates a proposition. The induction method for proving
that p(n) is true for all finite n ≥ 1 consists in showing first that p(1) is true and second
that if p(n) is true, then so is p(n + 1). The second step is called the induction step.
As an example, we show that if a 6= 1 then a + a2 + · · · + aN = (a − aN +1 )/(1 − a). The
identity is certainly true for n = 1. Assume that it is true for some N . Then
a + a2 + · · · + aN + aN +1
= (a − aN +1 )/(1 − a) + aN +1
= (a − aN +1 )/(1 − a) + (aN +1 − aN +2 )/(1 − a)
= (a − aN +2 )/(1 − a),
which proves the identity for N + 1.
If p(n) is true for all finite n, this does not imply that p(∞) is true, even if p(∞) is
well-defined. For instance, the set {1, 2, . . . , n} is finite for all finite n, but {1, 2, . . .} is
infinite.
A.8 Sample Problems
Problem A.8.1. Express (1 + 3i)/(2 + i) in the form a + bi and in the form r × eiθ .
Problem A.8.2. Prove by induction that
n
X Xn
3
k =( k)2 .
k=1 k=1
Note: We want a proof by induction, not a direct proof. You may use the fact that
n
X n(n + 1)
k= .
2
k=1
Problem A.8.3. Give an example of a function f (x) defined on [0, 1] such that
sup f (x) = 1 and inf f (x) = 0

0≤x≤1 0≤x≤1
and the function f (x) does not have a maximum on [0, 1].
R1 x+1
Problem A.8.4. Calculate 0 x+2 dx.
Problem A.8.5. Which of the following is/are true?
1. 0 ∈ (0, 1)
2. 0 ⊂ (−1, 3)
3. (0, 1) ∪ (1, 2) = (0, 2)
4. The set of integers is uncountable.
R∞
Problem A.8.6. Calculate 0 x2 e−x dx.
Problem A.8.7. Let A = (1, 5), B = [0, 3), and C = (2, 4). What is A \ (B∆C)?
Problem A.8.8. Calculate

N X
X N
1
.
n=m
n+1
m=0
Problem A.8.9. Let A = [3, 4.7). What are
min{A}, max{A}, inf{A}, and sup{A}?
Problem A.8.10. Let A be a set of numbers and define B = {−a|a ∈ A}. Show that
inf{A} = − sup{B}.
A.8. SAMPLE PROBLEMS 273
Problem A.8.11. Calculate, for |a| < 1,

∞
X ∞
X
n
na and n2 an .
n=0 n=0
Problem A.8.12. How many distinct sets of five cards with three red cards can one draw
from a deck of 52 cards?
Problem A.8.13. Let A be a set of real numbers with an upper bound b. Show that sup{A}
exists.
PN n.
Problem A.8.14. Derive the expression for n=0 a
Problem A.8.15. Let {xn , n ≥ 1} be real numbers such that xn ≤ xn+1 and xn ≤ a < ∞
for n ≥ 1. Prove that xn → x as n → ∞ for some x < ∞.
Problem A.8.16. Show that the set of finite sentences in English is countable.
Appendix B
Functions
A function f (·) is a mapping from a set D into another set V . To each point x of D, the
function attaches a single point f (x) of V .
The function f : D → V is said to be one-to-one if x 6= y implies f (x) 6= f (y). The
function is onto if {f (x)|x ∈ D} = V . The function is a bijection if it is onto and one-to-one.
For any function f : D → V one can define the inverse of a set A ⊂ V by
f −1 (A) = {x ∈ D|f (x) ∈ A}.
Figures B.1 and B.2 illustrate these notions.
275
276 APPENDIX B. FUNCTIONS
Figure B.1: Examples of mappings
Figure B.2: Graphs

Appendix C
Nonmeasurable Set
C.1 Overview
We defined events as being some sets of outcomes. The collection of events is closed under
countable set operations. When the sample space is countable, we can define the probability
of every set of outcomes. However, in general this is not possible. For instance, one cannot
define the length of every subset of the real line. We explain that fact in this note. These
ideas a bit subtle. We explain them only because some students always ask for a justification.
C.2 Outline
We construct a set S of real numbers between 0 and 1 with the following properties. Define
Sx = {y + x|y ∈ S}. That is, Sx is the set S shifted by x. Then there is a countable
collection C of numbers such that the union A of Sx for x in C is such that [1/3, 2/3] is a
subset of A and A is a subset of [0, 1]. Morever, Sx and Sy are disjoint whenever x and y
are distinct in C.
Assume that the “length” of S is L. The length of Sx is also L for all x. (Indeed, the
length is first defined for intervals and is shift-invariant.) The length of A must be the sum
277
278 APPENDIX C. NONMEASURABLE SET
of the length of the sets Sx for x in C since it is a countable union of these disjoint sets. If
L > 0, then the length of A is infinite, which is not possible since A is contained in [0, 1].
If L = 0, then the length of A must be 0, which is not possible since A contains [1/3, 2/3].
Thus the length of S cannot be defined. It remains to construct S. We do that next.
C.3 Constructing S
We start by defining x and y in [0, 1/3] to be equivalent if they differ by a rational num-
ber. For instance, x = (20.5)/8 and x + 0.12 are equivalent. We can then look at all the
equivalence classes of [0, 1/3], i.e., all the sets of equivalent numbers. Two different equiv-
alence classes must be disjoint. We then form a set S by picking one element from each
equivalence class. [Some philosophers will object to this selection, arguing that it is not
reasonable. They refuse the axiom of choice that postulates that such as set is well defined.]
Note that all the numbers in S are in [0, 1/3] and that any two numbers in S cannot be
equivalent since they were picked in different equivalence classes. That is, any two numbers
x, y in S differ by an irrational number. Moreover, any number u in [0, 1/3] is equivalent
to some s in S, since S contains a representative of all the equivalence classes of points in
[0, 1/3].
Next, let C be all the rational numbers in [0, 2/3]. For x in C, Sx is a subset of [0, 1].
Also, for any two distinct x, y in C the sets Sx and Sy must be disjoint. Otherwise, they
contain a common element z such that z = u + x = v + y for some distinct u and v in S;
but this implies that x − y = v − u is rational, which is not possible. It remains only to
show that the union of the sets Sx for x in C contains [1/3, 2/3]. Pick any number w in
[1/3, 2/3]. Note that w − 1/3 is in [0, 1/3] and must be equivalent to some s in S. That
implies that x = w − s is rational and it is in [0, 2/3], and therefore in C. Thus, w is in Sx
for some x in C.
Appendix D
Key Results
We cover a number of important results in this course. In the table below we list these
results. We indicate the main reference and their applications. In the table, Ex. refers to
an Example, S. to a Section, C. to a Chapter, and T. to a Theorem.
279
280 APPENDIX D. KEY RESULTS
Result Main Discussion Applications

Bayes’ Rule S. 3.3 Ex. 3.6.2
Borel-Cantelli T. 2.7.10 S. 10.3.1
Chebyshev’s ≤ (4.8.1) Ex. 4.10.19, 5.5.12, ??
CLT S. 11.3 , T. 11.4.1 Ex. 11.7.5, 11.7.6, 11.7.8, 11.7.9, 11.7.10,
12.2.5; S. 11.5, 12.1.10
Continuity of P (·) S. 2.3 (4.2.2); T. 2.7.10
Convergence S. 10.3, S. 10.4, S. 10.5 Ex. 11.7.1; S. 10.6
Coupling Ex. 14.8.14 S. 12.2.9
E[X|Y ] C. 6, (6.4.1), T. 6.4.2 Ex. 8.6.6, 9.6.8, 9.6.10; S. 6.2, 6.7
FSE S. 12.1.7 Ex. 14.8.4, 14.8.6, 14.8.7; S. 12.1.8, 14.5,
14.8 , 14.8.11, 14.8.12 , 14.8.14, 15.7.1,
15.7.2, 15.7.6
Gaussian C. 7, (7.1) T. 7.3.1; S. 7.5
HT [X|Y ] C. 8, T. 8.3.1, T. 8.3.2 Ex. 8.6.3-8.6.5, 8.6.7, 8.6.8, 8.6.11, 11.7.7,
11.7.8, 11.7.9; S. 8.3.2
Independence S. 3.4.4, S. 5.3, T. 5.3.1 Ex. 3.6.5, 4.10.10; S. 5.5
Lebesgue C.T. T. 10.7.1 Ex. 11.7.1
Linearity E(.) (4.6.3) (5.2.1), (5.2.2)
LLSE S. 9.2 Ex. 9.6.1, 9.6.3, 9.6.5, 9.6.6, 9.6.7, 9.6.10,
9.6.12, 12.2.1
Markov Chain C. 14, C. 15 S. 14.8
MAP, MLE (8.1.2), S. 8.2 Ex. 8.6.1, 8.6.2, 8.6.7, 8.6.8, 8.6.10, 9.6.2,
9.6.9, 9.6.11, 12.2.1
Memoryless S. 12.1.5, (4.3.4), (4.3.8) Ex. 15.7.4, 15.7.5
{Ω, F, P } S. 2.4 S. 2.5, S. 2.7, Ex. 4.10.2
SLLN S. 11.2 Ex. 11.7.2, 11.7.3, 11.7.10.a; S. 12.1.9,
12.2.3
Sufficient Statistics S. 9.4 Ex. 9.6.4, 9.6.11
Symmetry Ex. 7.5.8, 7.5.8, 9.6.9, 12.2.1, 12.2.4, 7.5.1
Transforms S. 10.2 S. 7.1.1, 7.2.2
Appendix E
Bertrand’s Paradox
The point of this note is that one has to be careful about the meaning of “choosing at
random.”
Consider the following question: What is the probability that a chord selected at random
in a circle is larger than the side of an inscribed equilateral triangle? There are three
plausible answers to this question: 1/2, 1/3, and 1/4. Of course, the answer depends on
how we choose the chord.
Answer 1: 1/3
The first choice is shown in the left-most part of Figure E.1. To choose the chord, we fix
a point A on the circle; it will be one of the ends of the chord. We then choose another
point X at random on the circumference of the circle. If X happens to be between B and C
(where ABC is equilateral), then AX is longer than the sides of ABC. Thus, the requested
probability is 1/3.
281
282 APPENDIX E. BERTRAND’S PARADOX
X'
B
X
X'
X' X
A O A
Figure E.1: Three ways of choosing a chord.
Answer 2: 1/4
The second choice is illustrated in the middle part of Figure E.1. We choose the chord by
choosing its midpoint (e.g., X) at random inside the circle. The chord is longer than the
side of the inscribed equilateral triangle if and only if X falls inside the circle with half the
radius of the original circle, which happens with probability 1/4.
Answer 2: 1/2
The third choice is illustrated in the right-most part of Figure E.1. We choose the chord by
choosing its midpoint (e.g., X) at random on a given radius OA of the circle. The chord is
longer than the side of the inscribed triangle if an only if the point is closer to the center
than half a radius, which happens with probability 1/2.

Appendix F
Simpson’s Paradox
The point of this note is that proportions do not add up and that one has to be careful
with statistics.
Consider a university where 80% of the male applicants are accepted but only 51% of
the female applicants are accepted. You will be tempted to conclude that the university
discriminates against female applicants. However, a closer look at this university shows
that it has only two colleges with the admission records shown in the table.
Note that each college admits a larger fraction of female applicants than of male appli-
cants, so that the university cannot be accused of discrimination against the female students.
It happens that more female students apply to a more difficult college.
College F. Appl. F. Adm. % F. Adm. M. Appl. M. Adm. % M. Adm.

A 980 490 50% 200 80 40%
B 20 20 100% 800 720 90%
Total 1000 510 51% 1000 800 80%
283
284 APPENDIX F. SIMPSON’S PARADOX
Appendix G
Familiar Distributions
We collect here the few distributions that we encounter repeatedly in the text.
G.1 Table
Distribution Shorthand Definition Mean Variance

Bernoulli B(p) 1 w.p. p; 0 w.p. 1 − p p p(1 − p)
¡n¢ m
Binomial B(n, p) m w.p. m p (1 − p)n−m , m = 0, . . . , n np np(1 − p)
Geometric G(p) m w.p. p(1 − p)m−1 , m ≥ 1 1/p p−2 − p−1
Poisson P (λ) m w.p. λm e−λ /m! λ λ
Uniform U [0, 1] fX (x) = 1{0 ≤ x ≤ 1} 1/2 1/12
Exponential Exd(λ) fX (x) = λe−λx 1{x ≥ 0} λ−1 λ−2
2
Std. Gaussian N (0, 1) fX (x) = √1 e−x /2 , x∈< 0 1
2π
G.2 Examples
Here are typical random experiments that give rise to these distributions. We also comment
on the properties of those distributions.
• Bernoulli: Flip a coin; X = 1 if outcome is H, X = 0 if it is T.
285
286 APPENDIX G. FAMILIAR DISTRIBUTIONS
• Binomial: Number of Hs when flipping n coins.
• Geometric: Number of flips until the first H. Memoryless. Holding time of state of
discrete-time Markov chain.
• Poisson: Number of photons that hit a given area in a given time interval. Limit of
P
B(n, p) as np = λ and n → ∞. The sum of independent P (λi ) is P ( i λi ). Random
sampling (coloring) of a P (λ)-number of objects yields independent Poisson random
variables.
• Uniform: A point picked “uniformly” in [0, 1]. Returned by function “random(.)”.
Useful to generate random variables.
• Exponential: Time until next photon hits. Memoryless. Holding time of state of
continuous-time Markov chain. Limit of G(p)/n as np = λ and n → ∞. The minimum

P
of independent Exd(λi ) is Exd( i λi ).
• Gaussian: Thermal noise. Sum of many small independent random variables (CLT).
By definition, µ + σN (0, 1) =D N (µ, σ 2 ). The sum of independent N (µi , σi2 ) is

P P
N ( i µi , i σi2 ).
Index
HT [X | Y ], 123 Binomial random variable, 40
N (0, 1), 101 Borel-Cantelli Lemma, 24
N (0, σ 2 ), 104 Borel-measurable, 43
{Ω, F, P }, 17 Brownian Motion, 199
ΣX,Y , 70
Central Limit Theorem, 177
ΣX , 70
Approximate, 178
Aperiodic Markov Chain, 230
De Moivre, 8
Approximate Central Limit Theorem, 178
Chebyshev’s Inequality, 46
Asymptotically Stationary Markov Chain, 230
Classification
Balance Equations, 195, 231 Theorem, 231
Continuous-Time Markov Chain, 248 CLT - Central Limit Theorem, 177
Detailed, 232 Communication Link
Bayes, 9 Optical, 255
Rule, 28 Wireless, 258
Bellman, Richard, 264 Conditional Expectation, 85
Bellman-Ford Algorithm, 262 of Jointly Gaussian RVs, 106
Bernoulli, 6 Examples, 85
Poisson limit, 201 Gambling System, 93
Process, 190 MMSE Property, 87
Random variable, 40 Pictures, 88
Bertrand’s Paradox, 281 Properties, 90
287
288 INDEX
Conditional Probability, 27 Sufficient Statistics, 146
Definition, 28 Estimator, 143
Confidence Intervals, 178 LLSE, 144
Continuous-Time Markov Chains, 245 Properties, 143
Continuous-Time Random Process, 190 Unbiased, 143
Convolution, 71 Events
Correlation, 69 Motivation, 15
Countable Additivity, 16 Expectation, 42
Covariance, 68 Linearity of, 45
Covariance matrices, 70 Exponentially distributed random variable,
Cumulative Distribution Function (cdf), 38 41
De Moivre, 7 Filtering, 211
Detailed Balance Equations, 232 First Passage Time, 232
Continuous Time, 248 First Step Equations, 193
Detection, 121 for first passage time, 232
Bayesian, 121 Function, 275
MAP, 122 Bijection, 275
MLE, 122 Inverse, 275
Dirac impulse, 40 One-to-one, 275
Discrete-Time Random Process, 190 Onto, 275
Distribution, 38 Function of Random Variable, 43
Joint, 68
Gambler’s Ruin, 193
Dynamic Programming Equations, 263
Gambling System, 93
Ergodicity, 202 Gauss, 10
Estimation, 143 Gaussian Random Variables, 101

INDEX 289
Useful Values, 103 Markov Chain, 231
Generating Random Variables, 41 Reflected Random Walk, 195
Generator of Markov Chain, 246 Inverse Image of Set, 275
Geometric random variable, 40 Irreducible Markov Chain, 229
Hidden Markov Chain Model, 260 Jensen’s Inequality, 46
Hidden variable, 4 Joint Density, 68
Hypothesis Testing, 121 Joint Distribution, 68
Composite Hypotheses, 128 Jointly Gaussian, 104
Example - Coin, 125

Key Results, 279
Example-Exponential, 125
Kolmogorov, 11
Example-Gaussian, 125
Neyman-Pearson Theorem, 123 Laplace, 9

Simple Hypothesis, 123 Least squares, 6
Legendre, 5
Independence, 70
Linear Time-Invariant (LTI), 212
of collection of events, 31
Linearity of Expectation, 45
of two events, 31
LLSE
Properties, 71
Recursive, 146
subtility, 32
LLSE - Linear Least Squares Estimator, 144
v.s. disjoint, 31
Inequalities, 45 M/M/1 Queue, 259
Chebyshev, 46 MAP - Maximum A Posteriori, 122
Jensen, 46 Markov, 11
Markov, 46 Inequality, 46
Invariant Distribution Property of Random Process, 203
Continuous-Time Markov Chain, 248 Markov Chain

290 INDEX
Aperiodic, Periodic, Period, 230 of Poisson Process, 200
Asymptotic Stationarity, 230 MLE - Maximum Likelihood Estimator, 122
Classification, 229 MMSE, 87
Classification Theorem, 231 Moments of Random Variable, 45
Construction - Continuous Time, 246

Nash Equilibrium, 262
Examples, discrete time, 226
Neyman-Pearson Theorem, 123
Generator, Rate Matrix, 246
Proof, 126
Irreducible, 229
Non-Markov Chain Example, 227
Recurrent, Transient, 230
Nonmeasurable Set, 277, 281, 283, 285
Regular, 246
Normal Random Variable, 101
State Transition Diagram, 226
Null Recurrent, 230
Time Reversal, 232
Optical Communication Link, 255
Time Reversible, 232
Orthogonal, 145
Transition Probability Matrix, 225
Markov Chains PASTA, 259

Continuous Time, 245 Period of Markov Chain, 230
Discrete Time, 225 Periodic Markov Chain, 230
Matching Pennies, 262 Poisson Process, 200
Maximum A Posteriori (MAP), 122 Number of Jumps, 200
Maximum Likelihood Estimator (MLE), 122 Sampling, 201
Measurability, 37 SLLN Scaling, 201
Memoryless Poisson random variable, 41
Exponential, 41 Positive Recurrent, 230
Geometric, 40 Probability Density Function (pdf), 39
Memoryless Property Probability mass function (pmf), 38
of Bernoulli Process, 192 Probability Space - Definition, 17

INDEX 291
Random Process Uniform in [a, b], 41
Continuous-Time, 190 Variance, 45
Discrete-Time, 190 Random Variables, 67
Ergodicity, 202 Correlation, 69
Poisson, 200 Covariance, 68
Reversibility, 202 Examples, 67
Stationary, 202 Independence, 70
Wiener, Brownian Motion, 199 Joint cdf (jcdf), 68
Random Processes, 189 Joint Distribution, 68
Random Variable, 37 Joint pdf, 68
Expectation, 42 Jointly Gaussian, 104
Bernoulli, 40 Random Walk, 192
Binomial, 40 CLT Scaling, 198
cdf, 38 Reflected, 194
SLLN Scaling, 197

Continuous, 39
Rate
Discrete, 38
Of exponentially distributed random vari-
Distribution, 38
able, 41
Exponentially distributed, 41
Rate Matrix of Markov Chain, 246
function of, 43
Recurrent Markov Chain, 230
Gaussian, 101
Recursive LLSE, 146
Generating, 41
Regular Markov Chains, 246
Geometric, 40
Reversibility of Random Process, 202
Moments, 45
pdf, 39 Saint Petersburg Paradox
Poisson, 41 For Poisson Process, 202
Probability mass function (pmf), 38 Saint Petersburg paradox

292 INDEX
for Bernoulli process, 191 in square, 18
Shortest Path Problem, 262 Uniform random variable, 41
Simpson, 8
Variance, 45
Simpson’s Paradox, 283
Properties, 76
Speech Recognition, 260
Viterbi Algorithm, 262
Standard Gaussian, 101
Useful Values, 103 Weak Law of Large Numbers, 175
Stars and bars method, 19 Bernoulli, 7
State Transition Diagram, 226 Wide Sense Stationary (WSS), 218
Continuous Time, 246 Wiener Process, 199
Stationarity of Random Process, 202 Wireless Communication Link, 258
Stationary
Markov Chain, 231
Strong Law of Large Numbers, 176
Sufficient Statistics, 146
Sum of independent random variables, 76
Time Reversal of Markov Chain, 232
Time-Reversibility
Continuous-Time Markov Chain, 248
Time-Reversible Markov Chain, 232
Transient Markov Chain, 230
Transition Probability Matrix, 225
Uniform
in finite set, 17
in interval, 18
Bibliography
[1] L. Breiman, Probability, Addison-Wesley, Reading, Mass, 1968.
[2] P. Bremaud, An introduction to probabilistic modeling, Springer Verlag, 1988.
[3] W. Feller, Introduction to probability theory and its applications, Wiley, New York.
[4] Port S.C. Hoel, P.G. and C. J. Stone, An introduction to probability theory, Houghton
Mifflin, 1971.
[5] J. Pitman, Probability, Springer-Verlag, 1997.
[6] S. Ross, Introduction to stochastic dynamic programming, Academic Press, New York,
NY, 1984.
[7] , Introduction to probability models, seventh edition, Harcourt, Academic Press,

Burlington, MA, 2000.
[8] Chisyakov V.P. Sevastyanov, B. A. and A. M. Zubkov, Problems in the theory of prob-
ability, MIR Publishers, Moscow, 1985.
[9] S. M. Stigler, The history of statistics – the measurement of uncertainty before 1900,
Belknap, Harvard, 1999.
293

Univ of California Lecture Notes-ProbabiliyTheory

Uploaded by

Copyright:

Available Formats

Univ of California Lecture Notes-ProbabiliyTheory

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Univ of California Lecture Notes-ProbabiliyTheory

Uploaded by

Copyright:

Available Formats

Lecture Notes on Probability Theory

and Random Processes

August 25, 2004

3 Conditional Probability and Independence 27

7 Gaussian Random Variables 101

7.2 Jointly Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8 Detection and Hypothesis Testing 121

10 Limits of Random Variables 163

10.7 Convergence of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

11 Law of Large Numbers & Central Limit Theorem 175

12 Random Processes Bernoulli - Poisson 189

13 Filtering Noise 211

13.3 Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

14 Markov Chains - Discrete Time 225

15 Markov Chains - Continuous Time 245

A Mathematics Review 265

A.6 Countability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

C Nonmeasurable Set 277

D Key Results 279

E Bertrand’s Paradox 281

F Simpson’s Paradox 283

G Familiar Distributions 285

Berkeley, June 2004 - Jean Walrand

of components and operating conditions. In some case, uncertainty is introduced in the

operations of the system, on purpose.

– an essential part of an engineer’s education. Randomness is a key element of all systems

of an earthquake. The power distribution grid carries an unpredictable load. Integrated

looking for patterns among unknown strings.

12-15. We conclude the notes by discussing a few applications in Chapter 16.

new for you.

In this chapter we introduce the concept of a model of an uncertain physical system. We

at the key contributors and some notes on references.

1.1 Models and Physical Reality

Probability Theory is a mathematical model of uncertainty. In these notes, we introduce

examples of uncertainty and we explain how the theory models them.

It is important to appreciate the difference between uncertainty in the physical world

model of this experiment and to relate it to the physical reality.

1.2 Concepts and Calculations

familiar with some new and nontrivial ideas.

Mathematicians frequently state that “definitions do not require interpretation.” We

beg to disagree. Although as a logical edifice, it is perfectly true that no interpretation is

develop such interpretations as we go along, using physical examples and pictures.

1.3 Function of Hidden Variable

know everything else.

Figure 1.1: Adrien Marie Legendre

ω. Remember, there is only one ω (picked by nature at the big bang).

1.4 A Look Back

Adrien Marie LEGENDRE, 1752-1833

Best use of inaccurate measurements: Method of Least Squares.

attempts at making use of inaccurate measurements.

(X 0 , Y 0 ). We would like to find A so that Y = AX and Y 0 = AX 0 . For concreteness, say

enough, but that may be unavoidable. What should we do?

This approach was commonly used in astronomy before 1750.

should be close to (2.5 + 1.75)/2 = 2.125.

We skip over many variations proposed by Mayer, Euler, and Laplace.

Another approach is to try to find A so as to minimize the sum of the squares of

minimizes (Y − AX)2 + (Y 0 − AX 0 )2 . In our example, we need to find A that minimizes

He called this approach the method of least squares.

we need to make a short excursion on the characterization of uncertainty.

Jacob BERNOULLI, 1654-1705