¯ ¯
¯ ¯
This Section treats some elementary ideas and concepts of set theory which are nec-
essary for a modern introduction to probability theory. Any well defined list or
collection of objects is called its elements or members. We write p ∈ A, if p is
an element in the set A. We say that A is a subset of B, if every element of A belongs
to set B. This is denoted by A ⊂ B
Two sets are equal if each is contained in the other, that is
A = {1, 3, 5, 7, 9}
Let A and B be arbitrary sets. The union of A and B, denoted by A ∪ B is the set
of elements which belong to A or B:
A ∩ B = {x : x ∈ A or x ∈ B}.
A \ B = {x : x ∈ A, x ∈
/ B}.
Ac = {x : x ∈ Ω, x ∈
/ A}.
Two examples above are countable sets.
EX: Let I be the unit interval of real numbers; i.e I = {x : 0 6 x 6 1}.
Then I is also an uncountable set.
This section explains the basic notions of combinatorial analysis and devel-
ops the corresponding probabilistic background.
The product of the positive integers from 1 to n inclusive occurs very
often in mathematics and hence is denoted by the special symbol n!
( read ”n factorial”).
An arrangement of a set of n objects in a given order is called a permu-
tation of the objects (taken all at a time).
An arrangment of any r 6 nof these objects in a given order is called an
r-permutation or a permutaion of the n objects taken r at a time.
The number of permutations of n object taken r at a time will be denoted
by P (n, r)
P (n, r) =
(n − r)!
In the special case that r = n, we have
number of combinations of n objects taken r at a time will be denoted
by r ,
n n!
r r!(n − r)
EX: The combinations of the letters a,b,c,d taken 3 at a time are {a,b,c},
{a,b,d}, {a,c,d}, {b,c,d} or simply abc, abd, acd, bcd
That means we have
4 4!
= = 4 cases
3 3!1!
Then A ∪ C = {2, 3, 4, 5, 6}
B ∩ C = {3, 5}
C c = {1, 4, 6}
Let Ω be a sample space, let T be the class of events and let P be a
real-valued function defined on T . Then P is called a .probability of event
A if the following axioms hold:
(i) For every event A, 0 6 P (A) 6 1;
(ii) P (Ω) = 1;
(iii) If A and B are mutually exclusive events, then
P (A ∪ B) = P (A) + P (B).
Now, we can prove the number of theorems which follow directly from our
Case 1: If ∅ is the empty set, then P (∅) = 0
Proof. Let A be any set, A and ∅ are disjoint, and A ∪ ∅ = A
EX: Let three coins be tossed and the number of heads observed; then
the sample space is Ω = {0, 1, 2, 3}. We obtain the following assignment
1 3 3 1
P (0) = , P (1) = , P (2) = , P (3) =
8 8 8 8
since each probability is nonnegative and the sum of the probability is 1.
Let A be the event that at least one head appears and let B be the event
that all tails or all heads appear
By definition, we have
3 3 1 7
P (A) = P (1) + P (2) + P (3) = + + =
8 8 8 8
1 1 1
P (B) = P (0) + P (3) = + =
8 8 4
The probability P (A) of any event A is then the sum of the probabilities of
its points i.e.
P (A) = Σω∈A P (ω).
EX: Consider the sample space Ω = {1, 2, 3, · · · , ∞} of the experiment of
tossing a coin till a head appears, here n denotes the number of times the
coin is tossed. A probability space is obtained by setting
1 1 1
P (1) = , P (2) = , · · · , P (n) = n , · · · , P (∞) = 0
2 4 2
Let E be an arbitrary event in a sample space Ω with P (E) > 0. The
probabilty that an event A occurs once E has occured or, in other words,
the conditional probability of A given E, write P (A|E) is defined as follows
P (A ∩ E)
P (A|E) =
P (E)
EX: Let a pair of fair dice be tossed. If the sum is 6, find the probability
that one of the dice is a 2. In other words, if
E = { sum is 6 } = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}
A = { a 2 appears on at least one die }
Find P (A|E)?
E consists of 5 elements and two of them (2,4),(4,2) belong to A,
Then P (A ∩ E) = 52 .
This result can be extended by induction as follows
For any events A1 , A2 , · · · , An
EX: We have a box with w white marbles and r red marbles. Select
one marble from this box and then return that marble to this box with x
marbles which are the same color. Find the probability that for the first
fourth selection, all of the selected marbles are white.
Solution: Let Wi be the event a white marble is selected in ith , i = 1, 4.
In this equation, we use (1) to replace P (B) and use P (Ai ∩B) = P (Ai )P (B|Ai )
to replace P (Ai ∩ B), thus obtaining
Bayes’ Theorem: Suppose A1 , A2 , · · · , An is a partition of S and B is any
event. Then, for any i,
P (Ai )P (B|Ai )
P (Ai |B) =
P (A1 )P (B|A1 ) + · · · + P (An )P (B|An )
EX: Box I has 2 white marbles and 3 yellow marbles. Box II has one white
marble and 2 yellow marbles. Take one marble from box I and throw to
box II. Then take one marble from box II out.
a) Find the probability that for the second selection, we get a white
b) Suppose that we have a yellow marble for the second selection. Find
the probability that we have a white one for the first selection.
Solution. Now, let T1 be the event that we have a white marble for the
first selection.
T2 be the event that we have a white marble for the second selec-
V1 be the event that we have a yellow marble for the first selection.
V2 be the event that we have a yellow marble for the second se-
a) By using the multiplication theorem for conditional probability
2 2 3 1 7
P (T2 ) = P (T1 )P (T2 |T1 ) + P (V1 )P (T2 |V1 ) = · + · =
5 4 5 4 20
b) By using Bayes Theorem,
P (A ∩ B) = P (A)P (B)
By using the above equation, we go to the definition of independence.
EX: Let a fair coin be tossed 3 times, we obtain the equiprobable space
Consider the events A = { the first toss is head }, B = { the second toss is head}
C = { exactly two heads are tossed in a row }
Clear A and B are independent events. We have
4 1
P (A) = P ({HHH, HHT, HT H, HT T }) = =
8 2
4 1
P (B) = P ({HHH, HHT, T HH, T HT }) = =
8 2
2 1
P (C) = P ({HHT, T HH}) = =
8 4
Then P (A ∩ B) = P ({HHH, HHT }) = 14
P (A ∩ C) = P ({HHT }) = 18 P (B ∩ C) = P ({HHT, T HH}) = 41 .
1 1 1
P (A)P (B) = · = = P (A ∩ B) A and B are independent,
2 2 4
1 1 1
P (A)P (C) = · = = P (A ∩ C) A and C are independent,
2 2 4
1 1 1
P (B)P (C) = · = 6= P (B ∩ C) B and C are not independent.
2 4 8
Three events A, B and C are independent if
(i) P (A ∩ B) = P (A)P (B), P (A ∩ C) = P (A)P (C) and P (B ∩ C) = P (B)P (C)
i.e the events are pairwise independent, and
(ii) P (A ∩ B ∩ C) = P (A)P (B)P (C).
The condition (ii) does not follow from condition (i), in other words,
three events may be pairwise independent but not independent themslves.
EX: Let a pair of coins be tossed, here Ω = { HH,HT,TH,TT } is an
equiprobable space. Consider the events
C = { heads on exactly one coin } = { HT,TH }
2 1
Then P (A) = P (B) = P (C) = 4
= 2
1 1 1
P (A∩B) = P ({ HH }) = , P (A∩C) = P ({ HT }) = , P (B∩C) = P ({ TH }) =
4 4 4
Thus the condition (i) is satisfied, i.e the events are pairwise independent.
However, A ∩ B ∩ C = ∅ and so
In other words, condition (ii) is not satisfied and so the three events are
not independent.
Definition 3. A random variable X on a sample space Ω is a function from Ω into
the set R of real numbers such that the preimage of every interval of R is an event of
We emphasize that if Ω is a discrete space in which every subset is an
event, then every real-valued function on Ω is a random variable. On the
other hand, it can be shown that if Ω is uncountable then certain real-valued
functions on Ω are not random variables.
If X and Y are random variables on the sample space Ω, then X + Y, X +
k, kX and XY (where k is a real number) are the function on Ω defined by
(X + Y )(ω) = X(ω) + Y (ω)
(X + k)(ω) = X(ω) + k
(kX)(ω) = kX(ω)
XY (ω) = X(ω)Y (ω)
for every ω ∈ Ω. It can be shown that these are also random variables.
We denote
P (X = a) = P ({ω ∈ Ω : X(ω) = a})
and P (a 6 X 6 b) = P ({ω ∈ Ω : a 6 X(ω) 6 b}).
Now, we suppose X is a random variable on Ω with a countably infinite
set, say X(Ω) = {x1 , x2 , · · · }. Such random variables together with those
finite sets are called discrete random variables. In the finite case, we make
X(Ω) into a probability space by defining the probability of xi to be f (xi ) =
P (X = xi ) and call f the distribution of X
That is , EX is weighted average of the possible values.
EX: A pair of fair dice is tossed. We obtain the finite equiprobable Ω
consisting of the 36 ordered pairs of numbers between 1 and 6
Let X assign to each point (a, b) in Ω the maximum of its numbers, i.e
X(a, b) = max(a, b). Then the image set of random variable X is X(Ω) =
{1, 2, 3, 4, 5, 6}
The distribution of X
f (1) = P (X = 1) = P ({1, 1}) =
f (2) = P (X = 2) = P ({2, 1}, {2, 2}) =
5 7
f (3) = P (X = 3) = , f (4) = P (X = 4) =
36 36,
9 11
f (5) = P (X = 5) =
, f (6) = P (X = 6) = .
36 36
This information is put in the form of a table as follows
(iv) If X ≥ Y then EX ≥ EY
(v) |EX| 6 E|X|
is called the nth center moment of X.
In particular, n = 2, E(X − µX )2 is called the variance of X.
The variance of X, denoted by V arX,
V arX = (xi − µX )2 f (xi ) = E(X − µX )2 = EX 2 − µ2X
2 2 2
√ = EX − µX = 21.97 − (4, 47) = 1, 99
Hence, V arX
and σX = 1, 99 = 1, 4
Remark 1.There is physical interpretation of mean and variance. Suppose that
each point xi on the x axis there is placed a unit with mass f (xi ). Then the mean is
the center of gravity of the system, and the variance is the moment of the inertia of
the system.
Remark 2.Many random variables gives rise to the same distribution; hence we
frequently speak of the mean, variance and standard deviation of a distribution instead
of the underlying random viariable.
Remark 3.Let X be a random variable with mean µ and the standard deviationσ >
0. The standardized random variable X • is defined as
X −µ
X• = .
Ex: Show that E(X • = 0 and V ar(X • ) = 1.
Suppose that X is a random variable whose image set X(Ω) is a continum of
numbers such as a interval. From the definition of random variables that the set
{ω ∈ Ω : a 6 X(ω) 6 b} is an event in Ω and so the probability P (a 6 X 6 b) is well
defined. We assume that there is a piecewise continuous function f : R → R such that
P (a 6 X 6 b) is equal to the area under the graph of f between x = aand x = b (as
shown bellow).
P (a 6 X 6 b) = f (x)dx.
when it exists.
The variance is defined by
V arX = E(X − µX ) = (x − µX )2 f (x)dx
when it exists.
Just as in the discrete case, V arX exists if and only if µX = EX and EX 2 both
exist and then Z
V arX = EX − µX = x2 f (x)dx − µ2X
2 2
Z2 Z2
x2 x3 2 4
EX = xf (x)dx = x · xdx = dx = =
2 2 6 0 3
R 0 0
Z2 Z2
x3 x4 2
EX 2 = x2 f (x)dx = x2 · xdx = dx = =2
2 2 8 0
R 0 0
r √
2 16
2 2 2 2
V arX = EX − µ = 2 − = and σX = =
9 9 9 3
A finite number of contious random variables, say X, Y, · · · , Z are said to be inde-
pendent if for any intervals [a, a0 ], [b, b0 ], · · · , [c, c0 ]
P (a 6 X 6 a0 , b 6 X 6 b0 , · · · , c 6 X 6 c0 ) = P (a 6 X 6 a0 )P (b 6 X 6 b0 ) · · · P (c 6 X 6 c0 )
F (a) = P (X 6 a)
If X is a discrete random variable with distribution function, then F is the ”step
function” defined by X
F (x) = f (xi )
xi 6x
Observe that F is a ”step function” with a step at the xi with height f (xi )
EX: Let X be a continous random variable with the following distribution
f (x) = x if 0 6 x 6 20elsewhere
The cumulative function F follows
F (x) = 0if x < 0 x2 if 0 6 x 6 21if x > 2
Here we use the fact that for 0 6 x 6 2
1 1
F (x) = tdt = x2
2 4
Let X and Y be random variables on a sample space Ω with respective image sets
X(Ω) = {x1 , x2 , · · · , xn } and Y (Ω) = {y1 , y2 , · · · , ym }
We make the product set
X(Ω)Y (Ω) = {(x1 , y1 ), (x1 , y2 ), · · · , (xn , ym )}
into a probability space by defining the probability of the ordered pair (xi , yj ) to be
P (X = xi , Y = yj ) which is written h(xi , yj ). Then h(xi , yj ) is called the joint distri-
bution or joint probability function of X and Y and is usually given in the form of a
The above function f and g are defined by
X n
f (xi ) = h(xi , yj ) and g(yj ) = h(xi , yj )
j=1 i=1
Both f (xi ) and g(yj ) are called the marginal distribution and are distribution of
X and Y . The joint distribution and respective means µX and µY , the covariance of
X and Y , denoted by Cov(X, Y ), is defined by
Cov(X, Y ) = (xi − µX )(yj − µY )h(xi , yj ) = E(X − µX )(Y − µY )
or equivalently by X
Cov(X, Y ) = xi yj h(xi , yj ) − µX µY
Cov(X, Y )
ρ(X, Y ) =
σX σY
Some properties of ρ
(i) ρ(X, Y ) = ρ(Y, X)
(ii)0 6 ρ 6 1
(iii) ρ(X, X) = 1
(iv) ρ(aX + b, cX + d) = ρ(X, Y ) if a, c 6= 0
EX: Let X and Y be random variables with the following joint distribution
Distribution of X and Y
We compute EXY and µX , µY
EXY = 1 · 4 · 0 + 1 · 10 · 21 + 3 · 4 · 12 + 3 · 10 · 0 = 11
µX = EX = 1 · 21 + 3 · 21 = 2
µY = EY = 4 · 12 + 10 · 12 = 7
Then, Cov(X, Y ) = E(XY ) − µX µY = 11 − 2 · 7 = −3
Remark The notion of a joint distribution h is extended to any finite number of
random variables X, Y, · · · , Z is the obvious way that is h(xi , yj , · · · , zk ) = P (X =
xi , Y = yj , · · · , Z = zk )
A finite number of random variables X, Y, Z, · · · on a sample space Ω are said to
be independent if
P (X = xi , Y = yj , · · · Z = zk ) = P (X = xi )P (Y = yj ) · · · P (Z = zk )
P (X = xi , Y = yj ) = P (X = xi )P (Y = yj )
Now, if X and Y has respective distribution f and g, and joint distribution h, then,
the above equation can be written as
Theorem 5. Let X be a random variable with mean µ and standard deviation σ. Then
for every > 0
P (|X − µ| ≥ ) 6 2
Proof. We begin with definition of variance
σ 2 = V arX = (xi − µ)2 f (xi )
We delete all the terms in the above series for which |xi −µ| < . This does not increase
the value of the series, since all its terms are nonnegative, that is
σ2 ≥ ∗(xi − µ)2 f (xi )
where the asterisk indicates that the summation extends only over these i for which
|xi − µ| ≥ . Thus this new summation does not increase in value if we replace each
|xi − µ| by ; that is X X
σ2 ≥ ∗2 f (xi ) = 2 ∗f (xi )
i i
But ∗f (xi ) is equal to the probability that |xi − µ| ≥ ; hence
σ 2 ≥ 2 P (|X − µ| ≥ )
Dividing by 2 we get the desired inequality.
Consider an experiment which has 2 outcomes, one is success and one the other is
failure. Let p be the probability of success and so q = 1 − p is the probability of failure.
Let X be the number of success in doing experiment one time.
Repeat an experiment in A.1 n times and these experiment are independent.
Let S be the number of successes in n times of doing experiment
n k n−k
Pk = P (S = k) = p q
We denote S ∼ b(n, p)
We can calculate
EX: Let a die be tossed 20 times. Let S be the number of obtaining the face 1 in
20 times tossing.
a) Find the distribution of S
b) Find the probability that in 20 times of tossing, the number of obtaining face 1
is 3.
a)S ∼ b(20, 16 )
b) P (S = 3) = 20
1 5
( 6 )( 6 )
λk e−λ
Pk = P (X = k) =
Then we said that X has Poisson distribution, denoted by X ∼ P oisson(λ)
We can calculate
This countably infinite distribution appears in many natural phenomena such as the
number of telephone calls per minute at some switchboard, the number of α particles
emitted by a radioactive substance.
Let a coin be tossed until we get a head and denote p and q be the respective prob-
ability of success and failure. Let X be the number of tossing. Then X get real valued
1, 2, 3, · · ·
P (X = 1) = P (H) = p
P (X = 2) = P (T H) = P (T )P (H) = qp
P (X = 3) = P (T T H) = P (T )P (T )P (H) = q 2 p
P (X = k) = q k−1 p
The binomial distribution is generalized as follows. Suppose the sample space of an
experiment is partitioned into s mutually exclusive events A1 , A2 , · · · , As with respective
probabilities p1 , p2 , · · · , ps . Hence p1 + p2 + · · · + ps = 1. Then
The above numbers form the so-called multinomial distribution since they are pre-
cisely the terms in the expansion of (p1 + p2 + · · · ps )n . If s = 2 then we obtain the
binomial distribution, which is discussed at the beginning of the chapter.
EX: A fair dice is tossed 8 times. The probability of obtaining the face 5 and 6
twice and each of the others is
8! 12 12 1 1 1 1 35
= ' 0, 06
2!2!1!1!1!1! 6 6 6 6 6 6 5832
A box contain N marbles. Of these, M are drawn at random, marked and returned
to the box. Next, n marbles are drawn at random from the box and marked marbles are
counted. If X denotes the number of marked marbles, then
N M N −M
PX = x =
n x N −n
It is easy to check that
EX = M
V arX = (N − M )(N − n)
N 2 (N − 1)
EX: A lot consisting of 50 bulbs in inspected by taking at random 10 bulbs and
testing them. If the number of defective bulbs is at most 1, the lot is accepted, otherwise,
it is rejected. If there are, infact 10 defective bulbs in the lot, the probability of accepting
the lot, is
10 40 40
9 + 10
= 0, 3487
10 10
A random variable X is said to have a uniform distribution on the interval [a, b] if
its probability density function is given by
f (x) = a 6 x 6 b0elsewhere
We will write X ∼ U [a, b] if X has uniform distribution on [a, b]
A random variable X is said to have exponential distribution with positive parameter
λ(λ > 0) if its p.d.f is given by
We will write X ∼ N (µ, σ 2 ) if X has normal distribution. This function is one of
the most important examples of a contious probability distribution.
We can calculate
EX: Given a random variable Y which has normal distribution, Y ∼ N (2, 9). Find
Table A.3 indicates the area under the standard normal curve corre-
spending to P (Z < z) where Z is the standard normal random variable . To
ilustrate the use of the table A.3, let us find probability that Z is less than 1.74. First,
we locate a value of z equal to 1.74 in the left column, then move across thw row to the
column under 0.04, where we read 0.9591 . therefore,
The integral
Γ(α) = xα−1 e−x dx
converges or deverges according as α > 0 or 6 0. For α > 0, this integral is called the
Gamma function.
Some properties of Gamma
√ fuction
(i)Γ(1) = 1!; Γ( 21 ) = π
(ii)Γ(α + 1) = αΓ(α)
(iii) Γ(n + 1) = n!
A random variable X is said to have Gamma distribution if p.d.f is given by
1 x
f (x) = α
xα−1 e− β x > 00x 6 0
where α, β > 0
We will write X ∼ Gamma( α, β) if X has Gamma distribution. In particular, if
α = 2r and β = 2, we said that X has Chi-Square distribution with the p.d.f is
1 r x
f (x) = r x 2 −1 e− 2 x > 00x 6 0
Γ( 2r )2 2
What do we mean by statistics ?
The outcome of a statistical experiment may be recorded either as a numerical value
or as adescriptive representation. When a pair of dice arew tossed and the total is
the outcome of interest, we record a numerical value. However, if the students of a
certain school are given blood test ad the type of blood is of interest, then a discriptive
representation might be the mos useful. A person’s blood can be classified in 8 was. It
must be AB, A, B or O with aplus or minus sign, depending on the presence or absence
of the Rh antigen.
Definition 10. Let X be a random variable with distribution function F and let
X1 , X2 , · · · , Xn be identical independent random variables with common distribution F .
Then the collection X1 , X2 , · · · , Xn is known as a random sample of size n from the dis-
tribution function F or simply as n independent observation on X. If X1 , X2 , · · · , Xn
is a random sample from F , their joint distribution f is given by
F (x1 , x2 , · · · , xn ) = F (xi )
is called the sample variance. Moreover, the sample standard deviation, denoted
by S, is the positive square root of the sample variance.
It should be noted that the sample statistic X, S and S 2 (and others that we will
be defined later) are random variables while the parameters µ, σ 2 and so on are fixed
constants that may be unknown.
Example 12. (Coffee prices.) A comparison of coffee prices at 4 randomly selected
grocery stores in San Diego showed increases from the previous month of 12, 15, 17 and
20 cents for a pound bag. Find the variance of this random sample of price increases.
Solution calculating the sample mean, we get
12 + 15 + 17 + 20
X= = 16 cents.
Therefore, P4
2 i=1 (xi − 16)2 34
s = = .
3 3
Whereas the expression for the sample variance in the above definition best illustrates
that S 2 is a measure of variability, an alternative expression does have some merit and
thus we shoud be aware of it.
Definition 13. (STATISTC) Any function of the random variables constituting a ran-
dom sample is called a statistic.
(v) ES 2 = σ 2 . This is precisely the reason why we call S 2 the sample variance.
(vi) V ar(S 2 ) = E(X−µ)
+ n(n−1) [E(X − µ)2 ]2
X1 + X2 + · · · + Xn ∼ N (nµ, nσ 2 )
1 σ2
(X1 + X2 + · · · + Xn ) ∼ N (µ, )
n n
Theorem 15. If X ∼ N (0, 1) then X 2 ∼ χ2(1)
Definition 17. Let X ∼ N (0, 1) and Y ∼ χ2(n) and let X and Y be independent. Then
the statistic s
T =
Y /n
is said to have a t-distribution with n degrees of freedom (d.f ) and we write T ∼ t(n)
Γ[ (n+1) ] (n+1)
fn (t) = n √ 2
(1 + t2 /n)− 2 − ∞ < t < +∞
Γ( 2 ) nπ
Definition 19. Let X and Y be independent χ2 random variables with m and n d.f
respectively. The random variable
F =
Y /n
(m + n) m m m −1
g(f ) = Γ[ ( )( f ) 2 ]f > 00f 6 0
2 n n
By some calculations, we obtain some statements
(i) X and S 2 are independent
(ii) (n−1)S
∼ χ2(n−1)
(iii)The distribution of S
is t(n−1)
Consider a random sample X1 , · · · , Xn from a distribution that involves a parameter
θ whose parameter is unknown and must be estimated. In a problem of this type, it is
desirable to use an estimator θ̂(X1 , · · · , Xn ) that, with high probability, will be close to
This lead to the following definition. An estimator θ̂(X1 , · · · , Xn ) is an unbiased
estimator of a parameter θ if
E θ̂ = θ
for every possible value of θ.
EX: Let (X1 , · · · , Xn ) be a random sample from a random variable X ∼ P oisson(θ), θ >
0. n
We know that X = n1
Xi .
It is very easy to see that
n n
(X1 + · · · + Xn ) 1X 1X nθ
EX = E = EXi = θ= =θ
n n i=1 n i=1 n
Let S be a random variable which has binomial distribution b(n, p). By some simple
calculation, we have
ES = np
From this, we can imply that
Therefore, Sn be unbiased estimator of p and be denoted by p̂. Moreover, by Demoive-
Laplace theorem
S − np
Z= √ ∼ N (0, 1)
Given a positive number c
n(X − µ)
P <c =α
is equavalent to r r
pq pq
p̂ − c < p < p̂ + c
n n
where p̂ = Sn
One recommended solution is replacing p by unbiased estimator p̂
r r
p̂(1 − p̂) p̂(1 − p̂) S
p̂ − c < p < p̂ + c where p̂ =
n n n
EX: Make a confident interval 95% and 99% for p where p is the ratio of families in
Ho Chi Minh City having washing machine. Interview randomly 100 families, we know
60 families having washing machine.
Solve. Let S be the number of families having washing machine in 100 interviewed
families, S ∼ b(100, p), where p is unknown parameter.
We know that S = 60
S 60
and p̂ = = = 0, 6
n 100
We have
S − np
P √
< c = P (|Z| < c) where Z ∼ N (0, 1)
EX: The height of a student of one university has normal distribution N (µ, σ02 )
where σ0 = 10cm. Make a confident interval 95% for µ, knowing that when measuring
100 students, we have
1 X
X= Xi = 158, 6cm
100 i=1
Solve. We have P σ0 < c = 0, 95
This equation is equivalent to
2Φ(c) = 0, 95
By using Laplace table, we can infer c = 1, 96. The confidence interval for µ
σ0 σ0
X − c√ < µ < X + c√
n n
1, 96.10 1, 96.10
158, 6 − √ < µ < 158, 6 + √
100 100
MAL DISTRIBUTION(where both µ and σ0 are unknown )
We know that the confident interval for µ is
σ0 σ0
X − c√ < µ < X + c√
n n
⇔ P (|tn−1 | < c) = α
where tn−1 is a Student random variable.
We can get the value of c from the Student table.
EX: The height of a student of one university has normal distribution N (µ, σ02 )
where µ and σ is unknown. Make a confident interval 95% for µ, knowing that when
measuring 10 students, we have
X = 158, 6 and S2 = (Xi − X)2 = 100
9 i=1
We have √
n(X − µ)
P <c =α
⇔ P (|t9 | < c) = α
By using Student table, we can infer c = 2, 2622. Therefore, we have the confident
interval for µ
X − c√ < µ < X + c√
n n
2, 2622.10 2, 2622.10
158, 6 − √ < µ < 158, 6 + √
10 10
In this section, we shall consider statiscal problems involving a parameter θ whose
value is unknown but must lie in a certain parameter space Ω. We shall suppose that
Ω can be partitioned into 2 disjoint subsets Ω0 and Ω1 and that the statistician must
decide whether the unknown value of θ lies Ω0 or in Ω1 .
Let H0 denote the hypotheses that θ ∈ Ω0 , and let H1 denote hypotheses that θ ∈ Ω1 .
Since the subsets Ω0 and Ω1 are disjoint and Ω0 ∪ Ω1 = Ω, exactly one of the hypotheses
H0 and H1 must be true. The stasiscian must decide whether to accept the hypothesis
H0 or to accept the hypothesis H1 . Accepting H0 is equivalent to rejecting H1 and
accepting H1 is equivalent to rejecting H0 . A problem of this type is called a problem
of testing. H1 is called the alternative hypothesis.
First, the test might result in the rejection of the null hypothesis H0 when, in fact,
H0 is true. This result is called an error of type 1.
Second, the test might result in the acceptance of the null hypothesis H0 when, in
fact, the alternative hypothesis H1 is true. This result is called an error of type 2. Let
α denote the problem of an error of type 1 and β denote the problem of an error of
type 2. Thus
α = P ( rejecting H0 |H0 is true )
β = P ( accepting H0 |H0 is fault)
It is desirable to find a test procedure for which the probabilities α and β of the two
types of error will be small. Ordinarily, the smaller α is the greater β is.
Therefore, this criterion is to assign α by a value which is called solution α by a
value which is called significance, and desire to find a test which makes β be minimum.
This is called Neyman-Pearson criterion.
Testing problem
H0 : p = p0
H1 : p 6= p0
where p0 is the given ratio. denote p̂ the unbiased estimation of p and exactly p̂ = Sn
We know that, √
S − np n(p̂ − p)
Z= √ = √ ∼ N (0, 1)
npq pq
is a random variable which has standard normal distribution.
By replacing p = p0 (H0 is true), we get
n(p̂ − p0 )
Z= √
p0 q0
= P (|Z| ≥ c|p = p0 )
= 1 − P (|Z| < c)
= 1 − 2Φ(c)
By using Laplace table, we can infer c.
EX: A survey from 100 random families in Ho Chi Minh City give the result that
38 families having washing machine. Then, the statement ”30% families have washing
machine” can be accepted or not with the significance 10% and 5%.
Solve. Testing problem
H0 : p = p0 = 0, 3
H1 : p 6= p0
The statistic √ √ 38
n(p̂ − p0 ) 100( 100 − 0, 3)
Z= √ = √ = 1, 75
p0 q 0 0, 3.0, 7
A well testing criterion
H0 : µ = µ0
H1 : µ 6= µ0 √
As we know, Z = n(X−µ)
∼ N (0, 1)
By replacing µ = µ0 (H0 is true ), we get
n(X − µ0 )
A well testing critetion is
EX: Let X be the product of rice on 1 ha. Let X1 , · · · , X25 be sample from N (µ, 9)
and X = 4, 3 ton/ha. Accept or reject the null hypothesis µ = 5 ton/ha at a level of
significance α = 5%
Solve. H0 : µ = 5
H1 : µ 6= 5
√ √
n(X − µ0 ) 25(4, 3 − 5)
|Z| =
= 1, 17
σ0 3
With the level of significance α = 5%, we have
α = 5% = P (rejectH0 |H0 is true)
=P (|Z| ≥ c|H0 is true)
=1 − 2Φ(c)
By using Laplace table, we can infer c = 1, 96. Intuitively, we have |Z| < c.
Therefore, the null hypothsis H0 is accepted at the level of significance α = 5%
X ∼ N (µ, σ 2 ) (where σ 2 is unknown parameters).
Testing problem
H0 : µ = µ0
H1 : µ 6= µ0
Let X1 , · · · , Xn be sample from normal distribution N (µ, σ 2 )
n(X − µ)
t= ∼ tn−1
By replacing µ = µ0 (H0 is true ), we have
n(X − µ0 )
A well testing criterion
We shall now consider a problem of testing hypotheses which uses the F distribution.
Suppose that the random variables X1 , · · · , Xn from a random sample of m observations
from a normal distribution which both the mean µ1 and the variance σ 2 are unknown
and the random variables Y1 , · · · , Yn from a random sample of n observations from a
normal distribution which both the mean µ1 and the variance σ 2 are unknown.
Suppose finally that the following hypotheses are to be tested at a specified level of
significance α0 (0 < α0 < 1), i.e
H0 : σ12 6 σ22
H1 : σ12 > σ22
We shall let the statistic V be defined by the following relation
SX /(m − 1)
V = 2
SY /(n − 1)
P m
where SX = (Xi − X) and SY = (Yi − X) .
i=1 i=1
Let α0 is a level of significance, i.e P (V ≥ c) = α0 . By using Fisher table, we can
imply c.
The likelihood ratio test procedure which we have just described specifies that the
hypothesis H0 should be rejected if V ≥ c
EX: Suppose that 6 observations (X1 , · · · , X6 ) are selected at random from a normal
distribution for which both the mean µ1 and the variance σ12 are unknown. It is found
= (Xi − X)2 = 30. 21 observation (Y1 , · · · , Y21 ) are selected at random from
that SX
a normal distribution for which both the mean µ2 and the variance σ22 are unknown. It
is found that SY2 = (Yi − Y )2 = 40
Test of hypotheses
H0 : σ12 6 σ22
H1 : σ12 > σ22
In this example, m=6 and n=21
SX /(m − 1) 30/5
V = 2 = =3
SY /(n − 1) 40/20
V has an F distribution with 5 and 20 degrees of freedom. With the level of signif-
icance α0 = 5%, i.e P (V ≥ c) = 5
We can imply c = 2, 71 by using Fisher table. Intuitively, we note that V > c
Therefore, the hypotheses H0 that σ12 6 σ22 should be rejected at the level of signifi-
cance α0 = 5%
If the level of significance α0 = 2.5%, we get c=3.29. Therefore, the the hypotheses
H0 that σ12 6 σ22 should be accepted at the level of significance α0 = 2.5%
[1] Given three sets A, B, C. Use the Venn diagram to illustrate the
following sets: A ∪ B ∪ C, A ∩ B ∪ B ∩ (C \ (A ∪ B)). [2] Let a card be selected
from two ordinary packs of 52 cards. Denote A = {the card is Diamonds or
clubs}, B = {the card is a red suit one}, C = {the card is not a Heart Jack}.
Compute the following probabilities: P (A), P (B ∪ C), P (C|B) [3]
EX 1( 2 points): What are a sample space, events an their probilitiy?
Formulate axioms of probability ? Give an example ?
number of socks be ?
EX 8:(4.23) Let A be the event that a family has children of both sexes,
and let B denote the event that a family has at most one boy. The events A
and B are independent. How small can the number of children in a family
be ?
EX 10:( 4.54) A box contains 5 radio tubes of which 2 are defective. The
tubes are tested one after the other until the two defecive tubes are dis-
covered.What is the probability that the process stopped on the (i) second
test,(ii) third test ?
EX 11:( Example 5.4) Let a die be tossed 20 times. Let S be the number
of obtaining the face 1 in 20 times tossing.
a) Find the distribution of S
b) Find the probability that in 20 times of tossing, the number of ob-
taining face 1 is 3.
a)∼ b(20, 16 )
( 16 )( 56 )
b) P (S = 3) = 3
10 Introduction to Stochastic processes
I. Review of basic terminology and properties of random variables and Dis-
tribution Function.
The following concepts will be assumed familiar to the Reader:
Given a pair (X, Y) of r.v.’s, their joint distribution function is the func-
tion FXY of two real variables given by
F(λ1 , λ2 ) = Pro{X ≤ λ1 , Y ≤ λ2 }.
Let X and Y be r.v.’s which can attain only countably many different
values, say 1,2, ... the conditional distribution function
{X ≤ x, Y ≤ y}
FX|Y (x|y) = , if P r{Y = y} > 0,
P r{Y = y}
These three properties capture the essential features of conditional distri-
butions. In fact , from *(C.P.3) we obtain
P r{X ≤ x, Y ≤ y} = P r{X ≤ x, Y ≤ y} − P r{X ≤ x, Y < y}
= Σu≤y FX|Y (x|u)P r(Y = u) − Σu<y F (x|u)P r{Y = u} = FX|Y (x|y)P r{Y = y},
which then implies the function
fX|Y (X|Y ) = P r{x < X, y = Y }/P R{y = Y }
at least when P r{Y = y} > 0. In advanced research, (C.P.1-3) is taken as the
basic for definition of conditional distributions. it can be established that
such conditional distributions exist for arbitrary real r.v.’s X and Y, and
even for real random vectors X = (X1 , ..., Xn ) and Y = (Y1 , ..., Yn ).
The application of (C.P.3) in the case y = ∞ produces the law of total
Z +∞
P r{X ≤ x} = P r{X ≤ x, Y ≤ ∞} = FX|Y (x|y)dFY (y),
When X and Y are jointly distributed continuous r.v.’s, we may define the
conditional density function
d pXY (x, y)
pX|Y (x|y) = FX|Y (x|y) =
dx pY (y)
at values y for which pY (y) > 0, and as a fixed arbitrary probability density
function when pY (y) = 0.
Let g be a function for which the expectation of g(X) exists. the condi-
tional expectation of g(X) given Y=y can be expressed in the form
E[g(X)|Y = y] = g(x)dFX|Y (x|y).
When X and Y are jointly continuous r.v.’s, E[g(X)|Y = y] may be computed
g(x)pXY (x, y)dx
E[g(X)|Y = y] = g(x)pX|Y (x|y)dx = , if pY (y) > 0,
pY (y)
and if X and Y are jointly distributed discrete r.v.’s, taking the possible
values x1 , x2 , · · · , then the detailed for mula reduces to
E[g(X)|Y = y] = Σ∞
i=1 P r[X = xi |Y = y]
Σ∞ g(xi )P r{X = xi , Y = y}
= i=1 , if P r{Y = y} > 0.
P r{Y = y}
In parallel with (C.P.1-3) we see that the conditional expectation of g(X)
given Y=y satisfies
(C.E.1) E[g(X)|Y = y] is a function of y for each function g for which
E[|g(x)|] < ∞; and
(C.E.2) For any bounded function h we have
E[g(X)h(Y )] = E[g(X)|Y = y]h(y)dFY (y),
E[g(X)] = Σ∞
i=1 P r{Y = yi }
are fixed numbers and g,h are given functions for which g(X) and h(X)b
are integrable, then
When h(y) = 1, for all y, we get the law of total probability in the form
E[g(X)] = E[E[g(X)|Y ]]
E[c|Y ] = c, (7)
E[f (Y )|Y ] = f (Y ), (8)
E[g(X)f (Y )|Y ] = f (Y )E[g(X)], (9)
E[g(X)] = E[E[g(X)]|Y ]. (10)
FXi1 ,...,Xij ,Xin+1 ,...,Xin (λ1 , . . . , λj−1 , . . . , λj+1 . . . . , λn ) = lim FXi1 ,...,Xij ,Xin+1 ,...,Xin
λj →∞
n-th toss of a die, its possible values are contained in the set {1, 2, 3, 4, 5, 6}
and typical realization of the process would be 5,1,3,2,2,4,1,6,3,6.... This is
shown schematically in the fig.1, where the ordinate for t = n is the value
of Xn . In this example, the random variables Xn are mutually independent
but generally the random variables Xn are dependent. Stochastic for which
T = [0, ∞] are particular important in applications. Here t can usually be
interpreted as time. We will content ourselves, for the moment, with a very
brief discussion of some concepts of stochastic processes and two examples
thereof; a summary of various types of stochastic processes is presented at
the at of the end of the Chapter, while the examples themselves will be
treated in greater detail in succeeding chapters.
(a) Suppose that t0 < t1 < · · · < tn ; then the increments Xt1 − Xt0 , · · · , Xtn −
Xtn−1 are mutually independent r.v.’s. ( A process with this property is said
to be a process with independent increments, and expresses the fact that
the changes of Xt over non overlapping time periods are independent r.v.’s.)
(b) The probability distribution of Xt2 −Xt1 , t2 > t1 , depends only on t2 −t1
(and not, for example, on t1 ).
Z x
P r[Xt − Xs ≤ x] = [2πB(t = s)] exp[−u2 /2B(t = s)]du,
Einstein were later experimentally verified and extended by various physi-
cists and mathematicians.
Let Xt denote the displacement (from its starting point, along some fixed
axis) at time t of a Brownian particle. The displacement Xt − Xs over the
time interval (s,t) can be regarded as the sum of a large number of small
displacements. The central limit theorem is essentially applicable and and
it seems reasonable to assert that Xt − Xs is normally distributed. Similarly,
it seems reasonable assume that the distributions of Xt − Xs and Xt+h − ts+h
are the same, for any h¿0, if we suppose that the medium to be in equi-
librium.Finally, intuitively clear that that the displacement Xt − Xs should
depend only on the length t-s and not on the time we begin observation.
This is the space in which the possible values of Xt lie4. In the case that
S = (0, 1, 2, · · · ) we refer to the process at hand as a discrete state processes.
If S=real line (−∞, +∞),then we call Xt a real valued stochastic process. If
S is a k-dimensional Euclidean space, then Xt is called a k-vector process,...
(a) Process with Stationary Independent Increments;
(b) Martingales; (c) Markov processes; (d) Stationary processes.
A stationary process is a stochastic process whose probabilistic laws re-
main unchanged through shifts in time (some time in space). the concept
captures the very natural notion of a physical system that lacks an inher-
ent time (space) origin. It is an appropriate assumption for a variety of
processes in communication theory, astronomy, biology, ecology, and eco-
nomics. vskip0.5cm 1. Definition and examples
Let T be an abstract set having the property that the sum of any two
points in T is also in T. Often T will be the set Z+ = {0, 1, 2, ...} of nonnegative
integers, but it just as well could be the could be the positive half or whole
real line, the plane, finite dimensional space, the surface of a sphere, or or
perhaps even an infinite- dimensional space.
The most important example of the second order processes is the Gaus-
sian process which is defined as the following: Definition 1.3. A stochastic
process {X(t), t ∈ T } is called a Gaussian process, if for any t1 , ..., tk ∈ T the
random vector (X(t1 , ..., X(tk ) has a k-dimensional Gaussian distribution, i.e.
for any real numbers λ1 , ..., λk the r.v. Σkj=1 λj X(tj ) is a Gaussian r.v. Def-
inition 1.4. If {X(t), t ∈ T } is a second order process then the function
R(t, s) := EX(t)X(s), t, s ∈ T is called its correlation function.
Definition 1.5. A second order process {X(t), t ∈ T } is a second is called
a weakly stationary process, if the mean m=EX(t) is constant, independent of
time, and the covariance function
12 Markov chains
The following are definitions and elementary properties of vectors and ma-
trices which are required for this chapter.
By a vector u we simply mean an n-tuple of numbers:
u = (u1 , u2 , · · · , un )
The uj are called the components of u. if all the components are zeros, the
vector u is called a zero vector. the set of all n-vectors forms a n-dimensional
Euclidean space.
By a matrx A we mean a rectangle array of numbers:
a11 a12 · · · a1n
a21 a22 · · · a2n
A = aij = ···
··· ···
am1 am2 · · · amn
the m horizontal n-tuples
Theorem 21. id u is a fixed vctor of a matrix A, then every non-zero scalar multiple
ku of u is also a fixed vector of A.
Definition A stochastic matrix P is said to be regular, if all the entries of some power
Pm are positive.
0 1 1/2 1/2
Example The matrix A = is regular, since A2 =
1/2 1/2 1/4 3/4
Theorem 22. Let P be a regular stochastic matrix. then:
{i} P has a unique probability vector t, and the components of t are positive.
{ii} the sequence P, P 2 , p3 , · · · of powers of P approaches the matrix T whose rows are
each the fixed point t.
{iii} if p is any probability vector, then the sequence of vectors pP, pP 2 , pP 3 , · · ·
approaches the fixed point t.
0 1
Example Consider the regular stochastic matrix P = . Its fixed
1/2 1/2
point is t = (1/3, 2/3)
We now consider a sequence of trials whose outcomes , say, X1 , X2 , · · · , satisfy the
following two properties:
(i) Each out come belongs to a finite set of out comes (a1 , a2 , · · · , am ) called the state
space of the system.; if the outcome on n-trial is ai , then we say that the system is in
state ai or at the n-th step.
(ii) The outcome of any trial depends at most upon the outcome of immediately pre-
ceding trial and not upon any other previous out come; with each pair of states (aI , aj )
there is given the probability pij such that aj occurs immediately after ai occurs. Al-
ternatively, Xn+1 depends on Xn but not on the earlier values X0 , X1 , ..., Xn−1 of the
Such a stochastic process is called a (finite) Markov chain. the numbers pij , called the
transition probabilities can be arranged in a matrix
p00 p01 · · · p0n
P = (pij ) = p10 .. p11 p1n .
. ..
pn0 ··· pnn
called the transition matrix.
One may represent the chain by a graph with nodes representing the sample sapace of
Xn and directed arcs (i,j) for all pairs of states i,j such that pi,j > 0.
Theorem 23. The transition matrix P of a Markov chain is a stochastic matrix.
Example A man either drives his car or catches a train to work each day. Suppose
he never goes by train two days in a row.; but if he drives to work , then the next day
he is just as likely to drive again as he is to travel by train.
The state space of the system is (t (train), d (drive)). This stochastic process is a
Markov chain since a outcome on any day depends only what happened on a preceding
The transition of the Markov chain is
0 1
1/2 1/2
The first row of the matrix corresponds to the fact that he never goes by the train two
days in a row and so he definitely will drive the day after he travels by train. The
second row of the matrix corresponds to the fact that the day after he drives he will
drive or goes by train with equal probability.
Higher Transition probabilities
The entry aij in the transition probabilities P of a Markov chain is the probability
that the system changes from the state ai to the state aj in one step. That is probability,
says, pnij . that a system changes from the state ai to the state aj in n steps?
Theorem 24. Let P be the transition probability matrix of a Markov chain process.
then the n-step transition matrix is equal to the nth power of P. Moreover, If p = (pi )
is the probability distribution of the system at arbitrary time, the pP is the probability
distribution of the system one step later and pPn is the probability distribution of the
system n steps later. In particular, if we denote by P(n) := (pij ) the n-step transition
matrix , then
Thus the probability that the system changes from, say, state t to state d in exactly 4
steps is 5/8, i.e. p4td = 5/8. Similarly, p4tt = 5/16 and p4dt = 5/16 and p4dd = 11/16
Theorem 26. let the transition matrix P of a Markov chain be regular. Then, in
the long run, the probability that any state aj occurs is approximately equal to the
component of the unique fixed probability vector t of P.
Hence we see that the effect of the initial state or the initial probability distribution
of the process wears off as the number of steps of the process increase. Furthermore,
every sequence of probability distributions approaches the fixed probability vector t of
P, called the stationary distribution of the Markov chain.
Review Exercises From textbook of Schaum’s Outline of