Leure ( - ) : Probability Calculations
Leure ( - ) : Probability Calculations
Leure ( - ) : Probability Calculations
Probability calculations
You have to have something random to calculate the probabilities, so
the basic ingredient is a random experiment. A random experiment is an
aion whose outcome is uncertain in advance of its happening.
It is a squeaky wheel that gets the grease: if you don’t ask, you
don’t get.
are the outcomes of nature’s random experiment. Nature does something—
ocks go up. Nature does something—you pass an exam. How long will
you live, will you get married—everything is a random experiment.
Suppose you toss a coin. You can have either heads or tails. It is a set
that has two objes inside: {𝐻, 𝑇 }. Suppose I tossed a coin twice, and
the sample space is the cartesian produ with itself: {𝐻, 𝑇 } × {𝐻, 𝑇 } =
{(𝐻, 𝐻), (𝐻, 𝑇 ), (𝑇, 𝐻), (𝑇, 𝑇 )}. The more complicated the experiment, the
more complicated the space.
What is an event? An event is any subset of the sample space. Look at
def def
this set: 𝐸 = {(𝐻, 𝐻)}, 𝐹 = {(𝐻, 𝐻), (𝐻, 𝑇 ), (𝑇, 𝐻)}. You are tossing a
coin twice, and both of the times you get heads.
The mo
powerful relational operator in mathematics is equal: “𝐴 = 𝐵”
def
means “𝐴 is identical to 𝐵”. Here we make the equality by definition, = .
There is no calculation behind it, we are imposing equality by definition.
If you toss a coin and get two heads, you say it is 𝐸. If you toss a coin
twice and if you see at lea
one head has occurred, you say it is 𝐹 . What
is the chance, probability, likelihood of the event 𝐸? We need to figure
out the probability of the event 𝐸? The probability of the fir
times the
second. Chevalier de Mere, Pierre Fermat, Blaise Pascal. This success of
this classical theory la
ed until the XIXth century. The probaility of an
event is the cardinality of an event divided by the cardinality of the sample
space. But the minute you are going to science and engineering with sample
spaces infinitely large, it fails because you can get the ratio ∞
∞ , which makes
no space. The price of a
ock can take any value between and +∞. You
cannot count the number of elements, this method is no longer su
ainable.
Who is the creator of the mathematical
ati
ics? Andrey Kolmogorov. In
, he wrote a book on mathematical theory of probability. It agrees on
the classical theory. So it is quite young. . .
Mathematical theory of probability is one of the greate
things that
have come to a human mind. We need a sy
ematic way of measuring
probabilities. Consider a random experiment, and let 𝑆 be the sample
space associated with this random experiment. We are given a particular
event. Event 𝐴 is a subset of this sample space. 𝐴 can have any cardinality,
including infinite. Everything is very-very general. The Greek could not
do it, Laplace could not handle it, Newton could not do it. The answer is
remarkably simple: all you need is a funion. It should take an event as
an input, and return a number. How should we con
ru this funion?
. Consi
ent (at any time the answer should be the same);
. Giving reasonable answers;
. If the sample space is finite, it should give the same answer as classical
probability.
As long as you can find a funion satisfying these axioms, you are ready
to measure probabilities. What should be the domain of the funion? It
receives the event and returns a number, so the domain should be the
set of all subsets. The domain is any event, and Kolmogorov gave it the
name probability measure: P : {set of all subsets of𝑆} ↦→ R is called the
probability measure if it satisfies three properties:
. Probability should be non-negative. They can be zero: P(𝐴) ≥
0 ∀𝐴 ⊂ 𝑆. How do you interpret negative numbers? Ask a -year
old what is zero is, and he will crap his pants. It took the humanity
some time to invent a symbol for nothingness.
. The probability of the entire sample space is , and the probability of
an empty event is : P(𝑆) = 1, P(∅) = 0.
. The required property: if 𝐴1 , 𝐴2 , . . . is a sequence of pairwise disjoint
subsets⋃︀of 𝑆, then the
∑︀∞probability of the union is the sum of probabilit-
∞
ies: P( 𝑖=1 𝐴𝑖 ) = 𝑖=1 P(𝐴𝑖 ). Disjoint means that the interseion is
empty, if the sequence is such that the interseion of any two events
is empty, 𝐴𝑖 ∩ 𝐴𝑗 = ∅ 𝑖 ̸= 𝑗.
Any funion that satisfies Kolmogorov’s axioms is a probability meas-
ure. Do we know that a funion exi
s? Otherwise this theory is silly.
Probability measures exi
!
def
Example. Suppose card(𝑆) < ∞. Then, P(𝐴) = card(𝐴)card(𝑆) , 𝐴 ⊂ 𝑆 is a
probability measure.
How about defining the probability measure on the real line? 𝑆 = R.
Its cardinality is uncountable infinity. Then
∫︁
def
P1 (𝐴) = I[0,1] (𝑥) d𝑥
𝐴
How about this:
∫︁
def 1 2
P2 (𝐴) = √ e −𝑥 /2 d𝑥
𝐴 2𝜋
In the fir
example,
∫︁ ∫︁
def
P1 (𝐴) = I[0,1] (𝑥) d𝑥 = I[0,1] (𝑥) d𝑥 =
R (R∖[0,1])∪[0,1]
∫︁ ∫︁
= I[0,1] (𝑥) d𝑥 + I[0,1] (𝑥) d𝑥.
R∖[0,1] [0,1]
∫︀
The fir
integral is always zero, so this is really [0,1] I[0,1] (𝑥) d𝑥, and this
integral is equal to .
The la
property is ju
the property of integrals. So probability meas-
ures exi
. In mathematics, exi
ence is crucial. The re
is improving the
fa that we exi
, otherwise it is ju
ab
ra nonsense. These properties
are even less than geometry, and it is so parsimonious. And Kolmogorov
died in . Professor Tripathi knows a person who knew Kolmogorov.
If in
ead of an infinite sequence, we have a finite number of disjoint
sets, say, 𝐴1 , 𝐴2 are disjoint subsets of 𝑆. Then P(𝐴1 ∪ 𝐴2 ) = P(𝐴1 ) + P(𝐴2 ).
But how⋃︀∞do we know that? The countable sequence ⋃︀∞ is 𝐴1 , 𝐴2 , ∅, ∅, ∅, . . ..
Then
∑︀∞ 𝑖=1 𝐴𝑖 = 𝐴1 ∪ 𝐴2 , so, by property , P( 𝑖=1 𝐴𝑖 ) = P(𝐴1 ∪ 𝐴2 ) =
𝑖=1 P(𝐴 𝑖 ) = P(𝐴1 ) + P(𝐴2 ).
These properties have remarkable implications.
S S
A
A B
AC
Leure (--)
Homework , additional problem . 𝐴 and 𝐵 are events of 𝑆. We have
shown that if P(𝐴), then P(𝐴 ∩ 𝐵) = 0. Fir
, 𝐴 ∩ 𝐵 ⊂ 𝐴. This implies
P(𝐴 ∩ 𝐵) ≤ P(𝐴). But P(𝐴) = 0 ≥ P(𝐴 ∩ 𝐵). But probabilities cannot be
negative, so this thing is zero.
Oh, s***, the sponge. . . Ah, here it is. I was about to use a harsh
word for the university.
def
P(𝐵) = P(𝐵∖𝐴)+P(𝐵∩𝐴). By definition, 𝐵∖𝐴 = 𝐵∩𝐴𝐶 , the complement
with respe to the sample space. Therefore, P(𝐵 ∖ 𝐴) = P(𝐵 ∩ 𝐴𝐶 ) ≤
P(𝐴𝐶 ) = 1 − P(𝐴) = 1 − 1 = 0. Therefore, P(𝐴 = 1) ⇒ P(𝐵 ∖ 𝐴) = 0.
Hence, P(𝐵) = P(𝐵 ∩𝐴). If you take an event that happens with probability
one, the probability of interseion of the probability of any other event.
These results are fanta
ically useful. They are used to simplify probab-
ility calculations.
def
Let 𝐸 = Q, the set of all rational numbers in R. Then
∫︁ ∫︁
P(𝐸) = I[0,1] (𝑥) d𝑥 = d𝑥
𝐸 𝑥∈𝐸∩[0,1]
Never confuse a zero probability event with an empty event. You might
think that this is a silly example.
Random variables
We introduce the probability measure, but there one problem: the
sample space can be very-very ab
ra and consi
of non-numerical out-
comes. The outcomes of a coin are heads or tails. You can think of a set of
donkeys as well as a set of real numbers. A set of real numbers is called a
group. The language of mathematics does not require real numbers to exi
.
Nature tosses a coin, there are some random processes in your mother body,
and you get a specific genome charaeri
ic. Computationally it is difficult,
but mathematically it is feasible. We need a device to to the computations
easily. This is why you need a funion to map the sample space into the
real line—the definition of a random variable. Once you have done this
mapping, for praical purposes, the real line becomes your sample space.
A random variable is a funion. Suppose the sample space is 𝑆 =
{𝐻, 𝑇 }. A possible mapping is
{︃
1, 𝜔 = 𝐻,
𝑋(𝜔) =
0, 𝜔 = 𝑇.
Random variable are funions, but they are written without arguments. It
is a convention: 𝑋 = 1 if the outcome is heads and 𝑋 = 0 if the outcome
is .
Uppercase roman letters will be used to denote random variables. If
there is some notation your grandmother likes, I couldn’t care less. I do
not care about what corner of the world you are coming from. Uppercase
roman letters denote random variables. Lowercase roman letters will
denote outcomes: 𝑋 = 𝑥, 𝑥 ∈ {0, 1}. I will use this religiously.
Whatever properties apply to funions apply to random variables.
However, some terminology is difficult in the probabili
ic sense. The
co-domain is the real line. The domain is heads and tails. The range is
and , and it is its support. The support of a random variable is the set
of values it takes. It written like this: supp 𝑋 = {0, 1}. We are not saying
“range” because we are not using funional notation.
If you want to find the probability a child is born with blue eyes and
brown hair that weighs . kg, you have to compute it, and you need two
pieces of information. All the fancy math is direed towards one goal:
how can I calculate the probability of an intere
ing event. You ju
have to
compute the probability 𝑋 > 9 with respe to some measure, because the
genes are combining in an ab
ra space.
Weight can be zero or arbitrarily large.
I have seen people with arbitrarily large weights.
. You have to specify the support of a random variable, i. e. the set of
values taken by this random variable.
. You have to specify the probability measure depending on your ima-
gination and the que
ion being asked.
Example . Suppose that supp 𝑋 = R and
∫︁
def 1 2
Prob(𝑋 ∈ 𝐴) = √ e −𝑥 /2 d𝑥.
𝐴 2𝜋
This thing integrates to , it is the Gaussian integral. This is a well-defined
probability measure. Do you recognise this random variable? Because it is
so frequently applied, it is called the
andard normal, or Gaussian random
2
variable. You can call the funion √12𝜋 e −𝑥 /2 Tom Cruise, but let us be
scientific and call it the probability density funion, PDF.
The support has to be specified. You can get the probabilities by integ-
rating densities over regions. The Gaussian density looks like this:
1 2
PDF𝑋 (𝑥) = √ e −𝑥 /2 , 𝑥∈R
2𝜋
𝜙 is a reserved letter, ju
as in computer programming, so do not use it for
d
anything else. The shorthand is 𝑋 = 𝒩 (0, 1).
Example . Let 𝑦 be a random variable. What is wrong with this
atement? This is lowercase. The error propagates. Better have a doubt
cleared. So let us use 𝑌 . Even 𝑉 and 𝑣 are di
inguished by a squiggle. Let
1
supp 𝑌 = [𝑎, 𝑏] where 𝑎 < 𝑏 and PDF𝑌 (𝑦) = 𝑏−𝑎 I[𝑎,𝑏] (𝑦), 𝑦 ∈ R. It takes
all values between 𝐴 and ∫︀ 𝐵. How will you use this piece of information?
Therefore, P(𝑌 ∈ 𝐴) = 𝑦∈𝐴 PDF𝑌 (𝑦) d𝑦 for 𝑦 ⊂ R. This random variable
is also frequently used in applied work. This is the uniform random
d
variable. The shorthand for “𝑌 is uniformly di
ributed on [𝑎, 𝑏]” is 𝑌 =
𝒰[𝑎, 𝑏] (or 𝑌 ∼ 𝒰[𝑎, 𝑏]).
Example . Let 𝑝 ∈ (0, 1) be a con
ant ∑︀
and let 𝑍 be a random variable
such that supp 𝑍 = {0, 1} and P(𝑍 ∈ 𝐴) = 𝑧∈𝐴 𝑝𝑧 (1 − 𝑝)1−𝑧 I{0,1} (𝑧), 𝐴 ∈
𝑅. Is this a probability
∑︀ measure? An empty summation is zero by definition.
Then, P(𝑍 ∈ 𝑅) = 𝑧∈R = 𝑝 (1 − 𝑝)1−0 I{0,1} (0) + 𝑝1 (1 − 𝑝)1−1 I{0,1} (1) =
0
chance it take any possible value is zero. On the other hand, a Bernoulli
random variable will have a non-zero probability of and . The fir
difference between the Gaussian and the Bernoulli random variable is
the support: the support of a Bernoulli random variable is countable.
THe support of a Gaussian variable is uncountable. If the support is an
uncountable set and the probability of 𝑋 being any possible number is ,
then 𝑋 is said to be continuously di
ributed. On the other hand, 𝑍 is
said to be discrete if supp 𝑍 is a countable set. Some random variables can
be both. In applied work, random variables can be partially continuous.
d
Example. Let 𝑋 = 𝒩 (0, 1). Let 𝑐 be a con
ant. Define
{︃
def 𝑋, 𝑋 > 𝑐,
𝑊 =
𝑐, 𝑋 ≤ 𝑐.
the threshold. This is called a censored random variable, or bottom-coded
random variable.
d
Example . We can be more generic. Let 𝜆 > 0. 𝑋 = exp(𝜆) if
1 −𝑥/𝜆
supp(𝑋) = (0, ∞) and PDF𝑋 (𝑥) = 𝜆 e I(0,∞) (𝑥), 𝑥 ∈ R. It is a con-
tinuous random variable. It is used to measure the survival of eleronic
objes: the probability of an SSD disk failing does not depend on its prior
hi
ory because there are no moving parts.
d
Example . Let 𝜆 > 0 𝑋 = Poisson(𝜆) if and only if supp(𝑋) = {0} ∪ N
and
𝜆𝑥
PMF𝑋 (𝑥) = e −𝜆 I{0}∪N (𝑥), 𝑥 ∈ R
𝑥!
∑︀
To get P(𝑋 ∈ 𝐴), you do 𝑥∈𝐴 PMF𝑋 (𝑥).
Do you agree that P(−∞ < 𝑋 ≤ 𝑥) = P[(−∞ < 𝑋 < ∞) ∩ (𝑋 ≤ 𝑥)]? The
fir
event happens with probability one. So this is simply P(𝑋 ≤ 𝑥).
d
A result without a proof. 𝑋 = 𝑌 if and only if CDF𝑋 = CDF𝑌 . We do
not have to worry about the support. It means that they have to be equal at
any point on the real line, 𝑡 ∈ R. It ju
gives the probability for an interval.
But you can live your life happily even with a casual acquaintance with the
CDF.
Suppose 𝑋 is continuously di
ributed. Then the CDF𝑋 (𝑥) = P(𝑋 ∈
(−∞, 𝑥]). How would you get this? By integrating the PDF:
∫︁ 𝑥
PDF𝑋 (𝑡) d𝑡.
𝑡=−∞
Result:
∫︁ 𝑥
d d
CDF𝑋 (𝑥) = PDF𝑋 (𝑡) d𝑡 = PDF𝑋 (𝑥)
d𝑥 d𝑥 −∞
This is the result that connes differential to integral calculus. This is the
fundamental theorem of calculus, or the Leibniz rule.
d
Let us do an example. Let 𝑌 = Bernoulli(𝑝). Then
⎧
⎪
⎪
⎪ 0, 𝑦 < 0,
⎪
⎨1 − 𝑝, 𝑦 = 0,
def
CDF𝑌 (𝑦) = P(𝑌 ≤ 𝑦) =
⎪
⎪
⎪ 1 − 𝑝, 0 < 𝑦 < 1,
⎪
⎩1, 𝑦 ≥ 1.
You know that P(𝑌 ≤ 0) = P(𝑌 < 0) + P(𝑌 = 0). However, this drawing is
incomplete without a hole. Now this is unambiguous. The reason why the
variables are called continuous is the CDF of a continuous variable being
continuous. You can recover the PMF by computing the jumps.
d
Remark. What is the relationship between these two
atements: 𝑋 = 𝑌
and 𝑋 = 𝑌 . The fir
is: 𝑋 has the same di
ribution as 𝑌 . The second:
𝑋 is equal to 𝑌 . The latter is 𝑋 − 𝑌 = 0. 𝑋 − 𝑌 is a random variable. In
what sense it equal to 0? With probability . You have to interpret random
variables very precisely. So what is the relationship between the
atements
d
𝑋 = 𝑌 and 𝑋 = 𝑌 with probability ? The implication is the following:
d
𝑋 = 𝑌 ⇒ 𝑋 = 𝑌 . The fir
is an identity.
Example. Let 𝑋 ∼ 𝒩 (0, 1). I want to find the di
ribution of −𝑋. The
transformation is very simple. Find the di
ribution of −𝑋. The answer is:
d
𝑋 = −𝑋. But 𝑋 ̸= −𝑋. Let 𝑡 ∈ R. Then
∫︁ ∞
def
CDF−𝑋 (𝑡) = P(−𝑋 ≤ 𝑡) = P(𝑋 ≥ −𝑡) = PDF𝑋 (𝑥) d𝑥 =
−𝑡
∫︁ ∞ ∫︁ −∞
= 𝜙(𝑥) d𝑥 = 𝜙(−𝑢)(−d𝑢) =
−𝑡 𝑡
∫︁ 𝑡 ∫︁ 𝑡
= 𝜙(−𝑢) d𝑢 = 𝜙(𝑢) d𝑢 = CDF𝑋 (𝑡)
−∞ −∞
d
The two CDFs are equal, so −𝑋 = 𝑋. However, 𝑋 ̸= −𝑋. Supoose to the
contrary this
atement is false. Suppose 𝑋 = −𝑋. This means that 2𝑋 = 0
and 𝑋 = 0 with probability , which is not true because it is Gaussian.
The vanilla equal is the mo
powerful operator in mathematics. If I put a
d
topping on it, like =, it is no longer vanilla.
Expeation
Let 𝑋 be a random variable. The expeed value of 𝑋 is defined to be
{︃∫︀
def 𝑥∈supp 𝑋 𝑥 PDF𝑋 (𝑥) d𝑥 if 𝑋 is continuously di
ributed,
E𝑋 = ∑︀
𝑥∈supp 𝑋 𝑥 PMF𝑋 (𝑥) if 𝑋 is discrete.
The learning will not happen if you do not work it out yourself. The be
thing is to ask a que
ion in class. It benefits everybody. Learning literally
happens when you make mi
akes, unless you are Isaac Newton. On the
other hand, 𝑋 takes only and , and on average it will take 𝑝, although it
is never taken.
There is a very classic book by Peter Whittle, “The expeation ap-
proach”.
Definition. Let 𝑋 ne a random variable. Let ℎ be a funion R ↦→ R.
The randomness in this funion is coming from 𝑋. You can simplify
probability calculations by the use of clever conditioning. Then
{︃∫︀
def 𝑥∈supp 𝑋 ℎ(𝑥) PDF𝑋 (𝑥) d𝑥 if 𝑋 is continuously di
ributed,
Eℎ(𝑋) = ∑︀
𝑥∈supp 𝑋 ℎ(𝑥) PMF𝑋 (𝑥) if 𝑋 is discrete.
def ∫︀
In principle, if 𝑌 = ℎ(𝑥), then Eℎ(𝑋) = E𝑌 = supp 𝑌 𝑦 PDF𝑌 (𝑦) d𝑦.
Remark. Let 𝑋 be a random variable. Then what is EI𝐴 (𝑋)? The
randomness in this is coming from 𝑋. The indicator is if 𝑋 lies in 𝐴.
Then
∫︁
EI𝐴 (𝑋) = I𝐴 (𝑥) PDF𝑋 (𝑥) d𝑥 =
𝑥∈supp 𝑋
∫︁
= I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥 =
R ∫︁
= Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥 = P(𝑥 ∈ 𝐴)
𝐴
∫︀
If something is not clear, look at this obje: R I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥.
It is equal to
∫︁
I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥 =
supp 𝑋∪(R∖supp 𝑋)
∫︁
I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥+
supp 𝑋
∫︁
+ I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥
R∖supp 𝑋
The indicator funions are your friends, so use them a lot. Things that lie
outside the support will contribute zero probability.
Expeation is a linear operator:
Definition. If 𝑋 is a random variable,
def
Var 𝑋 = E(𝑋 − E𝑋)2
The mean measure the average location, and the variance measures the
spread. This is called the measure of dispersion.
Leure (--)
d ∫︀ ∞ def
If 𝑋 = 𝒩 (0, 1), then E𝑋 = −∞ 𝑥𝜙(𝑥) d𝑥 = 0. Then, Var 𝑋 = E(𝑋 −
E𝑋)2 = E𝑋 2 − (E𝑋)2 .
d
What is this: 𝑌 = 𝒩 (𝜇, 𝜎 2 )? This is a Gaussian random variable with
mean 𝜇 and variance 𝜎 2 . This happens if and only if supp 𝑌 = R and
(︂ )︂
1 𝑦−𝜇 1 2 2
PDF𝑌 (𝑦) = 𝜙 =√ e −(𝑦−𝜇) /2𝜎 , 𝑦 ∈ R.
𝜎 𝜎 2𝜋𝜎
This is an example of a fat-tailed or a heavy-tailed di
ribution. The reason
integrals are finite is the fa that the value of the integral decreases very
rapidly. Here, the tails go zero, but not as fa
. Why are heavy tails useful?
Extreme events. On the other hand, if you want to model extreme events,
massive financial crises, the normal di
ribution if not OK. Cauchy di
ri-
bution is immensely useful for rare phenomena: earthquakes, financial
crises, shocks. How would you compute it?
∫︁ ∞
∫︁
1 1
E𝑋 = 𝑥 PDF𝑋 (𝑥) d𝑥 = 𝑥 2
=
𝑥∈supp 𝑋 −∞ 𝜋 1 + 𝑥
1 ∞ 𝑥
∫︁
1
= = log(1 + 𝑥2 )|+∞
−∞ = ∞ − ∞
𝜋 −∞ 1 + 𝑥2 2𝜋
Quantiles
Let 𝑋 be a random variable. Let 𝛼 ∈ (0, 1). An 𝛼th quantile (or 100 · 𝛼
percentile) of 𝑋 is such number 𝑄𝑋 (𝛼) ∈ R that could lie anywhere on the
real line that satisfies two properties:
. P(𝑋 ≤ 𝑄𝑋 (𝛼)) ≥ 𝛼,
. P(𝑋 ≥ 𝑄𝑋 (𝛼)) ≥ 1 − 𝛼.
Some quantiles have names attached to them. Remark: 𝑄𝑋 (0.5) is called
the median of 𝑋, med 𝑋. 𝑄𝑋 (0.25) is the fir
quartile, 𝑄𝑋 (0.75) is the
third quartile. The quantile regression is a great alternative to the linear
regression.
d
Example. Let 𝑋 = Bernoulli(0.5). Its support is {0, 1}, and the prob-
ability of success is 0.5. We are tossing a fair coin. Que
ion: find the
median of 𝑋. When everything else fails, apply the definition. Let us check
every number. Let 𝑞 < 0. Can any negative number be the median of 𝑋?
P(𝑋 ≤ 𝑞) = 0. However, this probability should be at lea
.. How about
𝑞 > 1? P(𝑋 ≤ 𝑞) = 1, but P(𝑋 ≥ 𝑞) = 0. So no number below or above
is the median. Now, let us pick any number between 0 and 1. Let 𝑞 = 0.
Then P(𝑋 ≤ 0) = P(𝑋 < 0) + P(𝑋 = 0) = 0.5. So zero satisfies the fir
condition. Next, P(𝑋 ≥ 0) = P(𝑋 = 0) + P(𝑋 > 0) = 0.5 + 0.5 = 1. If you
want to find something out and you get confused, use set theory: P(𝑋 >
0) = P(𝑋 ∈ (0, ∞)) = P(𝑋 ∈ {1}) + P(𝑋 ∈ (0, ∞) ∖ {1}) = 0.5 + 0 = 0.5.
Who said “zero”? Jesus, I am ageing rapidly.
Human beings are
upid. We are doing s*** because of hi
ory.
You can check that 1 is also a median. In general, let 𝑞 ∈ (0, 1). Then
P(𝑋 ≤ 𝑞) = 0.5 and P(𝑋 ≥ 𝑞) = 0.5. So in this problem, 𝑄𝑋 (0.5) =
med 𝑋 = [0, 1].
Quantiles might exi
, but they are not necessarily well-defined. The
median of Cauchy is . There are ways of redefining quantiles so that
unique values exi
.
Quantiles have some remarkable properties. Let 𝑔 be an increasing
funion in R. What does it mean? Well, it is increasing. Is there any
relationship between these two objes:
E𝑔(𝑋) ? 𝑔(E𝑋)
This is a hugely powerful property that expeations do not obey. This is
the equivariance property of quantiles.
Example. Let us use a set of data. Consider the following numbers:
, , . So far, you have not seen any data. There you go, you see data
in math camp. It is a dataset. In sets, duplicates are not allowed, but in
datasets, they are. What is the median? . Very good. Why? What is the
th percentile of this dataset?
Now you have another dataset: , , , . What is its median? Stati
ic-
ally, what should you do to answer the que
ion
ati
ically?
Consider the data: 𝑎, 𝑎, 𝑏, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑓, 𝑔. Each time, you get an outcome.
people have made choices. What is the median choice? How do we
answer this que
ion? Consider this dataset:
I, b, #, p, ♣, ♣, b
You should not
ick to that s*** middle-school rule. It has no
ati
ical
ju
ification. It is entirely arbitrary.
Again, consider this assignment. If data are , , , ,
⎧
⎪
⎪
⎪ 1, prob. 1/4,
⎪
⎨2, prob. 1/4,
def
𝑋 =
⎪
⎪
⎪ 3, prob. 1/4,
⎪
⎩4, prob. 1/4.
Can 23 be the median outcome? P(𝑋 ≤ 23) = 0.6, P(𝑋 ≥ 23) = 0.8. 23 is
the candidate, and the corresponding outcome is 𝑎. As long as the mapping
is one-to-one, it should be right.
Try the following mapping:
⎧
⎨23,
⎪ prob. 0.4
def
𝑌 = 58.5, prob. 0.4,
⎪
100, prob. 0.2.
⎩
In this case, P(𝑋 ≤ 23) = 0.4, and 𝑎 is not the median anymore. How you
do the coding matters. 𝑋 ̸= 𝑌 , so this is why the medians differ.
However, in some cases, there might be no ambiguity if half or more of
the outcomes are the same. Some
ati
ical methods are not invariant to
the choice of coding.
Remember that a quantile is not a funion. However, there is a way
to adju
the definition. If you analyse the definition of the quantile, you
will see that the empirical CDF has flat areas. However, for well-known
continuous
ati
ical di
ributions with an increasing CDF the quantiles
are unique. Otherwise, you can use the generalised inverse.
Leure (--)
Whenever you see a problem, the result is important.
Asking about the th quantile of a random variable does not give any
additional information about this variable. The only way to check it is to
try any number. P(𝑋 ≤ 𝑞) ≥ 0, P(𝑋 ≥ 𝑞) ≥ 1. Suppose 𝑞 < 𝑎. What is
the probability of being less or equal to 𝑞? Zero. What is the probability
that 𝑋 ≥ 𝑞? One. 𝑞 qualifies as the th quantile. So does 𝑎. So really
𝑄𝑋 (0) = (−∞, 𝑎]. This is really to illu
rate that 0th percentile is a silly
thing.
d
If 𝑋 = Bernoulli(𝑝), then 𝑝 is the probability of success. When 𝑝 = 0,
the probability of failure is 1. It is putting mass 1 at point 0. So Bernoulli(0)
random variable takes the value with probability . So it is really a
con
ant random variable. If 𝑞 < 0, P(𝑋 ≤ 𝑞) = 0. This is not a median. If
𝑞 > 0, P(𝑋 ≥ 𝑞) = 0. This is not a median. But 𝑋 = 0 qualifies as the 0th
quantile.
Now for the quantile. Show that 𝑄−𝑋 (𝛼) = −𝑄𝑋 (1 − 𝛼). Let us apply
the definition:
Do you agree that this is
−𝑄−𝑋 (𝛼) = 𝑄𝑋 (1 − 𝛼)
Conditioning
Conditioning is the really important topic in econometrics. This is
probably the mo
important part of the course. Suppose we have two
random variables, 𝑋 and 𝑌 . What is Cov(𝑋, 𝑌 )? It is
def
Cov(𝑋, 𝑌 ) = E(𝑋 − E𝑋)(𝑌 − E𝑌 ) = E𝑋𝑌 − E𝑋E𝑌
or ∑︁ ∑︁
P(𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵) = PMF𝑋,𝑌 (𝑥, 𝑦)
𝑋∈𝐴 𝑌 ∈𝐵
Now look at the inner integral. If you integrate the density over a region
to get a probability, so the framed expression if the marginal density of 𝑋.
Therefore, ∫︁
PDF𝑋 (𝑥) = PDF𝑋,𝑌 (𝑥, 𝑦) d𝑦, 𝑥 ∈ R
𝑦∈R
∫︀
In the save vein, PDF𝑌 (𝑦) = 𝑥∈R PDF𝑋,𝑌 (𝑥, 𝑦) d𝑥. If two variables are
ocha
ically independent, you can multiply the densities to get the joint
density, but not otherwise.
Conditional probabilities. Let 𝐴, 𝐵 be events from a sample space 𝑆.
We briefly go back to events. Suppose that P(𝐵) > 0, then, by definition,
def P(𝐴 ∩ 𝐵)
P(𝐴 | 𝐵) =
P(𝐵)
Some books are writing 𝐴/𝐵, but is is not this line /, not this line ∖, no
other symbol. This definition was given by the reverent Thomas Bayes.
Why is conditioning important? What does conditioning do? Suppose
𝑌 and 𝑋 are random variables. Also support that 𝑌 is not a determini
ic
funion of 𝑋, so you cannot write 𝑌 = 𝑔(𝑋). Part of randomness in 𝑌
can come from 𝑋, and part of its randomness is inherent. How would 𝑌
behave if you
opped the randomness in 𝑋?
If both 𝑋 and 𝑌 are discrete, I am going to define the conditional
density funion. The conditional probability mass funion 𝑌 | 𝑋 = 𝑥 is
def P(𝑌 = 𝑦, 𝑋 = 𝑥)
PMF𝑌 |𝑋 (𝑦) = , 𝑦 ∈ supp 𝑌, 𝑥 ∈ supp 𝑋
P(𝑋 = 𝑥)
You can talk about the mean, the variance and the quantile of a di
ribution
for conditional di
ributions.
Not let us generalise this slightly. You know what a veor is? It is
basically a bunch of numbers with an order attached to it. It is important
that a veor is a column veor. The
andard mathematical convention
is a column veor. We can generalise this definition. A random veor is
⎛ ⎞
𝑋
ju
a column veor of random variables. Say, ⎝ 𝑌 ⎠ is a random column
𝑍
veor with random variables as its elements. Now, if 𝑊 is a random veor,
⃗ or 𝑊 . In research papers, you do not write in boldface
I do need to write 𝑊
⎛ (1) ⎞
𝑊
or with veors. If dim 𝑋 = 3 × 1, then 𝑊 = ⎝𝑊 (2) ⎠. However, I do not
𝑊 (3)
use the notation 𝑊1 because it is the subscript of the observation.
Let 𝑋 be a 2 × 1 random veor. Let(︃ 𝑋1 , 𝑋
)︃2 , . . . be identical copies of 𝑋,
(1)
d d d 𝑋1
so 𝑋1 = 𝑋, 𝑋2 = 𝑋, 𝑋𝑛 = 𝑋. Then (2) . Mathematics is a universal
𝑋1
language, but even there, people have preferences.
You can find means, variances and quantiles with conditional di
ributions.
But let us focus on the conditional mean 𝑌 | 𝑋. It has several useful
properties.
Another definition. Let 𝑌 be a random variable. Let 𝑋 be a random
veor. The conditional expeation funion of 𝑌 | 𝑋 = 𝑥 is defined to
be
∫︁
def
CEF : 𝑥 ↦→ E(𝑌 | 𝑋 = 𝑥) = 𝑦 PDF𝑌 |𝑋=𝑥 (𝑦) d𝑦, 𝑥 ∈ supp 𝑋
𝑦∈supp 𝑌
or
def
∑︁
E(𝑌 | 𝑋 = 𝑥) = 𝑦 PMF𝑌 |𝑋=𝑥 (𝑦) 𝑥 ∈ supp 𝑋
𝑦∈supp 𝑌
∑︀
same logic, E(𝑌 | 𝑋 = 2) = 𝑦∈{0,1} 𝑦 PMF𝑌 |𝑋=2 (𝑦) = 0 · PMF𝑌 |𝑋=2 (0) +
1 · PMF𝑌 |𝑋=2 (1) = 34 . Finally, E(𝑌 | 𝑋 = 3) = 𝑦∈{0,1} 𝑦 PMF𝑌 |𝑋=3 (𝑦) =
∑︀
If you have a que
ion, then ask your que
ion loudly. The
members of parliament will never address each other. Mi
er
speaker... It has positive externalities, and a private que
ion
has zero welfare benefits.
This is read as “the conditional expeation of 𝑌 given 𝑋”. And this is a
random variable! This random variable has remarkable properties. How
about we li
these properties? We shall skip the proofs. These properties
are of supreme intere
.
Properties of conditional expeations.
Read seion ..
Leure (--)
The result in problem is used in the IV regression. You can have a
look at it.
Let us look at properties of conditional expeations. These properties
follow from conditional expeation funions. You fir
have to compute
the conditional expeation funions, and then plug in random variables.
Let 𝑌 be a random variable. Let 𝑋 be a random veor. Then the
random variable E(𝑌 | 𝑋) has the following properties:
. E(𝑌 | 𝑋) is a funion of 𝑋. The source of randomness here is 𝑋.
. The conditional expeation operator is a linear operator: E(𝐴 + 𝐵 |
𝑋) = E(𝐴 | 𝑋) + E(𝐵 | 𝑋) where 𝐴, 𝐵 are random variables.
. E(𝑐 | 𝑋) = 𝑐 is 𝑐 is a con
ant.
. The reason why conditional expeations were invented. The intuitive
meaning of E(𝑌 | 𝑋) is the fa that the randomness of 𝑋 has been
opped. This means that E(𝑔(𝑋)𝑌 | 𝑋) = 𝑔(𝑋)E(𝑌 | 𝑋). Con
ants
come outside the expeations. This has a non-
andard name: the
useful rule.
. This property has a
andard name: the law of iterated expeations.
You can take the expeation of this random variable: E𝑋 [E𝑌 |𝑋 (𝑌 |
𝑋)] = E[E(𝑌 | 𝑋)] = E𝑌 . There is an inner expeation, and an
outer expeation, so you get the marginal expeation of 𝑌 . If you
abbreviate it, you will get LIE, so it become the “law of iterated
mathematical expeations”, LIME.
. E(𝑌 | 𝑋) = E(𝑌 | ℎ(𝑥)) if ℎ is injeive, ℎ : supp 𝑋 ↦→ supp 𝑋. This
is a one-to-one funion, or injeive funion. Definition: ∀𝑥1 , 𝑥2 ∈
supp(𝑋), ℎ(𝑥1 ) = ℎ(𝑥2 ) ⇒ 𝑥1 = 𝑥2 .
. Let 𝑔 be a funion, not necessarily one-to-one. I am going to do re-
peated conditioning (not iterated conditioning): E[E(𝑌 | 𝑋) | 𝑔(𝑥)] =
E(𝑌 | 𝑔(𝑋)). There are two conditionings. This is used in economet-
rics all the time. In general, when you condition on less information,
the outcome will be determined by the small information set. This is
called the “smalle
information wins” property.
All of this can be proved. Whatever properties integral have, they are
inherited by the conditional expeation operator. These results hold in
great generality. You can solve a difficult probability problem if you split
the problem into several sources of randomness by conditioning, and then
you will work with the remaining randomness.
Remark. In terms of conditional expeation funion, E(𝑔(𝑋)𝑌 | 𝑋 =
𝑥) = E(𝑔(𝑥)𝑌 | 𝑋 = 𝑥) = 𝑔(𝑥)E(𝑌 | 𝑋 = 𝑥).
Conditional expeation has a couple of various interpretations. De-
pending on the application, you use the interpretation you need. One way
to interpret this: given the information on 𝑋, you try to predi the value
of 𝑌 . This is the be
possible predior of 𝑌 given 𝑋.
If you have two values, 𝑋 and 𝑔(𝑋), which contains more information?
𝑋. If 𝑔(0) = 3 and 𝑔(1) = 3, then you reduce information. Applying a
funion is a loss of information. The only time such loss does not occur is
when such a funion is injeive.
Example of property . This property is telling you a useful way of
discarding information. You may or may not like this example. Let 𝑁 be the
number of nuclear warheads in North Korea. Kim Jong Un has a produion
funion. You want to use the available information given the information
on capital and labour (say, number of cars and people). Unfortunately,
their assembly units are located under the ground and the data of 𝐾 is
missing. You want to have E(𝑁 | 𝐾, 𝐿), but how do you
ati
ically remove
the information about 𝐾? Remarkably, E[E(𝑁 | 𝐾, 𝐿) | 𝐿] = E(𝑁 | 𝐿).
In econometric terms, this is how you incorporate the effe of omitted
variables. Once you have omitted a variable, you get a short model. This is
how you do it: you are conditioning the long model on short information.
What is the effe of endogeneity due to the omitted variable? You are
using this property. This does not mean that you erase the variable; you
are conditioning on small information.
Example. Suppose E(𝑌 | 𝑋) = 𝛼0 + 𝛼1 𝑋 + 𝛼2 𝑋 2 . Can you tell me
E(𝑌 | 𝑋, 𝑋 2 )? Let ℎ(𝑥) = (𝑋, 𝑋 2 ). ℎ maps 𝑋 into two things. Is it a
one-to-one funion? Suppose ℎ(𝑥1 ) = ℎ(𝑥2 ). This means that 𝑥1 = 𝑥2 and
𝑥21 = 𝑥21 . By con
ruion,
(︂ )︂this is an injeive transformation. Any mapping
𝑥
of the form 𝑥1 ↦→ is -to- by con
ruion. By property , this is
𝐾(𝑥)
really E(𝑌 | 𝑋, 𝑋 2 ) = E(𝑌 | ℎ(𝑥)) = E(𝑌 | 𝑋). This shows how we should
write regression models. This is the parsimonious way of writing that.
Now suppose E(𝑌 | 𝑋) = 𝛼0 + 𝛼1 + 𝛼2 𝑋 2 . What is E(𝑌 | 𝑋 2 )? Do
you agree that E(𝑌 | 𝑋 2 ) = E[E(𝑌 | 𝑋) | 𝑋 2 ]. This is by property :
E(𝛼0 + 𝛼1 𝑋 + 𝛼2 𝑋 2 | 𝑋) = 𝛼0 + 𝛼1 E(𝑋 | 𝑋 2 ) + 𝛼2 𝑋 2 . And the middle
term has to be determined. More information has to be given to allow the
simplification.
Result. Let 𝑌 be a random variable. Let 𝑋 be a random veor. Think
of E(𝑌 | 𝑋) as prediing 𝑌 given the information on 𝑋. I claim that this
is the be
predior of 𝑌 | 𝑋.
def
Fir
approach. Let 𝑈 = 𝑌 −E(𝑌 | 𝑋). What is this? This is a mismatch.
Then this is the prediion error, the mi
ake that you make.
. Then the following is true: E(𝑈 | 𝑋) = E[𝑌 − E(𝑌 | 𝑋)] = E(𝑌 ) −
E[E(𝑌 | 𝑋) | 𝑋]. This is true due to the linearity and the law of iterated
expeations. But this is nothing else than E(𝑌 ) − E(𝑌 ) = 0. When you
predi a variable with its conditional expeation, the expeed prediion
error is zero. 𝑈 will never be with probability because that it would
imply that 𝑌 = E(𝑌 | 𝑋) with probability , but the latter is a funion of
𝑋!
. Stati
ically, when would you conclude that 𝑌 is the be
predior?
It is using information on 𝑋, and it would be be
if we used up all the
information that 𝑋 has to give. So the prediion error should not contain
any information about 𝑋: 𝑈 should be uncorrelated with any funion
of 𝑋. And indeed 𝑈 is uncorrelated with any funion of 𝑋. Let 𝑔 be any
funion of 𝑋. Then Corr(𝑈, 𝑔(𝑋)) = Cov(𝑈,𝑔(𝑋))
sd 𝑈 sd 𝑔(𝑋) . The variances are posit-
ive, the
andard deviations are positive, to here, the covariance mu
be
zero: Cov(𝑈, 𝑔(𝑋)) = E𝑈 𝑔(𝑋) − E𝑈 E𝑔(𝑋). Now, E𝑈 = E(E(𝑈 | 𝑋)) = 0
by the law of iterated expeations. Now, E𝑈 𝑔(𝑋) = E[E(𝑈 𝑔(𝑋) | 𝑋)] =
E[𝑔(𝑋)E(𝑈 | 𝑋)] = 0. We see that the population residual has no informa-
tion on 𝑋 about it. This obje has no information about 𝑋 in it.
d
Example. Let 𝑋 = Bernoulli(𝑝). Find the CDF of 𝑌 . Solution: let 𝑦 ∈ R.
def
Then CDF𝑌 (𝑦) = P(𝑌 ≤ 𝑦). It is
⎧
⎨0,
⎪ 𝑦 < 0,
1 − 𝑝, 0 ≤ 𝑦 < 1,
⎪
1, 𝑦 ≥ 1.
⎩
If you get confused, follow the mathematics. Ju
do not forget to put the
bullets and the holes.
Leure (--)
Let 𝑌 be a random variable and let 𝑋 be a random veor. Recall from
def
the la
time that 𝑈 = 𝑌 −E(𝑌 | 𝑋) this thing has two properties: () E(𝑈 |
𝑋) = 0, and we used this property to show that () 𝑈 is uncorrelated with
every funion of 𝑋! This is such a good predior!
It has several interpretations. It makes sense to call E(𝑌 | 𝑋) the be
predior. Do you agree that 𝑌 = E(𝑌 | 𝑋) + 𝑌 − E(𝑌 | 𝑋). You can call
⏟ ⏞ ⏟ ⏞
the fir
part predied by 𝑋, and the latter is the part that cannot be
predied by 𝑋. So we rewrite is as 𝑌 = E(𝑌 | 𝑋) + 𝑈 . This looks like
a regression model. There you go! This is the
ati
ical foundation of a
correly specified regression model.
There is a very useful definition that you will you throughout your
career. Let 𝐴 be a random variable, and 𝐵 be a random veor. If E(𝐴 |
𝐵) = 0, 𝐵 is said to be exogenous with respe to 𝐴. This implies no
correlation: 𝐴 is uncorrelated with 𝐵 if Corr(𝐴, 𝐵) = 0. Then, the property
E(𝑈 | 𝑋) = 0 can be re-
ated is “𝑋 is exogenous with respe to 𝑈 ”. Notice
that is 𝐵 is exogenous with respe to 𝐴, the converse is not true. You will
always have endogenous and exogenous variables, and you mu
specify
with respe to what.
Let us look at another interpretation of the conditional expeation
as the be
predior of 𝑌 given the information on 𝑋. This is a classic
regression; you want to predi log wage given education, gender, race etc.
𝑌 is a random variable. You want to predi it using some funions of 𝑋.
It is not to be allowed to be funions of 𝑌 because then, the prediion
error is zero. You have the prediion error: 𝑌 − 𝑔(𝑋). What funion
should you pick? Ideally, you should pick a funion that minimises the
prediion error. Here, the easie
way to make things positive is |𝑌 − 𝑔(𝑋)|.
The problem is the fa that this is a random obje, and we want to remove
the randomness in it:
This is an optimisation problem: search across all funions of 𝑋! The ob-
je is called mean absolute error. Now we have to solve this optimisation
problem. It is quite hard. It has a unique
∫︀ solution. The only way this is zero
is the integrand being zero. If 𝑓 ≥ 0, 𝑓 = 0 ⇔ 𝑓 = 0. However, you
are searching for a funion given the information, the optimal solution
would be med(𝑌 | 𝑋). However, this optimisation problem is difficult.
This is the basics for a quantile regression. However, you mu
have much
more fundamental under
anding. Purely for the sake of computation we
shall not use this funion. This funion has a kink at the origin that does
not allow calculus. And this causes problems. What is the easie
way to
fix this problem? How do you remove the kink? You should smooth this
funion at the origin.
How about we minimize MSE,
The unique solution to this problem is E(𝑌 | 𝑋). This is the basis for the
mean regression. These objes have different properties. The conditional
median is the be
prediion in the sense that it minimises mean absolute
error. The conditional expeation is the be
predior in the sense that it
minimises mean squared error, so the meaning of the word “be
” changes.
Quantile regression is a hugely intere
ing problem, but it takes pages of
proof.
Claim. E(𝑌 | 𝑋) is really the minimiser of across all funions of 𝑋
that minimises mean squared error.
The fir
part does not depend on zero. In the second part, if 𝑔 = E(𝑌 | 𝑋),
this gives the unique solution! This is a remarkable result because you
are minimising over the set of all funions of 𝑋. This set is infinite
dimensional, there are uncountably infinitely many funions of 𝑋. So this
is sometimes written as BP(𝑌 | 𝑋).
Once you have the conditional density, you can compute the conditional
expeation. However, only nature alone knows the true conditional expe-
ation. The e
imation can be done, but we need to look at non-parametric
methods. So, how about lowering our expeations? In
ead of be
predi-
ors, let us use be
linear predior. You are
ill minimising, but, in
ead
of searching across all funions, you are searching across linear funions
of 𝑋.
The proof is quite elementary. You can prove these properties using pure
algebra. It can be shown that the be
linear predior of 𝑌 given 𝑋 is equal
to
BLP(𝑌 | 𝑋) = 𝛽0 + 𝑋 ′ 𝛽1 ,
where 𝛽0 = E𝑌 − E𝑋 ′ 𝛽1 , and 𝛽1 = (Var 𝑋)−1 Cov(𝑋, 𝑌 ). It is very easy
to e
imate the population mean and covariances. This is the be
linear
predior. If 𝑋 is a veor, a
andard mathematical convention is a column
veor. Whatever the dimension of 𝑋, this
uff works. These definitions
are dimension-independent. Here, Var 𝑋 is a matrix, and Cov(𝑋, 𝑌 ) is a
veor.
Remark. Let 𝐴 and 𝐵 be random veors. They can be of any dimension.
Their dimensions are anything times one. Then,
def
Var 𝐴 = E (𝐴 − E𝐴) (𝐴 − E𝐴)′
⏟ ⏞ ⏟ ⏞
𝑚×1 1×𝑚
So Var 𝑋 = E𝑋𝑋 ′ − (E𝑋)(E𝑋)′ and Cov(𝑋, 𝑌 ) = E𝑋𝑌 − E(𝑋)E(𝑌 ).
Do the dimensions match? Var 𝑋 is a matrix, Cov(𝑋, 𝑌 ) is a veor, so
everything matches.
I want to compare BP’s and BLP’s.
BP BLP
BP(𝑌 | 𝑋) = E(𝑌 | 𝑋) BLP(𝑌 | 𝑋) = 𝛽0 + 𝑋 ′ 𝛽1
𝛽0 = E𝑌 − (E𝑋)′ 𝛽1 ,
𝛽1 = (Var 𝑋)−1 Cov(𝑋, 𝑌 )
def def
𝑈 = 𝑌 − E(𝑌 | 𝑋) is the popu- 𝑉 = 𝑌 − BLP(𝑌 | 𝑋) is the pop-
lation residual from the be
pre- ulation residual from the be
lin-
diion problem ear prediion problem
E(𝑈 | 𝑋) = 0 BLP(𝑉 | 𝑋) = 0
𝑈 is uncorrelated with every 𝑈 is uncorrelated with every lin-
funion of 𝑋 ear funion of 𝑋
The be
prediion of 𝑈 given 𝑋 is zero. However, with 𝑉 , it is true
only with linear funions. The population residual 𝑉 has no information
only about linear funions of 𝑋. This is much more re
ried problem
than this. However, there is no useful rule for BLP’s because they are
invariant only with respe to some linear transformations of 𝑋.
def (︀
𝑥 ↦→ Var(𝑌 | 𝑋 = 𝑥) = E (𝑌 − E(𝑌 | 𝑋 = 𝑥))2 ⃒ 𝑋 = 𝑥 =
⃒ )︀
)︀2
= E(𝑌 2 | 𝑋 = 𝑥) − E(𝑌 | 𝑋 = 𝑥)
(︀
then 𝑌 is said to be heteroskeda
ic.
Another thing. Let 𝑌 be a random variable and 𝑋 be a random veor.
Then 𝑌 = E(𝑌 | 𝑋) + 𝑌 − E(| 𝑋), so
The more you write, the less I am convinced that you know. Do
not write two pages of the
ate of your mind.
this is E(𝑌 | 𝐷) = 𝐷E(𝑌 | 𝐷 = 1) + (1 − 𝐷)E(𝑌 | 𝐷 = 0). If 𝐷 = 0, then it
is equal to (1 − 0)E(𝑌 | 𝐷 = 0), and if 𝐷 = 1, the RHS is 1 · E(𝑌 | 𝐷 = 1).
Since E(𝑌 | 𝐷) = 𝐷E(𝑌 | 𝐷 = 1) + (1 − 𝐷)E(𝑌 | 𝐷 = 0), then
𝐷E(𝑌 | 𝐷) = 𝐷2 E(𝑌 | 𝐷 = 1) + 𝐷(1 − 𝐷)E(𝑌 | 𝐷 = 0). Now look at 𝐷2 .
It is zero if and only if 𝐷 = 0, and one if and only if 𝐷 = 1. So 𝐷2 = 𝐷.
However, 𝐷(1−𝐷) is always zero. Therefore, 𝐷E(𝑌 | 𝐷) = 𝐷E(𝑌 | 𝐷 = 1).
Therefore, we take expeation on both sides: E[𝐷E(𝑌 | 𝐷)] = E[𝐷E(𝑌 |
𝐷 = 1)] = E𝐷E(𝑌 | 𝐷 = 1) = P(𝐷 = 1)E(𝑌 | 𝐷 = 1). So
E[𝐷E(𝑌 | 𝐷)] E𝑌 𝐷
E(𝑌 | 𝐷 = 1) = =
P(𝐷 = 1) P(𝐷 = 1)
QED is quod erat demon
randum, what we were trying to show. There
is a famous author, and Euclid wrote everything explicitly. Euclid showed
the
atements and in the end, summarised.
Exercise (i). Find E(𝑌 | 𝑋 ∈ 𝑆). What is the conditional expeation
of 𝑌 given that 𝑋 ∈ 𝑆? This might seem
range. So really, we have been
conditioning on other variables, but how do you condition on an event?
def
Let 𝐷 = I𝑆 (𝑋). How many values does 𝐷 take? Like a Bernoulli random
variable: 1 if 𝑋 ∈ 𝑆 or 0 if 𝑋 ̸∈ 𝑆. So E(𝑌 | 𝑋 ∈ 𝑆) = E(𝑌 | 𝐷 = 1). The
point is to write this obje as a conditional expeation funion. Then,
look at the problem : this is P(𝐷=1)
E𝑌 𝐷
. Do you agree that this is nothing but
E𝑌 I𝑆 (𝑋) I𝑆 (𝑋)|𝑋)] (𝑋)E(𝑌 |𝑋)]
P(𝐷=1) = E[E(𝑌P(𝐷=1) = E[I𝑆 P(𝐷=1) . Now you have to fiddle around
so that the answer appears as in part . You should learn from this problem
how to write events as random variables.
Exercise (ii). Censored or truncated regressions are called Tobit
models. When you use the margins command in Stata, it is using the result
d
from this problem. Let 𝑌 = 𝒩 (𝜇, 𝜎 2 ). Show that
(︀ 𝑎−𝜇 )︀ (︁ )︁
𝑏−𝜇
𝜙 𝜎 −𝜙 𝜎
E(𝑌 | 𝑎 < 𝑌 < 𝑏) = 𝜇 + 𝜎 (︁ )︁ (︀ 𝑎−𝜇 )︀ ,
𝑏−𝜇
Φ 𝜎 −Φ 𝜎
def 2 def ∫︀ 𝑡
where 𝜙(𝑥) = √1 e −𝑥 /2 ,
2𝜋
Φ(𝑡) = −∞ 𝜙(𝑥) d𝑥, 𝑡 ∈ R.
def
Fir
of all, E(𝑌 | 𝑎 < 𝑌 < 𝑏) = E(𝑌 | 𝑌 ∈ 𝑆) where 𝑆 = (𝑎, 𝑏). But
from part , we know that
∫︁
1
E(𝑌 | 𝑋 ∈ 𝑆) = E(𝑌 | 𝑋 = 𝑥) PDF𝑋 (𝑥) d𝑥
P(𝑋 ∈ 𝑆) 𝑥∈𝑆
Now,
∫︁ 𝑏 (𝑥−𝜇)2
∫︁ 𝑏
(𝑥−𝜇)2
1 − 1
𝑥√ e 2𝜎 2 d𝑥 = (𝑥 − 𝜇 + 𝜇) √ e − 2𝜎2 d𝑥 =
𝑎 2𝜋𝜎 𝑎 2𝜋𝜎
∫︁ 𝑏 2
∫︁ 𝑏
1 (𝑥−𝜇) 1 (𝑥−𝜇)2
−
= (𝑥 − 𝜇) √ e 2𝜎 2 d𝑥 + 𝜇 √ e − 2𝜎2 d𝑥
𝑎 2𝜋𝜎 𝑎 2𝜋𝜎
⏟ ⏞
PDF𝑌
(𝑥−𝜇)2 𝑥−𝜇
Let 𝑡 = 2𝜎 2
, therefore, d𝑡 = 𝜎2
d𝑥 and 𝜎 2 d𝑡 = (𝑥 − 𝜇) d𝑥.
Then, the fir
integral is equal to
∫︁ (𝑏−𝜇)2 ⃒ (𝑏−𝜇)2
2𝜎 2 1 −𝑡 2 1 −𝑡 ⃒⃒ 2𝜎2
√ e 𝜎 d𝑡 = −𝜎 √ e ⃒ =
(𝑎−𝜇)2 2𝜋𝜎 2𝜋 (𝑎−𝜇)2
2𝜎 2 2𝜎 2
(𝑎−𝜇)2 (𝑏−𝜇)2
(︂ )︂ (︂ (︂ )︂ (︂ )︂)︂
1 − − 𝑎−𝜇 𝑏−𝜇
= 𝜎√ e 2𝜎 − e 2𝜎
2 2 =𝜎 𝜙 −𝜙
2𝜋 𝜎 𝜎
You are here to learn. You should be learning days per week apart
from – hours you need for sleep. Welcome to grad school! It is work,
work, work. Life is tough.
The point is not algebra, the point is concept.
Let’s move on to another topic.
Independence
This topic is called independence. This is the concept that is really
remarkable. It was invented by Kolmogorov. It is to make probability
caluclations simpler. It may or may not hold.
Suppose that 𝐴 and 𝐵 are events of some sample space. We say that 𝐴
and 𝐵 are independent and we write it as 𝐴 ⊥⊥ 𝐵 if
P(𝐴 ∩ 𝐵) = P(𝐴)P(𝐵)
If some events are independent, do they have to be disjoint, and vice versa?
d def
If 𝐴 ∩ 𝐵 = ∅, does it imply 𝐴 ⊥
⊥ 𝐵? Let 𝑋 = 𝒩 (0, 1), let 𝐴 = (𝑋 > 0)
def
and 𝐵 = (𝑋 ≤ 0). Then 𝐴 ∩ 𝐵 = ∅. You have to apply the definition of
this concept. Then, P(𝐴) = 0.5, P(𝐵) = 0.5, but P(𝐴 ∩ 𝐵) = P(∅) = 0. The
other way around is also not true. Independence is a property of events
and has nothing to do with set theory. Sometimes, this concept is called
ocha
ic independence.
What if you had three events? You would the say that the independ-
ence should hold for any subcolleion. We say that 𝐴, 𝐵, 𝐶 are mu-
tually independent if P(𝐴 ∩ 𝐵) = P(𝐴)P(𝐵), P(𝐴 ∩ 𝐶) = P(𝐴)P(𝐶),
P(𝐵 ∩ 𝐶) = P(𝐵)P(𝐶), P(𝐴 ∩ 𝐵 ∩ 𝐶) = P(𝐴)P(𝐵)P(𝐶). This definition is
the culmination of a huge amount of human thought.
When can we claim that two random variables are independent? When
events they generate are independent.
Definition. Let 𝑋 and 𝑌 be random variables or random veors. We
say that 𝑋 is independent of 𝑌 (and we write 𝑋 ⊥ ⊥ 𝑌 ) if P(𝑋 ∈ 𝐴, 𝑌 ∈
𝐵) = P(𝑋 ∈ 𝐴)P(𝑌 ∈ 𝐵) ∀𝐴 ⊂ Rdim 𝑋 , 𝐵 ⊂ Rdim 𝑌 .
Result . It can be shown that
𝑋⊥
⊥ 𝑌 ⇔ PDF𝑋,𝑌 (𝑥, 𝑦) = PDF𝑋 (𝑥) PDF𝑌 (𝑦) ∀(𝑥, 𝑦) ∈ supp(𝑋, 𝑌 )
Another result. 𝑋 ⊥
⊥ 𝑌 ⇒ Corr(𝑋, 𝑌 ) = 0 because
produ of probabilities.
def def
Let 𝐴 = {𝑋 ∈ (1, ∞)} and {𝐵 = 𝑌 (∈ (−∞, 0.25))}. Then
Leure (--)
Let us discuss the relationship between independence and conditioning.
Let 𝑋 and 𝑌 be random variables. If 𝑋 ⊥⊥ 𝑌 , there are two ways to interpret:
P(𝑋 𝑖𝑛𝐴, 𝑌 ∈ 𝐵) = P(𝑋 ∈ 𝐴)P(𝑌 ∈ 𝐵) ∀𝐴, 𝐵 ⊂ R, or PDF𝑋,𝑌 =
PDF𝑋 · PDF𝑌 .
Suppose 𝑋 ⊥ ⊥ 𝑌 . What is the conditional density PDF𝑌 |𝑋 ? But this is
˙
equal to PDFPDF
𝑌 PDF𝑋
𝑋
= PDF𝑌 . If 𝑌 and 𝑋 are independent, they are not
giving information on one another.
What about the other way? If PDF𝑌 |𝑋 = PDF𝑌 , then PDF𝑌 |𝑋 PDF𝑋 =
⏟ ⏞
PDF𝑌 |𝑋
PDF𝑌 PDF𝑋 . So we have equivalence:
𝑌 ⊥
⊥ 𝑋 ⇔ PDF𝑌 |𝑋 = PDF𝑌
⊥ 𝐵 ⇔ 𝐴𝐶 ⊥
𝐴⊥ ⊥ 𝐵 𝐶 ⇔ 𝐴𝐶 ⊥
⊥𝐵 ⇔ 𝐴⊥ ⊥ 𝐵𝐶 ⇔ 𝐴 ⊥
⊥𝐵
. 𝐴 ⊥⊥ 𝐵 ⇒ 𝐴𝐶 ⊥ ⊥ 𝐵. Now 𝐵 = (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴). Therefore,
P(𝐵) = P(𝐵∖𝐴)+P(𝐵∩𝐴) = P(𝐵∩𝐴𝐶 )+P(𝐵∩𝐴). Therefore, P(𝐵∩𝐴𝐶 ) =
P(𝐵) − P(𝐵 ∩ 𝐴), and through independence it is P(𝐵) − P(𝐵)P(𝐴) =
P(𝐵)(1 − P(𝐴)) = P(𝐵)P(𝐴𝐶 ).
. 𝐴𝐶 ⊥⊥𝐵 ⇒ 𝐴⊥ ⊥ 𝐵 𝐶 . Do you agree that 𝐵 = (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴).
So P(𝐵 ∩ 𝐴) = P(𝐵) − P(𝐵 ∩ 𝐴𝐶 ). Also, 𝐴 = (𝐴 ∖ 𝐵) ∪ (𝐴 ∩ 𝐵), therefore,
P(𝐴 ∩ 𝐵) = P(𝐴) − P(𝐴 ∩ 𝐵 𝐶 ). Since P(𝐵 ∩ 𝐴) = P(𝐴 ∩ 𝐵),
the useful rule. Now we know the definition of it:
{︃ ∑︀
E[ 𝑛𝑖=1 𝑋𝑖 | 𝑁 = 𝑛], 𝑛 ≥ 1,
E(𝑆𝑛 | 𝑁 = 𝑛) = =
0, 𝑛 = 0.
{︃∑︀ {︃∑︀
𝑛 𝑛
𝑖=1 E(𝑋 𝑖 | 𝑁 = 𝑛), 𝑛 ≥ 1, 𝑖=1 E𝑋𝑖 , 𝑛 ≥ 1,
= = = 𝑛𝑝,
0, 𝑛 = 0. 0, 𝑛 = 0.
𝑝3 − 𝑝1 𝑝3 − 𝑝1
E(𝑋 | 𝑋 2 = 𝑡) = 0I𝑡=0 + I𝑡=1 = I𝑡=1 ,
𝑝1 + 𝑝3 𝑝1 + 𝑝3
𝑝3 −𝑝1
and E(𝑋 | 𝑋 2 ) = 𝑝1 +𝑝3 I𝑋 2 =1 .
Leure (--)
You will need the properties of Gaussian random veor in portfolio
theory. Gaussian objes possess properties that make them really unique.
They possess a property that other variables do not have.
d
Recall that 𝑋 is said to be a normal (Gaussian) random variable (𝑋 =
𝒩 (0, 1)) if and only if supp 𝑋 = R and
1 2
PDF𝑋 (𝑥) = 𝜙(𝑥) = √ e −𝑥 /2 , 𝑥∈R
2𝜋
Claim. Let 𝑎 and 𝑏 be con
ants. Then 𝑎𝑋 + 𝑏 is a Gaussian random
variable. What is the expeed value of 𝑎𝑋 + 𝑏? It is 𝑏. What is its variance?
𝑎2 Var 𝑋 = 𝑎2 . The di
ribution remains the same, only the mean and
the variance change. This is true if and only if any linear combination is
Gaussian. The di
ribution holds under linear transformations. This is
called the reproducing property of Gaussian random variables. Try and
see if you can prove it. Take the CDF of it, differentiate it and verify.
There is a
andard convention in English: if you really like Shakespeare,
you write “shakespearean”. If you are not familiar with him, you write
“Shakespearean”.
Do not make your own definitions if you are asked whether a veor is
Gaussian or not. Before we look at Gaussian random veors, recall some
uff about random veors.
Let 𝑋 be 𝑝 × 1 random veor, (𝑋 (1) , . . . , 𝑋 (𝑝) )⊤ . Then
def
E𝑋 = (E𝑋 (1) , . . . , E𝑋 (𝑝) )⊤ .
Note that 𝑌 ↦→ 𝐴𝑌 + 𝑏 is a linear mapping, a linear transformation of 𝑌 .
The variance is Var(𝐴𝑌 + 𝑏) = 𝐴(Var 𝑌 )𝐴′ . Never forget these results.
Gaussian random veor
def def
Let 𝑋 be a 𝑝 × 1 random veor. Let 𝜇𝑝×1 = E(𝑋) and 𝑉𝑝×𝑝 = Var 𝑋.
Now we introduce a very
range-looking definition: 𝑋 is said to be a
Gaussian random veor if and only if every linear combination of the
coordinates of 𝑋 is a Gaussian random variable, that is 𝛼′ 𝑋 is a Gaussian
random variable for each 𝛼 ∈ R𝑏 . This definition is really remarkable. It
frees you from remembering any formulæ.
If you have a random veor 𝑍, you check every possible linear combin-
ation of its coordinates.
def
Consequence of this definition. Let 𝑍 = (𝑋, 𝑌, 𝑊 )⊤ be a Gaussian
random veor. This means that any linear combination of components 𝑋,
𝑌 , and 𝑊 is a Gaussian random variable. Do not say “elements” because
this is not a set. Can you tell me what is the di
ribution of 𝑋? It is
Normal. Why? Because 𝛼 = (1, 0, 0)⊤ fits the definition: 𝑋 = (1, 0, 0) · 𝑍.
It is automatically implied that any combination is Gaussian. If you
ack
two random variables, it is not necessary that the result is Gaussian: you
need many requirements for their joint di
ribution. In the came vein,
𝑊 = (0, 0, 1)𝑍. And so is 𝑋 + 𝑍 = (1, 0, 1)𝑍.
Another piece of terminology. If 𝑋 is a Gaussian random veor, then
its coordinates are said to be jointly normal.
Notation. 𝑍 is a Gaussian random veor with mean 𝜇 and variance 𝑉
d d
is denoted as 𝑍 = 𝒩 (𝜇, 𝑉 ). Nice people write it as 𝑍 = MVN(𝜇, 𝑉 ).
d
Result. Let 𝑍 be a Gaussian random veor, that is 𝑍 = 𝒩 (𝜇, 𝑉 ). The
dimension of 𝜇 is the same as 𝑍’s: dim 𝑍 × 1. The dimension of 𝑉 is
dim 𝑍 × dim 𝑍. Let 𝐴 be a matrix of con
ants. Let 𝑏 be a veor of con
ants.
d
Then 𝐴𝑍 + 𝑏 = 𝒩 . The number of columns of 𝐴 mu
be equal to the
number of columns of 𝑍. We claim that this veor is Gaussian. Gaussianity
is preserved. Let’s see why.
How can we check this result? We should show that 𝛼′ 𝑍 is Gaussian.
Fir
, the outcome is clearly a veor. It is a random veor. Then, we
mu
check every linear combination for Gaussianity. Let 𝛼 ∈ Rdim 𝑏 . Then
def
𝛼′ (𝐴𝑍 + 𝑏) = 𝛼′ 𝐴𝑍 + 𝛼′ 𝑏 = 𝑎′ 𝑍 + 𝑐 when 𝑎 = 𝐴𝛼 and 𝑐 = 𝛼′ 𝑏. Then
𝑎′ 𝑍 is ju
a linear combination of coordinates of 𝑍. This is a Gaussian
random variable because 𝑍 was Gaussian in the fir
place. So: any linear
transformation of a Gaussian random veor is Gaussian. Then
d
Therefore, 𝑌 = 𝒩 by the reproducing property of Gaussian random veors
under linear maps. So
(︂(︂ )︂ (︂ )︂ (︂ )︂)︂
d 1 1 1 1 1 2
𝑌 =𝒩 E𝑋, Var 𝑋 ,
2 −3 2 −3 1 −3
(︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂
1 1 1 3 1 1 1 −2 1 2
which gives = and =
2 −3 1 −4 2 −3 −2 7 1 −3
(︂ )︂
4 −17
.
−17 91
Let us see what we have learned.
d
Example. Let 𝑋 = 𝒩 (0, 1) be a
andard normal variable. However,
d d d
𝑌 = 𝒩 (0, 𝐼) is a multivariate normal veor. Let 𝑌 = 𝑋𝑅 where 𝑅 =
Rademacher and 𝑅 ⊥ ⊥ 𝑋.
What is the di
ribution of 𝑌 ? Observe that supp 𝑌 = R because 𝑋
takes all the values from the real line. Let us find the CDF of 𝑌 . Let 𝑦 ∈ R.
Then
def
CDF𝑌 (𝑦) = P(𝑌 ≤ 𝑦) = P(𝑋𝑅 ≤ 𝑌 ) =
= P(𝑋𝑅 ≤ 𝑦, 𝑅 = −1) + P(𝑋𝑅 ≤ 𝑦, 𝑅 = 1)
because the
andard Gaussian is symmetric around the origin. Here, you
are adding the same thing twice:
d
This means that 𝑌 is also
andard normal, 𝑌 = 𝒩 (0, 1). We created
another
andard normal by using this trick.
Same example, que
ion . Now, I am going to create a random veor:
def
let 𝑍 = (𝑋, 𝑌 )⊤ . If you do not agree that 𝑍 is a random veor, you have
a huge exi
ential issue here. Is 𝑍 a Gaussian random veor? What do
you have to check? If any linear combination is Gaussian. So let’s check
uncountably infinite many linear combinations, hopefully by the end of
this leure. Let’s check
Leure (--)
Yet another reproducing property if Gaussian random veors. There
is the second reproducing property of Gaussians. I am not sure whether
there are other variables with this property. Suppose we have
(︂ )︂ (︂[︂ ]︂ [︂ 2 ]︂)︂
𝑌 d 𝜇𝑋 𝜎𝑋 𝜎𝑋𝑌
=𝒩 ,
𝑋 𝜇𝑌 𝜎𝑌2
d
Result: 𝑌 | 𝑋 = 𝒩 (E(𝑌 | 𝑋), Var(𝑌 | 𝑋)), where E(𝑌 | 𝑋) = BLP(𝑌 | 𝑋)
d
and Var(𝑌 | 𝑋) = Var(𝑌 − BLP(𝑌 | 𝑋)). Now, 𝑋 | 𝑌 = 𝒩 (E(𝑋 |
𝑌 ), Var(𝑋 | 𝑌 )), where E(𝑋 | 𝑌 ) = BLP(𝑋 | 𝑌 ) and Var(𝑋 | 𝑌 ) =
Var(𝑋 − BLP(𝑋 | 𝑌 )) . So 𝑌 is homoskeda
ic with respe to 𝑋 and 𝑋 is
homoskeda
ic with respe to 𝑌 . This is implied by the joint Gaussianity.
Example. Suppose I told you
(︂ )︂ (︂[︂ ]︂ [︂ ]︂)︂
𝑌1 d 2 7 1
=𝒩 ,
𝑌2 0 1 3
def
Find the di
ribution of 𝑋1 + 𝑋2 | 𝑋1 = 𝑋2 . Let 𝑌1 = 𝑋1 + 𝑋2 and
def
𝑌2 = 𝑋1 − 𝑋2 . Then Law(𝑋1 + 𝑋2 | 𝑋1 = 𝑋2 ) = Law(𝑌1 | 𝑌2 = 0).
There is another word for “di
ribution”: “law”.
Do you agree that
(︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂
𝑌1 𝑋1 + 𝑋2 1 1 𝑋1 d
= = =
𝑌2 𝑋1 − 𝑋2 1 −1 𝑋2
(︂[︂ ]︂ (︂ )︂ [︂ ]︂ [︂ ]︂ [︂ ]︂)︂ (︂[︂ ]︂ [︂ ]︂)︂
d 1 1 1 1 1 3 1 1 1 d 2 7 1
=𝒩 , =𝒩 ,
1 −1 1 1 −1 1 2 1 −1 0 1 3
d d
So really 𝑌1 | 𝑌2 = 𝒩 (2 + 13 𝑌2 , 20 20
3 ). Hence 𝑌1 | 𝑌2 = 0 = 𝒩 (2, 3 ).
Try at home: 2𝑋2 | 𝑋1 − 𝑋2 = 0.
This is not an easy result to show. The density has to be this, and then every
linear combination of coordinates will be Gaussian.
If there are two random variables and they are independent, they are
surely uncorrelated. If two random variables are uncorrelated, are they in-
dependent? No. However, if they are jointly normal, are they independent?
For jointly normal variables, uncorrelatedness will imply independence.
This is such a
rong property!
Result. Let 𝑋 be a 𝑝 × 1 Gaussian random veor. Then the coordinates
of 𝑋 are independent if and only if they are pairwise uncorrelated. Why
is this result so useful? Suppose I give you a bunch of random variables.
Suppose each of them is marginally normal, and suppose you
ack them.
Will the result be Gaussian? Yes.
Proof. Independence always implies uncorrelatedness. It only remains
to prove the other way round, so we want to show that if coordinated of
𝑋 are pairwise uncorrelated, then they are independent. Assume that
coordinates of 𝑋 are pairwise uncorrelated. When you write down the
variance-covariance matrix, it will be diagonal:
⎛ 2 ⎞
𝜎1 · · · 0
Var 𝑋 = ⎝ ... . . . ... ⎠
⎜ ⎟
0 · · · 𝜎𝑝2
𝑝
)︂𝑝/2 (︃ )︃
1 ∑︁ (𝑥(𝑖) − 𝜇(𝑖) )2
(︂
1 1
PDF𝑋 (𝑥) = exp − =
𝜎𝑖2
√︁
2𝜋 𝜎12 · . . . · 𝜎𝑝2 2
𝑖=1
𝑝 (𝑥 (𝑖) −𝜇(𝑖) )2 𝑝
∏︁ 1 −
2𝜎 2
∏︁
= √ e 𝑖 = PDF𝑋 (𝑖) (𝑥).
𝑖=1
2𝜋𝜎𝑖 𝑖=1
are ju
bounded.
Definition. (𝑥𝑛 ) is said to be eventually bounded if ∃ such numbers 𝑀
and 𝑁 such that 𝑛 > 𝑁 ⇒ |𝑥𝑛 | < 𝑀 . A sequence that is bounded never
gets too large or too small.
Example. 𝑛 ∈ N. The sequences 𝑥𝑛 = sin 𝑛 and 𝑥𝑛 = (−1)𝑛 are not
convergent, but eventually bounded. The set of all eventually bounded
non-random sequences is denoted by 𝑂(1). This thing is read as “big oh of
one”. There is another symbol. We would like to isolate all sequences that
go to zero. Among convergent sequences, it is good to have a name for the
ones that converge to zero. 𝑥𝑛 → 𝑥 as 𝑛 → ∞ ⇔ 𝑥𝑛 − 𝑥 → 0 as 𝑛 → ∞.
Definition. The set of all non-random sequences that converge to zero
as 𝑛 → ∞ is denoted by 𝑜(1).
But what if, in
ead of real numbers, we had sequences of random
veors? If 𝑥𝑛 → 𝑥 as 𝑛 → ∞, then 𝑥𝑛 − 𝑥 = 𝑜(1) as 𝑛 → ∞. The big
advantage: we do not know how to make algebra with arrows. We know
algebra with equalities.
𝑜(1) is a sequence converging to zero. Then 𝑜(1)𝑂(1) = 𝑜(1), although
the ones on the end are different. Next, 𝑜(1) + 𝑂(1) = 𝑂(1). Next,
𝑜(1) − 𝑜(1) = 𝑜(1) because these might be different sequences. This is
not arithmetic, this is symbol manipulation.
What would change if we had real veors? Recall 𝑥𝑛 → 𝑛 as 𝑛 → ∞ ⇔
𝑥𝑛 − 𝑥 → 0 as 𝑛 → ∞ ⇔ ∀𝜀 > 0 ∃𝑁𝜀 > 0 : ∀𝑛 > 𝑁𝜀 𝑥𝑛 − 𝑥 ∈ 𝐵(0, 𝜀).
This definition is a remarkable definition because nowhere we had said
that 𝑥 is a real number. The very same thing means that |𝑥𝑛 − 𝑥| < 𝜀, but
here you are measuring the di
ance between two real numbers. For veor
𝑥, the only thing that would change ||𝑥𝑛 − 𝑥|| < 𝜀.
Let us ju
talk about the convergence of random veors. A random
veor is ju
like a veor of random variables. You would ju
have to do
something to account for randomness inside.
𝑝
Definition. We say that 𝑋𝑛 converges in probability to 𝑋, like 𝑋𝑛 →− 𝑋
as 𝑛 → ∞, or plim𝑛→∞ 𝑋𝑛 = 𝑋 if ∀𝜀 > 0 ∃𝑁𝜀 such that 𝑛 > 𝑁𝑒 ⇒
P(||𝑋𝑛 − 𝑋|| > 𝜀) < 0. This is ju
a probabili
ic translation of the real
definition.
d 𝑝
Example. Let 𝑋𝑛 = 𝒩 (0, 𝑛), 𝑛 ∈ N. Then, 𝑋𝑛𝑛 → − 0 as 𝑛 → ∞. The
reason Europe became great was calculus, nothing ⃒ ⃒ else.
Let 𝜀 > 0. Then we⃒ have ⃒to show that P(⃒ 𝑋𝑛𝑛 ⃒ > 𝜀) is less than 𝜀. It
is nothing else than P(⃒ 𝒩 (0,𝑛)
⃒ (︀ 1 )︀⃒
⃒𝒩 0, ⃒ > 𝜀) = P(𝒩 0, 1 >
(︀ )︀
> 𝜀) =
⃒ ⃒
𝑛 ⃒ P( 𝑛 𝑛
√ √
𝜀) + P(𝒩 0, 𝑛1 < −𝜀) = P(𝒩 (0, 1) > 𝑛𝜀) + P(𝒩 (0, 1) < − 𝑛𝜀) =
(︀ )︀
√ √
1 − Φ( 𝑛𝜀) + Φ(− 𝑛𝜀) → 0 as 𝑛 → ∞. This is true because the CDF is
converging to zero as 𝑥 → −∞ and to one as 𝑥 → ∞.
Why was convergence invented? As you say, 𝑥𝑛 → 𝑥 as 𝑛 → ∞, what
is the intuition? If you go far out in the tails, you can approximate 𝑥 with
𝑥𝑛 . You can use elements from the tails of the sequence to approximate the
sequence? And to translate this into
ati
ics, the
ati
ical equivalence of
approximation is e
imation.
The expeation is an unknown obje. How would ∫︀ you calculate the
expeed value of 𝑋? It is the integral: E𝑋 = 𝑥∈supp 𝑋 𝑥 PDF𝑋 (𝑥) d𝑥.
Who alone knows the densities of random variables? Nature. We want to
e
imate the unknown value.
Result. Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be IID copies of the random veor 𝑋. We
have some data on 𝑋. Then
𝑛
1 ∑︁ 𝑝
𝑋𝑖 →
− E𝑋 as 𝑛 → ∞
𝑛
𝑖=1
Leure (--)
def def
Homework , problem . 𝑈 = 2𝑋2 − 𝑋2 + 𝑋3 , 𝑉 = 𝑋1 + 2𝑋1 + 3𝑋3 .
The fir
ep is to find the joint di
ribution of 𝑉 and 𝑈 . They can be
written as ⎛ ⎞
(︂ )︂ (︂ )︂ 𝑋1
𝑈 2 −1 1 ⎝ ⎠ d
= 𝑋2 = 𝒩 (. . .)
𝑉 1 2 3
𝑋3
La
time, we looked at the Bernoulli’s weak law of large numbers. This
is the
andard convention. If you change something, different laws of
large numbers exi
. Assume that 𝑋1 , . . . , 𝑋𝑛 are IID random veors. Then
𝑝
𝑋¯ = 1 ∑︀𝑛 𝑋𝑖 → − E𝑋1 as 𝑛 → ∞. You could say that these veors are
𝑛 𝑖=1
copies of a random veor.
The proof of this is super simple. The value of this result is that this
gives you a theoretical ju
ification of using sample means to e
imate
theoretical means. La
year, there was a talk by a
ati
ician who was
talking about the
ati
ics. But people have been using sample averages
since Babylonian times. They used cuneiform to report sample averages
of wheat given to the tax colleor. How much on average did the tax
colleors get? The theoretial ju
ification took , years.
𝑝 𝑝
It can be shown that if 𝑋¯ →
− E𝑋1 , 𝑋 ¯ − E𝑋1 → − 0 as 𝑛 → ∞. Which is
1 ∑︀𝑛 0 𝑝
− ⇔ 𝑛1 𝑛𝑖=1 (𝑋𝑖 − 𝜇) → − 0 ⇔ 𝑛1 𝑛𝑖=1 (𝑋𝑖 − 𝜇) = 𝑜𝑝 (1).
∑︀ ∑︀
𝑛 𝑖=1 𝑋𝑖 − E𝑋1 →
We denoted such sequences as 𝑜(1), but due to the randomness, we shall
use other notation.
Definition. A generic sequence of random veors or variable that
converges in probability to zero is denoted by 𝑜𝑝 (1), “little oh pee one”.
There are situations where the law does not apply, for example, when
the expeation does not exi
.
Let me introduce another notion of convergence. 𝑛1 𝑛𝑖=1 𝑋𝑖 is the
∑︀
sample mean. It is consi
ent for the population. We know that in a certain
sense, the sample mean gets closer and closer to the population mean. This
is a degenerate result: the di
ribution of the sample mean is putting all
its mass in the limit at the point E𝑋1 . It was natural to ask: can we refine
this result? Can we modify this result so that we have a finite di
ribution?
How fa
is this convergence? This is a logical que
ion. Amazingly, the
final answer was given in or so by a Swedish mathematician with all
i’s dotted and t’s crossed. Some very great people failed to give a rigorous
proof.
In order to show this result, we need another notion of convergence—
convergence in di
ribution. Let (𝑌𝑛 ) be a sequence of random veors
and let 𝑌 be another random veor. We say then that (𝑌𝑛 ) is said to
converge in di
ribution to (𝑌𝑛 ) is the following condition holds:
𝑛→∞
CDF𝑌𝑛 (𝑦) −−−→ CDF𝑌 (𝑦),
that is, if the CDF converges pointwise for each 𝑦 at which CDF𝑌 is con-
tinuous. This is the usual pointwise convergence of funions. Everything
is non-random happens. We write this as
𝑑
𝑌𝑛 −
→𝑌
And there is a slight technical wrinkle: you need only the points of con-
tinuity. Remember, 𝑌𝑛 could be discrete, continuous, or mixed. Then the
CDF can have jumps, and you ju
exclude them from consideration. But
this notion of convergence was invented to prove one of the main limiting
results in
ati
ics. Here, the CLT required the development of new math-
ematics and complex analysis. It was not until complex analysis was fully
developed; Gauss could not do it, Laplace could not do it. There always
was a mi
ake or inconsi
ency. Lindeberg did this.
This is also once of the laws that work in Nature. In Chicago, it works.
There is a Museum of Science and Indu
ry. The marbles are dropping
randomly, but their final di
ribution is the bell curve.
This has huge praical implication. All the influence te
ing, NASA says
that the probability of success is such, then it is based on this approximation
result. People have been using this result without proving it since the days
of Gauss and Laplace.
Recall WLLN:
𝑛
1 ∑︁ 𝑝
(𝑋𝑗 − 𝜇) →− 0
𝑛
𝑗=1
The CLT says that if you look at something with population zero. In
ead
of dividing by 𝑛, you are dividing by something smaller. Once you do this,
it turns out that this thing is no longer degenerate in the limit. The limit is
a well-defined random variable with a certain di
ribution.
Another que
ion was the speed of converges. The limit ∑︀in the WLLN
goes toe number too quickly. What should 𝛼 be so that 𝑛𝛼 𝑛1 𝑛𝑖=1 (𝑋𝑖 −𝜇) =
0. It turns out that the only 𝛼 possible is .. You should not confuse the
WLLN and the CLT.
Note that
√ ∑︁
⎛ ⎞
𝑛 𝑛 1
1 ∑︁ 𝑛 √ 1 ∑︁ √
√ (𝑋𝑖 − 𝜇) = (𝑋𝑖 − 𝜇) = 𝑛 ⎝ 𝑋𝑗 − 𝜇⎠ = 𝑛(𝑋 ¯ − 𝜇)
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑗=1
What is the meaning of this result? What does it mean that 𝑋𝑛 converges
to 𝑋? The limit can be replaced by the tail of the sequence. What is the
praical meaning of this results? 𝑌 can be approximated probabili
ically
by the tail of 𝑌𝑛 . This really means that the di
ribution of 𝑌𝑛 is very close
to 𝑌 if 𝑛 is large. If the sample size is large enough. . . but how large? If
someone says 𝑛 = 25, this is shit. That’s why econometricians are doing
Monte-Carlo simulations. For empirical purposes, whatever size you have,
you should use it. If in an exam you are given a sample size 𝑛 and you
write it is not large enough for CLT, you get zero.
If 𝑛 is large enough, then
√ 𝑑 𝑑
¯ − 𝜇) ≈
𝑛(𝑋 ¯≈
𝒩 (0, 𝑉 ) ⇔ 𝑋 𝒩 (𝜇, 𝑉 /𝑛)
def
says that it is approximately normal. Let 𝜇 = E𝑋, 𝑉 = Var 𝑋. Then
¯ 𝑛 ≤ 0.8) ≤ P(0.6 − 𝜇 ≤ 𝑋
P(0.6 ≤ 𝑋 ¯ 𝑛 − 𝜇 ≤ 0.8 − 𝜇) =
(︃ )︃
0.6 − 𝜇 ¯𝑛 − 𝜇
𝑋 0.8 − 𝜇
= P √︀ ≤ √︀ ≤ √︀
𝑉 /𝑛 𝑉 /𝑛 𝑉 /𝑛
√ 𝑑 𝑋¯ −𝜇 𝑑
¯ − 𝜇) ≈
𝑛(𝑋 𝒩 (0, 𝑉 ) ⇔ √︀ ≈ 𝒩 (0, 1)
𝑉 /𝑛
𝑋𝑛 −𝜇 ¯
What is the ju
ification that √ is approximately normal? The CLT. So
𝑉 /𝑛
the latter is (︃ )︃ (︃ )︃
0.8 − 𝜇 0.6 − 𝜇
Φ √︀ −Φ √︀
𝑉 /𝑛 𝑉 /𝑛
In order to get 𝜇 and 𝑉 , you need to do
1
3 4 ⃒⃒1 3
∫︁ ∫︁ ⃒
3
E𝑋 = 𝑥 PDF𝑋 (𝑥) d𝑥 = 3𝑥 d𝑥 = 𝑥 ⃒ =
supp 𝑋 0 4 0 4
And Var 𝑋 = 35 − 9
16 = 48−45
80 = 3
80 .
So you have
(︃ )︃ (︃ )︃
4/5 − 3/4 3/5 − 3/4
Φ √︀ −Φ √︀
3/80 · 1/15 3/80 · 1/15
The intere
ing que
ions in science do not have exa answers, but they
have approximate answers.
HW, Additional problem . Define the random variable 𝑋𝑖 that de-
notes the outcome of the 𝑖th game.
{︃
def 5, with prob. 1/3,
𝑋𝑖 =
2, with prob. 2/3
and
𝑑 𝑑
𝑋𝑛 −−−→ 𝑋 ⇒ 𝑔(𝑋𝑛 ) −−−→ 𝑔(𝑋)
𝑛→∞ 𝑛→∞
You could say, the e
imate is , because your grandmother told you, but
this is a shitty e
imator! E
imators should be told by data. A consi
ent
𝑝 𝑝^ 𝑝 def 𝑝
e
imator of 1−𝑝 is 1−^ 𝑝 . Why? Because CMT. 𝑔(ˆ𝑝) →
− 𝑔(𝑝), so let 𝑔(𝑝) = 1−𝑝 ,
𝑔 : (0, 1) ↦→ R is continuous, so there you go.
Example. Suppose
(︂ )︂ (︂[︂ ]︂ [︂ 2 ]︂)︂ (︂ )︂
𝑋𝑛 𝑑 𝜇𝑋 𝜎 𝜎𝑋𝑌 d 𝑋
−
→𝒩 , 𝑋 =
𝑌𝑛 𝜇𝑌 𝜎𝑌2 𝑌
√ 𝑑
random variable, so 𝑛(𝜃ˆ − 𝜃) ≈ 𝒩 (0, 𝜎 2 ). Therefore,
𝜎2
(︂ )︂
𝑑
ˆ
𝜃 ≈ 𝒩 𝜃,
𝑛
𝜎2 ˆ Not really because they
𝑛 is the variance. Can we call it the variance of 𝜃?
ˆ
are not equal. It is an approximation to the variance of 𝜃.
Definition. The asymptotic variance is defined as asVar(𝜃) ˆ def 2
= 𝜎 . 𝑛
How do you call the square root of the variance? Standard deviation:
ˆ = √𝜎 = assd(𝜃)
√︁
asVar(𝜃) ˆ
𝑛
ˆ ≡ assd( ˆ def 𝜎
ˆ
se(𝜃) ̂︂ 𝜃) = √
𝑛
ˆ
def 𝜃 − 𝜃
𝑇ˆ =
ˆ
se(𝜃)
𝜃ˆ − 𝜃 𝜃ˆ − 𝜃 𝜎 𝜃ˆ − 𝜃 𝑑
= √ = √ − → 𝒩 (0, 1)
ˆ
se(𝜃) ˆ/ 𝑛 ⏟ 𝜎
𝜎 ˆ⏞ 𝜎/ 𝑛
𝑝
⏟ ⏞
→
−1 𝑑
−
→𝒩 (0,1)
𝑝 2 𝑝 𝑝 𝑝
− 𝜎 2 , Then 𝜎𝜎^ 2 →
ˆ2 →
We said that 𝜎 − 1. And so does 𝜎𝜎^ →
− 1. And 𝜎𝜎^ →
− 1. The
produ of the two, by Slutsky’s lemma, this converges in di
ribution to
the
andard normal. This is the basis for inference in large sample. When
you are doing 𝑡-te
s, without the assumption of normality, this is asymp-
totically normal. The 𝑡 di
ribution came from William Gossett. However,
the 𝑡 di
ribution requires the normality assumption, and nowhere in this
derivation, we used this assumption.
Delta method
There are two versions of the delta method: the approximate delta
method and the exa delta method.
Suppose we know that 𝜃ˆ is a consi
ent e
imator of 𝜃 and
√ 𝑑
𝑛(𝜃ˆ − 𝜃) −−−→ 𝒩 (0, 𝜎 2 )
𝑛→∞
ˆ = 𝑔(𝜃) + d (︀ ˆ
𝑔 𝜆𝜃 + (1 − 𝜆)𝜃 (𝜃ˆ − 𝜃),
)︀
𝑔(𝜃) 𝜆 ∈ (0, 1)
d𝜃
Therefore,
𝑝
→
− 𝑔𝜃 (𝜃)
√ ⏞ (︀ ⏟ )︀ √ 𝑑
𝑛(𝑔(𝜃) − 𝑔(𝜃)) = 𝑔𝜃 𝜃 + 𝜆(𝜃ˆ − 𝜃) 𝑛(𝜃ˆ − 𝜃) −
ˆ → 𝑔𝜃 (𝜃)𝒩 (0, 𝜎 2 ), 𝜆 ∈ (0, 1)
⏟ ⏞ ⏟ ⏞
𝑝 𝒩 (0,𝜎 2 )
→
−𝜃
E𝑔(𝑋, 𝑌 ) ≈ 𝑔(E𝑋, E𝑌 ) + 0 + 0
This is purely a back-of-the-envelope calculation only to save your arse if
someone is holding a gun to you head. This is equal only in case of linear
funtions.
What about the variance?
You should treat the delta method with some caution given the non-
linearity of 𝑔.
Example. Additional problem , part (ii).