Leure ( - ) : Probability Calculations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

 Le ure  (--)

If you want a Ph. D. in econometrics, you will have to do measure theory,


but this course is going to require only calculus and integration.
DeGroot’s “Stati ics” is a fundamental book, fanta ically well-written.
We shall not follow the book closely, because otherwise, we would not need
the course.
We are going to look at two predi ors: linear predi ors, be predi ors,
the di ributions, the Gaussian di ribution. Econometrics has no normality
required. Normality is required in portfolio theory. Then, some basic
asymptotic theory, CLT. Finally, inference and confidence intervals.
The problem sets and solutions will be po ed on Moodle (ideological
reje ion or something). Group work is encouraged. It is amazing how our
brains work: we have a problem, and each one approaches with a different
solutions. Solving the same problem in a different way is essential for
group work. It is not in your intere to free-ride.
The fir homework is due Friday. There are  problem sets. If you look
at additional problem , there are  parts. You have to solve part . The
exam is on the th. All the problem sets are light-headed. You have to get
cracking, there is no time to wa e.
Professor Tripathi’s office number is BC.. If you want a specific time,
send an email to gautam.tripathi@uni.lu.
No fancy uff as long as your handwriting is legible.
Not having enough blackboards sucks.
Let us review some basic probability calculations.

 Probability calculations
You have to have something random to calculate the probabilities, so
the basic ingredient is a random experiment. A random experiment is an
a ion whose outcome is uncertain in advance of its happening.

It is a squeaky wheel that gets the grease: if you don’t ask, you
don’t get.

A sample space for an experiment is the set of all outcomes associated


with this experiment.
The simple experiment is tossing a coin. Your sex is an experiment of
the nature combining genes in your parents. The outcomes that you see


are the outcomes of nature’s random experiment. Nature does something—
ocks go up. Nature does something—you pass an exam. How long will
you live, will you get married—everything is a random experiment.
Suppose you toss a coin. You can have either heads or tails. It is a set
that has two obje s inside: {𝐻, 𝑇 }. Suppose I tossed a coin twice, and
the sample space is the cartesian produ with itself: {𝐻, 𝑇 } × {𝐻, 𝑇 } =
{(𝐻, 𝐻), (𝐻, 𝑇 ), (𝑇, 𝐻), (𝑇, 𝑇 )}. The more complicated the experiment, the
more complicated the space.
What is an event? An event is any subset of the sample space. Look at
def def
this set: 𝐸 = {(𝐻, 𝐻)}, 𝐹 = {(𝐻, 𝐻), (𝐻, 𝑇 ), (𝑇, 𝐻)}. You are tossing a
coin twice, and both of the times you get heads.
The mo powerful relational operator in mathematics is equal: “𝐴 = 𝐵”
def
means “𝐴 is identical to 𝐵”. Here we make the equality by definition, = .
There is no calculation behind it, we are imposing equality by definition.
If you toss a coin and get two heads, you say it is 𝐸. If you toss a coin
twice and if you see at lea one head has occurred, you say it is 𝐹 . What
is the chance, probability, likelihood of the event 𝐸? We need to figure
out the probability of the event 𝐸? The probability of the fir times the
second. Chevalier de Mere, Pierre Fermat, Blaise Pascal. This success of
this classical theory la ed until the XIXth century. The probaility of an
event is the cardinality of an event divided by the cardinality of the sample
space. But the minute you are going to science and engineering with sample
spaces infinitely large, it fails because you can get the ratio ∞
∞ , which makes
no space. The price of a ock can take any value between  and +∞. You
cannot count the number of elements, this method is no longer su ainable.
Who is the creator of the mathematical ati ics? Andrey Kolmogorov. In
, he wrote a book on mathematical theory of probability. It agrees on
the classical theory. So it is quite young. . .
Mathematical theory of probability is one of the greate things that
have come to a human mind. We need a sy ematic way of measuring
probabilities. Consider a random experiment, and let 𝑆 be the sample
space associated with this random experiment. We are given a particular
event. Event 𝐴 is a subset of this sample space. 𝐴 can have any cardinality,
including infinite. Everything is very-very general. The Greek could not
do it, Laplace could not handle it, Newton could not do it. The answer is
remarkably simple: all you need is a fun ion. It should take an event as
an input, and return a number. How should we con ru this fun ion?
. Consi ent (at any time the answer should be the same);


. Giving reasonable answers;
. If the sample space is finite, it should give the same answer as classical
probability.
As long as you can find a fun ion satisfying these axioms, you are ready
to measure probabilities. What should be the domain of the fun ion? It
receives the event and returns a number, so the domain should be the
set of all subsets. The domain is any event, and Kolmogorov gave it the
name probability measure: P : {set of all subsets of𝑆} ↦→ R is called the
probability measure if it satisfies three properties:
. Probability should be non-negative. They can be zero: P(𝐴) ≥
0 ∀𝐴 ⊂ 𝑆. How do you interpret negative numbers? Ask a -year
old what is zero is, and he will crap his pants. It took the humanity
some time to invent a symbol for nothingness.
. The probability of the entire sample space is , and the probability of
an empty event is : P(𝑆) = 1, P(∅) = 0.
. The required property: if 𝐴1 , 𝐴2 , . . . is a sequence of pairwise disjoint
subsets⋃︀of 𝑆, then the
∑︀∞probability of the union is the sum of probabilit-

ies: P( 𝑖=1 𝐴𝑖 ) = 𝑖=1 P(𝐴𝑖 ). Disjoint means that the interse ion is
empty, if the sequence is such that the interse ion of any two events
is empty, 𝐴𝑖 ∩ 𝐴𝑗 = ∅ 𝑖 ̸= 𝑗.
Any fun ion that satisfies Kolmogorov’s axioms is a probability meas-
ure. Do we know that a fun ion exi s? Otherwise this theory is silly.
Probability measures exi !
def
Example. Suppose card(𝑆) < ∞. Then, P(𝐴) = card(𝐴)card(𝑆) , 𝐴 ⊂ 𝑆 is a
probability measure.
How about defining the probability measure on the real line? 𝑆 = R.
Its cardinality is uncountable infinity. Then
∫︁
def
P1 (𝐴) = I[0,1] (𝑥) d𝑥
𝐴

We use the indicator fun ion of this set:


{︃
1, 𝑥 ∈ [0, 1],
I[0,1] (𝑥) =
0, 𝑥 ∈
̸ [0, 1].


How about this:
∫︁
def 1 2
P2 (𝐴) = √ e −𝑥 /2 d𝑥
𝐴 2𝜋
In the fir example,
∫︁ ∫︁
def
P1 (𝐴) = I[0,1] (𝑥) d𝑥 = I[0,1] (𝑥) d𝑥 =
R (R∖[0,1])∪[0,1]
∫︁ ∫︁
= I[0,1] (𝑥) d𝑥 + I[0,1] (𝑥) d𝑥.
R∖[0,1] [0,1]
∫︀
The fir integral is always zero, so this is really [0,1] I[0,1] (𝑥) d𝑥, and this
integral is equal to .
The la property is ju the property of integrals. So probability meas-
ures exi . In mathematics, exi ence is crucial. The re is improving the
fa that we exi , otherwise it is ju ab ra nonsense. These  properties
are even less than geometry, and it is so parsimonious. And Kolmogorov
died in . Professor Tripathi knows a person who knew Kolmogorov.
If in ead of an infinite sequence, we have a finite number of disjoint
sets, say, 𝐴1 , 𝐴2 are disjoint subsets of 𝑆. Then P(𝐴1 ∪ 𝐴2 ) = P(𝐴1 ) + P(𝐴2 ).
But how⋃︀∞do we know that? The countable sequence ⋃︀∞ is 𝐴1 , 𝐴2 , ∅, ∅, ∅, . . ..
Then
∑︀∞ 𝑖=1 𝐴𝑖 = 𝐴1 ∪ 𝐴2 , so, by property , P( 𝑖=1 𝐴𝑖 ) = P(𝐴1 ∪ 𝐴2 ) =
𝑖=1 P(𝐴 𝑖 ) = P(𝐴1 ) + P(𝐴2 ).
These properties have remarkable implications.

 Implications of the probability axioms


Let 𝑆 be a sample space.
. If 𝐴 is an event, then P(𝐴𝐶 ) = 1 − P(𝐴). How do we show this? Only
using the probability axioms.
It is easie to draw a pi ure (fig. , left). 𝑆 = 𝐴 ∪ 𝐴𝐶 ⇒ P(𝑆) =
P(𝐴 ∪ 𝐴𝐶 ) = 1 = P(𝐴) + P(𝐴𝐶 ). Why is the probability of the sample space
one? Did your grandmother tell you? No, it is Kolmogorov’s second axiom.
. If 𝐴 ⊂ 𝐵, then P(𝐴) ≤ P(𝐵).
Again, let us draw a pi ure (fig. , right). 𝐵 = (𝐵 ∖ 𝐴) ∪ 𝐴, so P(𝐵) =
P(𝐵 ∖ 𝐴 ∪ 𝐴) = P(𝐵 ∖ 𝐴) + P(𝐴) > P(𝐴). Pure magic!


S S

A
A B
AC

Figure : A complement of a set and a subset

. P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵) − P(𝐴 ∪ 𝐵).


Now we are not saying that these sets are disjoint! That’s ju one way to
say, “your arse”. Do you agree that 𝐴 ∪ 𝐵 = (𝐴 ∖ 𝐵) ∪ (𝐴 ∩ 𝐵) ∪ (𝐵 ∖ 𝐴).
What is the defining chara eri ic of these sets? They are pairwise disjoint.
A set is not disjoint with itself (unless it is an empty set). This is why
P(𝐴 ∪ 𝐵) = P(𝐴 ∖ 𝐵) + P(𝐴 ∩ 𝐵) + P(𝐵 ∖ 𝐴). Now, 𝐴 = (𝐴 ∖ 𝐵) ∪ (𝐴 ∩ 𝐵),
so P(𝐴) = P(𝐴 ∖ 𝐵) + P(𝐴 ∩ 𝐵), so P(𝐴 ∖ 𝐵) = P(𝐴) − P(𝐴 ∩ 𝐵). Same goes
for 𝐵: 𝐵 = (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴), so P(𝐵) = P(𝐵 ∖ 𝐴) + P(𝐵 ∩ 𝐴), so P(𝐵 ∖ 𝐴) =
P(𝐵) − P(𝐵 ∩ 𝐴). This does the trick: P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵) − P(𝐴 ∪ 𝐵).
. Suppose 𝐴 and 𝐵 are subsets of 𝑆. Suppose P(𝐴) = 0. Then, P(𝐴 ∩
𝐵) = 0.
Fir , 𝐴 ∩ 𝐵 ⊂ 𝐴, therefore, P(𝐴 ∩ 𝐵) ≤ P(𝐴) = 0, and the probability
cannot be smaller than zero, so it is zero.
Remember that if P(𝐴) = 0, it does not mean that 𝐴 = ∅. Vice versa, if
P(𝐵) = 1, 𝐵 is not necessarily the sample space.

 Le ure  (--)
Homework , additional problem . 𝐴 and 𝐵 are events of 𝑆. We have
shown that if P(𝐴), then P(𝐴 ∩ 𝐵) = 0. Fir , 𝐴 ∩ 𝐵 ⊂ 𝐴. This implies
P(𝐴 ∩ 𝐵) ≤ P(𝐴). But P(𝐴) = 0 ≥ P(𝐴 ∩ 𝐵). But probabilities cannot be
negative, so this thing is zero.

Oh, s***, the sponge. . . Ah, here it is. I was about to use a harsh
word for the university.

Same problem, part . Is this atement true or false? P(𝐴) = 1 ⇒


P(𝐴 ∩ 𝐵) = P(𝐵). Fir , (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴). These sets are disjoint, therefore,


def
P(𝐵) = P(𝐵∖𝐴)+P(𝐵∩𝐴). By definition, 𝐵∖𝐴 = 𝐵∩𝐴𝐶 , the complement
with respe to the sample space. Therefore, P(𝐵 ∖ 𝐴) = P(𝐵 ∩ 𝐴𝐶 ) ≤
P(𝐴𝐶 ) = 1 − P(𝐴) = 1 − 1 = 0. Therefore, P(𝐴 = 1) ⇒ P(𝐵 ∖ 𝐴) = 0.
Hence, P(𝐵) = P(𝐵 ∩𝐴). If you take an event that happens with probability
one, the probability of interse ion of the probability of any other event.
These results are fanta ically useful. They are used to simplify probab-
ility calculations.

(On downloaded books.) “His moral is laxer than mine!”

Suppose that an event happen with probability zero. It is not empty!


This is the mathematical theory of probability. Once you get uck, intuition
has to be replaced with mathematics. You have to rely upon doing the
things sy ematically.
Additional problem , part . Give an example of an event 𝐸 such that
𝐸 ̸= ∅ but P(𝐸) = 0. This is mathematical atement that has nothing to
do with intuition.
Consider a random experiment whose sample space is R. This is some
random experiment about which we do not care. I am going to define a
probability measure on a subset of R:
∫︁
def
P(𝐴) = I[0,1] (𝑥) d𝑥, 𝐴 ⊂ R
𝐴

def
Let 𝐸 = Q, the set of all rational numbers in R. Then
∫︁ ∫︁
P(𝐸) = I[0,1] (𝑥) d𝑥 = d𝑥
𝐸 𝑥∈𝐸∩[0,1]

How many rational numbers are there? A countable infinity. An integral


measure is the size of a set. Rational numbers are countably infinite. They
are in one-to-one correspondence with natural numbers. So the infinity
of irrational numbers is uncountably infinite. A number  is an interval
collapsed on itself. The Lebesgue measure of rationals is zero. If you have
an interval, its measure is its length. The measure of figure on a plane is its
area. If you look at the set of rationals in [0, 1], they are also countable, so
∫︁ ∫︁ ∞ ∫︁
∑︁
d𝑥 = ⋃︀∞ d𝑥 = d𝑥 = 0
Q∩[0,1] 𝑖=1 {𝑞𝑖 } 𝑖=1 {𝑞𝑖 }


Never confuse a zero probability event with an empty event. You might
think that this is a silly example.

 Random variables
We introduce the probability measure, but there one problem: the
sample space can be very-very ab ra and consi of non-numerical out-
comes. The outcomes of a coin are heads or tails. You can think of a set of
donkeys as well as a set of real numbers. A set of real numbers is called a
group. The language of mathematics does not require real numbers to exi .
Nature tosses a coin, there are some random processes in your mother body,
and you get a specific genome chara eri ic. Computationally it is difficult,
but mathematically it is feasible. We need a device to to the computations
easily. This is why you need a fun ion to map the sample space into the
real line—the definition of a random variable. Once you have done this
mapping, for pra ical purposes, the real line becomes your sample space.
A random variable is a fun ion. Suppose the sample space is 𝑆 =
{𝐻, 𝑇 }. A possible mapping is
{︃
1, 𝜔 = 𝐻,
𝑋(𝜔) =
0, 𝜔 = 𝑇.

Random variable are fun ions, but they are written without arguments. It
is a convention: 𝑋 = 1 if the outcome is heads and 𝑋 = 0 if the outcome
is .
Uppercase roman letters will be used to denote random variables. If
there is some notation your grandmother likes, I couldn’t care less. I do
not care about what corner of the world you are coming from. Uppercase
roman letters denote random variables. Lowercase roman letters will
denote outcomes: 𝑋 = 𝑥, 𝑥 ∈ {0, 1}. I will use this religiously.
Whatever properties apply to fun ions apply to random variables.
However, some terminology is difficult in the probabili ic sense. The
co-domain is the real line. The domain is heads and tails. The range is 
and , and it is its support. The support of a random variable is the set
of values it takes. It written like this: supp 𝑋 = {0, 1}. We are not saying
“range” because we are not using fun ional notation.
If you want to find the probability a child is born with blue eyes and
brown hair that weighs . kg, you have to compute it, and you need two
pieces of information. All the fancy math is dire ed towards one goal:


how can I calculate the probability of an intere ing event. You ju have to
compute the probability 𝑋 > 9 with respe to some measure, because the
genes are combining in an ab ra space.
Weight can be zero or arbitrarily large.
I have seen people with arbitrarily large weights.
. You have to specify the support of a random variable, i. e. the set of
values taken by this random variable.
. You have to specify the probability measure depending on your ima-
gination and the que ion being asked.
Example . Suppose that supp 𝑋 = R and
∫︁
def 1 2
Prob(𝑋 ∈ 𝐴) = √ e −𝑥 /2 d𝑥.
𝐴 2𝜋
This thing integrates to , it is the Gaussian integral. This is a well-defined
probability measure. Do you recognise this random variable? Because it is
so frequently applied, it is called the andard normal, or Gaussian random
2
variable. You can call the fun ion √12𝜋 e −𝑥 /2 Tom Cruise, but let us be
scientific and call it the probability density fun ion, PDF.
The support has to be specified. You can get the probabilities by integ-
rating densities over regions. The Gaussian density looks like this:

1 2
PDF𝑋 (𝑥) = √ e −𝑥 /2 , 𝑥∈R
2𝜋
𝜙 is a reserved letter, ju as in computer programming, so do not use it for
d
anything else. The shorthand is 𝑋 = 𝒩 (0, 1).
Example . Let 𝑦 be a random variable. What is wrong with this
atement? This is lowercase. The error propagates. Better have a doubt
cleared. So let us use 𝑌 . Even 𝑉 and 𝑣 are di inguished by a squiggle. Let
1
supp 𝑌 = [𝑎, 𝑏] where 𝑎 < 𝑏 and PDF𝑌 (𝑦) = 𝑏−𝑎 I[𝑎,𝑏] (𝑦), 𝑦 ∈ R. It takes
all values between 𝐴 and ∫︀ 𝐵. How will you use this piece of information?
Therefore, P(𝑌 ∈ 𝐴) = 𝑦∈𝐴 PDF𝑌 (𝑦) d𝑦 for 𝑦 ⊂ R. This random variable
is also frequently used in applied work. This is the uniform random
d
variable. The shorthand for “𝑌 is uniformly di ributed on [𝑎, 𝑏]” is 𝑌 =
𝒰[𝑎, 𝑏] (or 𝑌 ∼ 𝒰[𝑎, 𝑏]).
Example . Let 𝑝 ∈ (0, 1) be a con ant ∑︀
and let 𝑍 be a random variable
such that supp 𝑍 = {0, 1} and P(𝑍 ∈ 𝐴) = 𝑧∈𝐴 𝑝𝑧 (1 − 𝑝)1−𝑧 I{0,1} (𝑧), 𝐴 ∈


𝑅. Is this a probability
∑︀ measure? An empty summation is zero by definition.
Then, P(𝑍 ∈ 𝑅) = 𝑧∈R = 𝑝 (1 − 𝑝)1−0 I{0,1} (0) + 𝑝1 (1 − 𝑝)1−1 I{0,1} (1) =
0

1 − 𝑝 + 𝑝 = 1. The sum a s exa ly like an integral, so the third axiom


d
works, too. Then 𝑍 is di ributed as a Bernoulli random variable, 𝑍 =
Bernoulli(𝑝). If you go to Iceland or We eros and say “Bernoulli”, everyone
will under and it. In ead of a density fun ion, we call it probability
mass fun ion. It places masses at  and . ∫︀
A ually, there is no difference between integrals and sums: 𝑔(𝑥) d𝑥 is
a Lebesgue measure.
d
Let 𝑍 = Bernoulli(𝑝). What is the probability of 𝑍 = 1? There is one
way to do the calculation:
∑︀ using this rule. This is the same as asking
P(𝑍 ∈ {1}). This is 𝑧∈{1} PMF𝑍 (𝑧) = PDF𝑍 (1) = 𝑝. The probabiliry 𝑍
takes the value  is 1 − 𝑝. You can use the fa the it takes only two values.
The probability 𝑍 takes the value  is —an example for the homework.
d
Suppose 𝑋 = 𝒩 (0, 1) ⇔ supp 𝑋 = R and PDF𝑋 (𝑥) = 𝜙(𝑥), 𝑥 ∈ 𝑅.
Can you tell me the probability 𝑋 = 3? It is zero.∫︀ How do you figure this
out? This is the same as saying P(𝑋 ∈ {3}) = {3} 𝜙(𝑥) d𝑥. The Lebesgue
measure of this set, its length is . Also, P(𝑋 = 𝑐) = {𝑐} 𝜙(𝑥)) d𝑥 = 0. The
∫︀

chance it take any possible value is zero. On the other hand, a Bernoulli
random variable will have a non-zero probability of  and . The fir
difference between the Gaussian and the Bernoulli random variable is
the support: the support of a Bernoulli random variable is countable.
THe support of a Gaussian variable is uncountable. If the support is an
uncountable set and the probability of 𝑋 being any possible number is ,
then 𝑋 is said to be continuously di ributed. On the other hand, 𝑍 is
said to be discrete if supp 𝑍 is a countable set. Some random variables can
be both. In applied work, random variables can be partially continuous.
d
Example. Let 𝑋 = 𝒩 (0, 1). Let 𝑐 be a con ant. Define
{︃
def 𝑋, 𝑋 > 𝑐,
𝑊 =
𝑐, 𝑋 ≤ 𝑐.

If 𝑋 is a random variable, 𝑔(𝑋) is random. 𝑊 is a fun ion of 𝑋. What is


its support? [𝑐, +∞). Is 𝑊 discrete? No, because its support is uncountable.
Is 𝑊 continuous? P(𝑊 = 𝑐) = P(𝑋 ≤ 𝑐) = P(𝒩 (0, 1) ≤ 𝑐) > 0. This is an
example of a mixed random variable. Think of 𝑋 as your income. When
you fill in the form in the USA, you have to report the amount if it is above


the threshold. This is called a censored random variable, or bottom-coded
random variable.
d
Example . We can be more generic. Let 𝜆 > 0. 𝑋 = exp(𝜆) if
1 −𝑥/𝜆
supp(𝑋) = (0, ∞) and PDF𝑋 (𝑥) = 𝜆 e I(0,∞) (𝑥), 𝑥 ∈ R. It is a con-
tinuous random variable. It is used to measure the survival of ele ronic
obje s: the probability of an SSD disk failing does not depend on its prior
hi ory because there are no moving parts.
d
Example . Let 𝜆 > 0 𝑋 = Poisson(𝜆) if and only if supp(𝑋) = {0} ∪ N
and
𝜆𝑥
PMF𝑋 (𝑥) = e −𝜆 I{0}∪N (𝑥), 𝑥 ∈ R
𝑥!
∑︀
To get P(𝑋 ∈ 𝐴), you do 𝑥∈𝐴 PMF𝑋 (𝑥).

“The French include zero as a natural number.” — “Well, the


French are unnatural in several ways.”

Suppose we have two random variables. 𝑋 and 𝑌 . Que ion: when


d
is 𝑋 = 𝑌 ? Answer. The fir requirement is if and only if supp 𝑋 =
supp 𝑌 . Next, you mu have PDF𝑋 (𝑡) = PDF𝑌 (𝑡) ∀𝑡 ∈ supp 𝑋 = supp 𝑌
or PMF𝑋 (𝑡) = PMF𝑌 (𝑡) ∀𝑡 ∈ supp 𝑋 = supp 𝑌 . Turns out that there is one
fun ion that suits for this purpose. The cumulative density fun ion is not
very useful for pra ical applications. You normally compute probability
by integrating the density or summing the masses.
Definition. Let 𝑋 be a random variable. The cumulative di ribution
fun ion (CDF) of 𝑋 if the fun ion from R ↦→ [0, 1] given by
def
CDF𝑋 (𝑥) = P(−∞ < 𝑋 ≤ 𝑥) = P(𝑋 ≤ 𝑥), 𝑥∈R

Do you agree that P(−∞ < 𝑋 ≤ 𝑥) = P[(−∞ < 𝑋 < ∞) ∩ (𝑋 ≤ 𝑥)]? The
fir event happens with probability one. So this is simply P(𝑋 ≤ 𝑥).
d
A result without a proof. 𝑋 = 𝑌 if and only if CDF𝑋 = CDF𝑌 . We do
not have to worry about the support. It means that they have to be equal at
any point on the real line, 𝑡 ∈ R. It ju gives the probability for an interval.
But you can live your life happily even with a casual acquaintance with the
CDF.
Suppose 𝑋 is continuously di ributed. Then the CDF𝑋 (𝑥) = P(𝑋 ∈


(−∞, 𝑥]). How would you get this? By integrating the PDF:
∫︁ 𝑥
PDF𝑋 (𝑡) d𝑡.
𝑡=−∞

Result:
∫︁ 𝑥
d d
CDF𝑋 (𝑥) = PDF𝑋 (𝑡) d𝑡 = PDF𝑋 (𝑥)
d𝑥 d𝑥 −∞

This is the result that conne s differential to integral calculus. This is the
fundamental theorem of calculus, or the Leibniz rule.
d
Let us do an example. Let 𝑌 = Bernoulli(𝑝). Then



⎪ 0, 𝑦 < 0,

⎨1 − 𝑝, 𝑦 = 0,
def
CDF𝑌 (𝑦) = P(𝑌 ≤ 𝑦) =


⎪ 1 − 𝑝, 0 < 𝑦 < 1,

⎩1, 𝑦 ≥ 1.

You know that P(𝑌 ≤ 0) = P(𝑌 < 0) + P(𝑌 = 0). However, this drawing is
incomplete without a hole. Now this is unambiguous. The reason why the
variables are called continuous is the CDF of a continuous variable being
continuous. You can recover the PMF by computing the jumps.
d
Remark. What is the relationship between these two atements: 𝑋 = 𝑌
and 𝑋 = 𝑌 . The fir is: 𝑋 has the same di ribution as 𝑌 . The second:
𝑋 is equal to 𝑌 . The latter is 𝑋 − 𝑌 = 0. 𝑋 − 𝑌 is a random variable. In
what sense it equal to 0? With probability . You have to interpret random
variables very precisely. So what is the relationship between the atements
d
𝑋 = 𝑌 and 𝑋 = 𝑌 with probability ? The implication is the following:
d
𝑋 = 𝑌 ⇒ 𝑋 = 𝑌 . The fir is an identity.
Example. Let 𝑋 ∼ 𝒩 (0, 1). I want to find the di ribution of −𝑋. The
transformation is very simple. Find the di ribution of −𝑋. The answer is:


d
𝑋 = −𝑋. But 𝑋 ̸= −𝑋. Let 𝑡 ∈ R. Then
∫︁ ∞
def
CDF−𝑋 (𝑡) = P(−𝑋 ≤ 𝑡) = P(𝑋 ≥ −𝑡) = PDF𝑋 (𝑥) d𝑥 =
−𝑡
∫︁ ∞ ∫︁ −∞
= 𝜙(𝑥) d𝑥 = 𝜙(−𝑢)(−d𝑢) =
−𝑡 𝑡
∫︁ 𝑡 ∫︁ 𝑡
= 𝜙(−𝑢) d𝑢 = 𝜙(𝑢) d𝑢 = CDF𝑋 (𝑡)
−∞ −∞

d
The two CDFs are equal, so −𝑋 = 𝑋. However, 𝑋 ̸= −𝑋. Supoose to the
contrary this atement is false. Suppose 𝑋 = −𝑋. This means that 2𝑋 = 0
and 𝑋 = 0 with probability , which is not true because it is Gaussian.
The vanilla equal is the mo powerful operator in mathematics. If I put a
d
topping on it, like =, it is no longer vanilla.

 Expe ation
Let 𝑋 be a random variable. The expe ed value of 𝑋 is defined to be
{︃∫︀
def 𝑥∈supp 𝑋 𝑥 PDF𝑋 (𝑥) d𝑥 if 𝑋 is continuously di ributed,
E𝑋 = ∑︀
𝑥∈supp 𝑋 𝑥 PMF𝑋 (𝑥) if 𝑋 is discrete.

Sometimes it is called the population mean of 𝑋. Why is it a useful obje


to look at. It is for psychological reasons. We do not like volatility. This
is why insurance markets exi . People try to minimise uncertainty and
will do anything for this. A variable is jumping around. We want one
number that captures the uncertainty. It says, “𝑋 is jumping around but
which value does it take on average?” It has nothing to do with the values
it a ually takes.
d
Example. Suppose 𝑋 = Bernoulli(𝑝). What is the expe ed value of 𝑋?
∑︁ ∑︁
E𝑋 = 𝑥 PMF𝑋 (𝑥) = 𝑥𝑝𝑥 (1 − 𝑝)1−𝑥 I{0,1} (𝑥) = 𝑝.
𝑥∈supp 𝑋 𝑥∈{0,1}

The learning will not happen if you do not work it out yourself. The be
thing is to ask a que ion in class. It benefits everybody. Learning literally
happens when you make mi akes, unless you are Isaac Newton. On the
other hand, 𝑋 takes only  and , and on average it will take 𝑝, although it


is never taken.
There is a very classic book by Peter Whittle, “The expe ation ap-
proach”.
Definition. Let 𝑋 ne a random variable. Let ℎ be a fun ion R ↦→ R.
The randomness in this fun ion is coming from 𝑋. You can simplify
probability calculations by the use of clever conditioning. Then
{︃∫︀
def 𝑥∈supp 𝑋 ℎ(𝑥) PDF𝑋 (𝑥) d𝑥 if 𝑋 is continuously di ributed,
Eℎ(𝑋) = ∑︀
𝑥∈supp 𝑋 ℎ(𝑥) PMF𝑋 (𝑥) if 𝑋 is discrete.

def ∫︀
In principle, if 𝑌 = ℎ(𝑥), then Eℎ(𝑋) = E𝑌 = supp 𝑌 𝑦 PDF𝑌 (𝑦) d𝑦.
Remark. Let 𝑋 be a random variable. Then what is EI𝐴 (𝑋)? The
randomness in this is coming from 𝑋. The indicator is  if 𝑋 lies in 𝐴.
Then
∫︁
EI𝐴 (𝑋) = I𝐴 (𝑥) PDF𝑋 (𝑥) d𝑥 =
𝑥∈supp 𝑋
∫︁
= I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥 =
R ∫︁
= Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥 = P(𝑥 ∈ 𝐴)
𝐴
∫︀
If something is not clear, look at this obje : R I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥.
It is equal to
∫︁
I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥 =
supp 𝑋∪(R∖supp 𝑋)
∫︁
I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥+
supp 𝑋
∫︁
+ I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥
R∖supp 𝑋

The indicator fun ions are your friends, so use them a lot. Things that lie
outside the support will contribute zero probability.
Expe ation is a linear operator:

E(𝑐1 𝑋 + 𝑐2 ) = 𝑐1 E𝑋 + 𝑐2 , E(𝛼𝑋 + 𝛽𝑌 ) = 𝛼E𝑋 + 𝛽E𝑌


Definition. If 𝑋 is a random variable,
def
Var 𝑋 = E(𝑋 − E𝑋)2

The mean measure the average location, and the variance measures the
spread. This is called the measure of dispersion.

 Le ure  (--)
d ∫︀ ∞ def
If 𝑋 = 𝒩 (0, 1), then E𝑋 = −∞ 𝑥𝜙(𝑥) d𝑥 = 0. Then, Var 𝑋 = E(𝑋 −
E𝑋)2 = E𝑋 2 − (E𝑋)2 .
d
What is this: 𝑌 = 𝒩 (𝜇, 𝜎 2 )? This is a Gaussian random variable with
mean 𝜇 and variance 𝜎 2 . This happens if and only if supp 𝑌 = R and
(︂ )︂
1 𝑦−𝜇 1 2 2
PDF𝑌 (𝑦) = 𝜙 =√ e −(𝑦−𝜇) /2𝜎 , 𝑦 ∈ R.
𝜎 𝜎 2𝜋𝜎

The andard deviation, 𝜎, is the square root of variance, 𝜎 2 .


It inherits all properties of integrals. You can have  cases:
. expe ed value exi s (lies on the real line),
. exi s and infinite, and
. does not exi (you get into an illegal mathematical operations).
Some sequences have expe ations whose limit diverges to infinity. All
these cases are important: when you are faced with a random variable
and you want a measure of location, you have to do another measure of
location.
Example of the second case. Let 𝑋 be a continuously di ributed ran-
dom variable with support [1, ∞) and PDF𝑋 (𝑥) = 𝑥12 I[1,∞) (𝑥). This vari-
able does not have a specific name. What is its expe ed value?
∫︁ ∫︁ ∞ ∫︁ ∞
1 1
E𝑋 = 𝑥 PDF𝑋 (𝑥) d𝑥 = 𝑥 2 d𝑥 = d𝑥 = log 𝑥|∞
1 = +∞
𝑥∈supp 𝑋 1 𝑥 1 𝑥

A density mu be non-negative and integrate to .


There is a very well-known random variable, especially for those in fin-
ance. Let 𝑋 be a continuously di ributed random variable with supp 𝑋 =
R and PDF𝑋 (𝑥) = 𝜋1 1+𝑥1
2 , 𝑥 ∈ R. We say it has full support. This is Cauchy
d
random variable, or Student with  degree of freedom. Let 𝑋 = Cauchy.


This is an example of a fat-tailed or a heavy-tailed di ribution. The reason
integrals are finite is the fa that the value of the integral decreases very
rapidly. Here, the tails go zero, but not as fa . Why are heavy tails useful?
Extreme events. On the other hand, if you want to model extreme events,
massive financial crises, the normal di ribution if not OK. Cauchy di ri-
bution is immensely useful for rare phenomena: earthquakes, financial
crises, shocks. How would you compute it?
∫︁ ∞
∫︁
1 1
E𝑋 = 𝑥 PDF𝑋 (𝑥) d𝑥 = 𝑥 2
=
𝑥∈supp 𝑋 −∞ 𝜋 1 + 𝑥
1 ∞ 𝑥
∫︁
1
= = log(1 + 𝑥2 )|+∞
−∞ = ∞ − ∞
𝜋 −∞ 1 + 𝑥2 2𝜋

And this is an illegal operation. The only reasonable conclusion: this


integral does not exi . In fa , you can compute any moment of a random
variable.
E𝑋 is the fir moment of the random variable. E𝑋 2 is the second
moment of 𝑋. E𝑋 18 is the th moment of 𝑋. Note that E𝑋 18 = E(𝑋 18 ),
not (E𝑋)18 ! Parentheses make reading confusing. Remember, in papers
and research articles, they deprecate too many parentheses. This is not
scholarly writing. For Cauchy di ribution, no moments exi . And if you
model financial crises, you will have a great amount of financial crises.
We need a new measure of location. In mathematics, exi ence is funda-
mental, and we really need another measure of location. This is called the
quantile of a random variable.

 Quantiles
Let 𝑋 be a random variable. Let 𝛼 ∈ (0, 1). An 𝛼th quantile (or 100 · 𝛼
percentile) of 𝑋 is such number 𝑄𝑋 (𝛼) ∈ R that could lie anywhere on the
real line that satisfies two properties:
. P(𝑋 ≤ 𝑄𝑋 (𝛼)) ≥ 𝛼,
. P(𝑋 ≥ 𝑄𝑋 (𝛼)) ≥ 1 − 𝛼.
Some quantiles have names attached to them. Remark: 𝑄𝑋 (0.5) is called
the median of 𝑋, med 𝑋. 𝑄𝑋 (0.25) is the fir quartile, 𝑄𝑋 (0.75) is the
third quartile. The quantile regression is a great alternative to the linear
regression.


d
Example. Let 𝑋 = Bernoulli(0.5). Its support is {0, 1}, and the prob-
ability of success is 0.5. We are tossing a fair coin. Que ion: find the
median of 𝑋. When everything else fails, apply the definition. Let us check
every number. Let 𝑞 < 0. Can any negative number be the median of 𝑋?
P(𝑋 ≤ 𝑞) = 0. However, this probability should be at lea .. How about
𝑞 > 1? P(𝑋 ≤ 𝑞) = 1, but P(𝑋 ≥ 𝑞) = 0. So no number below  or above
 is the median. Now, let us pick any number between 0 and 1. Let 𝑞 = 0.
Then P(𝑋 ≤ 0) = P(𝑋 < 0) + P(𝑋 = 0) = 0.5. So zero satisfies the fir
condition. Next, P(𝑋 ≥ 0) = P(𝑋 = 0) + P(𝑋 > 0) = 0.5 + 0.5 = 1. If you
want to find something out and you get confused, use set theory: P(𝑋 >
0) = P(𝑋 ∈ (0, ∞)) = P(𝑋 ∈ {1}) + P(𝑋 ∈ (0, ∞) ∖ {1}) = 0.5 + 0 = 0.5.
Who said “zero”? Jesus, I am ageing rapidly.
Human beings are upid. We are doing s*** because of hi ory.
You can check that 1 is also a median. In general, let 𝑞 ∈ (0, 1). Then
P(𝑋 ≤ 𝑞) = 0.5 and P(𝑋 ≥ 𝑞) = 0.5. So in this problem, 𝑄𝑋 (0.5) =
med 𝑋 = [0, 1].
Quantiles might exi , but they are not necessarily well-defined. The
median of Cauchy is . There are ways of redefining quantiles so that
unique values exi .
Quantiles have some remarkable properties. Let 𝑔 be an increasing
fun ion in R. What does it mean? Well, it is increasing. Is there any
relationship between these two obje s:

E𝑔(𝑋) ? 𝑔(E𝑋)

No, there is no relation. However, quantiles are amazing!

𝑄𝑔(𝑋) (𝛼) = 𝑔(𝑄𝑋 (𝛼))

This is a hugely powerful property that expe ations do not obey. This is
the equivariance property of quantiles.
Example. Let us use a set of data. Consider the following numbers:
, , . So far, you have not seen any data. There you go, you see data
in math camp. It is a dataset. In sets, duplicates are not allowed, but in
datasets, they are. What is the median? . Very good. Why? What is the
th percentile of this dataset?
Now you have another dataset: , , , . What is its median? Stati ic-
ally, what should you do to answer the que ion ati ically?


Consider the data: 𝑎, 𝑎, 𝑏, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑓, 𝑔. Each time, you get an outcome.
 people have made choices. What is the median choice? How do we
answer this que ion? Consider this dataset:

I, b, #, p, ♣, ♣, b

You should not ick to that s*** middle-school rule. It has no ati ical
ju ification. It is entirely arbitrary.
Again, consider this assignment. If data are , , , ,



⎪ 1, prob. 1/4,

⎨2, prob. 1/4,
def
𝑋 =


⎪ 3, prob. 1/4,

⎩4, prob. 1/4.

If the data are 𝑎, 𝑎, 𝑏, 𝑏, 𝑏, 𝑐, then



⎨0, prob. 1/3,

def
𝑋 = 1, prob. 1/2,

2, prob. 1/6.

We have created a probability di ribution. This di ribution a ually has


a name—empirical di ribution. The numbers you see are a ually out-
comes of a random variable. Can the number  be a median. Unfortunately,
with quantiles you really have to check every number. There is no other
option.  cannot be a median because P(𝑋 ≤ 1) = 0.25. Can  be a median?
P(𝑋 ≤ 4) = 1 but P(𝑋 ≥ 4) = 0.25. It is not a median. What about ? 
is a median because P(𝑋 ≤ 3) = 34 and P(𝑋 ≥ 3) = 0.5. What about ?
 is a median. A ually, med 𝑋 = [2, 3], which after interse ion with the
datasets gives {2, 3}.
What happens if the data are not an ordered set? If data are 𝑎, 𝑎, 𝑏, 𝑏, 𝑐
Let ⎧
⎨23,
⎪ prob. 0.4
def
𝑋 = 58.5, prob. 0.4,

0, prob. 0.2.

Can 23 be the median outcome? P(𝑋 ≤ 23) = 0.6, P(𝑋 ≥ 23) = 0.8. 23 is
the candidate, and the corresponding outcome is 𝑎. As long as the mapping


is one-to-one, it should be right.
Try the following mapping:

⎨23,
⎪ prob. 0.4
def
𝑌 = 58.5, prob. 0.4,

100, prob. 0.2.

In this case, P(𝑋 ≤ 23) = 0.4, and 𝑎 is not the median anymore. How you
do the coding matters. 𝑋 ̸= 𝑌 , so this is why the medians differ.
However, in some cases, there might be no ambiguity if half or more of
the outcomes are the same. Some ati ical methods are not invariant to
the choice of coding.
Remember that a quantile is not a fun ion. However, there is a way
to adju the definition. If you analyse the definition of the quantile, you
will see that the empirical CDF has flat areas. However, for well-known
continuous ati ical di ributions with an increasing CDF the quantiles
are unique. Otherwise, you can use the generalised inverse.

 Le ure  (--)
Whenever you see a problem, the result is important.
Asking about the th quantile of a random variable does not give any
additional information about this variable. The only way to check it is to
try any number. P(𝑋 ≤ 𝑞) ≥ 0, P(𝑋 ≥ 𝑞) ≥ 1. Suppose 𝑞 < 𝑎. What is
the probability of being less or equal to 𝑞? Zero. What is the probability
that 𝑋 ≥ 𝑞? One. 𝑞 qualifies as the th quantile. So does 𝑎. So really
𝑄𝑋 (0) = (−∞, 𝑎]. This is really to illu rate that 0th percentile is a silly
thing.
d
If 𝑋 = Bernoulli(𝑝), then 𝑝 is the probability of success. When 𝑝 = 0,
the probability of failure is 1. It is putting mass 1 at point 0. So Bernoulli(0)
random variable takes the value  with probability . So it is really a
con ant random variable. If 𝑞 < 0, P(𝑋 ≤ 𝑞) = 0. This is not a median. If
𝑞 > 0, P(𝑋 ≥ 𝑞) = 0. This is not a median. But 𝑋 = 0 qualifies as the 0th
quantile.
Now for the quantile. Show that 𝑄−𝑋 (𝛼) = −𝑄𝑋 (1 − 𝛼). Let us apply
the definition:

P(−𝑋 ≤ 𝑄−𝑋 (𝛼)) ≥ 𝛼 and P(−𝑋 ≥ 𝑄−𝑋 (𝛼)) ≥ 1 − 𝛼.


Do you agree that this is

P(𝑋 ≥ −𝑄−𝑋 (𝛼)) ≥ 𝛼 and P(𝑋 ≤ −𝑄−𝑋 (𝛼)) ≥ 1 − 𝛼

P(𝑋 ≤ −𝑄−𝑋 (𝛼)) ≥ 1 − 𝛼 and P(𝑋 ≥ −𝑄−𝑋 (𝛼)) ≥ 𝛼


If P(𝑋 ≤ 𝑞) ≥ 𝛽 and P(𝑋 ≥ 𝑞) and P(𝑋 ≥ 𝑞) ≥ 1 − 𝛽, this implies

−𝑄−𝑋 (𝛼) = 𝑄𝑋 (1 − 𝛼)

 Conditioning
Conditioning is the really important topic in econometrics. This is
probably the mo important part of the course. Suppose we have two
random variables, 𝑋 and 𝑌 . What is Cov(𝑋, 𝑌 )? It is
def
Cov(𝑋, 𝑌 ) = E(𝑋 − E𝑋)(𝑌 − E𝑌 ) = E𝑋𝑌 − E𝑋E𝑌

To calculate the latter, you integrate 𝑋 with the density of 𝑋, integrate


𝑌 with respe to the density of 𝑌 , but how do you calculate this obje ,
E𝑋𝑌 ? You have to integrate it with respe to the joint density.
Definition. The joint density (or PMF) of 𝑋 and 𝑌 is a fun ion PDF𝑋,𝑌
or PMF𝑋,𝑌 : R × R ↦→ [0, ∞) such that
∫︁ ∫︁
P(𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵) = PDF𝑋,𝑌 (𝑥, 𝑦) d𝑥 d𝑦
𝑋∈𝐴 𝑌 ∈𝐵

or ∑︁ ∑︁
P(𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵) = PMF𝑋,𝑌 (𝑥, 𝑦)
𝑋∈𝐴 𝑌 ∈𝐵

for 𝐴, 𝐵 ⊂ 𝑅. These fun ions mu integrate to one.


Suppose I give you the joint density PDF𝑋,𝑌 . How do you find the
marginal density of 𝑋? The latter is determining the probabili ic beha-
viour of 𝑋. If I tell you how 𝑋 and 𝑌 behave together, can you derive their
individual di ributions? Going from joint to marginal is always possible
provided that you discard the information in a proper way. Do you agree
that P(𝑋 ∈ 𝐴) = P(𝑋 ∈ 𝐴, 𝑌 ∈ R) because the probability of the latter is
one. This is ∫︁ ∫︁
PDF𝑋,𝑌 (𝑥, 𝑦) d𝑦 d𝑥
𝑋∈𝐴 𝑌 ∈R


Now look at the inner integral. If you integrate the density over a region
to get a probability, so the framed expression if the marginal density of 𝑋.
Therefore, ∫︁
PDF𝑋 (𝑥) = PDF𝑋,𝑌 (𝑥, 𝑦) d𝑦, 𝑥 ∈ R
𝑦∈R
∫︀
In the save vein, PDF𝑌 (𝑦) = 𝑥∈R PDF𝑋,𝑌 (𝑥, 𝑦) d𝑥. If two variables are
ocha ically independent, you can multiply the densities to get the joint
density, but not otherwise.
Conditional probabilities. Let 𝐴, 𝐵 be events from a sample space 𝑆.
We briefly go back to events. Suppose that P(𝐵) > 0, then, by definition,

def P(𝐴 ∩ 𝐵)
P(𝐴 | 𝐵) =
P(𝐵)

Some books are writing 𝐴/𝐵, but is is not this line /, not this line ∖, no
other symbol. This definition was given by the reverent Thomas Bayes.
Why is conditioning important? What does conditioning do? Suppose
𝑌 and 𝑋 are random variables. Also support that 𝑌 is not a determini ic
fun ion of 𝑋, so you cannot write 𝑌 = 𝑔(𝑋). Part of randomness in 𝑌
can come from 𝑋, and part of its randomness is inherent. How would 𝑌
behave if you opped the randomness in 𝑋?
If both 𝑋 and 𝑌 are discrete, I am going to define the conditional
density fun ion. The conditional probability mass fun ion 𝑌 | 𝑋 = 𝑥 is

def P(𝑌 = 𝑦, 𝑋 = 𝑥)
PMF𝑌 |𝑋 (𝑦) = , 𝑦 ∈ supp 𝑌, 𝑥 ∈ supp 𝑋
P(𝑋 = 𝑥)

The conditioning means that we are opping the randomness by fixing


𝑋 = 𝑥. Once you have a conditional PMF, how will you compute this,
P(𝑌 ∈ 𝐴 | 𝑋 = 𝑥)? You will do the summation:
∑︁
P(𝑌 ∈ 𝐴 | 𝑋 = 𝑥) = PMF𝑌 |𝑋=𝑥 (𝑦), 𝑥 ∈ supp 𝑋.
𝑦∈𝐴

You can talk about the mean, the variance and the quantile of a di ribution
for conditional di ributions.
Not let us generalise this slightly. You know what a ve or is? It is
basically a bunch of numbers with an order attached to it. It is important
that a ve or is a column ve or. The andard mathematical convention
is a column ve or. We can generalise this definition. A random ve or is


⎛ ⎞
𝑋
ju a column ve or of random variables. Say, ⎝ 𝑌 ⎠ is a random column
𝑍
ve or with random variables as its elements. Now, if 𝑊 is a random ve or,
⃗ or 𝑊 . In research papers, you do not write in boldface
I do need to write 𝑊
⎛ (1) ⎞
𝑊
or with ve ors. If dim 𝑋 = 3 × 1, then 𝑊 = ⎝𝑊 (2) ⎠. However, I do not
𝑊 (3)
use the notation 𝑊1 because it is the subscript of the observation.
Let 𝑋 be a 2 × 1 random ve or. Let(︃ 𝑋1 , 𝑋
)︃2 , . . . be identical copies of 𝑋,
(1)
d d d 𝑋1
so 𝑋1 = 𝑋, 𝑋2 = 𝑋, 𝑋𝑛 = 𝑋. Then (2) . Mathematics is a universal
𝑋1
language, but even there, people have preferences.

I am being nice to you by making the di in ion between ran-


dom variables and ve ors.

Let 𝑌 be a random variable. Let 𝑋 be a random ve or. It could have


dimension 1. We will assume that both 𝑌 and 𝑋 are continuously di rib-
uted. This means that every component of 𝑋 is continuously di ributed.
Now the conditional density of 𝑌 | 𝑋 = 𝑥 is defined to be

def PDF𝑌,𝑋 (𝑦, 𝑥)


PDF𝑌 |𝑋=𝑥 = , 𝑦 ∈ supp 𝑌, 𝑥 ∈ supp 𝑋
PDF𝑋 (𝑥)

This obje gives rise to a valid probability definition. Then


∫︁
P(𝑌 ∈ 𝐴 | 𝑋 = 𝑥) = PDF𝑌 |𝑋=𝑥 (𝑦) d𝑦, 𝑥 ∈ supp 𝑋
𝑦∈𝐴

You can find means, variances and quantiles with conditional di ributions.
But let us focus on the conditional mean 𝑌 | 𝑋. It has several useful
properties.
Another definition. Let 𝑌 be a random variable. Let 𝑋 be a random
ve or. The conditional expe ation fun ion of 𝑌 | 𝑋 = 𝑥 is defined to
be
∫︁
def
CEF : 𝑥 ↦→ E(𝑌 | 𝑋 = 𝑥) = 𝑦 PDF𝑌 |𝑋=𝑥 (𝑦) d𝑦, 𝑥 ∈ supp 𝑋
𝑦∈supp 𝑌


or
def
∑︁
E(𝑌 | 𝑋 = 𝑥) = 𝑦 PMF𝑌 |𝑋=𝑥 (𝑦) 𝑥 ∈ supp 𝑋
𝑦∈supp 𝑌

As long as 𝑋 is discrete or continuous, the properties of 𝑋 do not matter.


In this obje , 𝑦 has been integrated or summed out, so we are talking
about a fun ion of 𝑥. This fun ion will change only as I change 𝑥. There
is no 𝑦 in the fun ion, it has been integrated out. It is a fun ion of the
conditioning variable.
The treatment of conditioning is limited in the book. Read se ion .
and follow the class notes. The exam will be similar to what is done here.
People have built careers udying the properties of random variables.
Let us do an example. Let us have two random variables:
𝑌 ∖𝑋   
 . . .
 . . .
This type of tables is called a contingency table. It is ju an obsure name
for a joint di ribution.
Let us fir compute marginal probabilities. Find the conditional ex-
pe ation fun ion CEF : 𝑥 ↦→ E(𝑌 | 𝑋 = 𝑥).
Homework: find 𝑦 ↦→ E(𝑋 | 𝑌 = 𝑦).
What is the probability 𝑋 = 1? It is P(𝑋 = 1, 𝑌 = 0) + P(𝑋 = 1, 𝑌 =
1) = 0.3. Next, P(𝑋 = 2) = 0.4 and P(𝑋 = 3) = 0.3.
Recall that
def P(𝑌 = 𝑦, 𝑋 = 𝑥)
PMF𝑌 |𝑋=𝑥 = , 𝑦 ∈ supp 𝑌, 𝑥 ∈ supp 𝑋.
P(𝑋 = 𝑥)
P(𝑌 =0,𝑋=1)
PMF𝑌 |𝑋=1 (0) = P(𝑋=1) = 0.2
0.3 = 2/3.
P(𝑌 =1,𝑋=1)
PMF𝑌 |𝑋=1 (1) = P(𝑋=1) = 0.1
0.3 = 1/3.
P(𝑌 =0,𝑋=2) 0.1
PMF𝑌 |𝑋=2 (0) = P(𝑋=2) = 0.4 = 1/4.
P(𝑌 =1,𝑋=2)
PMF𝑌 |𝑋=2 (1) = P(𝑋=2) = 0.3
0.4 = 3/4.
P(𝑌 =0,𝑋=3) 0.15
PMF𝑌 |𝑋=3 (0) = P(𝑋=3) = 0.3 = 1/2.
P(𝑌 =1,𝑋=3)
PMF𝑌 |𝑋=3 (1) = P(𝑋=3) = 0.15
0.3 = 1/2.
def ∑︀
Recall E(𝑌 | 𝑋 = 𝑥) = 𝑦∈supp 𝑌 𝑦 PMF𝑌 |𝑋=𝑥 (𝑦),
𝑥 ∈ supp
∑︀ 𝑋. We
have to do this for all three values of 𝑋. Then E(𝑌 | 𝑋 = 1) = 𝑦∈{0,1} 𝑦 ·
PMF𝑌 |𝑋=1 (𝑦) = 0 · PMF𝑌 |𝑋=1 (0) + 1 · PMF𝑌 |𝑋=1 (1) = 13 . Following the


∑︀
same logic, E(𝑌 | 𝑋 = 2) = 𝑦∈{0,1} 𝑦 PMF𝑌 |𝑋=2 (𝑦) = 0 · PMF𝑌 |𝑋=2 (0) +
1 · PMF𝑌 |𝑋=2 (1) = 34 . Finally, E(𝑌 | 𝑋 = 3) = 𝑦∈{0,1} 𝑦 PMF𝑌 |𝑋=3 (𝑦) =
∑︀

0 · PMF𝑌 |𝑋=3 (0) + 1 · PMF𝑌 |𝑋=3 (1) = 12 .


Now we need to write this fun ion as one line. Therefore,
1 3 1
E(𝑌 | 𝑋 = 𝑥) = I{1} (𝑥) + I{2} (𝑥) + I{3} (𝑥), 𝑥 ∈ supp 𝑋.
3 4 2
See if you can find the conditional expe ation fun ion of 𝑦. You can
condition anything on anything.

If you have a que ion, then ask your que ion loudly. The
members of parliament will never address each other. Mi er
speaker... It has positive externalities, and a private que ion
has zero welfare benefits.

Now suppose 𝑍 is a random variable. What is 𝑔(𝑍)? It is a random


variable. What is the source of randomness in this? 𝑍. Do not forget this.
We all agree that E(𝑌 | 𝑋 = 𝑥) is a fun ion of 𝑥. Let us call it ℎ(𝑥). Can
you tell me that what ℎ(𝑋) is? A random variable. What random variable
is it? It is 13 I{1} (𝑋) + 34 I{2} (𝑋) + 12 I{3} (𝑋). This becomes a random variable.
I took the obje E(𝑌 | 𝑋 = 𝑥)|𝑥=𝑋 , and we did a plug-in operation. By
doing this, we have created a hugely important random variable. Upon
doing so, we created a random variable. Let us define this obje as
def
E(𝑌 | 𝑋) = E(𝑌 | 𝑋 = 𝑥)|𝑥=𝑋

This is read as “the conditional expe ation of 𝑌 given 𝑋”. And this is a
random variable! This random variable has remarkable properties. How
about we li these properties? We shall skip the proofs. These properties
are of supreme intere .
Properties of conditional expe ations.
Read se ion ..

 Le ure  (--)
The result in problem  is used in the IV regression. You can have a
look at it.


Let us look at properties of conditional expe ations. These properties
follow from conditional expe ation fun ions. You fir have to compute
the conditional expe ation fun ions, and then plug in random variables.
Let 𝑌 be a random variable. Let 𝑋 be a random ve or. Then the
random variable E(𝑌 | 𝑋) has the following properties:
. E(𝑌 | 𝑋) is a fun ion of 𝑋. The source of randomness here is 𝑋.
. The conditional expe ation operator is a linear operator: E(𝐴 + 𝐵 |
𝑋) = E(𝐴 | 𝑋) + E(𝐵 | 𝑋) where 𝐴, 𝐵 are random variables.
. E(𝑐 | 𝑋) = 𝑐 is 𝑐 is a con ant.
. The reason why conditional expe ations were invented. The intuitive
meaning of E(𝑌 | 𝑋) is the fa that the randomness of 𝑋 has been
opped. This means that E(𝑔(𝑋)𝑌 | 𝑋) = 𝑔(𝑋)E(𝑌 | 𝑋). Con ants
come outside the expe ations. This has a non- andard name: the
useful rule.
. This property has a andard name: the law of iterated expe ations.
You can take the expe ation of this random variable: E𝑋 [E𝑌 |𝑋 (𝑌 |
𝑋)] = E[E(𝑌 | 𝑋)] = E𝑌 . There is an inner expe ation, and an
outer expe ation, so you get the marginal expe ation of 𝑌 . If you
abbreviate it, you will get LIE, so it become the “law of iterated
mathematical expe ations”, LIME.
. E(𝑌 | 𝑋) = E(𝑌 | ℎ(𝑥)) if ℎ is inje ive, ℎ : supp 𝑋 ↦→ supp 𝑋. This
is a one-to-one fun ion, or inje ive fun ion. Definition: ∀𝑥1 , 𝑥2 ∈
supp(𝑋), ℎ(𝑥1 ) = ℎ(𝑥2 ) ⇒ 𝑥1 = 𝑥2 .
. Let 𝑔 be a fun ion, not necessarily one-to-one. I am going to do re-
peated conditioning (not iterated conditioning): E[E(𝑌 | 𝑋) | 𝑔(𝑥)] =
E(𝑌 | 𝑔(𝑋)). There are two conditionings. This is used in economet-
rics all the time. In general, when you condition on less information,
the outcome will be determined by the small information set. This is
called the “smalle information wins” property.
All of this can be proved. Whatever properties integral have, they are
inherited by the conditional expe ation operator. These results hold in
great generality. You can solve a difficult probability problem if you split
the problem into several sources of randomness by conditioning, and then
you will work with the remaining randomness.
Remark. In terms of conditional expe ation fun ion, E(𝑔(𝑋)𝑌 | 𝑋 =
𝑥) = E(𝑔(𝑥)𝑌 | 𝑋 = 𝑥) = 𝑔(𝑥)E(𝑌 | 𝑋 = 𝑥).
Conditional expe ation has a couple of various interpretations. De-
pending on the application, you use the interpretation you need. One way


to interpret this: given the information on 𝑋, you try to predi the value
of 𝑌 . This is the be possible predi or of 𝑌 given 𝑋.
If you have two values, 𝑋 and 𝑔(𝑋), which contains more information?
𝑋. If 𝑔(0) = 3 and 𝑔(1) = 3, then you reduce information. Applying a
fun ion is a loss of information. The only time such loss does not occur is
when such a fun ion is inje ive.
Example of property . This property is telling you a useful way of
discarding information. You may or may not like this example. Let 𝑁 be the
number of nuclear warheads in North Korea. Kim Jong Un has a produ ion
fun ion. You want to use the available information given the information
on capital and labour (say, number of cars and people). Unfortunately,
their assembly units are located under the ground and the data of 𝐾 is
missing. You want to have E(𝑁 | 𝐾, 𝐿), but how do you ati ically remove
the information about 𝐾? Remarkably, E[E(𝑁 | 𝐾, 𝐿) | 𝐿] = E(𝑁 | 𝐿).
In econometric terms, this is how you incorporate the effe of omitted
variables. Once you have omitted a variable, you get a short model. This is
how you do it: you are conditioning the long model on short information.
What is the effe of endogeneity due to the omitted variable? You are
using this property. This does not mean that you erase the variable; you
are conditioning on small information.
Example. Suppose E(𝑌 | 𝑋) = 𝛼0 + 𝛼1 𝑋 + 𝛼2 𝑋 2 . Can you tell me
E(𝑌 | 𝑋, 𝑋 2 )? Let ℎ(𝑥) = (𝑋, 𝑋 2 ). ℎ maps 𝑋 into two things. Is it a
one-to-one fun ion? Suppose ℎ(𝑥1 ) = ℎ(𝑥2 ). This means that 𝑥1 = 𝑥2 and
𝑥21 = 𝑥21 . By con ru ion,
(︂ )︂this is an inje ive transformation. Any mapping
𝑥
of the form 𝑥1 ↦→ is -to- by con ru ion. By property , this is
𝐾(𝑥)
really E(𝑌 | 𝑋, 𝑋 2 ) = E(𝑌 | ℎ(𝑥)) = E(𝑌 | 𝑋). This shows how we should
write regression models. This is the parsimonious way of writing that.
Now suppose E(𝑌 | 𝑋) = 𝛼0 + 𝛼1 + 𝛼2 𝑋 2 . What is E(𝑌 | 𝑋 2 )? Do
you agree that E(𝑌 | 𝑋 2 ) = E[E(𝑌 | 𝑋) | 𝑋 2 ]. This is by property :
E(𝛼0 + 𝛼1 𝑋 + 𝛼2 𝑋 2 | 𝑋) = 𝛼0 + 𝛼1 E(𝑋 | 𝑋 2 ) + 𝛼2 𝑋 2 . And the middle
term has to be determined. More information has to be given to allow the
simplification.
Result. Let 𝑌 be a random variable. Let 𝑋 be a random ve or. Think
of E(𝑌 | 𝑋) as predi ing 𝑌 given the information on 𝑋. I claim that this
is the be predi or of 𝑌 | 𝑋.
def
Fir approach. Let 𝑈 = 𝑌 −E(𝑌 | 𝑋). What is this? This is a mismatch.
Then this is the predi ion error, the mi ake that you make.


. Then the following is true: E(𝑈 | 𝑋) = E[𝑌 − E(𝑌 | 𝑋)] = E(𝑌 ) −
E[E(𝑌 | 𝑋) | 𝑋]. This is true due to the linearity and the law of iterated
expe ations. But this is nothing else than E(𝑌 ) − E(𝑌 ) = 0. When you
predi a variable with its conditional expe ation, the expe ed predi ion
error is zero. 𝑈 will never be  with probability  because that it would
imply that 𝑌 = E(𝑌 | 𝑋) with probability , but the latter is a fun ion of
𝑋!
. Stati ically, when would you conclude that 𝑌 is the be predi or?
It is using information on 𝑋, and it would be be if we used up all the
information that 𝑋 has to give. So the predi ion error should not contain
any information about 𝑋: 𝑈 should be uncorrelated with any fun ion
of 𝑋. And indeed 𝑈 is uncorrelated with any fun ion of 𝑋. Let 𝑔 be any
fun ion of 𝑋. Then Corr(𝑈, 𝑔(𝑋)) = Cov(𝑈,𝑔(𝑋))
sd 𝑈 sd 𝑔(𝑋) . The variances are posit-
ive, the andard deviations are positive, to here, the covariance mu be
zero: Cov(𝑈, 𝑔(𝑋)) = E𝑈 𝑔(𝑋) − E𝑈 E𝑔(𝑋). Now, E𝑈 = E(E(𝑈 | 𝑋)) = 0
by the law of iterated expe ations. Now, E𝑈 𝑔(𝑋) = E[E(𝑈 𝑔(𝑋) | 𝑋)] =
E[𝑔(𝑋)E(𝑈 | 𝑋)] = 0. We see that the population residual has no informa-
tion on 𝑋 about it. This obje has no information about 𝑋 in it.
d
Example. Let 𝑋 = Bernoulli(𝑝). Find the CDF of 𝑌 . Solution: let 𝑦 ∈ R.
def
Then CDF𝑌 (𝑦) = P(𝑌 ≤ 𝑦). It is

⎨0,
⎪ 𝑦 < 0,
1 − 𝑝, 0 ≤ 𝑦 < 1,

1, 𝑦 ≥ 1.

If you get confused, follow the mathematics. Ju do not forget to put the
bullets and the holes.

 Le ure  (--)
Let 𝑌 be a random variable and let 𝑋 be a random ve or. Recall from
def
the la time that 𝑈 = 𝑌 −E(𝑌 | 𝑋) this thing has two properties: () E(𝑈 |
𝑋) = 0, and we used this property to show that () 𝑈 is uncorrelated with
every fun ion of 𝑋! This is such a good predi or!
It has several interpretations. It makes sense to call E(𝑌 | 𝑋) the be
predi or. Do you agree that 𝑌 = E(𝑌 | 𝑋) + 𝑌 − E(𝑌 | 𝑋). You can call
⏟ ⏞ ⏟ ⏞
the fir part predi ed by 𝑋, and the latter is the part that cannot be


predi ed by 𝑋. So we rewrite is as 𝑌 = E(𝑌 | 𝑋) + 𝑈 . This looks like
a regression model. There you go! This is the ati ical foundation of a
corre ly specified regression model.
There is a very useful definition that you will you throughout your
career. Let 𝐴 be a random variable, and 𝐵 be a random ve or. If E(𝐴 |
𝐵) = 0, 𝐵 is said to be exogenous with respe to 𝐴. This implies no
correlation: 𝐴 is uncorrelated with 𝐵 if Corr(𝐴, 𝐵) = 0. Then, the property
E(𝑈 | 𝑋) = 0 can be re- ated is “𝑋 is exogenous with respe to 𝑈 ”. Notice
that is 𝐵 is exogenous with respe to 𝐴, the converse is not true. You will
always have endogenous and exogenous variables, and you mu specify
with respe to what.
Let us look at another interpretation of the conditional expe ation
as the be predi or of 𝑌 given the information on 𝑋. This is a classic
regression; you want to predi log wage given education, gender, race etc.
𝑌 is a random variable. You want to predi it using some fun ions of 𝑋.
It is not to be allowed to be fun ions of 𝑌 because then, the predi ion
error is zero. You have the predi ion error: 𝑌 − 𝑔(𝑋). What fun ion
should you pick? Ideally, you should pick a fun ion that minimises the
predi ion error. Here, the easie way to make things positive is |𝑌 − 𝑔(𝑋)|.
The problem is the fa that this is a random obje , and we want to remove
the randomness in it:

arg min E|𝑌 − 𝑔(𝑋)|.


𝑔∈{fun ions of 𝑋}

This is an optimisation problem: search across all fun ions of 𝑋! The ob-
je is called mean absolute error. Now we have to solve this optimisation
problem. It is quite hard. It has a unique
∫︀ solution. The only way this is zero
is the integrand being zero. If 𝑓 ≥ 0, 𝑓 = 0 ⇔ 𝑓 = 0. However, you
are searching for a fun ion given the information, the optimal solution
would be med(𝑌 | 𝑋). However, this optimisation problem is difficult.
This is the basics for a quantile regression. However, you mu have much
more fundamental under anding. Purely for the sake of computation we
shall not use this fun ion. This fun ion has a kink at the origin that does
not allow calculus. And this causes problems. What is the easie way to
fix this problem? How do you remove the kink? You should smooth this
fun ion at the origin.


How about we minimize MSE,

arg min E(𝑌 − 𝑔(𝑋))2


𝑔∈{fun ions of 𝑋}

The unique solution to this problem is E(𝑌 | 𝑋). This is the basis for the
mean regression. These obje s have different properties. The conditional
median is the be predi ion in the sense that it minimises mean absolute
error. The conditional expe ation is the be predi or in the sense that it
minimises mean squared error, so the meaning of the word “be ” changes.
Quantile regression is a hugely intere ing problem, but it takes  pages of
proof.
Claim. E(𝑌 | 𝑋) is really the minimiser of across all fun ions of 𝑋
that minimises mean squared error.

E(𝑌 | 𝑋) = arg min E(𝑌 − 𝑔(𝑋))2


𝑔∈{fun ions of 𝑋}

For any given 𝑋,

E(𝑌 − 𝑔(𝑋))2 = E(𝑌 − E(𝑌 | 𝑋) +E(𝑌 | 𝑋) − 𝑔(𝑋))2 =


⏟ ⏞
= E(𝑈 + E(𝑌 | 𝑋) − 𝑔(𝑋))2 =
⎛ ⎞

= E𝑈 2 + E(E(𝑌 | 𝑋) − 𝑔(𝑋))2 + 2E ⎝𝑈 E(𝑌 | 𝑋) − 𝑔(𝑋) ⎠ =


⎜ (︀ )︀⎟
⏟ ⏞
fun ion of 𝑋

You remember that E(𝑈 | 𝑋) = 0 ⇒ E𝑈 = 0 by the law of iterated


expe ations: E𝑈 = E(E(𝑈 | 𝑋)) = 0. The latter part is 0 because 𝑈 is
uncorrelated with every fun ion of 𝑋.

= E𝑈 2 + E(E(𝑌 | 𝑋) − 𝑔(𝑋))2 = E𝑈 2 + min E(E(𝑌 | 𝑋) − 𝑔(𝑋))2


𝑔

The fir part does not depend on zero. In the second part, if 𝑔 = E(𝑌 | 𝑋),
this gives the unique solution! This is a remarkable result because you
are minimising over the set of all fun ions of 𝑋. This set is infinite
dimensional, there are uncountably infinitely many fun ions of 𝑋. So this
is sometimes written as BP(𝑌 | 𝑋).
Once you have the conditional density, you can compute the conditional
expe ation. However, only nature alone knows the true conditional expe -


ation. The e imation can be done, but we need to look at non-parametric
methods. So, how about lowering our expe ations? In ead of be predi -
ors, let us use be linear predi or. You are ill minimising, but, in ead
of searching across all fun ions, you are searching across linear fun ions
of 𝑋.

 Be linear predi ors


Let 𝑌 be a random variable. Let 𝑋 be a random ve or. The be linear
predi or of 𝑌 given information on 𝑋 is the unique linear fun ion of 𝑋
that solves the following optimisation problem:

BLP(𝑌 | 𝑋) = arg min E(𝑌 − 𝑔(𝑋))2


𝑔∈{linear fun ions of 𝑋}

The proof is quite elementary. You can prove these properties using pure
algebra. It can be shown that the be linear predi or of 𝑌 given 𝑋 is equal
to
BLP(𝑌 | 𝑋) = 𝛽0 + 𝑋 ′ 𝛽1 ,
where 𝛽0 = E𝑌 − E𝑋 ′ 𝛽1 , and 𝛽1 = (Var 𝑋)−1 Cov(𝑋, 𝑌 ). It is very easy
to e imate the population mean and covariances. This is the be linear
predi or. If 𝑋 is a ve or, a andard mathematical convention is a column
ve or. Whatever the dimension of 𝑋, this uff works. These definitions
are dimension-independent. Here, Var 𝑋 is a matrix, and Cov(𝑋, 𝑌 ) is a
ve or.
Remark. Let 𝐴 and 𝐵 be random ve ors. They can be of any dimension.
Their dimensions are anything times one. Then,
def
Var 𝐴 = E (𝐴 − E𝐴) (𝐴 − E𝐴)′
⏟ ⏞ ⏟ ⏞
𝑚×1 1×𝑚

Sometimes, people ju call it the variance-covariance matrix. This can be


rewritten as
Var 𝐴 = E𝐴𝐴′ − (E𝐴)(E𝐴)′
Now, what about Cov(𝐴, 𝐵)? The sensible of defining it is

Cov(𝐴, 𝐵) = E(𝐴 − E𝐴)(𝐵 − E𝐵)′ → dim 𝐴 × dim 𝐵

Cov(𝐴, 𝐵) = E𝐴𝐵 ′ − (E𝐴)(E𝐵)′


So Var 𝑋 = E𝑋𝑋 ′ − (E𝑋)(E𝑋)′ and Cov(𝑋, 𝑌 ) = E𝑋𝑌 − E(𝑋)E(𝑌 ).
Do the dimensions match? Var 𝑋 is a matrix, Cov(𝑋, 𝑌 ) is a ve or, so
everything matches.
I want to compare BP’s and BLP’s.
BP BLP
BP(𝑌 | 𝑋) = E(𝑌 | 𝑋) BLP(𝑌 | 𝑋) = 𝛽0 + 𝑋 ′ 𝛽1
𝛽0 = E𝑌 − (E𝑋)′ 𝛽1 ,
𝛽1 = (Var 𝑋)−1 Cov(𝑋, 𝑌 )
def def
𝑈 = 𝑌 − E(𝑌 | 𝑋) is the popu- 𝑉 = 𝑌 − BLP(𝑌 | 𝑋) is the pop-
lation residual from the be pre- ulation residual from the be lin-
di ion problem ear predi ion problem
E(𝑈 | 𝑋) = 0 BLP(𝑉 | 𝑋) = 0
𝑈 is uncorrelated with every 𝑈 is uncorrelated with every lin-
fun ion of 𝑋 ear fun ion of 𝑋
The be predi ion of 𝑈 given 𝑋 is zero. However, with 𝑉 , it is true
only with linear fun ions. The population residual 𝑉 has no information
only about linear fun ions of 𝑋. This is much more re ri ed problem
than this. However, there is no useful rule for BLP’s because they are
invariant only with respe to some linear transformations of 𝑋.

 Conditional variance fun ion


Let 𝑌 be a random variable. Let 𝑋 be a random ve or. Now that you
have conditional densities, you can define any conditional obje s. Then
the conditional variance fun ion Var(𝑌 | 𝑋 | 𝑥) is the fun ion

def (︀
𝑥 ↦→ Var(𝑌 | 𝑋 = 𝑥) = E (𝑌 − E(𝑌 | 𝑋 = 𝑥))2 ⃒ 𝑋 = 𝑥 =
⃒ )︀
)︀2
= E(𝑌 2 | 𝑋 = 𝑥) − E(𝑌 | 𝑋 = 𝑥)
(︀

Definition. The conditional variance of 𝑌 given 𝑋 is the random


def
variable Var(𝑌 | 𝑋) = Var(𝑌 | 𝑋 = 𝑥)|𝑥→𝑋 . In the literature, there is a
specific name to this fun ion, the skeda ic fun ion. Ask Chri os. This
is the fun ion that captures variation.
Definition. If the skeda ic fun ion 𝑥 ↦→ Var(𝑌 | 𝑋 = 𝑥) is the con ant
fun ion, then 𝑌 is said to be homoskeda ic with respe to 𝑋. If not,


then 𝑌 is said to be heteroskeda ic.
Another thing. Let 𝑌 be a random variable and 𝑋 be a random ve or.
Then 𝑌 = E(𝑌 | 𝑋) + 𝑌 − E(| 𝑋), so

Var 𝑌 = Var(E(𝑌 | 𝑋)) + Var 𝑈

But Var 𝑈 = E𝑈 2 − (E𝑈 )2 = E𝑈 2 because E(𝑈 | 𝑋) = 0 ⇒ E𝑈 = 0.


However, the converse is not true, ̸⇐. To show this counterexample, try
to find to random variables with mean zero but the conditional mean not
zero. Fir , try to find the solution yourself. If you never get uck, you
never learn.
Do you agree it is equal to E(E(𝑈 2 | 𝑋))? This is the law of iterated
expe ations. But E(𝑈 2 | 𝑋) = E[(𝑌 − E(𝑌 | 𝑋))2 | 𝑋] = Var(𝑌 | 𝑋). So
E(E(𝑈 2 | 𝑋)) = E(Var(𝑌 | 𝑋)). So

Var 𝑌 = Var E(𝑌 | 𝑋) + E Var(𝑌 | 𝑋)

This is called the variance decomposition formula, or ANOVA (ANalysis


Of VAriance).
Let us art doing some problems from homework . The additional
problem  cannot be googled because it is new.

The more you write, the less I am convinced that you know. Do
not write two pages of the ate of your mind.

Additional problem  has  parts.


. E(𝑌 | 𝑋) is not E(𝑋 | 𝑌 ) because the fir is a fun ion of 𝑋, the
second is a fun ion of 𝑌 .
. 𝑔 is a fun ion. Var(𝑔(𝑋)𝑌 | 𝑋) = 𝑔 2 (𝑋) Var(𝑌 | 𝑋). Is it true? Yes.
Once you condition on 𝑋, randomness on 𝑋 is opped, and by definition,
Var(𝑐𝑍) = 𝑐2 Var 𝑍. Besides that, you can rewrite this sa Var(𝑔(𝑋)𝑌 |
𝑋) = E(𝑔(𝑋)2 𝑌 2 | 𝑋) − (E(𝑔(𝑋)𝑌 | 𝑋))2 , which is by the useful rule
𝑔 2 (𝑥)E(𝑌 2 | 𝑋) − (𝑔(𝑋)E(𝑌 | 𝑋))2 = 𝑔 2 (𝑋)[E(𝑌 2 | 𝑋) − (E(𝑌 | 𝑋))2 ] =
𝑔 2 (𝑋) Var(𝑌 | 𝑋).
Problem  will allow you to solve problem . Let 𝐷 be a Bernoulli
random variable. Let 𝑌 be a random variable. Show that E(𝑌 | 𝐷 =
E𝑌 𝐷
1) = P(𝐷=1) . We show this using the definition. Fir , do you agree that
E𝑌 𝐷 = E(E(𝑌 𝐷 | 𝐷)). It is by LIE. Next, it is equal to E(𝐷E(𝑌 | 𝐷)). How
many values does 𝐷 take? Only two. Do you agree, since 𝐷 ∈ {0, 1}, that


this is E(𝑌 | 𝐷) = 𝐷E(𝑌 | 𝐷 = 1) + (1 − 𝐷)E(𝑌 | 𝐷 = 0). If 𝐷 = 0, then it
is equal to (1 − 0)E(𝑌 | 𝐷 = 0), and if 𝐷 = 1, the RHS is 1 · E(𝑌 | 𝐷 = 1).
Since E(𝑌 | 𝐷) = 𝐷E(𝑌 | 𝐷 = 1) + (1 − 𝐷)E(𝑌 | 𝐷 = 0), then
𝐷E(𝑌 | 𝐷) = 𝐷2 E(𝑌 | 𝐷 = 1) + 𝐷(1 − 𝐷)E(𝑌 | 𝐷 = 0). Now look at 𝐷2 .
It is zero if and only if 𝐷 = 0, and one if and only if 𝐷 = 1. So 𝐷2 = 𝐷.
However, 𝐷(1−𝐷) is always zero. Therefore, 𝐷E(𝑌 | 𝐷) = 𝐷E(𝑌 | 𝐷 = 1).
Therefore, we take expe ation on both sides: E[𝐷E(𝑌 | 𝐷)] = E[𝐷E(𝑌 |
𝐷 = 1)] = E𝐷E(𝑌 | 𝐷 = 1) = P(𝐷 = 1)E(𝑌 | 𝐷 = 1). So

E[𝐷E(𝑌 | 𝐷)] E𝑌 𝐷
E(𝑌 | 𝐷 = 1) = =
P(𝐷 = 1) P(𝐷 = 1)

QED is quod erat demon randum, what we were trying to show. There
is a famous author, and Euclid wrote everything explicitly. Euclid showed
the atements and in the end, summarised.
Exercise  (i). Find E(𝑌 | 𝑋 ∈ 𝑆). What is the conditional expe ation
of 𝑌 given that 𝑋 ∈ 𝑆? This might seem range. So really, we have been
conditioning on other variables, but how do you condition on an event?
def
Let 𝐷 = I𝑆 (𝑋). How many values does 𝐷 take? Like a Bernoulli random
variable: 1 if 𝑋 ∈ 𝑆 or 0 if 𝑋 ̸∈ 𝑆. So E(𝑌 | 𝑋 ∈ 𝑆) = E(𝑌 | 𝐷 = 1). The
point is to write this obje as a conditional expe ation fun ion. Then,
look at the problem : this is P(𝐷=1)
E𝑌 𝐷
. Do you agree that this is nothing but
E𝑌 I𝑆 (𝑋) I𝑆 (𝑋)|𝑋)] (𝑋)E(𝑌 |𝑋)]
P(𝐷=1) = E[E(𝑌P(𝐷=1) = E[I𝑆 P(𝐷=1) . Now you have to fiddle around
so that the answer appears as in part . You should learn from this problem
how to write events as random variables.
Exercise  (ii). Censored or truncated regressions are called Tobit
models. When you use the margins command in Stata, it is using the result
d
from this problem. Let 𝑌 = 𝒩 (𝜇, 𝜎 2 ). Show that
(︀ 𝑎−𝜇 )︀ (︁ )︁
𝑏−𝜇
𝜙 𝜎 −𝜙 𝜎
E(𝑌 | 𝑎 < 𝑌 < 𝑏) = 𝜇 + 𝜎 (︁ )︁ (︀ 𝑎−𝜇 )︀ ,
𝑏−𝜇
Φ 𝜎 −Φ 𝜎

def 2 def ∫︀ 𝑡
where 𝜙(𝑥) = √1 e −𝑥 /2 ,
2𝜋
Φ(𝑡) = −∞ 𝜙(𝑥) d𝑥, 𝑡 ∈ R.
def
Fir of all, E(𝑌 | 𝑎 < 𝑌 < 𝑏) = E(𝑌 | 𝑌 ∈ 𝑆) where 𝑆 = (𝑎, 𝑏). But


from part , we know that
∫︁
1
E(𝑌 | 𝑋 ∈ 𝑆) = E(𝑌 | 𝑋 = 𝑥) PDF𝑋 (𝑥) d𝑥
P(𝑋 ∈ 𝑆) 𝑥∈𝑆

But if 𝑋 = 𝑌 , then this implies that


∫︁
1
E(𝑌 | 𝑌 ∈ 𝑆) = E(𝑌 | 𝑌 = 𝑥) PDF𝑌 (𝑥) d𝑥
P(𝑌 ∈ 𝑆) 𝑥∈(𝑎,𝑏)
1
∫︀
Therefore, now, E(𝑌 | 𝑌 ∈ 𝑆) = P(𝑎<𝑌 <𝑏) 𝑥∈(𝑎,𝑏) 𝑥 PDF𝑌 (𝑥) d𝑥 . For
E(𝑌 | 𝑌 = 𝑥), you have to apply the useful rule: E(𝑥 | 𝑌 = 𝑥) = 𝑥. Now
this becomes ∫︁ 𝑏
1 1 (𝑥−𝜇)2
𝑥√ e − 2𝜎2 d𝑥
P(𝑎 < 𝑌 < 𝑏) 𝑎 2𝜋𝜎
Next,

P(𝑎 < 𝑌 < 𝑏) = P(𝑎 < 𝑌 ≤ 𝑏) = P(𝑌 ≤ 𝑏) − P(𝑌 ≤ 𝑎) =


(︂ )︂ (︂ )︂
𝑌 −𝜇 𝑏−𝜇 𝑌 −𝜇 𝑎−𝜇
=P ≤ −P ≤ =
𝜎 𝜎 𝜎 𝜎
(︂ )︂ (︂ )︂
𝑏−𝜇 𝑎−𝜇
= P 𝒩 (0, 1) ≤ − P 𝒩 (0, 1) ≤ =
𝜎 𝜎
(︂ )︂ (︂ )︂
𝑏−𝜇 𝑎−𝜇
=Φ −Φ
𝜎 𝜎

Now,
∫︁ 𝑏 (𝑥−𝜇)2
∫︁ 𝑏
(𝑥−𝜇)2
1 − 1
𝑥√ e 2𝜎 2 d𝑥 = (𝑥 − 𝜇 + 𝜇) √ e − 2𝜎2 d𝑥 =
𝑎 2𝜋𝜎 𝑎 2𝜋𝜎
∫︁ 𝑏 2
∫︁ 𝑏
1 (𝑥−𝜇) 1 (𝑥−𝜇)2

= (𝑥 − 𝜇) √ e 2𝜎 2 d𝑥 + 𝜇 √ e − 2𝜎2 d𝑥
𝑎 2𝜋𝜎 𝑎 2𝜋𝜎
⏟ ⏞
PDF𝑌

(𝑥−𝜇)2 𝑥−𝜇
Let 𝑡 = 2𝜎 2
, therefore, d𝑡 = 𝜎2
d𝑥 and 𝜎 2 d𝑡 = (𝑥 − 𝜇) d𝑥.


Then, the fir integral is equal to

∫︁ (𝑏−𝜇)2 ⃒ (𝑏−𝜇)2
2𝜎 2 1 −𝑡 2 1 −𝑡 ⃒⃒ 2𝜎2
√ e 𝜎 d𝑡 = −𝜎 √ e ⃒ =
(𝑎−𝜇)2 2𝜋𝜎 2𝜋 (𝑎−𝜇)2
2𝜎 2 2𝜎 2
(𝑎−𝜇)2 (𝑏−𝜇)2
(︂ )︂ (︂ (︂ )︂ (︂ )︂)︂
1 − − 𝑎−𝜇 𝑏−𝜇
= 𝜎√ e 2𝜎 − e 2𝜎
2 2 =𝜎 𝜙 −𝜙
2𝜋 𝜎 𝜎

You are here to learn. You should be learning  days per week apart
from – hours you need for sleep. Welcome to grad school! It is work,
work, work. Life is tough.
The point is not algebra, the point is concept.
Let’s move on to another topic.

 Independence
This topic is called independence. This is the concept that is really
remarkable. It was invented by Kolmogorov. It is to make probability
caluclations simpler. It may or may not hold.
Suppose that 𝐴 and 𝐵 are events of some sample space. We say that 𝐴
and 𝐵 are independent and we write it as 𝐴 ⊥⊥ 𝐵 if

P(𝐴 ∩ 𝐵) = P(𝐴)P(𝐵)

If some events are independent, do they have to be disjoint, and vice versa?
d def
If 𝐴 ∩ 𝐵 = ∅, does it imply 𝐴 ⊥
⊥ 𝐵? Let 𝑋 = 𝒩 (0, 1), let 𝐴 = (𝑋 > 0)
def
and 𝐵 = (𝑋 ≤ 0). Then 𝐴 ∩ 𝐵 = ∅. You have to apply the definition of
this concept. Then, P(𝐴) = 0.5, P(𝐵) = 0.5, but P(𝐴 ∩ 𝐵) = P(∅) = 0. The
other way around is also not true. Independence is a property of events
and has nothing to do with set theory. Sometimes, this concept is called
ocha ic independence.
What if you had three events? You would the say that the independ-
ence should hold for any subcolle ion. We say that 𝐴, 𝐵, 𝐶 are mu-
tually independent if P(𝐴 ∩ 𝐵) = P(𝐴)P(𝐵), P(𝐴 ∩ 𝐶) = P(𝐴)P(𝐶),
P(𝐵 ∩ 𝐶) = P(𝐵)P(𝐶), P(𝐴 ∩ 𝐵 ∩ 𝐶) = P(𝐴)P(𝐵)P(𝐶). This definition is
the culmination of a huge amount of human thought.
When can we claim that two random variables are independent? When
events they generate are independent.


Definition. Let 𝑋 and 𝑌 be random variables or random ve ors. We
say that 𝑋 is independent of 𝑌 (and we write 𝑋 ⊥ ⊥ 𝑌 ) if P(𝑋 ∈ 𝐴, 𝑌 ∈
𝐵) = P(𝑋 ∈ 𝐴)P(𝑌 ∈ 𝐵) ∀𝐴 ⊂ Rdim 𝑋 , 𝐵 ⊂ Rdim 𝑌 .
Result . It can be shown that

𝑋⊥
⊥ 𝑌 ⇔ PDF𝑋,𝑌 (𝑥, 𝑦) = PDF𝑋 (𝑥) PDF𝑌 (𝑦) ∀(𝑥, 𝑦) ∈ supp(𝑋, 𝑌 )

This is a result that is not so easy to prove. Independence is so rong and


it simplifies calculations.
Result . The reason why independence is a really rong condition.
When you declare two variables are independent, you are imposing a very
rong re ri ion. If 𝑋 ⊥ ⊥ 𝑌 , then 𝑔(𝑋) ⊥⊥ ℎ(𝑦) ∀𝑔, ℎ. Any fun ion of 𝑋
is independent of any fun ion of 𝑌 . If you take a con ant fun ion, 𝑋 is
independent of 𝑐.
Result . If 𝑋 is a random variable and 𝑐 is a con ant, then 𝑋 ⊥ ⊥ 𝑐.
This is easy to prove. Ju accept it.
Let us look at some re ri ions imposed by independence. If they are
independent, then their correlation is zero. The converse is not true.
Let 𝑋 and 𝑌 be random variables. If 𝑋 ⊥ ⊥ 𝑌 , then
∫︁ ∫︁
E𝑋𝑌 = PDF𝑋,𝑌 (𝑥, 𝑦) d d𝑦 =
𝑥∈supp 𝑋 𝑦∈supp 𝑌
∫︁ ∫︁ ∫︁ ∫︁
= 𝑥𝑦 PDF𝑋 (𝑥) PDF𝑌 (𝑦) d𝑥 d𝑦 = 𝑥 PDF𝑋 (𝑥) =
𝑥 𝑦 𝑥 𝑦
= PDF𝑌 (𝑦) = E𝑋 · E𝑌

Another result. 𝑋 ⊥
⊥ 𝑌 ⇒ Corr(𝑋, 𝑌 ) = 0 because

Cov(𝑋, 𝑌 ) E𝑋𝑌 − E𝑋E𝑌


Corr(𝑋, 𝑌 ) = = = 0.
sd 𝑋 sd 𝑌 sd 𝑋 sd 𝑌
d def
Example. Let 𝑋 = 𝒩 (0, 1) and let 𝑌 = 𝑋 2 . Then

Cov(𝑌, 𝑋) E𝑋𝑌 − E𝑋E𝑌 E𝑋𝑌 E𝑋 3


Corr(𝑌, 𝑋) = = √ = √ = √ =0
sd 𝑌 sd 𝑋 2·1 2 2
Are these variable independent? 𝑌 is a determini ic fun ion of 𝑋. The
behaviour of 𝑌 is ri ly dependent on 𝑋. You cannot even find the joint
density of 𝑌 and 𝑋. The probability of interse ion is not equal to the


produ of probabilities.
def def
Let 𝐴 = {𝑋 ∈ (1, ∞)} and {𝐵 = 𝑌 (∈ (−∞, 0.25))}. Then

P(𝐴 ∩ 𝐵) = P(𝑋 > 1, 𝑌 < 0.25) = P(𝑋 > 1, 𝑋 2 < 0.25) =


= P(𝑋 > 1, −0.5 < 𝑋 < 0.5) = 0 ̸= P(𝑋 > 1)P(𝑌 < 0.25)

because the latter is (1 − Φ(1)) · (Φ(0.5) − Φ(−0.5)). These numbers can be


found in tables. Typically, you numerically approximate these integrals.

 Le ure  (--)
Let us discuss the relationship between independence and conditioning.
Let 𝑋 and 𝑌 be random variables. If 𝑋 ⊥⊥ 𝑌 , there are two ways to interpret:
P(𝑋 𝑖𝑛𝐴, 𝑌 ∈ 𝐵) = P(𝑋 ∈ 𝐴)P(𝑌 ∈ 𝐵) ∀𝐴, 𝐵 ⊂ R, or PDF𝑋,𝑌 =
PDF𝑋 · PDF𝑌 .
Suppose 𝑋 ⊥ ⊥ 𝑌 . What is the conditional density PDF𝑌 |𝑋 ? But this is
˙
equal to PDFPDF
𝑌 PDF𝑋
𝑋
= PDF𝑌 . If 𝑌 and 𝑋 are independent, they are not
giving information on one another.
What about the other way? If PDF𝑌 |𝑋 = PDF𝑌 , then PDF𝑌 |𝑋 PDF𝑋 =
⏟ ⏞
PDF𝑌 |𝑋
PDF𝑌 PDF𝑋 . So we have equivalence:

𝑌 ⊥
⊥ 𝑋 ⇔ PDF𝑌 |𝑋 = PDF𝑌

In some books, it is used as the definition.


Example. Suppose 𝑋 ⊥ ⊥ 𝑌 . Then E(𝑌 | 𝑋) = E(𝑌 ) because con-
ditioning gives no information. Then, Var(𝑌 | 𝑋) = Var 𝑌 . These are
one-line proofs. What about the other way around? Suppose we know
E(𝑌 | 𝑋) = E𝑌 . Can we conclude that 𝑌 ⊥ ⊥ 𝑋? Saying that the condi-
tional mean is equal to the unconditional is ju a feature of a di ribution.
This is ju one particular feature. What about the conditional quantiles,
variances and modes? So this is not true. You can find such variables
that Var(𝑌 | 𝑋) = Var 𝑌 , but 𝑌 is not independent on 𝑋. Independence
implies re ri ions on all features of the di ribution.
Homework , AP . Let 𝐴 and 𝐵 be events. Show that everything
between them is independent:

⊥ 𝐵 ⇔ 𝐴𝐶 ⊥
𝐴⊥ ⊥ 𝐵 𝐶 ⇔ 𝐴𝐶 ⊥
⊥𝐵 ⇔ 𝐴⊥ ⊥ 𝐵𝐶 ⇔ 𝐴 ⊥
⊥𝐵


. 𝐴 ⊥⊥ 𝐵 ⇒ 𝐴𝐶 ⊥ ⊥ 𝐵. Now 𝐵 = (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴). Therefore,
P(𝐵) = P(𝐵∖𝐴)+P(𝐵∩𝐴) = P(𝐵∩𝐴𝐶 )+P(𝐵∩𝐴). Therefore, P(𝐵∩𝐴𝐶 ) =
P(𝐵) − P(𝐵 ∩ 𝐴), and through independence it is P(𝐵) − P(𝐵)P(𝐴) =
P(𝐵)(1 − P(𝐴)) = P(𝐵)P(𝐴𝐶 ).
. 𝐴𝐶 ⊥⊥𝐵 ⇒ 𝐴⊥ ⊥ 𝐵 𝐶 . Do you agree that 𝐵 = (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴).
So P(𝐵 ∩ 𝐴) = P(𝐵) − P(𝐵 ∩ 𝐴𝐶 ). Also, 𝐴 = (𝐴 ∖ 𝐵) ∪ (𝐴 ∩ 𝐵), therefore,
P(𝐴 ∩ 𝐵) = P(𝐴) − P(𝐴 ∩ 𝐵 𝐶 ). Since P(𝐵 ∩ 𝐴) = P(𝐴 ∩ 𝐵),

P(𝐵) − P(𝐵 ∩ 𝐴𝐶 ) = P(𝐴) − P(𝐴 ∩ 𝐵 𝐶 )

Therefore, P(𝐴 ∩ 𝐵 𝐶 ) = P(𝐴) − P(𝐵) + P(𝐵 ∩ 𝐴𝐶 ) = P(𝐴) − P(𝐵) +


P(𝐵)P(𝐴𝐶 ) = P(𝐴) − P(𝐵)(1 − P(𝐴𝐶 )) = P(𝐴) − P(𝐵)P(𝐴) = P(𝐴)(1 −
P(𝐵)) = P(𝐴)P(𝐵 𝐶 ).
All con ants are independent of all random variable.
d
𝑋 = Rademacher if supp 𝑋 = {−1, 1} and P(𝑋 = −1) = P(𝑋 = 1) =
0.5. It is useful in boot rapping.
def ∑︀
AP . Let 𝑆𝑚 = 𝑚 𝑖=1 𝑋𝑖 , where (𝑋1 , . . . , 𝑋𝑛 ) are a colle ion of inde-
pendent identically di ributed. They are independent, and their joint PDF
is the produ of marginals. We say that a random sample is a colle ion
of IID random variables. Here, the summation index is random.
d
Let 𝑁 = Poisson(𝜆). Then E𝑆𝑁 is the sum of a random number of terms.
We declare {︃∑︀
𝑚
def 𝑖=1 𝑋𝑖 , 𝑚 ≥ 1,
𝑆𝑚 =
0, 𝑚=0
Now, E𝑆𝑚 = 𝑛𝑖=1 E𝑋𝑖 . The only randomness is coming from the 𝑋’s.
∑︀
However, E𝑆𝑁 has two sources of randomness. You have too much ran-
domness. If we could somehow shut down the randomness, we could solve
it. And we do it by conditioning on 𝑁 .
Step . E𝑆𝑁 = E[E(𝑆𝑁 | 𝑁 )] (LIE). Now, E(𝑆𝑁 | 𝑁 ). In order to proceed,
def
we need E(𝑌 | 𝑋) = E(𝑌 | 𝑋 = 𝑥)|𝑥→𝑋 . You fir have to compute the
conditional expe ation fun ion. I want to find E(𝑆𝑁 | 𝑁 ) = E(𝑆𝑁 | 𝑁 =
𝑛)|𝑛→𝑁 , the CEF.
Step . Now use the information E(𝑆𝑁 | 𝑁 = 𝑛) = E(𝑆𝑛 | 𝑁 = 𝑛) by


the useful rule. Now we know the definition of it:
{︃ ∑︀
E[ 𝑛𝑖=1 𝑋𝑖 | 𝑁 = 𝑛], 𝑛 ≥ 1,
E(𝑆𝑛 | 𝑁 = 𝑛) = =
0, 𝑛 = 0.
{︃∑︀ {︃∑︀
𝑛 𝑛
𝑖=1 E(𝑋 𝑖 | 𝑁 = 𝑛), 𝑛 ≥ 1, 𝑖=1 E𝑋𝑖 , 𝑛 ≥ 1,
= = = 𝑛𝑝,
0, 𝑛 = 0. 0, 𝑛 = 0.

𝑛 ≥ 0. If all 𝑋’s are independent of 𝑛, then this is ju the marginal


expe ation of 𝑋! So we have shown that E(𝑆𝑁 | 𝑁 = 𝑛) = 𝑛𝑝. Now,
def
E(𝑆𝑁 | 𝑁 ) = 𝑛𝑝|𝑛→𝑁 = 𝑁 𝑝, and

E𝑆𝑁 = E(E[𝑆𝑁 | 𝑁 ]) = E(𝑁 𝑝) = 𝑝E𝑁 = 𝑝𝜆



⎨−1, 𝑝1 ,

Homework , AP. 𝑋 = 0, 𝑝2 , Finding E(𝑋 2 | 𝑋) is obvious and

1, 𝑝3 .

2
is equal to 𝑋 . Once you condition on 𝑋, everything that depends on
def
𝑋 a s like a con ant. Now, E(𝑋 | 𝑋 2 ) = E(𝑋 2 | 𝑋 2 = 𝑡)|𝑡→𝑋 2 . Now,
supp 𝑋 2 = {0, 1}. Fir , E(𝑋 | 𝑋 2 = 0) = 0 because 𝑋 2 = 0 ⇔ 𝑋 = 0
and because of the useful rule.
Now
def
∑︁ ∑︁
E(𝑋 | 𝑋 2 = 1) = 𝑥 PMF𝑋|𝑋 2 =1 (𝑥) = 𝑥P(𝑋 = 𝑥 | 𝑋 2 = 1).
𝑥∈supp 𝑋 𝑥∈{−1,0,1}

This is equal to −1P(𝑋 = −1 | 𝑋 2 = 1) + 1P(𝑋 = 1 | 𝑋 2 = 1) =


2 =1) 2 =1)
−1 P(𝑋=−1,𝑋
P(𝑋 2 =1)
+ P(𝑋=1,𝑋
P(𝑋 2 =1)
. Now, if 𝑋 = −1, then 𝑋 2 = 1 automatically,
so this is equal to − P(𝑋=−1)
P(𝑋 2 =1)
+ P(𝑋=1)
P(𝑋 2 =1)
. Next, P(𝑋 2 = 1) = 𝑝1 + 𝑝3 , to the
expression is equal to − 𝑝1𝑝+𝑝
1
3
+ 𝑝1𝑝+𝑝
3
3
= 𝑝𝑝31 −𝑝 1
+𝑝3 . So

𝑝3 − 𝑝1 𝑝3 − 𝑝1
E(𝑋 | 𝑋 2 = 𝑡) = 0I𝑡=0 + I𝑡=1 = I𝑡=1 ,
𝑝1 + 𝑝3 𝑝1 + 𝑝3
𝑝3 −𝑝1
and E(𝑋 | 𝑋 2 ) = 𝑝1 +𝑝3 I𝑋 2 =1 .


 Le ure  (--)
You will need the properties of Gaussian random ve or in portfolio
theory. Gaussian obje s possess properties that make them really unique.
They possess a property that other variables do not have.
d
Recall that 𝑋 is said to be a normal (Gaussian) random variable (𝑋 =
𝒩 (0, 1)) if and only if supp 𝑋 = R and
1 2
PDF𝑋 (𝑥) = 𝜙(𝑥) = √ e −𝑥 /2 , 𝑥∈R
2𝜋
Claim. Let 𝑎 and 𝑏 be con ants. Then 𝑎𝑋 + 𝑏 is a Gaussian random
variable. What is the expe ed value of 𝑎𝑋 + 𝑏? It is 𝑏. What is its variance?
𝑎2 Var 𝑋 = 𝑎2 . The di ribution remains the same, only the mean and
the variance change. This is true if and only if any linear combination is
Gaussian. The di ribution holds under linear transformations. This is
called the reproducing property of Gaussian random variables. Try and
see if you can prove it. Take the CDF of it, differentiate it and verify.
There is a andard convention in English: if you really like Shakespeare,
you write “shakespearean”. If you are not familiar with him, you write
“Shakespearean”.
Do not make your own definitions if you are asked whether a ve or is
Gaussian or not. Before we look at Gaussian random ve ors, recall some
uff about random ve ors.
Let 𝑋 be 𝑝 × 1 random ve or, (𝑋 (1) , . . . , 𝑋 (𝑝) )⊤ . Then
def
E𝑋 = (E𝑋 (1) , . . . , E𝑋 (𝑝) )⊤ .

It is a ve or of con ants. The variance of 𝑋 is defined as E(𝑋 − E𝑋)(𝑋 −


E𝑋)′ = E𝑋𝑋 ′ − (E𝑋)(E𝑋)′ , which is as 𝑝 × 𝑝 matrix.
Remark. Var 𝑋 is a symmetric and positive semi-definite matrix. Sym-
metric means that Var 𝑋 = (Var 𝑋)′ . Positive semi-definiteness means that
𝛼′ (Var 𝑋)𝛼 ≥ 0 ∀𝛼 ∈ R𝑝 . However, if a matrix is positive definite, it is
guaranteed to be invertible. We shall make sure that if the coordinates of
𝑋 are linearly independent with probability , then this matrix is going to
be positive definite.
Remark. If 𝑌 is a random ve or, 𝐴 is a matrix of con ants, 𝑏 is a ve or
of con ants, then E(𝐴𝑌 + 𝑏) = 𝐴E𝑌 + 𝑏. It is implicit that 𝐴 and 𝑏 have
such dimensions that these multiplication and addition are conformable.


Note that 𝑌 ↦→ 𝐴𝑌 + 𝑏 is a linear mapping, a linear transformation of 𝑌 .
The variance is Var(𝐴𝑌 + 𝑏) = 𝐴(Var 𝑌 )𝐴′ . Never forget these results.

 Gaussian random ve or
def def
Let 𝑋 be a 𝑝 × 1 random ve or. Let 𝜇𝑝×1 = E(𝑋) and 𝑉𝑝×𝑝 = Var 𝑋.
Now we introduce a very range-looking definition: 𝑋 is said to be a
Gaussian random ve or if and only if every linear combination of the
coordinates of 𝑋 is a Gaussian random variable, that is 𝛼′ 𝑋 is a Gaussian
random variable for each 𝛼 ∈ R𝑏 . This definition is really remarkable. It
frees you from remembering any formulæ.
If you have a random ve or 𝑍, you check every possible linear combin-
ation of its coordinates.
def
Consequence of this definition. Let 𝑍 = (𝑋, 𝑌, 𝑊 )⊤ be a Gaussian
random ve or. This means that any linear combination of components 𝑋,
𝑌 , and 𝑊 is a Gaussian random variable. Do not say “elements” because
this is not a set. Can you tell me what is the di ribution of 𝑋? It is
Normal. Why? Because 𝛼 = (1, 0, 0)⊤ fits the definition: 𝑋 = (1, 0, 0) · 𝑍.
It is automatically implied that any combination is Gaussian. If you ack
two random variables, it is not necessary that the result is Gaussian: you
need many requirements for their joint di ribution. In the came vein,
𝑊 = (0, 0, 1)𝑍. And so is 𝑋 + 𝑍 = (1, 0, 1)𝑍.
Another piece of terminology. If 𝑋 is a Gaussian random ve or, then
its coordinates are said to be jointly normal.
Notation. 𝑍 is a Gaussian random ve or with mean 𝜇 and variance 𝑉
d d
is denoted as 𝑍 = 𝒩 (𝜇, 𝑉 ). Nice people write it as 𝑍 = MVN(𝜇, 𝑉 ).
d
Result. Let 𝑍 be a Gaussian random ve or, that is 𝑍 = 𝒩 (𝜇, 𝑉 ). The
dimension of 𝜇 is the same as 𝑍’s: dim 𝑍 × 1. The dimension of 𝑉 is
dim 𝑍 × dim 𝑍. Let 𝐴 be a matrix of con ants. Let 𝑏 be a ve or of con ants.
d
Then 𝐴𝑍 + 𝑏 = 𝒩 . The number of columns of 𝐴 mu be equal to the
number of columns of 𝑍. We claim that this ve or is Gaussian. Gaussianity
is preserved. Let’s see why.
How can we check this result? We should show that 𝛼′ 𝑍 is Gaussian.
Fir , the outcome is clearly a ve or. It is a random ve or. Then, we
mu check every linear combination for Gaussianity. Let 𝛼 ∈ Rdim 𝑏 . Then
def
𝛼′ (𝐴𝑍 + 𝑏) = 𝛼′ 𝐴𝑍 + 𝛼′ 𝑏 = 𝑎′ 𝑍 + 𝑐 when 𝑎 = 𝐴𝛼 and 𝑐 = 𝛼′ 𝑏. Then
𝑎′ 𝑍 is ju a linear combination of coordinates of 𝑍. This is a Gaussian
random variable because 𝑍 was Gaussian in the fir place. So: any linear


transformation of a Gaussian random ve or is Gaussian. Then

𝐴𝑍 + 𝑏 = 𝒩 (𝐴E𝑍 + 𝑏, 𝐴(Var 𝑍)𝐴′ )

This result is really remarkable: a linear transformation of a Gaussian


random ve or is Gaussian. There is a name for it: reproducing property
of Gaussian random ve ors.
Why is this so important and beloved in science? Let us do an example.
Example. Let (︂(︂ )︂ (︂ )︂)︂
d 1 1 −2
𝑋=𝒩 ,
2 −2 7
def def
Let 𝑌 (1) = 𝑋 (1) + 𝑋 (2) and 𝑌 (2) = 2𝑋 (1) − 3𝑋 (2) . Find the di ribution
of 𝑌 = (𝑌 (1), 𝑌 (2) )⊤ . This probability calculation can be done without
integrations. This can be figured out in linear algebra. Probability calcu-
lations that are highly non-linear are replaced with simple linear algebra.
People who do ock trading prefer to work with linear approximations
under normality assumptions.
Note.
(︂ (1) )︂ (︂ (1)
𝑋 + 𝑋 (2)
)︂ (︂ )︂ (︂ (1) )︂ (︂ )︂
𝑌 1 1 𝑋 1 1
𝑌 = = = = 𝑋
𝑌 (2) 2𝑋 (1) − 3𝑋 (2) 2 −3 𝑋 (2) 2 −3

d
Therefore, 𝑌 = 𝒩 by the reproducing property of Gaussian random ve ors
under linear maps. So
(︂(︂ )︂ (︂ )︂ (︂ )︂)︂
d 1 1 1 1 1 2
𝑌 =𝒩 E𝑋, Var 𝑋 ,
2 −3 2 −3 1 −3
(︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂
1 1 1 3 1 1 1 −2 1 2
which gives = and =
2 −3 1 −4 2 −3 −2 7 1 −3
(︂ )︂
4 −17
.
−17 91
Let us see what we have learned.
d
Example. Let 𝑋 = 𝒩 (0, 1) be a andard normal variable. However,
d d d
𝑌 = 𝒩 (0, 𝐼) is a multivariate normal ve or. Let 𝑌 = 𝑋𝑅 where 𝑅 =
Rademacher and 𝑅 ⊥ ⊥ 𝑋.
What is the di ribution of 𝑌 ? Observe that supp 𝑌 = R because 𝑋
takes all the values from the real line. Let us find the CDF of 𝑌 . Let 𝑦 ∈ R.


Then
def
CDF𝑌 (𝑦) = P(𝑌 ≤ 𝑦) = P(𝑋𝑅 ≤ 𝑌 ) =
= P(𝑋𝑅 ≤ 𝑦, 𝑅 = −1) + P(𝑋𝑅 ≤ 𝑦, 𝑅 = 1)

because these two events are disjoint. How, it is equal to

P(−𝑋 ≤ 𝑦, 𝑅 = −1) + P(𝑋 ≤ 𝑦, 𝑅 = 1) =


= P(−𝑋 ≤ 𝑌 )P(𝑅 = −1) + P(𝑋 ≤ 𝑦)P(𝑅 = 1)

because 𝑅 and 𝑋 are independent. This is equal to

0.5P(−𝒩 (0, 1) ≤ 𝑌 ) + 0.5P(𝒩 (0, 1) ≤ 𝑦) =


= 0.5P(𝒩 (0, 1) ≤ 𝑌 ) + 0.5P(𝒩 (0, 1) ≤ 𝑦)

because the andard Gaussian is symmetric around the origin. Here, you
are adding the same thing twice:

= P(𝒩 (0, 1) ≤ 𝑦) = CDF𝒩 (0,1) (𝑦)

d
This means that 𝑌 is also andard normal, 𝑌 = 𝒩 (0, 1). We created
another andard normal by using this trick.
Same example, que ion . Now, I am going to create a random ve or:
def
let 𝑍 = (𝑋, 𝑌 )⊤ . If you do not agree that 𝑍 is a random ve or, you have
a huge exi ential issue here. Is 𝑍 a Gaussian random ve or? What do
you have to check? If any linear combination is Gaussian. So let’s check
uncountably infinite many linear combinations, hopefully by the end of
this le ure. Let’s check

P(𝑋 + 𝑌 = 0) = P(𝑋 + 𝑅𝑋 = 0) = P(𝑅 = −1) = 0.5

If 𝑋 and 𝑌 were jointly normal, then you would have P(𝑋 + 𝑌 = 0) = 0


because this is a continuous di ribution. However, here, we observe some
point mass. This tells you that 𝑍 is not a Gaussian random ve or. Gaussian
random ve ors can only be recognised through linear combinations.


 Le ure  (--)
Yet another reproducing property if Gaussian random ve ors. There
is the second reproducing property of Gaussians. I am not sure whether
there are other variables with this property. Suppose we have
(︂ )︂ (︂[︂ ]︂ [︂ 2 ]︂)︂
𝑌 d 𝜇𝑋 𝜎𝑋 𝜎𝑋𝑌
=𝒩 ,
𝑋 𝜇𝑌 𝜎𝑌2

d
Result: 𝑌 | 𝑋 = 𝒩 (E(𝑌 | 𝑋), Var(𝑌 | 𝑋)), where E(𝑌 | 𝑋) = BLP(𝑌 | 𝑋)
d
and Var(𝑌 | 𝑋) = Var(𝑌 − BLP(𝑌 | 𝑋)). Now, 𝑋 | 𝑌 = 𝒩 (E(𝑋 |
𝑌 ), Var(𝑋 | 𝑌 )), where E(𝑋 | 𝑌 ) = BLP(𝑋 | 𝑌 ) and Var(𝑋 | 𝑌 ) =
Var(𝑋 − BLP(𝑋 | 𝑌 )) . So 𝑌 is homoskeda ic with respe to 𝑋 and 𝑋 is
homoskeda ic with respe to 𝑌 . This is implied by the joint Gaussianity.
Example. Suppose I told you
(︂ )︂ (︂[︂ ]︂ [︂ ]︂)︂
𝑌1 d 2 7 1
=𝒩 ,
𝑌2 0 1 3

What is the di ribution of 𝑌1 | 𝑌2 ? Then


d
𝑌1 | 𝑌2 = 𝒩 (BLP(𝑌1 | 𝑌2 ), Var(𝑌1 − BLP(𝑌1 | 𝑌2 )))
2 ,𝑌1 )
But BLP(𝑌1 | 𝑌2 ) = 𝛽0 + 𝛽1 𝑌2 , 𝛽0 = E𝑌1 − 𝛽1 E𝑌2 , 𝛽1 = Cov(𝑌 Var 𝑌2 . Then
𝛽0 = 2 − 𝛽1 · 0, 𝛽1 = 13 . So BLP(𝑌1 | 𝑌2 ) = 2 + 13 𝑌2 . What is the variance of
the residual of this linear predi or problem? This is Var(𝑌1 − 2 − 13 𝑌2 ) =
Var(𝑌1 − 31 𝑌2 ) = Var(𝑌1 ) + 91 Var 𝑌2 − 32 Cov(𝑌1 , 𝑌2 ) = 7 + 31 − 23 = 20 3 .
d
Therefore, I can write that 𝑌1 | 𝑌2 = 𝒩 (2 + 31 𝑌2 , 20
3 ). So the conditioning
variable appears only in the mean fun ion.
Once you assume a random ve or is Gaussian, calculations become
super-easy. It is a huge computational saving. This is why people dump
Gaussianity in empirical finance.
Example. The only difficult thing is the way we write it.
The moment you see Gaussian random ve ors, you should
jump with joy.
Suppose (︂ )︂ (︂[︂ ]︂ [︂ ]︂)︂
𝑋1 d 1 3 1
=𝒩 ,
𝑋2 1 1 2


def
Find the di ribution of 𝑋1 + 𝑋2 | 𝑋1 = 𝑋2 . Let 𝑌1 = 𝑋1 + 𝑋2 and
def
𝑌2 = 𝑋1 − 𝑋2 . Then Law(𝑋1 + 𝑋2 | 𝑋1 = 𝑋2 ) = Law(𝑌1 | 𝑌2 = 0).
There is another word for “di ribution”: “law”.
Do you agree that
(︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂
𝑌1 𝑋1 + 𝑋2 1 1 𝑋1 d
= = =
𝑌2 𝑋1 − 𝑋2 1 −1 𝑋2
(︂[︂ ]︂ (︂ )︂ [︂ ]︂ [︂ ]︂ [︂ ]︂)︂ (︂[︂ ]︂ [︂ ]︂)︂
d 1 1 1 1 1 3 1 1 1 d 2 7 1
=𝒩 , =𝒩 ,
1 −1 1 1 −1 1 2 1 −1 0 1 3

d d
So really 𝑌1 | 𝑌2 = 𝒩 (2 + 13 𝑌2 , 20 20
3 ). Hence 𝑌1 | 𝑌2 = 0 = 𝒩 (2, 3 ).
Try at home: 2𝑋2 | 𝑋1 − 𝑋2 = 0.

 The PDF of Gaussian random ve ors


Let 𝑋 be a 𝑝 × 1 random ve or. Then 𝑋 is a Gaussian random vetor
with mean 𝜇 and variance 𝑉 if and only if the joint PDF of the coordinates
of 𝑋 is
(︂ )︂𝑝/2 (︂ )︂
1 1 1
PDF𝑋 (𝑥) = √ exp − (𝑥 − 𝜇) 𝑉 (𝑥 − 𝜇) , 𝑥 ∈ R𝑝
′ −1
2𝜋 det 𝑉 2

This is not an easy result to show. The density has to be this, and then every
linear combination of coordinates will be Gaussian.
If there are two random variables and they are independent, they are
surely uncorrelated. If two random variables are uncorrelated, are they in-
dependent? No. However, if they are jointly normal, are they independent?
For jointly normal variables, uncorrelatedness will imply independence.
This is such a rong property!
Result. Let 𝑋 be a 𝑝 × 1 Gaussian random ve or. Then the coordinates
of 𝑋 are independent if and only if they are pairwise uncorrelated. Why
is this result so useful? Suppose I give you a bunch of random variables.
Suppose each of them is marginally normal, and suppose you ack them.
Will the result be Gaussian? Yes.
Proof. Independence always implies uncorrelatedness. It only remains
to prove the other way round, so we want to show that if coordinated of
𝑋 are pairwise uncorrelated, then they are independent. Assume that
coordinates of 𝑋 are pairwise uncorrelated. When you write down the


variance-covariance matrix, it will be diagonal:
⎛ 2 ⎞
𝜎1 · · · 0
Var 𝑋 = ⎝ ... . . . ... ⎠
⎜ ⎟

0 · · · 𝜎𝑝2

But we know that 𝑋 is a Gaussian random ve or. So the PDF will be


(︂ )︂𝑝/2 (︂ )︂
1 1 1 ′ −1
PDF𝑋 (𝑥) = √ exp − (𝑥 − 𝜇) 𝑉 (𝑥 − 𝜇)
2𝜋 det 𝑉 2

The inverse of a diagonal matrix is a matrix of reciprocal entries, so it is


equal to

𝑝
)︂𝑝/2 (︃ )︃
1 ∑︁ (𝑥(𝑖) − 𝜇(𝑖) )2
(︂
1 1
PDF𝑋 (𝑥) = exp − =
𝜎𝑖2
√︁
2𝜋 𝜎12 · . . . · 𝜎𝑝2 2
𝑖=1
𝑝 (𝑥 (𝑖) −𝜇(𝑖) )2 𝑝
∏︁ 1 −
2𝜎 2
∏︁
= √ e 𝑖 = PDF𝑋 (𝑖) (𝑥).
𝑖=1
2𝜋𝜎𝑖 𝑖=1

What is the moral of this example? If you have a bunch of random


variables that are marginally normally di ributed and independent, then
they are jointly normal if you ack them up. This is how Gaussian random
ve ors are con ru ed.

 Large sample theory


This is quite important because this is the basis for all econometrics. In
econometrics, usually the obje s are indexed by sample size, and we want
to observe their behaviour as 𝑛 → ∞. Let us begin with convergence of
real sequences.
Let (𝑥𝑛 ) be a sequence of real numbers. Then (𝑥𝑛 ) → 𝑥 as 𝑛 → ∞
means lim𝑛→∞ 𝑥𝑛 = 𝑥, which in turn means that if you draw a ball around
𝑥, all terms of this sequence lie within this ball, or

∀𝜀 > 0 ∃𝑁𝜀 > 0 : 𝑛 > 𝑁𝜀 ⇒ 𝑥𝑛 ∈ 𝐵(𝑥, 𝜀).

However, some sequence do not converge. Some sequences ju oscillate.


Another type is boundedness: they need not converge to anything, but they


are ju bounded.
Definition. (𝑥𝑛 ) is said to be eventually bounded if ∃ such numbers 𝑀
and 𝑁 such that 𝑛 > 𝑁 ⇒ |𝑥𝑛 | < 𝑀 . A sequence that is bounded never
gets too large or too small.
Example. 𝑛 ∈ N. The sequences 𝑥𝑛 = sin 𝑛 and 𝑥𝑛 = (−1)𝑛 are not
convergent, but eventually bounded. The set of all eventually bounded
non-random sequences is denoted by 𝑂(1). This thing is read as “big oh of
one”. There is another symbol. We would like to isolate all sequences that
go to zero. Among convergent sequences, it is good to have a name for the
ones that converge to zero. 𝑥𝑛 → 𝑥 as 𝑛 → ∞ ⇔ 𝑥𝑛 − 𝑥 → 0 as 𝑛 → ∞.
Definition. The set of all non-random sequences that converge to zero
as 𝑛 → ∞ is denoted by 𝑜(1).
But what if, in ead of real numbers, we had sequences of random
ve ors? If 𝑥𝑛 → 𝑥 as 𝑛 → ∞, then 𝑥𝑛 − 𝑥 = 𝑜(1) as 𝑛 → ∞. The big
advantage: we do not know how to make algebra with arrows. We know
algebra with equalities.
𝑜(1) is a sequence converging to zero. Then 𝑜(1)𝑂(1) = 𝑜(1), although
the ones on the end are different. Next, 𝑜(1) + 𝑂(1) = 𝑂(1). Next,
𝑜(1) − 𝑜(1) = 𝑜(1) because these might be different sequences. This is
not arithmetic, this is symbol manipulation.
What would change if we had real ve ors? Recall 𝑥𝑛 → 𝑛 as 𝑛 → ∞ ⇔
𝑥𝑛 − 𝑥 → 0 as 𝑛 → ∞ ⇔ ∀𝜀 > 0 ∃𝑁𝜀 > 0 : ∀𝑛 > 𝑁𝜀 𝑥𝑛 − 𝑥 ∈ 𝐵(0, 𝜀).
This definition is a remarkable definition because nowhere we had said
that 𝑥 is a real number. The very same thing means that |𝑥𝑛 − 𝑥| < 𝜀, but
here you are measuring the di ance between two real numbers. For ve or
𝑥, the only thing that would change ||𝑥𝑛 − 𝑥|| < 𝜀.
Let us ju talk about the convergence of random ve ors. A random
ve or is ju like a ve or of random variables. You would ju have to do
something to account for randomness inside.
𝑝
Definition. We say that 𝑋𝑛 converges in probability to 𝑋, like 𝑋𝑛 →− 𝑋
as 𝑛 → ∞, or plim𝑛→∞ 𝑋𝑛 = 𝑋 if ∀𝜀 > 0 ∃𝑁𝜀 such that 𝑛 > 𝑁𝑒 ⇒
P(||𝑋𝑛 − 𝑋|| > 𝜀) < 0. This is ju a probabili ic translation of the real
definition.
d 𝑝
Example. Let 𝑋𝑛 = 𝒩 (0, 𝑛), 𝑛 ∈ N. Then, 𝑋𝑛𝑛 → − 0 as 𝑛 → ∞. The
reason Europe became great was calculus, nothing ⃒ ⃒ else.
Let 𝜀 > 0. Then we⃒ have ⃒to show that P(⃒ 𝑋𝑛𝑛 ⃒ > 𝜀) is less than 𝜀. It
is nothing else than P(⃒ 𝒩 (0,𝑛)
⃒ (︀ 1 )︀⃒
⃒𝒩 0, ⃒ > 𝜀) = P(𝒩 0, 1 >
(︀ )︀
> 𝜀) =
⃒ ⃒
𝑛 ⃒ P( 𝑛 𝑛
√ √
𝜀) + P(𝒩 0, 𝑛1 < −𝜀) = P(𝒩 (0, 1) > 𝑛𝜀) + P(𝒩 (0, 1) < − 𝑛𝜀) =
(︀ )︀


√ √
1 − Φ( 𝑛𝜀) + Φ(− 𝑛𝜀) → 0 as 𝑛 → ∞. This is true because the CDF is
converging to zero as 𝑥 → −∞ and to one as 𝑥 → ∞.
Why was convergence invented? As you say, 𝑥𝑛 → 𝑥 as 𝑛 → ∞, what
is the intuition? If you go far out in the tails, you can approximate 𝑥 with
𝑥𝑛 . You can use elements from the tails of the sequence to approximate the
sequence? And to translate this into ati ics, the ati ical equivalence of
approximation is e imation.
The expe ation is an unknown obje . How would ∫︀ you calculate the
expe ed value of 𝑋? It is the integral: E𝑋 = 𝑥∈supp 𝑋 𝑥 PDF𝑋 (𝑥) d𝑥.
Who alone knows the densities of random variables? Nature. We want to
e imate the unknown value.
Result. Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be IID copies of the random ve or 𝑋. We
have some data on 𝑋. Then
𝑛
1 ∑︁ 𝑝
𝑋𝑖 →
− E𝑋 as 𝑛 → ∞
𝑛
𝑖=1

We say that the sample average converges to the population average. We


¯ In terms of math, the sample
can denote the summation of 𝑛 variables as 𝑋.
average converges in probability to the population average. So you can
approximate E𝑋 by the tail of the fir part. In ati ics, you say that the
sample average is a good e imate of population mean. This was proved by
Jacob Bernoulli in . And this is called Bernoulli’s weak law of large
numbers (WLLN). The Bernoulli family mu have drank from a range
source because three generations of people made contributions to calculus
and medicine. The sample mean is a consi ent e imator of the population
mean.
𝑝
Here is the definition. If 𝑋𝑛 →
− 𝑋 as 𝑛 → ∞, then we say that 𝑋𝑛
is consi ent for 𝑋. The word “consi ency” ju means “convergence in
probability”.

 Le ure  (--)
def def
Homework , problem . 𝑈 = 2𝑋2 − 𝑋2 + 𝑋3 , 𝑉 = 𝑋1 + 2𝑋1 + 3𝑋3 .
The fir ep is to find the joint di ribution of 𝑉 and 𝑈 . They can be


written as ⎛ ⎞
(︂ )︂ (︂ )︂ 𝑋1
𝑈 2 −1 1 ⎝ ⎠ d
= 𝑋2 = 𝒩 (. . .)
𝑉 1 2 3
𝑋3
La time, we looked at the Bernoulli’s weak law of large numbers. This
is the andard convention. If you change something, different laws of
large numbers exi . Assume that 𝑋1 , . . . , 𝑋𝑛 are IID random ve ors. Then
𝑝
𝑋¯ = 1 ∑︀𝑛 𝑋𝑖 → − E𝑋1 as 𝑛 → ∞. You could say that these ve ors are
𝑛 𝑖=1
copies of a random ve or.
The proof of this is super simple. The value of this result is that this
gives you a theoretical ju ification of using sample means to e imate
theoretical means. La year, there was a talk by a ati ician who was
talking about the ati ics. But people have been using sample averages
since Babylonian times. They used cuneiform to report sample averages
of wheat given to the tax colle or. How much on average did the tax
colle ors get? The theoretial ju ification took , years.
𝑝 𝑝
It can be shown that if 𝑋¯ →
− E𝑋1 , 𝑋 ¯ − E𝑋1 → − 0 as 𝑛 → ∞. Which is
1 ∑︀𝑛 0 𝑝
− ⇔ 𝑛1 𝑛𝑖=1 (𝑋𝑖 − 𝜇) → − 0 ⇔ 𝑛1 𝑛𝑖=1 (𝑋𝑖 − 𝜇) = 𝑜𝑝 (1).
∑︀ ∑︀
𝑛 𝑖=1 𝑋𝑖 − E𝑋1 →
We denoted such sequences as 𝑜(1), but due to the randomness, we shall
use other notation.
Definition. A generic sequence of random ve ors or variable that
converges in probability to zero is denoted by 𝑜𝑝 (1), “little oh pee one”.
There are situations where the law does not apply, for example, when
the expe ation does not exi .
Let me introduce another notion of convergence. 𝑛1 𝑛𝑖=1 𝑋𝑖 is the
∑︀
sample mean. It is consi ent for the population. We know that in a certain
sense, the sample mean gets closer and closer to the population mean. This
is a degenerate result: the di ribution of the sample mean is putting all
its mass in the limit at the point E𝑋1 . It was natural to ask: can we refine
this result? Can we modify this result so that we have a finite di ribution?
How fa is this convergence? This is a logical que ion. Amazingly, the
final answer was given in  or so by a Swedish mathematician with all
i’s dotted and t’s crossed. Some very great people failed to give a rigorous
proof.
In order to show this result, we need another notion of convergence—
convergence in di ribution. Let (𝑌𝑛 ) be a sequence of random ve ors
and let 𝑌 be another random ve or. We say then that (𝑌𝑛 ) is said to


converge in di ribution to (𝑌𝑛 ) is the following condition holds:
𝑛→∞
CDF𝑌𝑛 (𝑦) −−−→ CDF𝑌 (𝑦),

that is, if the CDF converges pointwise for each 𝑦 at which CDF𝑌 is con-
tinuous. This is the usual pointwise convergence of fun ions. Everything
is non-random happens. We write this as
𝑑
𝑌𝑛 −
→𝑌

And there is a slight technical wrinkle: you need only the points of con-
tinuity. Remember, 𝑌𝑛 could be discrete, continuous, or mixed. Then the
CDF can have jumps, and you ju exclude them from consideration. But
this notion of convergence was invented to prove one of the main limiting
results in ati ics. Here, the CLT required the development of new math-
ematics and complex analysis. It was not until complex analysis was fully
developed; Gauss could not do it, Laplace could not do it. There always
was a mi ake or inconsi ency. Lindeberg did this.
This is also once of the laws that work in Nature. In Chicago, it works.
There is a Museum of Science and Indu ry. The marbles are dropping
randomly, but their final di ribution is the bell curve.

You can find it on YouTube, but I have seen this personally.


Several times, namely three.

Lindeberg’s Central Limit Theorem (CLT). Assume 𝑋1 , . . . , 𝑋𝑛 are


def def
IID random ve ors. Let 𝜇 = E𝑋1 and 𝑉 = Var 𝑋1 . Then
𝑛
1 ∑︁ 𝑑
√ (𝑋𝑖 − 𝜇) −−−→ 𝒩 (0, 𝑉 )
𝑛 𝑛→∞
𝑖=1

This has huge pra ical implication. All the influence te ing, NASA says
that the probability of success is such, then it is based on this approximation
result. People have been using this result without proving it since the days
of Gauss and Laplace.
Recall WLLN:
𝑛
1 ∑︁ 𝑝
(𝑋𝑗 − 𝜇) →− 0
𝑛
𝑗=1


The CLT says that if you look at something with population zero. In ead
of dividing by 𝑛, you are dividing by something smaller. Once you do this,
it turns out that this thing is no longer degenerate in the limit. The limit is
a well-defined random variable with a certain di ribution.
Another que ion was the speed of converges. The limit ∑︀in the WLLN
goes toe number too quickly. What should 𝛼 be so that 𝑛𝛼 𝑛1 𝑛𝑖=1 (𝑋𝑖 −𝜇) =
0. It turns out that the only 𝛼 possible is .. You should not confuse the
WLLN and the CLT.
Note that
√ ∑︁
⎛ ⎞
𝑛 𝑛 1
1 ∑︁ 𝑛 √ 1 ∑︁ √
√ (𝑋𝑖 − 𝜇) = (𝑋𝑖 − 𝜇) = 𝑛 ⎝ 𝑋𝑗 − 𝜇⎠ = 𝑛(𝑋 ¯ − 𝜇)
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑗=1

Then CLT can be rewritten as


√ 𝑑
𝑛(𝑋¯ − 𝜇) −
→ 𝒩 (0, 𝑉 )

What is the meaning of this result? What does it mean that 𝑋𝑛 converges
to 𝑋? The limit can be replaced by the tail of the sequence. What is the
pra ical meaning of this results? 𝑌 can be approximated probabili ically
by the tail of 𝑌𝑛 . This really means that the di ribution of 𝑌𝑛 is very close
to 𝑌 if 𝑛 is large. If the sample size is large enough. . . but how large? If
someone says 𝑛 = 25, this is shit. That’s why econometricians are doing
Monte-Carlo simulations. For empirical purposes, whatever size you have,
you should use it. If in an exam you are given a sample size 𝑛 and you
write it is not large enough for CLT, you get zero.
If 𝑛 is large enough, then
√ 𝑑 𝑑
¯ − 𝜇) ≈
𝑛(𝑋 ¯≈
𝒩 (0, 𝑉 ) ⇔ 𝑋 𝒩 (𝜇, 𝑉 /𝑛)

What is the sense the approximation is valid? Convergence in di ribution.


Definition. A random sample is a colle ion of IID random variables.
Example. What about an example? Problem set , ex. . 𝑋1 , . . . , 𝑋𝑛 are
IID draws from PDF𝑋 (𝑥) = 3𝑥2 I[0,1] (𝑥), 𝑥 ∈ R. Find P(0.6 ≤ 𝑋¯ 15 ≤ 0.8).
How will you solve this problem? You need to know the di ribution of the
sample mean. Do you know the di ribution of the sample mean? The CLT


def
says that it is approximately normal. Let 𝜇 = E𝑋, 𝑉 = Var 𝑋. Then

¯ 𝑛 ≤ 0.8) ≤ P(0.6 − 𝜇 ≤ 𝑋
P(0.6 ≤ 𝑋 ¯ 𝑛 − 𝜇 ≤ 0.8 − 𝜇) =
(︃ )︃
0.6 − 𝜇 ¯𝑛 − 𝜇
𝑋 0.8 − 𝜇
= P √︀ ≤ √︀ ≤ √︀
𝑉 /𝑛 𝑉 /𝑛 𝑉 /𝑛

We are using that the if everything is scalar,

√ 𝑑 𝑋¯ −𝜇 𝑑
¯ − 𝜇) ≈
𝑛(𝑋 𝒩 (0, 𝑉 ) ⇔ √︀ ≈ 𝒩 (0, 1)
𝑉 /𝑛

𝑋𝑛 −𝜇 ¯
What is the ju ification that √ is approximately normal? The CLT. So
𝑉 /𝑛
the latter is (︃ )︃ (︃ )︃
0.8 − 𝜇 0.6 − 𝜇
Φ √︀ −Φ √︀
𝑉 /𝑛 𝑉 /𝑛
In order to get 𝜇 and 𝑉 , you need to do
1
3 4 ⃒⃒1 3
∫︁ ∫︁ ⃒
3
E𝑋 = 𝑥 PDF𝑋 (𝑥) d𝑥 = 3𝑥 d𝑥 = 𝑥 ⃒ =
supp 𝑋 0 4 0 4

Then, Var 𝑋 = E𝑋 2 − (E𝑋)2 .


1
3 5 ⃒⃒1 3
∫︁ ∫︁ ⃒
2 2 2 4
E𝑋 = 𝑥 3𝑥 d𝑥 = 3𝑥 d𝑥 = 𝑥 ⃒ =
supp 𝑋 0 5 0 5

And Var 𝑋 = 35 − 9
16 = 48−45
80 = 3
80 .
So you have
(︃ )︃ (︃ )︃
4/5 − 3/4 3/5 − 3/4
Φ √︀ −Φ √︀
3/80 · 1/15 3/80 · 1/15

The intere ing que ions in science do not have exa answers, but they
have approximate answers.
HW, Additional problem . Define the random variable 𝑋𝑖 that de-


notes the outcome of the 𝑖th game.
{︃
def 5, with prob. 1/3,
𝑋𝑖 =
2, with prob. 2/3

This∑︀ ∑︀𝑛 Your total winnings after 𝑛 games


is ju a discrete random variable.
𝑛
are 𝑖=1 𝑋𝑖 . Find 𝑛 such that P( 𝑖=1 𝑋𝑖 ≥ 0) = 0.99. You have to do
enough algebra to make the CLT appear.
(︃ 𝑛 )︃ (︃ )︃
∑︁ 𝑋¯ −𝜇 −𝜇
P 𝑋𝑖 ≥ 0 ¯ ≥ 0) = P
= P(𝑋 √︀ ≥ √︀ ≈
𝑖=1
𝑉 /𝑛 𝑉 /𝑛
(︃ )︃
−𝜇 −𝜇
≈ P 𝒩 (0, 1) ≥ √︀ = 0.99 ⇔ √︀ = 𝑄𝒩 (0,1) (0.01)
𝑉 /𝑛 𝑉 /𝑛

√ 𝑉 𝑉
𝑛=− 𝑄𝒩 (0,1) (0.01) ⇒ 𝑛 = 2 (𝑄𝒩 (0,1) (0.01))2
𝜇 𝜇
When you talk about convergence of real numbers, the only reason it
is defined so that if you apply a continuous fun ion, then 𝑔(𝑋𝑛 ) → 𝑔(𝑋).
People realised very early on that continuous fun ions are very nice (but
infinitely differentiable ones are even nicer). The convergence mu be
in the very same sense: convergence should happen under continuous
transformations.
Continuous mapping theorem (CMP). Let (𝑋𝑛 ) be a sequence of ran-
dom ve ors. Let 𝑋 ne a random ve or. Let 𝑔 be a continuous mapping.
You can think about the mapping on the entire Euclidean space. Then
𝑝 𝑝
𝑋𝑛 −−−→ 𝑋 ⇒ 𝑔(𝑋𝑛 ) −−−→ 𝑔(𝑋)
𝑛→∞ 𝑛→∞

and
𝑑 𝑑
𝑋𝑛 −−−→ 𝑋 ⇒ 𝑔(𝑋𝑛 ) −−−→ 𝑔(𝑋)
𝑛→∞ 𝑛→∞

Example. Let 𝑋1 , . . . , 𝑋𝑛 be IID Bernoulli(𝑝). Find a consi ent e im-


ator or 𝑝. So far, we have not talked about e imation. 𝑝 is the probability
of success, but also the expe ed value! Recall, 𝑝 = E𝑋1 . What is the
def ¯
theoretical ju ification? WLLN. So, 𝑝ˆ = 𝑋 is a consi ent e imator of 𝑝.
𝑝
You are intere ed in the odds ratio 1−𝑝 . Odds ratios are useful obje s.
They look very range, but in England, they are used in horse race betting.


You could say, the e imate is , because your grandmother told you, but
this is a shitty e imator! E imators should be told by data. A consi ent
𝑝 𝑝^ 𝑝 def 𝑝
e imator of 1−𝑝 is 1−^ 𝑝 . Why? Because CMT. 𝑔(ˆ𝑝) →
− 𝑔(𝑝), so let 𝑔(𝑝) = 1−𝑝 ,
𝑔 : (0, 1) ↦→ R is continuous, so there you go.
Example. Suppose
(︂ )︂ (︂[︂ ]︂ [︂ 2 ]︂)︂ (︂ )︂
𝑋𝑛 𝑑 𝜇𝑋 𝜎 𝜎𝑋𝑌 d 𝑋

→𝒩 , 𝑋 =
𝑌𝑛 𝜇𝑌 𝜎𝑌2 𝑌

We do not know the di ribution of 𝑋𝑛 and 𝑌𝑛 , but their asymptotic di ri-


bution is normal. Find the asymptotic di ribution of 𝑋𝑛 + 𝑌𝑛 .
Fir , let 𝑔(𝑎, 𝑏) = 𝑎 + 𝑏, so 𝑔 : R2 ↦→ R is continuous. By the CMT,
𝑑 𝑑 d 2 + 𝜎2 +
𝑔(𝑋𝑛 , 𝑌𝑛 ) −−−→ 𝑔(𝑋, 𝑌 ), i. e. 𝑋𝑛 + 𝑌𝑛 −
→ 𝑋 + 𝑌 = 𝒩 (𝜇𝑋 + 𝜇𝑌 , 𝜎𝑋 𝑌
𝑛→∞
2𝜎𝑋𝑌 ).
Assume that the que ion is about the limiting di ribution of 𝑋 2 + 𝑌 2 .
𝑑
By CMT, 𝑋𝑛2 + 𝑌𝑛2 −
→ 𝑋 2 + 𝑌 2.
Slutsky’s lemma. Now, let (𝑋𝑛 ), (𝑌𝑛 ) be sequences of random variables.
𝑑
Let 𝑋 be a random variable and let 𝑐 be a con ant. Suppose 𝑋𝑛 −→ 𝑋 as
𝑝
𝑛 → ∞ and 𝑌𝑛 →− 𝑐. Then convergence in di ribution is preserved:
𝑑
. 𝑋𝑛 + 𝑌𝑛 −−−→ 𝑋 + 𝑐,
𝑛→∞
𝑑
. 𝑋𝑛 𝑌𝑛 −−−→ 𝑋𝑐,
𝑛→∞
𝑑
. 𝑋𝑛
𝑌𝑛 −−−→ 𝑋 .
𝑛→∞ 𝑐
For example, if 𝑋 is normal, you can divide by it.
You have heard about Slutsky in economics. However, these are the
same Slutsky’s!
Let 𝜃 be a parameter (unknown con ant). Suppose we are e imating
ˆ Suppose that √𝑛(𝜃ˆ − 𝜃) −−𝑑−→ 𝒩 (0, 𝜎 2 ) when 𝜎 2 is
𝜃 by the e imator 𝜃.
𝑛→∞
unknown. In econometrics, convergence in di ribution really comes from
the CLT. Convergence in probability will come from the CLT. Because the
variance is unknown, let’s e imate it. Suppose 𝜎 2 is consi ently e imable
ˆ2.
by 𝜎
√ 𝑑
What is the pra ical meaning of 𝑛(𝜃ˆ − 𝜃) −−−→ 𝒩 (0, 𝜎 2 )? For large
𝑛→∞
enough 𝑛, the di ribution of this obje is close enough to the Gaussian


√ 𝑑
random variable, so 𝑛(𝜃ˆ − 𝜃) ≈ 𝒩 (0, 𝜎 2 ). Therefore,

𝜎2
(︂ )︂
𝑑
ˆ
𝜃 ≈ 𝒩 𝜃,
𝑛
𝜎2 ˆ Not really because they
𝑛 is the variance. Can we call it the variance of 𝜃?
ˆ
are not equal. It is an approximation to the variance of 𝜃.
Definition. The asymptotic variance is defined as asVar(𝜃) ˆ def 2
= 𝜎 . 𝑛
How do you call the square root of the variance? Standard deviation:

ˆ = √𝜎 = assd(𝜃)
√︁
asVar(𝜃) ˆ
𝑛

However, there is a really nice name: andard error is defined as

ˆ ≡ assd( ˆ def 𝜎
ˆ
se(𝜃) ̂︂ 𝜃) = √
𝑛

Que ion: what is the limiting di ribution of

ˆ
def 𝜃 − 𝜃
𝑇ˆ =
ˆ
se(𝜃)

This is called a 𝑡- ati ic. The limiting di ribution is

𝜃ˆ − 𝜃 𝜃ˆ − 𝜃 𝜎 𝜃ˆ − 𝜃 𝑑
= √ = √ − → 𝒩 (0, 1)
ˆ
se(𝜃) ˆ/ 𝑛 ⏟ 𝜎
𝜎 ˆ⏞ 𝜎/ 𝑛
𝑝
⏟ ⏞

−1 𝑑

→𝒩 (0,1)
𝑝 2 𝑝 𝑝 𝑝
− 𝜎 2 , Then 𝜎𝜎^ 2 →
ˆ2 →
We said that 𝜎 − 1. And so does 𝜎𝜎^ →
− 1. And 𝜎𝜎^ →
− 1. The
produ of the two, by Slutsky’s lemma, this converges in di ribution to
the andard normal. This is the basis for inference in large sample. When
you are doing 𝑡-te s, without the assumption of normality, this is asymp-
totically normal. The 𝑡 di ribution came from William Gossett. However,
the 𝑡 di ribution requires the normality assumption, and nowhere in this
derivation, we used this assumption.


 Delta method
There are two versions of the delta method: the approximate delta
method and the exa delta method.
Suppose we know that 𝜃ˆ is a consi ent e imator of 𝜃 and
√ 𝑑
𝑛(𝜃ˆ − 𝜃) −−−→ 𝒩 (0, 𝜎 2 )
𝑛→∞

ˆ Let’s assume that


What can you say about the limiting di ribution of 𝑔(𝜃).
𝑝
ˆ − 𝑔(𝜃) −−−→ 0 by CMT. Que ion: what can we
𝑔 is continuous. Then 𝑔(𝜃)
√ 𝑛→∞ ˆ − 𝑔(𝜃))?
say about the di ribution of 𝑛(𝑔(𝜃)
Solution. Let us do a Taylor expansion of 𝑔. Suppose 𝑔 is continuously
differentiable, that is, the derivative exi s and is continuous. Then, by
Taylor expansion,

ˆ = 𝑔(𝜃) + d (︀ ˆ
𝑔 𝜆𝜃 + (1 − 𝜆)𝜃 (𝜃ˆ − 𝜃),
)︀
𝑔(𝜃) 𝜆 ∈ (0, 1)
d𝜃
Therefore,
𝑝

− 𝑔𝜃 (𝜃)
√ ⏞ (︀ ⏟ )︀ √ 𝑑
𝑛(𝑔(𝜃) − 𝑔(𝜃)) = 𝑔𝜃 𝜃 + 𝜆(𝜃ˆ − 𝜃) 𝑛(𝜃ˆ − 𝜃) −
ˆ → 𝑔𝜃 (𝜃)𝒩 (0, 𝜎 2 ), 𝜆 ∈ (0, 1)
⏟ ⏞ ⏟ ⏞
𝑝 𝒩 (0,𝜎 2 )

−𝜃

The latter is di ributed as 𝒩 (0, (𝑔𝜃 (𝜃))2 𝜎 2 ).


This is a real proof, everything is exa , we are dealing with equalities
everywhere.
Now, there is a method that makes it really scratchy. The approximate
delta-method.
Let 𝑋 and 𝑌 be random variables. Let 𝑔 be a fun ion: R2 ↦→ R.
Que ion: what is E𝑔(𝑋, 𝑌 )? In general, this is a very difficult que ion, but
the approximate delta-method really works on the back of the envelope.
By a Taylor expansion,

𝑔(𝑋, 𝑌 ) ≈ 𝑔(E𝑋, E𝑌 ) + 𝑔1 (E𝑋, E𝑌 )(𝑋 − E𝑋) + 𝑔2 (E𝑋, E𝑌 )(𝑌 − E𝑌 )

We are expanding 𝑔 around its expe ed value. Therefore,

E𝑔(𝑋, 𝑌 ) ≈ 𝑔(E𝑋, E𝑌 ) + 0 + 0


This is purely a back-of-the-envelope calculation only to save your arse if
someone is holding a gun to you head. This is equal only in case of linear
funtions.
What about the variance?

Var 𝑔(𝑋, 𝑌 ) ≈ Var(𝑔1 (E𝑋, E𝑌 )(𝑋 − E𝑋) + 𝑔2 (E𝑋, E𝑌 )(𝑌 − E𝑌 )) =


= 𝑔12 (E𝑋, E𝑌 ) Var 𝑋 + 𝑔22 (E𝑋, E𝑌 ) Var 𝑌 +
+ 2𝑔1 (E𝑋, E𝑌 )𝑔2 (E𝑋, E𝑌 ) Cov(𝑋, 𝑌 )

You should treat the delta method with some caution given the non-
linearity of 𝑔.
Example. Additional problem , part (ii).

You under and what “widget” means? Some crap.


d d
𝑃 = 𝒰(0.9𝑝, 1.1𝑝) and 𝑄 = 𝒰(𝑞 −5, 𝑞 +5). Find E𝑃 𝑄 using the approximate
delta method.
Let 𝑔(𝑎, 𝑏) = 𝑎𝑏. Then, by the approximate delta method, E𝑔(𝑃, 𝑄) ≈
𝑔(E𝑃, E𝑄) = E𝑃 E𝑄. The variance will not be exa . Var 𝑔(𝑃, 𝑄) ≈
𝑔12 (E𝑃, E𝑄) Var 𝑃 + 𝑔22 (E𝑃, E𝑄) Var 𝑄 + 2𝑔12 (E𝑃, E𝑄)𝑔22 (E𝑃, E𝑄) Cov(𝑃, 𝑄).
The latter covariance is zero. Next, if 𝑔(𝑎, 𝑏) = 𝑎𝑏, then 𝑔1 (𝑎, 𝑏) = 𝑏 and
𝑔2 (𝑏, 𝑎) = 𝑎. Then, 𝑔1 (E𝑃, E𝑄) = E𝑄, 𝑔2 (E𝑃, E𝑄) = E𝑃 .
Your brain should reserve space for the useful uff.



You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy