Leure ( - ) : Probability Calculations

 Leure  (--)
If you want a Ph. D. in econometrics, you will have to do measure theory,

but this course is going to require only calculus and integration.
DeGroot’s “Stati ics” is a fundamental book, fanta ically well-written.
We shall not follow the book closely, because otherwise, we would not need
the course.
We are going to look at two prediors: linear prediors, be prediors,
the di ributions, the Gaussian di ribution. Econometrics has no normality
required. Normality is required in portfolio theory. Then, some basic
asymptotic theory, CLT. Finally, inference and confidence intervals.
The problem sets and solutions will be po ed on Moodle (ideological
rejeion or something). Group work is encouraged. It is amazing how our
brains work: we have a problem, and each one approaches with a different
solutions. Solving the same problem in a different way is essential for
group work. It is not in your intere to free-ride.
The fir homework is due Friday. There are  problem sets. If you look
at additional problem , there are  parts. You have to solve part . The
exam is on the th. All the problem sets are light-headed. You have to get
cracking, there is no time to wa e.
Professor Tripathi’s office number is BC.. If you want a specific time,
send an email to gautam.tripathi@uni.lu.
No fancy uff as long as your handwriting is legible.
Not having enough blackboards sucks.
Let us review some basic probability calculations.
 Probability calculations
You have to have something random to calculate the probabilities, so
the basic ingredient is a random experiment. A random experiment is an
aion whose outcome is uncertain in advance of its happening.
It is a squeaky wheel that gets the grease: if you don’t ask, you
don’t get.
A sample space for an experiment is the set of all outcomes associated

with this experiment.
The simple experiment is tossing a coin. Your sex is an experiment of
the nature combining genes in your parents. The outcomes that you see

are the outcomes of nature’s random experiment. Nature does something—
ocks go up. Nature does something—you pass an exam. How long will
you live, will you get married—everything is a random experiment.
Suppose you toss a coin. You can have either heads or tails. It is a set
that has two objes inside: {𝐻, 𝑇 }. Suppose I tossed a coin twice, and
the sample space is the cartesian produ with itself: {𝐻, 𝑇 } × {𝐻, 𝑇 } =
{(𝐻, 𝐻), (𝐻, 𝑇 ), (𝑇, 𝐻), (𝑇, 𝑇 )}. The more complicated the experiment, the
more complicated the space.
What is an event? An event is any subset of the sample space. Look at
def def
this set: 𝐸 = {(𝐻, 𝐻)}, 𝐹 = {(𝐻, 𝐻), (𝐻, 𝑇 ), (𝑇, 𝐻)}. You are tossing a
coin twice, and both of the times you get heads.
The mo powerful relational operator in mathematics is equal: “𝐴 = 𝐵”
def
means “𝐴 is identical to 𝐵”. Here we make the equality by definition, = .
There is no calculation behind it, we are imposing equality by definition.
If you toss a coin and get two heads, you say it is 𝐸. If you toss a coin
twice and if you see at lea one head has occurred, you say it is 𝐹 . What
is the chance, probability, likelihood of the event 𝐸? We need to figure
out the probability of the event 𝐸? The probability of the fir times the
second. Chevalier de Mere, Pierre Fermat, Blaise Pascal. This success of
this classical theory la ed until the XIXth century. The probaility of an
event is the cardinality of an event divided by the cardinality of the sample
space. But the minute you are going to science and engineering with sample
spaces infinitely large, it fails because you can get the ratio ∞
∞ , which makes
no space. The price of a ock can take any value between  and +∞. You
cannot count the number of elements, this method is no longer su ainable.
Who is the creator of the mathematical ati ics? Andrey Kolmogorov. In
, he wrote a book on mathematical theory of probability. It agrees on
the classical theory. So it is quite young. . .
Mathematical theory of probability is one of the greate things that
have come to a human mind. We need a sy ematic way of measuring
probabilities. Consider a random experiment, and let 𝑆 be the sample
space associated with this random experiment. We are given a particular
event. Event 𝐴 is a subset of this sample space. 𝐴 can have any cardinality,
including infinite. Everything is very-very general. The Greek could not
do it, Laplace could not handle it, Newton could not do it. The answer is
remarkably simple: all you need is a funion. It should take an event as
an input, and return a number. How should we con ru this funion?
. Consi ent (at any time the answer should be the same);

. Giving reasonable answers;
. If the sample space is finite, it should give the same answer as classical
probability.
As long as you can find a funion satisfying these axioms, you are ready
to measure probabilities. What should be the domain of the funion? It
receives the event and returns a number, so the domain should be the
set of all subsets. The domain is any event, and Kolmogorov gave it the
name probability measure: P : {set of all subsets of𝑆} ↦→ R is called the
probability measure if it satisfies three properties:
. Probability should be non-negative. They can be zero: P(𝐴) ≥
0 ∀𝐴 ⊂ 𝑆. How do you interpret negative numbers? Ask a -year
old what is zero is, and he will crap his pants. It took the humanity
some time to invent a symbol for nothingness.
. The probability of the entire sample space is , and the probability of
an empty event is : P(𝑆) = 1, P(∅) = 0.
. The required property: if 𝐴1 , 𝐴2 , . . . is a sequence of pairwise disjoint
subsets⋃︀of 𝑆, then the
∑︀∞probability of the union is the sum of probabilit-
∞
ies: P( 𝑖=1 𝐴𝑖 ) = 𝑖=1 P(𝐴𝑖 ). Disjoint means that the interseion is
empty, if the sequence is such that the interseion of any two events
is empty, 𝐴𝑖 ∩ 𝐴𝑗 = ∅ 𝑖 ̸= 𝑗.
Any funion that satisfies Kolmogorov’s axioms is a probability meas-
ure. Do we know that a funion exi s? Otherwise this theory is silly.
Probability measures exi !
def
Example. Suppose card(𝑆) < ∞. Then, P(𝐴) = card(𝐴)card(𝑆) , 𝐴 ⊂ 𝑆 is a
probability measure.
How about defining the probability measure on the real line? 𝑆 = R.
Its cardinality is uncountable infinity. Then
∫︁
def
P1 (𝐴) = I[0,1] (𝑥) d𝑥
𝐴
We use the indicator funion of this set:

{︃
1, 𝑥 ∈ [0, 1],
I[0,1] (𝑥) =
0, 𝑥 ∈
̸ [0, 1].

How about this:
∫︁
def 1 2
P2 (𝐴) = √ e −𝑥 /2 d𝑥
𝐴 2𝜋
In the fir example,
∫︁ ∫︁
def
P1 (𝐴) = I[0,1] (𝑥) d𝑥 = I[0,1] (𝑥) d𝑥 =
R (R∖[0,1])∪[0,1]
∫︁ ∫︁
= I[0,1] (𝑥) d𝑥 + I[0,1] (𝑥) d𝑥.
R∖[0,1] [0,1]
∫︀
The fir integral is always zero, so this is really [0,1] I[0,1] (𝑥) d𝑥, and this
integral is equal to .
The la property is ju the property of integrals. So probability meas-
ures exi . In mathematics, exi ence is crucial. The re is improving the
fa that we exi , otherwise it is ju ab ra nonsense. These  properties
are even less than geometry, and it is so parsimonious. And Kolmogorov
died in . Professor Tripathi knows a person who knew Kolmogorov.
If in ead of an infinite sequence, we have a finite number of disjoint
sets, say, 𝐴1 , 𝐴2 are disjoint subsets of 𝑆. Then P(𝐴1 ∪ 𝐴2 ) = P(𝐴1 ) + P(𝐴2 ).
But how⋃︀∞do we know that? The countable sequence ⋃︀∞ is 𝐴1 , 𝐴2 , ∅, ∅, ∅, . . ..
Then
∑︀∞ 𝑖=1 𝐴𝑖 = 𝐴1 ∪ 𝐴2 , so, by property , P( 𝑖=1 𝐴𝑖 ) = P(𝐴1 ∪ 𝐴2 ) =
𝑖=1 P(𝐴 𝑖 ) = P(𝐴1 ) + P(𝐴2 ).
These properties have remarkable implications.
 Implications of the probability axioms

Let 𝑆 be a sample space.
. If 𝐴 is an event, then P(𝐴𝐶 ) = 1 − P(𝐴). How do we show this? Only
using the probability axioms.
It is easie to draw a piure (fig. , left). 𝑆 = 𝐴 ∪ 𝐴𝐶 ⇒ P(𝑆) =
P(𝐴 ∪ 𝐴𝐶 ) = 1 = P(𝐴) + P(𝐴𝐶 ). Why is the probability of the sample space
one? Did your grandmother tell you? No, it is Kolmogorov’s second axiom.
. If 𝐴 ⊂ 𝐵, then P(𝐴) ≤ P(𝐵).
Again, let us draw a piure (fig. , right). 𝐵 = (𝐵 ∖ 𝐴) ∪ 𝐴, so P(𝐵) =
P(𝐵 ∖ 𝐴 ∪ 𝐴) = P(𝐵 ∖ 𝐴) + P(𝐴) > P(𝐴). Pure magic!

S S
A
A B
AC
Figure : A complement of a set and a subset
. P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵) − P(𝐴 ∪ 𝐵).

Now we are not saying that these sets are disjoint! That’s ju one way to
say, “your arse”. Do you agree that 𝐴 ∪ 𝐵 = (𝐴 ∖ 𝐵) ∪ (𝐴 ∩ 𝐵) ∪ (𝐵 ∖ 𝐴).
What is the defining charaeri ic of these sets? They are pairwise disjoint.
A set is not disjoint with itself (unless it is an empty set). This is why
P(𝐴 ∪ 𝐵) = P(𝐴 ∖ 𝐵) + P(𝐴 ∩ 𝐵) + P(𝐵 ∖ 𝐴). Now, 𝐴 = (𝐴 ∖ 𝐵) ∪ (𝐴 ∩ 𝐵),
so P(𝐴) = P(𝐴 ∖ 𝐵) + P(𝐴 ∩ 𝐵), so P(𝐴 ∖ 𝐵) = P(𝐴) − P(𝐴 ∩ 𝐵). Same goes
for 𝐵: 𝐵 = (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴), so P(𝐵) = P(𝐵 ∖ 𝐴) + P(𝐵 ∩ 𝐴), so P(𝐵 ∖ 𝐴) =
P(𝐵) − P(𝐵 ∩ 𝐴). This does the trick: P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵) − P(𝐴 ∪ 𝐵).
. Suppose 𝐴 and 𝐵 are subsets of 𝑆. Suppose P(𝐴) = 0. Then, P(𝐴 ∩
𝐵) = 0.
Fir , 𝐴 ∩ 𝐵 ⊂ 𝐴, therefore, P(𝐴 ∩ 𝐵) ≤ P(𝐴) = 0, and the probability
cannot be smaller than zero, so it is zero.
Remember that if P(𝐴) = 0, it does not mean that 𝐴 = ∅. Vice versa, if
P(𝐵) = 1, 𝐵 is not necessarily the sample space.
 Leure  (--)
Homework , additional problem . 𝐴 and 𝐵 are events of 𝑆. We have
shown that if P(𝐴), then P(𝐴 ∩ 𝐵) = 0. Fir , 𝐴 ∩ 𝐵 ⊂ 𝐴. This implies
P(𝐴 ∩ 𝐵) ≤ P(𝐴). But P(𝐴) = 0 ≥ P(𝐴 ∩ 𝐵). But probabilities cannot be
negative, so this thing is zero.
Oh, s***, the sponge. . . Ah, here it is. I was about to use a harsh
word for the university.
Same problem, part . Is this atement true or false? P(𝐴) = 1 ⇒

P(𝐴 ∩ 𝐵) = P(𝐵). Fir , (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴). These sets are disjoint, therefore,

def
P(𝐵) = P(𝐵∖𝐴)+P(𝐵∩𝐴). By definition, 𝐵∖𝐴 = 𝐵∩𝐴𝐶 , the complement
with respe to the sample space. Therefore, P(𝐵 ∖ 𝐴) = P(𝐵 ∩ 𝐴𝐶 ) ≤
P(𝐴𝐶 ) = 1 − P(𝐴) = 1 − 1 = 0. Therefore, P(𝐴 = 1) ⇒ P(𝐵 ∖ 𝐴) = 0.
Hence, P(𝐵) = P(𝐵 ∩𝐴). If you take an event that happens with probability
one, the probability of interseion of the probability of any other event.
These results are fanta ically useful. They are used to simplify probab-
ility calculations.
(On downloaded books.) “His moral is laxer than mine!”
Suppose that an event happen with probability zero. It is not empty!

This is the mathematical theory of probability. Once you get uck, intuition
has to be replaced with mathematics. You have to rely upon doing the
things sy ematically.
Additional problem , part . Give an example of an event 𝐸 such that
𝐸 ̸= ∅ but P(𝐸) = 0. This is mathematical atement that has nothing to
do with intuition.
Consider a random experiment whose sample space is R. This is some
random experiment about which we do not care. I am going to define a
probability measure on a subset of R:
∫︁
def
P(𝐴) = I[0,1] (𝑥) d𝑥, 𝐴 ⊂ R
𝐴
def
Let 𝐸 = Q, the set of all rational numbers in R. Then
∫︁ ∫︁
P(𝐸) = I[0,1] (𝑥) d𝑥 = d𝑥
𝐸 𝑥∈𝐸∩[0,1]
How many rational numbers are there? A countable infinity. An integral

measure is the size of a set. Rational numbers are countably infinite. They
are in one-to-one correspondence with natural numbers. So the infinity
of irrational numbers is uncountably infinite. A number  is an interval
collapsed on itself. The Lebesgue measure of rationals is zero. If you have
an interval, its measure is its length. The measure of figure on a plane is its
area. If you look at the set of rationals in [0, 1], they are also countable, so
∫︁ ∫︁ ∞ ∫︁
∑︁
d𝑥 = ⋃︀∞ d𝑥 = d𝑥 = 0
Q∩[0,1] 𝑖=1 {𝑞𝑖 } 𝑖=1 {𝑞𝑖 }

Never confuse a zero probability event with an empty event. You might
think that this is a silly example.
 Random variables
We introduce the probability measure, but there one problem: the
sample space can be very-very ab ra and consi of non-numerical out-
comes. The outcomes of a coin are heads or tails. You can think of a set of
donkeys as well as a set of real numbers. A set of real numbers is called a
group. The language of mathematics does not require real numbers to exi .
Nature tosses a coin, there are some random processes in your mother body,
and you get a specific genome charaeri ic. Computationally it is difficult,
but mathematically it is feasible. We need a device to to the computations
easily. This is why you need a funion to map the sample space into the
real line—the definition of a random variable. Once you have done this
mapping, for praical purposes, the real line becomes your sample space.
A random variable is a funion. Suppose the sample space is 𝑆 =
{𝐻, 𝑇 }. A possible mapping is
{︃
1, 𝜔 = 𝐻,
𝑋(𝜔) =
0, 𝜔 = 𝑇.
Random variable are funions, but they are written without arguments. It
is a convention: 𝑋 = 1 if the outcome is heads and 𝑋 = 0 if the outcome
is .
Uppercase roman letters will be used to denote random variables. If
there is some notation your grandmother likes, I couldn’t care less. I do
not care about what corner of the world you are coming from. Uppercase
roman letters denote random variables. Lowercase roman letters will
denote outcomes: 𝑋 = 𝑥, 𝑥 ∈ {0, 1}. I will use this religiously.
Whatever properties apply to funions apply to random variables.
However, some terminology is difficult in the probabili ic sense. The
co-domain is the real line. The domain is heads and tails. The range is 
and , and it is its support. The support of a random variable is the set
of values it takes. It written like this: supp 𝑋 = {0, 1}. We are not saying
“range” because we are not using funional notation.
If you want to find the probability a child is born with blue eyes and
brown hair that weighs . kg, you have to compute it, and you need two
pieces of information. All the fancy math is direed towards one goal:

how can I calculate the probability of an intere ing event. You ju have to
compute the probability 𝑋 > 9 with respe to some measure, because the
genes are combining in an ab ra space.
Weight can be zero or arbitrarily large.
I have seen people with arbitrarily large weights.
. You have to specify the support of a random variable, i. e. the set of
values taken by this random variable.
. You have to specify the probability measure depending on your ima-
gination and the que ion being asked.
Example . Suppose that supp 𝑋 = R and
∫︁
def 1 2
Prob(𝑋 ∈ 𝐴) = √ e −𝑥 /2 d𝑥.
𝐴 2𝜋
This thing integrates to , it is the Gaussian integral. This is a well-defined
probability measure. Do you recognise this random variable? Because it is
so frequently applied, it is called the andard normal, or Gaussian random
2
variable. You can call the funion √12𝜋 e −𝑥 /2 Tom Cruise, but let us be
scientific and call it the probability density funion, PDF.
The support has to be specified. You can get the probabilities by integ-
rating densities over regions. The Gaussian density looks like this:
1 2
PDF𝑋 (𝑥) = √ e −𝑥 /2 , 𝑥∈R
2𝜋
𝜙 is a reserved letter, ju as in computer programming, so do not use it for
d
anything else. The shorthand is 𝑋 = 𝒩 (0, 1).
Example . Let 𝑦 be a random variable. What is wrong with this
atement? This is lowercase. The error propagates. Better have a doubt
cleared. So let us use 𝑌 . Even 𝑉 and 𝑣 are di inguished by a squiggle. Let
1
supp 𝑌 = [𝑎, 𝑏] where 𝑎 < 𝑏 and PDF𝑌 (𝑦) = 𝑏−𝑎 I[𝑎,𝑏] (𝑦), 𝑦 ∈ R. It takes
all values between 𝐴 and ∫︀ 𝐵. How will you use this piece of information?
Therefore, P(𝑌 ∈ 𝐴) = 𝑦∈𝐴 PDF𝑌 (𝑦) d𝑦 for 𝑦 ⊂ R. This random variable
is also frequently used in applied work. This is the uniform random
d
variable. The shorthand for “𝑌 is uniformly di ributed on [𝑎, 𝑏]” is 𝑌 =
𝒰[𝑎, 𝑏] (or 𝑌 ∼ 𝒰[𝑎, 𝑏]).
Example . Let 𝑝 ∈ (0, 1) be a con ant ∑︀
and let 𝑍 be a random variable
such that supp 𝑍 = {0, 1} and P(𝑍 ∈ 𝐴) = 𝑧∈𝐴 𝑝𝑧 (1 − 𝑝)1−𝑧 I{0,1} (𝑧), 𝐴 ∈

𝑅. Is this a probability
∑︀ measure? An empty summation is zero by definition.
Then, P(𝑍 ∈ 𝑅) = 𝑧∈R = 𝑝 (1 − 𝑝)1−0 I{0,1} (0) + 𝑝1 (1 − 𝑝)1−1 I{0,1} (1) =
0
1 − 𝑝 + 𝑝 = 1. The sum as exaly like an integral, so the third axiom

d
works, too. Then 𝑍 is di ributed as a Bernoulli random variable, 𝑍 =
Bernoulli(𝑝). If you go to Iceland or We eros and say “Bernoulli”, everyone
will under and it. In ead of a density funion, we call it probability
mass funion. It places masses at  and . ∫︀
Aually, there is no difference between integrals and sums: 𝑔(𝑥) d𝑥 is
a Lebesgue measure.
d
Let 𝑍 = Bernoulli(𝑝). What is the probability of 𝑍 = 1? There is one
way to do the calculation:
∑︀ using this rule. This is the same as asking
P(𝑍 ∈ {1}). This is 𝑧∈{1} PMF𝑍 (𝑧) = PDF𝑍 (1) = 𝑝. The probabiliry 𝑍
takes the value  is 1 − 𝑝. You can use the fa the it takes only two values.
The probability 𝑍 takes the value  is —an example for the homework.
d
Suppose 𝑋 = 𝒩 (0, 1) ⇔ supp 𝑋 = R and PDF𝑋 (𝑥) = 𝜙(𝑥), 𝑥 ∈ 𝑅.
Can you tell me the probability 𝑋 = 3? It is zero.∫︀ How do you figure this
out? This is the same as saying P(𝑋 ∈ {3}) = {3} 𝜙(𝑥) d𝑥. The Lebesgue
measure of this set, its length is . Also, P(𝑋 = 𝑐) = {𝑐} 𝜙(𝑥)) d𝑥 = 0. The
∫︀
chance it take any possible value is zero. On the other hand, a Bernoulli
random variable will have a non-zero probability of  and . The fir
difference between the Gaussian and the Bernoulli random variable is
the support: the support of a Bernoulli random variable is countable.
THe support of a Gaussian variable is uncountable. If the support is an
uncountable set and the probability of 𝑋 being any possible number is ,
then 𝑋 is said to be continuously di ributed. On the other hand, 𝑍 is
said to be discrete if supp 𝑍 is a countable set. Some random variables can
be both. In applied work, random variables can be partially continuous.
d
Example. Let 𝑋 = 𝒩 (0, 1). Let 𝑐 be a con ant. Define
{︃
def 𝑋, 𝑋 > 𝑐,
𝑊 =
𝑐, 𝑋 ≤ 𝑐.
If 𝑋 is a random variable, 𝑔(𝑋) is random. 𝑊 is a funion of 𝑋. What is

its support? [𝑐, +∞). Is 𝑊 discrete? No, because its support is uncountable.
Is 𝑊 continuous? P(𝑊 = 𝑐) = P(𝑋 ≤ 𝑐) = P(𝒩 (0, 1) ≤ 𝑐) > 0. This is an
example of a mixed random variable. Think of 𝑋 as your income. When
you fill in the form in the USA, you have to report the amount if it is above

the threshold. This is called a censored random variable, or bottom-coded
random variable.
d
Example . We can be more generic. Let 𝜆 > 0. 𝑋 = exp(𝜆) if
1 −𝑥/𝜆
supp(𝑋) = (0, ∞) and PDF𝑋 (𝑥) = 𝜆 e I(0,∞) (𝑥), 𝑥 ∈ R. It is a con-
tinuous random variable. It is used to measure the survival of eleronic
objes: the probability of an SSD disk failing does not depend on its prior
hi ory because there are no moving parts.
d
Example . Let 𝜆 > 0 𝑋 = Poisson(𝜆) if and only if supp(𝑋) = {0} ∪ N
and
𝜆𝑥
PMF𝑋 (𝑥) = e −𝜆 I{0}∪N (𝑥), 𝑥 ∈ R
𝑥!
∑︀
To get P(𝑋 ∈ 𝐴), you do 𝑥∈𝐴 PMF𝑋 (𝑥).
“The French include zero as a natural number.” — “Well, the

French are unnatural in several ways.”
Suppose we have two random variables. 𝑋 and 𝑌 . Que ion: when

d
is 𝑋 = 𝑌 ? Answer. The fir requirement is if and only if supp 𝑋 =
supp 𝑌 . Next, you mu have PDF𝑋 (𝑡) = PDF𝑌 (𝑡) ∀𝑡 ∈ supp 𝑋 = supp 𝑌
or PMF𝑋 (𝑡) = PMF𝑌 (𝑡) ∀𝑡 ∈ supp 𝑋 = supp 𝑌 . Turns out that there is one
funion that suits for this purpose. The cumulative density funion is not
very useful for praical applications. You normally compute probability
by integrating the density or summing the masses.
Definition. Let 𝑋 be a random variable. The cumulative di ribution
funion (CDF) of 𝑋 if the funion from R ↦→ [0, 1] given by
def
CDF𝑋 (𝑥) = P(−∞ < 𝑋 ≤ 𝑥) = P(𝑋 ≤ 𝑥), 𝑥∈R
Do you agree that P(−∞ < 𝑋 ≤ 𝑥) = P[(−∞ < 𝑋 < ∞) ∩ (𝑋 ≤ 𝑥)]? The
fir event happens with probability one. So this is simply P(𝑋 ≤ 𝑥).
d
A result without a proof. 𝑋 = 𝑌 if and only if CDF𝑋 = CDF𝑌 . We do
not have to worry about the support. It means that they have to be equal at
any point on the real line, 𝑡 ∈ R. It ju gives the probability for an interval.
But you can live your life happily even with a casual acquaintance with the
CDF.
Suppose 𝑋 is continuously di ributed. Then the CDF𝑋 (𝑥) = P(𝑋 ∈

(−∞, 𝑥]). How would you get this? By integrating the PDF:
∫︁ 𝑥
PDF𝑋 (𝑡) d𝑡.
𝑡=−∞
Result:
∫︁ 𝑥
d d
CDF𝑋 (𝑥) = PDF𝑋 (𝑡) d𝑡 = PDF𝑋 (𝑥)
d𝑥 d𝑥 −∞
This is the result that connes differential to integral calculus. This is the
fundamental theorem of calculus, or the Leibniz rule.
d
Let us do an example. Let 𝑌 = Bernoulli(𝑝). Then
⎧
⎪
⎪
⎪ 0, 𝑦 < 0,
⎪
⎨1 − 𝑝, 𝑦 = 0,
def
CDF𝑌 (𝑦) = P(𝑌 ≤ 𝑦) =
⎪
⎪
⎪ 1 − 𝑝, 0 < 𝑦 < 1,
⎪
⎩1, 𝑦 ≥ 1.
You know that P(𝑌 ≤ 0) = P(𝑌 < 0) + P(𝑌 = 0). However, this drawing is
incomplete without a hole. Now this is unambiguous. The reason why the
variables are called continuous is the CDF of a continuous variable being
continuous. You can recover the PMF by computing the jumps.
d
Remark. What is the relationship between these two atements: 𝑋 = 𝑌
and 𝑋 = 𝑌 . The fir is: 𝑋 has the same di ribution as 𝑌 . The second:
𝑋 is equal to 𝑌 . The latter is 𝑋 − 𝑌 = 0. 𝑋 − 𝑌 is a random variable. In
what sense it equal to 0? With probability . You have to interpret random
variables very precisely. So what is the relationship between the atements
d
𝑋 = 𝑌 and 𝑋 = 𝑌 with probability ? The implication is the following:
d
𝑋 = 𝑌 ⇒ 𝑋 = 𝑌 . The fir is an identity.
Example. Let 𝑋 ∼ 𝒩 (0, 1). I want to find the di ribution of −𝑋. The
transformation is very simple. Find the di ribution of −𝑋. The answer is:

d
𝑋 = −𝑋. But 𝑋 ̸= −𝑋. Let 𝑡 ∈ R. Then
∫︁ ∞
def
CDF−𝑋 (𝑡) = P(−𝑋 ≤ 𝑡) = P(𝑋 ≥ −𝑡) = PDF𝑋 (𝑥) d𝑥 =
−𝑡
∫︁ ∞ ∫︁ −∞
= 𝜙(𝑥) d𝑥 = 𝜙(−𝑢)(−d𝑢) =
−𝑡 𝑡
∫︁ 𝑡 ∫︁ 𝑡
= 𝜙(−𝑢) d𝑢 = 𝜙(𝑢) d𝑢 = CDF𝑋 (𝑡)
−∞ −∞
d
The two CDFs are equal, so −𝑋 = 𝑋. However, 𝑋 ̸= −𝑋. Supoose to the
contrary this atement is false. Suppose 𝑋 = −𝑋. This means that 2𝑋 = 0
and 𝑋 = 0 with probability , which is not true because it is Gaussian.
The vanilla equal is the mo powerful operator in mathematics. If I put a
d
topping on it, like =, it is no longer vanilla.
 Expeation
Let 𝑋 be a random variable. The expeed value of 𝑋 is defined to be
{︃∫︀
def 𝑥∈supp 𝑋 𝑥 PDF𝑋 (𝑥) d𝑥 if 𝑋 is continuously di ributed,
E𝑋 = ∑︀
𝑥∈supp 𝑋 𝑥 PMF𝑋 (𝑥) if 𝑋 is discrete.
Sometimes it is called the population mean of 𝑋. Why is it a useful obje

to look at. It is for psychological reasons. We do not like volatility. This
is why insurance markets exi . People try to minimise uncertainty and
will do anything for this. A variable is jumping around. We want one
number that captures the uncertainty. It says, “𝑋 is jumping around but
which value does it take on average?” It has nothing to do with the values
it aually takes.
d
Example. Suppose 𝑋 = Bernoulli(𝑝). What is the expeed value of 𝑋?
∑︁ ∑︁
E𝑋 = 𝑥 PMF𝑋 (𝑥) = 𝑥𝑝𝑥 (1 − 𝑝)1−𝑥 I{0,1} (𝑥) = 𝑝.
𝑥∈supp 𝑋 𝑥∈{0,1}
The learning will not happen if you do not work it out yourself. The be
thing is to ask a que ion in class. It benefits everybody. Learning literally
happens when you make mi akes, unless you are Isaac Newton. On the
other hand, 𝑋 takes only  and , and on average it will take 𝑝, although it

is never taken.
There is a very classic book by Peter Whittle, “The expeation ap-
proach”.
Definition. Let 𝑋 ne a random variable. Let ℎ be a funion R ↦→ R.
The randomness in this funion is coming from 𝑋. You can simplify
probability calculations by the use of clever conditioning. Then
{︃∫︀
def 𝑥∈supp 𝑋 ℎ(𝑥) PDF𝑋 (𝑥) d𝑥 if 𝑋 is continuously di ributed,
Eℎ(𝑋) = ∑︀
𝑥∈supp 𝑋 ℎ(𝑥) PMF𝑋 (𝑥) if 𝑋 is discrete.
def ∫︀
In principle, if 𝑌 = ℎ(𝑥), then Eℎ(𝑋) = E𝑌 = supp 𝑌 𝑦 PDF𝑌 (𝑦) d𝑦.
Remark. Let 𝑋 be a random variable. Then what is EI𝐴 (𝑋)? The
randomness in this is coming from 𝑋. The indicator is  if 𝑋 lies in 𝐴.
Then
∫︁
EI𝐴 (𝑋) = I𝐴 (𝑥) PDF𝑋 (𝑥) d𝑥 =
𝑥∈supp 𝑋
∫︁
= I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥 =
R ∫︁
= Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥 = P(𝑥 ∈ 𝐴)
𝐴
∫︀
If something is not clear, look at this obje: R I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥.
It is equal to
∫︁
I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥 =
supp 𝑋∪(R∖supp 𝑋)
∫︁
I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥+
supp 𝑋
∫︁
+ I𝐴 (𝑥)Isupp 𝑋 (𝑥) PDF𝑋 (𝑥) d𝑥
R∖supp 𝑋
The indicator funions are your friends, so use them a lot. Things that lie
outside the support will contribute zero probability.
Expeation is a linear operator:
E(𝑐1 𝑋 + 𝑐2 ) = 𝑐1 E𝑋 + 𝑐2 , E(𝛼𝑋 + 𝛽𝑌 ) = 𝛼E𝑋 + 𝛽E𝑌

Definition. If 𝑋 is a random variable,
def
Var 𝑋 = E(𝑋 − E𝑋)2
The mean measure the average location, and the variance measures the
spread. This is called the measure of dispersion.
 Leure  (--)
d ∫︀ ∞ def
If 𝑋 = 𝒩 (0, 1), then E𝑋 = −∞ 𝑥𝜙(𝑥) d𝑥 = 0. Then, Var 𝑋 = E(𝑋 −
E𝑋)2 = E𝑋 2 − (E𝑋)2 .
d
What is this: 𝑌 = 𝒩 (𝜇, 𝜎 2 )? This is a Gaussian random variable with
mean 𝜇 and variance 𝜎 2 . This happens if and only if supp 𝑌 = R and
(︂ )︂
1 𝑦−𝜇 1 2 2
PDF𝑌 (𝑦) = 𝜙 =√ e −(𝑦−𝜇) /2𝜎 , 𝑦 ∈ R.
𝜎 𝜎 2𝜋𝜎
The andard deviation, 𝜎, is the square root of variance, 𝜎 2 .

It inherits all properties of integrals. You can have  cases:
. expeed value exi s (lies on the real line),
. exi s and infinite, and
. does not exi (you get into an illegal mathematical operations).
Some sequences have expeations whose limit diverges to infinity. All
these cases are important: when you are faced with a random variable
and you want a measure of location, you have to do another measure of
location.
Example of the second case. Let 𝑋 be a continuously di ributed ran-
dom variable with support [1, ∞) and PDF𝑋 (𝑥) = 𝑥12 I[1,∞) (𝑥). This vari-
able does not have a specific name. What is its expeed value?
∫︁ ∫︁ ∞ ∫︁ ∞
1 1
E𝑋 = 𝑥 PDF𝑋 (𝑥) d𝑥 = 𝑥 2 d𝑥 = d𝑥 = log 𝑥|∞
1 = +∞
𝑥∈supp 𝑋 1 𝑥 1 𝑥
A density mu be non-negative and integrate to .

There is a very well-known random variable, especially for those in fin-
ance. Let 𝑋 be a continuously di ributed random variable with supp 𝑋 =
R and PDF𝑋 (𝑥) = 𝜋1 1+𝑥1
2 , 𝑥 ∈ R. We say it has full support. This is Cauchy
d
random variable, or Student with  degree of freedom. Let 𝑋 = Cauchy.

This is an example of a fat-tailed or a heavy-tailed di ribution. The reason
integrals are finite is the fa that the value of the integral decreases very
rapidly. Here, the tails go zero, but not as fa . Why are heavy tails useful?
Extreme events. On the other hand, if you want to model extreme events,
massive financial crises, the normal di ribution if not OK. Cauchy di ri-
bution is immensely useful for rare phenomena: earthquakes, financial
crises, shocks. How would you compute it?
∫︁ ∞
∫︁
1 1
E𝑋 = 𝑥 PDF𝑋 (𝑥) d𝑥 = 𝑥 2
=
𝑥∈supp 𝑋 −∞ 𝜋 1 + 𝑥
1 ∞ 𝑥
∫︁
1
= = log(1 + 𝑥2 )|+∞
−∞ = ∞ − ∞
𝜋 −∞ 1 + 𝑥2 2𝜋
And this is an illegal operation. The only reasonable conclusion: this

integral does not exi . In fa, you can compute any moment of a random
variable.
E𝑋 is the fir moment of the random variable. E𝑋 2 is the second
moment of 𝑋. E𝑋 18 is the th moment of 𝑋. Note that E𝑋 18 = E(𝑋 18 ),
not (E𝑋)18 ! Parentheses make reading confusing. Remember, in papers
and research articles, they deprecate too many parentheses. This is not
scholarly writing. For Cauchy di ribution, no moments exi . And if you
model financial crises, you will have a great amount of financial crises.
We need a new measure of location. In mathematics, exi ence is funda-
mental, and we really need another measure of location. This is called the
quantile of a random variable.
 Quantiles
Let 𝑋 be a random variable. Let 𝛼 ∈ (0, 1). An 𝛼th quantile (or 100 · 𝛼
percentile) of 𝑋 is such number 𝑄𝑋 (𝛼) ∈ R that could lie anywhere on the
real line that satisfies two properties:
. P(𝑋 ≤ 𝑄𝑋 (𝛼)) ≥ 𝛼,
. P(𝑋 ≥ 𝑄𝑋 (𝛼)) ≥ 1 − 𝛼.
Some quantiles have names attached to them. Remark: 𝑄𝑋 (0.5) is called
the median of 𝑋, med 𝑋. 𝑄𝑋 (0.25) is the fir quartile, 𝑄𝑋 (0.75) is the
third quartile. The quantile regression is a great alternative to the linear
regression.

d
Example. Let 𝑋 = Bernoulli(0.5). Its support is {0, 1}, and the prob-
ability of success is 0.5. We are tossing a fair coin. Que ion: find the
median of 𝑋. When everything else fails, apply the definition. Let us check
every number. Let 𝑞 < 0. Can any negative number be the median of 𝑋?
P(𝑋 ≤ 𝑞) = 0. However, this probability should be at lea .. How about
𝑞 > 1? P(𝑋 ≤ 𝑞) = 1, but P(𝑋 ≥ 𝑞) = 0. So no number below  or above
 is the median. Now, let us pick any number between 0 and 1. Let 𝑞 = 0.
Then P(𝑋 ≤ 0) = P(𝑋 < 0) + P(𝑋 = 0) = 0.5. So zero satisfies the fir
condition. Next, P(𝑋 ≥ 0) = P(𝑋 = 0) + P(𝑋 > 0) = 0.5 + 0.5 = 1. If you
want to find something out and you get confused, use set theory: P(𝑋 >
0) = P(𝑋 ∈ (0, ∞)) = P(𝑋 ∈ {1}) + P(𝑋 ∈ (0, ∞) ∖ {1}) = 0.5 + 0 = 0.5.
Who said “zero”? Jesus, I am ageing rapidly.
Human beings are upid. We are doing s*** because of hi ory.
You can check that 1 is also a median. In general, let 𝑞 ∈ (0, 1). Then
P(𝑋 ≤ 𝑞) = 0.5 and P(𝑋 ≥ 𝑞) = 0.5. So in this problem, 𝑄𝑋 (0.5) =
med 𝑋 = [0, 1].
Quantiles might exi , but they are not necessarily well-defined. The
median of Cauchy is . There are ways of redefining quantiles so that
unique values exi .
Quantiles have some remarkable properties. Let 𝑔 be an increasing
funion in R. What does it mean? Well, it is increasing. Is there any
relationship between these two objes:
E𝑔(𝑋) ? 𝑔(E𝑋)
No, there is no relation. However, quantiles are amazing!
𝑄𝑔(𝑋) (𝛼) = 𝑔(𝑄𝑋 (𝛼))
This is a hugely powerful property that expeations do not obey. This is
the equivariance property of quantiles.
Example. Let us use a set of data. Consider the following numbers:
, , . So far, you have not seen any data. There you go, you see data
in math camp. It is a dataset. In sets, duplicates are not allowed, but in
datasets, they are. What is the median? . Very good. Why? What is the
th percentile of this dataset?
Now you have another dataset: , , , . What is its median? Stati ic-
ally, what should you do to answer the que ion ati ically?

Consider the data: 𝑎, 𝑎, 𝑏, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑓, 𝑔. Each time, you get an outcome.
 people have made choices. What is the median choice? How do we
answer this que ion? Consider this dataset:
I, b, #, p, ♣, ♣, b
You should not ick to that s*** middle-school rule. It has no ati ical
ju ification. It is entirely arbitrary.
Again, consider this assignment. If data are , , , ,
⎧
⎪
⎪
⎪ 1, prob. 1/4,
⎪
⎨2, prob. 1/4,
def
𝑋 =
⎪
⎪
⎪ 3, prob. 1/4,
⎪
⎩4, prob. 1/4.
If the data are 𝑎, 𝑎, 𝑏, 𝑏, 𝑏, 𝑐, then

⎧
⎨0, prob. 1/3,
⎪
def
𝑋 = 1, prob. 1/2,
⎪
2, prob. 1/6.
⎩
We have created a probability di ribution. This di ribution aually has

a name—empirical di ribution. The numbers you see are aually out-
comes of a random variable. Can the number  be a median. Unfortunately,
with quantiles you really have to check every number. There is no other
option.  cannot be a median because P(𝑋 ≤ 1) = 0.25. Can  be a median?
P(𝑋 ≤ 4) = 1 but P(𝑋 ≥ 4) = 0.25. It is not a median. What about ? 
is a median because P(𝑋 ≤ 3) = 34 and P(𝑋 ≥ 3) = 0.5. What about ?
 is a median. Aually, med 𝑋 = [2, 3], which after interseion with the
datasets gives {2, 3}.
What happens if the data are not an ordered set? If data are 𝑎, 𝑎, 𝑏, 𝑏, 𝑐
Let ⎧
⎨23,
⎪ prob. 0.4
def
𝑋 = 58.5, prob. 0.4,
⎪
0, prob. 0.2.
⎩
Can 23 be the median outcome? P(𝑋 ≤ 23) = 0.6, P(𝑋 ≥ 23) = 0.8. 23 is
the candidate, and the corresponding outcome is 𝑎. As long as the mapping

is one-to-one, it should be right.
Try the following mapping:
⎧
⎨23,
⎪ prob. 0.4
def
𝑌 = 58.5, prob. 0.4,
⎪
100, prob. 0.2.
⎩
In this case, P(𝑋 ≤ 23) = 0.4, and 𝑎 is not the median anymore. How you
do the coding matters. 𝑋 ̸= 𝑌 , so this is why the medians differ.
However, in some cases, there might be no ambiguity if half or more of
the outcomes are the same. Some ati ical methods are not invariant to
the choice of coding.
Remember that a quantile is not a funion. However, there is a way
to adju the definition. If you analyse the definition of the quantile, you
will see that the empirical CDF has flat areas. However, for well-known
continuous ati ical di ributions with an increasing CDF the quantiles
are unique. Otherwise, you can use the generalised inverse.
 Leure  (--)
Whenever you see a problem, the result is important.
Asking about the th quantile of a random variable does not give any
additional information about this variable. The only way to check it is to
try any number. P(𝑋 ≤ 𝑞) ≥ 0, P(𝑋 ≥ 𝑞) ≥ 1. Suppose 𝑞 < 𝑎. What is
the probability of being less or equal to 𝑞? Zero. What is the probability
that 𝑋 ≥ 𝑞? One. 𝑞 qualifies as the th quantile. So does 𝑎. So really
𝑄𝑋 (0) = (−∞, 𝑎]. This is really to illu rate that 0th percentile is a silly
thing.
d
If 𝑋 = Bernoulli(𝑝), then 𝑝 is the probability of success. When 𝑝 = 0,
the probability of failure is 1. It is putting mass 1 at point 0. So Bernoulli(0)
random variable takes the value  with probability . So it is really a
con ant random variable. If 𝑞 < 0, P(𝑋 ≤ 𝑞) = 0. This is not a median. If
𝑞 > 0, P(𝑋 ≥ 𝑞) = 0. This is not a median. But 𝑋 = 0 qualifies as the 0th
quantile.
Now for the quantile. Show that 𝑄−𝑋 (𝛼) = −𝑄𝑋 (1 − 𝛼). Let us apply
the definition:
P(−𝑋 ≤ 𝑄−𝑋 (𝛼)) ≥ 𝛼 and P(−𝑋 ≥ 𝑄−𝑋 (𝛼)) ≥ 1 − 𝛼.

Do you agree that this is
P(𝑋 ≥ −𝑄−𝑋 (𝛼)) ≥ 𝛼 and P(𝑋 ≤ −𝑄−𝑋 (𝛼)) ≥ 1 − 𝛼
P(𝑋 ≤ −𝑄−𝑋 (𝛼)) ≥ 1 − 𝛼 and P(𝑋 ≥ −𝑄−𝑋 (𝛼)) ≥ 𝛼

If P(𝑋 ≤ 𝑞) ≥ 𝛽 and P(𝑋 ≥ 𝑞) and P(𝑋 ≥ 𝑞) ≥ 1 − 𝛽, this implies
−𝑄−𝑋 (𝛼) = 𝑄𝑋 (1 − 𝛼)
 Conditioning
Conditioning is the really important topic in econometrics. This is
probably the mo important part of the course. Suppose we have two
random variables, 𝑋 and 𝑌 . What is Cov(𝑋, 𝑌 )? It is
def
Cov(𝑋, 𝑌 ) = E(𝑋 − E𝑋)(𝑌 − E𝑌 ) = E𝑋𝑌 − E𝑋E𝑌
To calculate the latter, you integrate 𝑋 with the density of 𝑋, integrate

𝑌 with respe to the density of 𝑌 , but how do you calculate this obje,
E𝑋𝑌 ? You have to integrate it with respe to the joint density.
Definition. The joint density (or PMF) of 𝑋 and 𝑌 is a funion PDF𝑋,𝑌
or PMF𝑋,𝑌 : R × R ↦→ [0, ∞) such that
∫︁ ∫︁
P(𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵) = PDF𝑋,𝑌 (𝑥, 𝑦) d𝑥 d𝑦
𝑋∈𝐴 𝑌 ∈𝐵
or ∑︁ ∑︁
P(𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵) = PMF𝑋,𝑌 (𝑥, 𝑦)
𝑋∈𝐴 𝑌 ∈𝐵
for 𝐴, 𝐵 ⊂ 𝑅. These funions mu integrate to one.

Suppose I give you the joint density PDF𝑋,𝑌 . How do you find the
marginal density of 𝑋? The latter is determining the probabili ic beha-
viour of 𝑋. If I tell you how 𝑋 and 𝑌 behave together, can you derive their
individual di ributions? Going from joint to marginal is always possible
provided that you discard the information in a proper way. Do you agree
that P(𝑋 ∈ 𝐴) = P(𝑋 ∈ 𝐴, 𝑌 ∈ R) because the probability of the latter is
one. This is ∫︁ ∫︁
PDF𝑋,𝑌 (𝑥, 𝑦) d𝑦 d𝑥
𝑋∈𝐴 𝑌 ∈R

Now look at the inner integral. If you integrate the density over a region
to get a probability, so the framed expression if the marginal density of 𝑋.
Therefore, ∫︁
PDF𝑋 (𝑥) = PDF𝑋,𝑌 (𝑥, 𝑦) d𝑦, 𝑥 ∈ R
𝑦∈R
∫︀
In the save vein, PDF𝑌 (𝑦) = 𝑥∈R PDF𝑋,𝑌 (𝑥, 𝑦) d𝑥. If two variables are
ocha ically independent, you can multiply the densities to get the joint
density, but not otherwise.
Conditional probabilities. Let 𝐴, 𝐵 be events from a sample space 𝑆.
We briefly go back to events. Suppose that P(𝐵) > 0, then, by definition,
def P(𝐴 ∩ 𝐵)
P(𝐴 | 𝐵) =
P(𝐵)
Some books are writing 𝐴/𝐵, but is is not this line /, not this line ∖, no
other symbol. This definition was given by the reverent Thomas Bayes.
Why is conditioning important? What does conditioning do? Suppose
𝑌 and 𝑋 are random variables. Also support that 𝑌 is not a determini ic
funion of 𝑋, so you cannot write 𝑌 = 𝑔(𝑋). Part of randomness in 𝑌
can come from 𝑋, and part of its randomness is inherent. How would 𝑌
behave if you opped the randomness in 𝑋?
If both 𝑋 and 𝑌 are discrete, I am going to define the conditional
density funion. The conditional probability mass funion 𝑌 | 𝑋 = 𝑥 is
def P(𝑌 = 𝑦, 𝑋 = 𝑥)
PMF𝑌 |𝑋 (𝑦) = , 𝑦 ∈ supp 𝑌, 𝑥 ∈ supp 𝑋
P(𝑋 = 𝑥)
The conditioning means that we are opping the randomness by fixing

𝑋 = 𝑥. Once you have a conditional PMF, how will you compute this,
P(𝑌 ∈ 𝐴 | 𝑋 = 𝑥)? You will do the summation:
∑︁
P(𝑌 ∈ 𝐴 | 𝑋 = 𝑥) = PMF𝑌 |𝑋=𝑥 (𝑦), 𝑥 ∈ supp 𝑋.
𝑦∈𝐴
You can talk about the mean, the variance and the quantile of a di ribution
for conditional di ributions.
Not let us generalise this slightly. You know what a veor is? It is
basically a bunch of numbers with an order attached to it. It is important
that a veor is a column veor. The andard mathematical convention
is a column veor. We can generalise this definition. A random veor is

⎛ ⎞
𝑋
ju a column veor of random variables. Say, ⎝ 𝑌 ⎠ is a random column
𝑍
veor with random variables as its elements. Now, if 𝑊 is a random veor,
⃗ or 𝑊 . In research papers, you do not write in boldface
I do need to write 𝑊
⎛ (1) ⎞
𝑊
or with veors. If dim 𝑋 = 3 × 1, then 𝑊 = ⎝𝑊 (2) ⎠. However, I do not
𝑊 (3)
use the notation 𝑊1 because it is the subscript of the observation.
Let 𝑋 be a 2 × 1 random veor. Let(︃ 𝑋1 , 𝑋
)︃2 , . . . be identical copies of 𝑋,
(1)
d d d 𝑋1
so 𝑋1 = 𝑋, 𝑋2 = 𝑋, 𝑋𝑛 = 𝑋. Then (2) . Mathematics is a universal
𝑋1
language, but even there, people have preferences.
I am being nice to you by making the di inion between ran-

dom variables and veors.
Let 𝑌 be a random variable. Let 𝑋 be a random veor. It could have

dimension 1. We will assume that both 𝑌 and 𝑋 are continuously di rib-
uted. This means that every component of 𝑋 is continuously di ributed.
Now the conditional density of 𝑌 | 𝑋 = 𝑥 is defined to be
def PDF𝑌,𝑋 (𝑦, 𝑥)

PDF𝑌 |𝑋=𝑥 = , 𝑦 ∈ supp 𝑌, 𝑥 ∈ supp 𝑋
PDF𝑋 (𝑥)
This obje gives rise to a valid probability definition. Then

∫︁
P(𝑌 ∈ 𝐴 | 𝑋 = 𝑥) = PDF𝑌 |𝑋=𝑥 (𝑦) d𝑦, 𝑥 ∈ supp 𝑋
𝑦∈𝐴
You can find means, variances and quantiles with conditional di ributions.
But let us focus on the conditional mean 𝑌 | 𝑋. It has several useful
properties.
Another definition. Let 𝑌 be a random variable. Let 𝑋 be a random
veor. The conditional expeation funion of 𝑌 | 𝑋 = 𝑥 is defined to
be
∫︁
def
CEF : 𝑥 ↦→ E(𝑌 | 𝑋 = 𝑥) = 𝑦 PDF𝑌 |𝑋=𝑥 (𝑦) d𝑦, 𝑥 ∈ supp 𝑋
𝑦∈supp 𝑌

or
def
∑︁
E(𝑌 | 𝑋 = 𝑥) = 𝑦 PMF𝑌 |𝑋=𝑥 (𝑦) 𝑥 ∈ supp 𝑋
𝑦∈supp 𝑌
As long as 𝑋 is discrete or continuous, the properties of 𝑋 do not matter.

In this obje, 𝑦 has been integrated or summed out, so we are talking
about a funion of 𝑥. This funion will change only as I change 𝑥. There
is no 𝑦 in the funion, it has been integrated out. It is a funion of the
conditioning variable.
The treatment of conditioning is limited in the book. Read seion .
and follow the class notes. The exam will be similar to what is done here.
People have built careers udying the properties of random variables.
Let us do an example. Let us have two random variables:
𝑌 ∖𝑋   
 . . .
 . . .
This type of tables is called a contingency table. It is ju an obsure name
for a joint di ribution.
Let us fir compute marginal probabilities. Find the conditional ex-
peation funion CEF : 𝑥 ↦→ E(𝑌 | 𝑋 = 𝑥).
Homework: find 𝑦 ↦→ E(𝑋 | 𝑌 = 𝑦).
What is the probability 𝑋 = 1? It is P(𝑋 = 1, 𝑌 = 0) + P(𝑋 = 1, 𝑌 =
1) = 0.3. Next, P(𝑋 = 2) = 0.4 and P(𝑋 = 3) = 0.3.
Recall that
def P(𝑌 = 𝑦, 𝑋 = 𝑥)
PMF𝑌 |𝑋=𝑥 = , 𝑦 ∈ supp 𝑌, 𝑥 ∈ supp 𝑋.
P(𝑋 = 𝑥)
P(𝑌 =0,𝑋=1)
PMF𝑌 |𝑋=1 (0) = P(𝑋=1) = 0.2
0.3 = 2/3.
P(𝑌 =1,𝑋=1)
PMF𝑌 |𝑋=1 (1) = P(𝑋=1) = 0.1
0.3 = 1/3.
P(𝑌 =0,𝑋=2) 0.1
PMF𝑌 |𝑋=2 (0) = P(𝑋=2) = 0.4 = 1/4.
P(𝑌 =1,𝑋=2)
PMF𝑌 |𝑋=2 (1) = P(𝑋=2) = 0.3
0.4 = 3/4.
P(𝑌 =0,𝑋=3) 0.15
PMF𝑌 |𝑋=3 (0) = P(𝑋=3) = 0.3 = 1/2.
P(𝑌 =1,𝑋=3)
PMF𝑌 |𝑋=3 (1) = P(𝑋=3) = 0.15
0.3 = 1/2.
def ∑︀
Recall E(𝑌 | 𝑋 = 𝑥) = 𝑦∈supp 𝑌 𝑦 PMF𝑌 |𝑋=𝑥 (𝑦),
𝑥 ∈ supp
∑︀ 𝑋. We
have to do this for all three values of 𝑋. Then E(𝑌 | 𝑋 = 1) = 𝑦∈{0,1} 𝑦 ·
PMF𝑌 |𝑋=1 (𝑦) = 0 · PMF𝑌 |𝑋=1 (0) + 1 · PMF𝑌 |𝑋=1 (1) = 13 . Following the

∑︀
same logic, E(𝑌 | 𝑋 = 2) = 𝑦∈{0,1} 𝑦 PMF𝑌 |𝑋=2 (𝑦) = 0 · PMF𝑌 |𝑋=2 (0) +
1 · PMF𝑌 |𝑋=2 (1) = 34 . Finally, E(𝑌 | 𝑋 = 3) = 𝑦∈{0,1} 𝑦 PMF𝑌 |𝑋=3 (𝑦) =
∑︀
0 · PMF𝑌 |𝑋=3 (0) + 1 · PMF𝑌 |𝑋=3 (1) = 12 .

Now we need to write this funion as one line. Therefore,
1 3 1
E(𝑌 | 𝑋 = 𝑥) = I{1} (𝑥) + I{2} (𝑥) + I{3} (𝑥), 𝑥 ∈ supp 𝑋.
3 4 2
See if you can find the conditional expeation funion of 𝑦. You can
condition anything on anything.
If you have a que ion, then ask your que ion loudly. The
members of parliament will never address each other. Mi er
speaker... It has positive externalities, and a private que ion
has zero welfare benefits.
Now suppose 𝑍 is a random variable. What is 𝑔(𝑍)? It is a random

variable. What is the source of randomness in this? 𝑍. Do not forget this.
We all agree that E(𝑌 | 𝑋 = 𝑥) is a funion of 𝑥. Let us call it ℎ(𝑥). Can
you tell me that what ℎ(𝑋) is? A random variable. What random variable
is it? It is 13 I{1} (𝑋) + 34 I{2} (𝑋) + 12 I{3} (𝑋). This becomes a random variable.
I took the obje E(𝑌 | 𝑋 = 𝑥)|𝑥=𝑋 , and we did a plug-in operation. By
doing this, we have created a hugely important random variable. Upon
doing so, we created a random variable. Let us define this obje as
def
E(𝑌 | 𝑋) = E(𝑌 | 𝑋 = 𝑥)|𝑥=𝑋
This is read as “the conditional expeation of 𝑌 given 𝑋”. And this is a
random variable! This random variable has remarkable properties. How
about we li these properties? We shall skip the proofs. These properties
are of supreme intere .
Properties of conditional expeations.
Read seion ..
 Leure  (--)
The result in problem  is used in the IV regression. You can have a
look at it.

Let us look at properties of conditional expeations. These properties
follow from conditional expeation funions. You fir have to compute
the conditional expeation funions, and then plug in random variables.
Let 𝑌 be a random variable. Let 𝑋 be a random veor. Then the
random variable E(𝑌 | 𝑋) has the following properties:
. E(𝑌 | 𝑋) is a funion of 𝑋. The source of randomness here is 𝑋.
. The conditional expeation operator is a linear operator: E(𝐴 + 𝐵 |
𝑋) = E(𝐴 | 𝑋) + E(𝐵 | 𝑋) where 𝐴, 𝐵 are random variables.
. E(𝑐 | 𝑋) = 𝑐 is 𝑐 is a con ant.
. The reason why conditional expeations were invented. The intuitive
meaning of E(𝑌 | 𝑋) is the fa that the randomness of 𝑋 has been
opped. This means that E(𝑔(𝑋)𝑌 | 𝑋) = 𝑔(𝑋)E(𝑌 | 𝑋). Con ants
come outside the expeations. This has a non- andard name: the
useful rule.
. This property has a andard name: the law of iterated expeations.
You can take the expeation of this random variable: E𝑋 [E𝑌 |𝑋 (𝑌 |
𝑋)] = E[E(𝑌 | 𝑋)] = E𝑌 . There is an inner expeation, and an
outer expeation, so you get the marginal expeation of 𝑌 . If you
abbreviate it, you will get LIE, so it become the “law of iterated
mathematical expeations”, LIME.
. E(𝑌 | 𝑋) = E(𝑌 | ℎ(𝑥)) if ℎ is injeive, ℎ : supp 𝑋 ↦→ supp 𝑋. This
is a one-to-one funion, or injeive funion. Definition: ∀𝑥1 , 𝑥2 ∈
supp(𝑋), ℎ(𝑥1 ) = ℎ(𝑥2 ) ⇒ 𝑥1 = 𝑥2 .
. Let 𝑔 be a funion, not necessarily one-to-one. I am going to do re-
peated conditioning (not iterated conditioning): E[E(𝑌 | 𝑋) | 𝑔(𝑥)] =
E(𝑌 | 𝑔(𝑋)). There are two conditionings. This is used in economet-
rics all the time. In general, when you condition on less information,
the outcome will be determined by the small information set. This is
called the “smalle information wins” property.
All of this can be proved. Whatever properties integral have, they are
inherited by the conditional expeation operator. These results hold in
great generality. You can solve a difficult probability problem if you split
the problem into several sources of randomness by conditioning, and then
you will work with the remaining randomness.
Remark. In terms of conditional expeation funion, E(𝑔(𝑋)𝑌 | 𝑋 =
𝑥) = E(𝑔(𝑥)𝑌 | 𝑋 = 𝑥) = 𝑔(𝑥)E(𝑌 | 𝑋 = 𝑥).
Conditional expeation has a couple of various interpretations. De-
pending on the application, you use the interpretation you need. One way

to interpret this: given the information on 𝑋, you try to predi the value
of 𝑌 . This is the be possible predior of 𝑌 given 𝑋.
If you have two values, 𝑋 and 𝑔(𝑋), which contains more information?
𝑋. If 𝑔(0) = 3 and 𝑔(1) = 3, then you reduce information. Applying a
funion is a loss of information. The only time such loss does not occur is
when such a funion is injeive.
Example of property . This property is telling you a useful way of
discarding information. You may or may not like this example. Let 𝑁 be the
number of nuclear warheads in North Korea. Kim Jong Un has a produion
funion. You want to use the available information given the information
on capital and labour (say, number of cars and people). Unfortunately,
their assembly units are located under the ground and the data of 𝐾 is
missing. You want to have E(𝑁 | 𝐾, 𝐿), but how do you ati ically remove
the information about 𝐾? Remarkably, E[E(𝑁 | 𝐾, 𝐿) | 𝐿] = E(𝑁 | 𝐿).
In econometric terms, this is how you incorporate the effe of omitted
variables. Once you have omitted a variable, you get a short model. This is
how you do it: you are conditioning the long model on short information.
What is the effe of endogeneity due to the omitted variable? You are
using this property. This does not mean that you erase the variable; you
are conditioning on small information.
Example. Suppose E(𝑌 | 𝑋) = 𝛼0 + 𝛼1 𝑋 + 𝛼2 𝑋 2 . Can you tell me
E(𝑌 | 𝑋, 𝑋 2 )? Let ℎ(𝑥) = (𝑋, 𝑋 2 ). ℎ maps 𝑋 into two things. Is it a
one-to-one funion? Suppose ℎ(𝑥1 ) = ℎ(𝑥2 ). This means that 𝑥1 = 𝑥2 and
𝑥21 = 𝑥21 . By con ruion,
(︂ )︂this is an injeive transformation. Any mapping
𝑥
of the form 𝑥1 ↦→ is -to- by con ruion. By property , this is
𝐾(𝑥)
really E(𝑌 | 𝑋, 𝑋 2 ) = E(𝑌 | ℎ(𝑥)) = E(𝑌 | 𝑋). This shows how we should
write regression models. This is the parsimonious way of writing that.
Now suppose E(𝑌 | 𝑋) = 𝛼0 + 𝛼1 + 𝛼2 𝑋 2 . What is E(𝑌 | 𝑋 2 )? Do
you agree that E(𝑌 | 𝑋 2 ) = E[E(𝑌 | 𝑋) | 𝑋 2 ]. This is by property :
E(𝛼0 + 𝛼1 𝑋 + 𝛼2 𝑋 2 | 𝑋) = 𝛼0 + 𝛼1 E(𝑋 | 𝑋 2 ) + 𝛼2 𝑋 2 . And the middle
term has to be determined. More information has to be given to allow the
simplification.
Result. Let 𝑌 be a random variable. Let 𝑋 be a random veor. Think
of E(𝑌 | 𝑋) as prediing 𝑌 given the information on 𝑋. I claim that this
is the be predior of 𝑌 | 𝑋.
def
Fir approach. Let 𝑈 = 𝑌 −E(𝑌 | 𝑋). What is this? This is a mismatch.
Then this is the prediion error, the mi ake that you make.

. Then the following is true: E(𝑈 | 𝑋) = E[𝑌 − E(𝑌 | 𝑋)] = E(𝑌 ) −
E[E(𝑌 | 𝑋) | 𝑋]. This is true due to the linearity and the law of iterated
expeations. But this is nothing else than E(𝑌 ) − E(𝑌 ) = 0. When you
predi a variable with its conditional expeation, the expeed prediion
error is zero. 𝑈 will never be  with probability  because that it would
imply that 𝑌 = E(𝑌 | 𝑋) with probability , but the latter is a funion of
𝑋!
. Stati ically, when would you conclude that 𝑌 is the be predior?
It is using information on 𝑋, and it would be be if we used up all the
information that 𝑋 has to give. So the prediion error should not contain
any information about 𝑋: 𝑈 should be uncorrelated with any funion
of 𝑋. And indeed 𝑈 is uncorrelated with any funion of 𝑋. Let 𝑔 be any
funion of 𝑋. Then Corr(𝑈, 𝑔(𝑋)) = Cov(𝑈,𝑔(𝑋))
sd 𝑈 sd 𝑔(𝑋) . The variances are posit-
ive, the andard deviations are positive, to here, the covariance mu be
zero: Cov(𝑈, 𝑔(𝑋)) = E𝑈 𝑔(𝑋) − E𝑈 E𝑔(𝑋). Now, E𝑈 = E(E(𝑈 | 𝑋)) = 0
by the law of iterated expeations. Now, E𝑈 𝑔(𝑋) = E[E(𝑈 𝑔(𝑋) | 𝑋)] =
E[𝑔(𝑋)E(𝑈 | 𝑋)] = 0. We see that the population residual has no informa-
tion on 𝑋 about it. This obje has no information about 𝑋 in it.
d
Example. Let 𝑋 = Bernoulli(𝑝). Find the CDF of 𝑌 . Solution: let 𝑦 ∈ R.
def
Then CDF𝑌 (𝑦) = P(𝑌 ≤ 𝑦). It is
⎧
⎨0,
⎪ 𝑦 < 0,
1 − 𝑝, 0 ≤ 𝑦 < 1,
⎪
1, 𝑦 ≥ 1.
⎩
If you get confused, follow the mathematics. Ju do not forget to put the
bullets and the holes.
 Leure  (--)
Let 𝑌 be a random variable and let 𝑋 be a random veor. Recall from
def
the la time that 𝑈 = 𝑌 −E(𝑌 | 𝑋) this thing has two properties: () E(𝑈 |
𝑋) = 0, and we used this property to show that () 𝑈 is uncorrelated with
every funion of 𝑋! This is such a good predior!
It has several interpretations. It makes sense to call E(𝑌 | 𝑋) the be
predior. Do you agree that 𝑌 = E(𝑌 | 𝑋) + 𝑌 − E(𝑌 | 𝑋). You can call
⏟ ⏞ ⏟ ⏞
the fir part predied by 𝑋, and the latter is the part that cannot be

predied by 𝑋. So we rewrite is as 𝑌 = E(𝑌 | 𝑋) + 𝑈 . This looks like
a regression model. There you go! This is the ati ical foundation of a
correly specified regression model.
There is a very useful definition that you will you throughout your
career. Let 𝐴 be a random variable, and 𝐵 be a random veor. If E(𝐴 |
𝐵) = 0, 𝐵 is said to be exogenous with respe to 𝐴. This implies no
correlation: 𝐴 is uncorrelated with 𝐵 if Corr(𝐴, 𝐵) = 0. Then, the property
E(𝑈 | 𝑋) = 0 can be re- ated is “𝑋 is exogenous with respe to 𝑈 ”. Notice
that is 𝐵 is exogenous with respe to 𝐴, the converse is not true. You will
always have endogenous and exogenous variables, and you mu specify
with respe to what.
Let us look at another interpretation of the conditional expeation
as the be predior of 𝑌 given the information on 𝑋. This is a classic
regression; you want to predi log wage given education, gender, race etc.
𝑌 is a random variable. You want to predi it using some funions of 𝑋.
It is not to be allowed to be funions of 𝑌 because then, the prediion
error is zero. You have the prediion error: 𝑌 − 𝑔(𝑋). What funion
should you pick? Ideally, you should pick a funion that minimises the
prediion error. Here, the easie way to make things positive is |𝑌 − 𝑔(𝑋)|.
The problem is the fa that this is a random obje, and we want to remove
the randomness in it:
arg min E|𝑌 − 𝑔(𝑋)|.

𝑔∈{funions of 𝑋}
This is an optimisation problem: search across all funions of 𝑋! The ob-
je is called mean absolute error. Now we have to solve this optimisation
problem. It is quite hard. It has a unique
∫︀ solution. The only way this is zero
is the integrand being zero. If 𝑓 ≥ 0, 𝑓 = 0 ⇔ 𝑓 = 0. However, you
are searching for a funion given the information, the optimal solution
would be med(𝑌 | 𝑋). However, this optimisation problem is difficult.
This is the basics for a quantile regression. However, you mu have much
more fundamental under anding. Purely for the sake of computation we
shall not use this funion. This funion has a kink at the origin that does
not allow calculus. And this causes problems. What is the easie way to
fix this problem? How do you remove the kink? You should smooth this
funion at the origin.

How about we minimize MSE,
arg min E(𝑌 − 𝑔(𝑋))2

The unique solution to this problem is E(𝑌 | 𝑋). This is the basis for the
mean regression. These objes have different properties. The conditional
median is the be prediion in the sense that it minimises mean absolute
error. The conditional expeation is the be predior in the sense that it
minimises mean squared error, so the meaning of the word “be ” changes.
Quantile regression is a hugely intere ing problem, but it takes  pages of
proof.
Claim. E(𝑌 | 𝑋) is really the minimiser of across all funions of 𝑋
that minimises mean squared error.
E(𝑌 | 𝑋) = arg min E(𝑌 − 𝑔(𝑋))2

For any given 𝑋,
E(𝑌 − 𝑔(𝑋))2 = E(𝑌 − E(𝑌 | 𝑋) +E(𝑌 | 𝑋) − 𝑔(𝑋))2 =

⏟ ⏞
= E(𝑈 + E(𝑌 | 𝑋) − 𝑔(𝑋))2 =
⎛ ⎞
= E𝑈 2 + E(E(𝑌 | 𝑋) − 𝑔(𝑋))2 + 2E ⎝𝑈 E(𝑌 | 𝑋) − 𝑔(𝑋) ⎠ =

⎜ (︀ )︀⎟
⏟ ⏞
funion of 𝑋
You remember that E(𝑈 | 𝑋) = 0 ⇒ E𝑈 = 0 by the law of iterated

expeations: E𝑈 = E(E(𝑈 | 𝑋)) = 0. The latter part is 0 because 𝑈 is
uncorrelated with every funion of 𝑋.
= E𝑈 2 + E(E(𝑌 | 𝑋) − 𝑔(𝑋))2 = E𝑈 2 + min E(E(𝑌 | 𝑋) − 𝑔(𝑋))2

𝑔
The fir part does not depend on zero. In the second part, if 𝑔 = E(𝑌 | 𝑋),
this gives the unique solution! This is a remarkable result because you
are minimising over the set of all funions of 𝑋. This set is infinite
dimensional, there are uncountably infinitely many funions of 𝑋. So this
is sometimes written as BP(𝑌 | 𝑋).
Once you have the conditional density, you can compute the conditional
expeation. However, only nature alone knows the true conditional expe-

ation. The e imation can be done, but we need to look at non-parametric
methods. So, how about lowering our expeations? In ead of be predi-
ors, let us use be linear predior. You are ill minimising, but, in ead
of searching across all funions, you are searching across linear funions
of 𝑋.
 Be linear prediors

Let 𝑌 be a random variable. Let 𝑋 be a random veor. The be linear
predior of 𝑌 given information on 𝑋 is the unique linear funion of 𝑋
that solves the following optimisation problem:
BLP(𝑌 | 𝑋) = arg min E(𝑌 − 𝑔(𝑋))2

𝑔∈{linear funions of 𝑋}
The proof is quite elementary. You can prove these properties using pure
algebra. It can be shown that the be linear predior of 𝑌 given 𝑋 is equal
to
BLP(𝑌 | 𝑋) = 𝛽0 + 𝑋 ′ 𝛽1 ,
where 𝛽0 = E𝑌 − E𝑋 ′ 𝛽1 , and 𝛽1 = (Var 𝑋)−1 Cov(𝑋, 𝑌 ). It is very easy
to e imate the population mean and covariances. This is the be linear
predior. If 𝑋 is a veor, a andard mathematical convention is a column
veor. Whatever the dimension of 𝑋, this uff works. These definitions
are dimension-independent. Here, Var 𝑋 is a matrix, and Cov(𝑋, 𝑌 ) is a
veor.
Remark. Let 𝐴 and 𝐵 be random veors. They can be of any dimension.
Their dimensions are anything times one. Then,
def
Var 𝐴 = E (𝐴 − E𝐴) (𝐴 − E𝐴)′
⏟ ⏞ ⏟ ⏞
𝑚×1 1×𝑚
Sometimes, people ju call it the variance-covariance matrix. This can be

rewritten as
Var 𝐴 = E𝐴𝐴′ − (E𝐴)(E𝐴)′
Now, what about Cov(𝐴, 𝐵)? The sensible of defining it is
Cov(𝐴, 𝐵) = E(𝐴 − E𝐴)(𝐵 − E𝐵)′ → dim 𝐴 × dim 𝐵
Cov(𝐴, 𝐵) = E𝐴𝐵 ′ − (E𝐴)(E𝐵)′

So Var 𝑋 = E𝑋𝑋 ′ − (E𝑋)(E𝑋)′ and Cov(𝑋, 𝑌 ) = E𝑋𝑌 − E(𝑋)E(𝑌 ).
Do the dimensions match? Var 𝑋 is a matrix, Cov(𝑋, 𝑌 ) is a veor, so
everything matches.
I want to compare BP’s and BLP’s.
BP BLP
BP(𝑌 | 𝑋) = E(𝑌 | 𝑋) BLP(𝑌 | 𝑋) = 𝛽0 + 𝑋 ′ 𝛽1
𝛽0 = E𝑌 − (E𝑋)′ 𝛽1 ,
𝛽1 = (Var 𝑋)−1 Cov(𝑋, 𝑌 )
def def
𝑈 = 𝑌 − E(𝑌 | 𝑋) is the popu- 𝑉 = 𝑌 − BLP(𝑌 | 𝑋) is the pop-
lation residual from the be pre- ulation residual from the be lin-
diion problem ear prediion problem
E(𝑈 | 𝑋) = 0 BLP(𝑉 | 𝑋) = 0
𝑈 is uncorrelated with every 𝑈 is uncorrelated with every lin-
funion of 𝑋 ear funion of 𝑋
The be prediion of 𝑈 given 𝑋 is zero. However, with 𝑉 , it is true
only with linear funions. The population residual 𝑉 has no information
only about linear funions of 𝑋. This is much more re ried problem
than this. However, there is no useful rule for BLP’s because they are
invariant only with respe to some linear transformations of 𝑋.
 Conditional variance funion

Let 𝑌 be a random variable. Let 𝑋 be a random veor. Now that you
have conditional densities, you can define any conditional objes. Then
the conditional variance funion Var(𝑌 | 𝑋 | 𝑥) is the funion
def (︀
𝑥 ↦→ Var(𝑌 | 𝑋 = 𝑥) = E (𝑌 − E(𝑌 | 𝑋 = 𝑥))2 ⃒ 𝑋 = 𝑥 =
⃒ )︀
)︀2
= E(𝑌 2 | 𝑋 = 𝑥) − E(𝑌 | 𝑋 = 𝑥)
(︀
Definition. The conditional variance of 𝑌 given 𝑋 is the random

def
variable Var(𝑌 | 𝑋) = Var(𝑌 | 𝑋 = 𝑥)|𝑥→𝑋 . In the literature, there is a
specific name to this funion, the skeda ic funion. Ask Chri os. This
is the funion that captures variation.
Definition. If the skeda ic funion 𝑥 ↦→ Var(𝑌 | 𝑋 = 𝑥) is the con ant
funion, then 𝑌 is said to be homoskeda ic with respe to 𝑋. If not,

then 𝑌 is said to be heteroskeda ic.
Another thing. Let 𝑌 be a random variable and 𝑋 be a random veor.
Then 𝑌 = E(𝑌 | 𝑋) + 𝑌 − E(| 𝑋), so
Var 𝑌 = Var(E(𝑌 | 𝑋)) + Var 𝑈
But Var 𝑈 = E𝑈 2 − (E𝑈 )2 = E𝑈 2 because E(𝑈 | 𝑋) = 0 ⇒ E𝑈 = 0.

However, the converse is not true, ̸⇐. To show this counterexample, try
to find to random variables with mean zero but the conditional mean not
zero. Fir , try to find the solution yourself. If you never get uck, you
never learn.
Do you agree it is equal to E(E(𝑈 2 | 𝑋))? This is the law of iterated
expeations. But E(𝑈 2 | 𝑋) = E[(𝑌 − E(𝑌 | 𝑋))2 | 𝑋] = Var(𝑌 | 𝑋). So
E(E(𝑈 2 | 𝑋)) = E(Var(𝑌 | 𝑋)). So
Var 𝑌 = Var E(𝑌 | 𝑋) + E Var(𝑌 | 𝑋)
This is called the variance decomposition formula, or ANOVA (ANalysis

Of VAriance).
Let us art doing some problems from homework . The additional
problem  cannot be googled because it is new.
The more you write, the less I am convinced that you know. Do
not write two pages of the ate of your mind.
Additional problem  has  parts.

. E(𝑌 | 𝑋) is not E(𝑋 | 𝑌 ) because the fir is a funion of 𝑋, the
second is a funion of 𝑌 .
. 𝑔 is a funion. Var(𝑔(𝑋)𝑌 | 𝑋) = 𝑔 2 (𝑋) Var(𝑌 | 𝑋). Is it true? Yes.
Once you condition on 𝑋, randomness on 𝑋 is opped, and by definition,
Var(𝑐𝑍) = 𝑐2 Var 𝑍. Besides that, you can rewrite this sa Var(𝑔(𝑋)𝑌 |
𝑋) = E(𝑔(𝑋)2 𝑌 2 | 𝑋) − (E(𝑔(𝑋)𝑌 | 𝑋))2 , which is by the useful rule
𝑔 2 (𝑥)E(𝑌 2 | 𝑋) − (𝑔(𝑋)E(𝑌 | 𝑋))2 = 𝑔 2 (𝑋)[E(𝑌 2 | 𝑋) − (E(𝑌 | 𝑋))2 ] =
𝑔 2 (𝑋) Var(𝑌 | 𝑋).
Problem  will allow you to solve problem . Let 𝐷 be a Bernoulli
random variable. Let 𝑌 be a random variable. Show that E(𝑌 | 𝐷 =
E𝑌 𝐷
1) = P(𝐷=1) . We show this using the definition. Fir , do you agree that
E𝑌 𝐷 = E(E(𝑌 𝐷 | 𝐷)). It is by LIE. Next, it is equal to E(𝐷E(𝑌 | 𝐷)). How
many values does 𝐷 take? Only two. Do you agree, since 𝐷 ∈ {0, 1}, that

this is E(𝑌 | 𝐷) = 𝐷E(𝑌 | 𝐷 = 1) + (1 − 𝐷)E(𝑌 | 𝐷 = 0). If 𝐷 = 0, then it
is equal to (1 − 0)E(𝑌 | 𝐷 = 0), and if 𝐷 = 1, the RHS is 1 · E(𝑌 | 𝐷 = 1).
Since E(𝑌 | 𝐷) = 𝐷E(𝑌 | 𝐷 = 1) + (1 − 𝐷)E(𝑌 | 𝐷 = 0), then
𝐷E(𝑌 | 𝐷) = 𝐷2 E(𝑌 | 𝐷 = 1) + 𝐷(1 − 𝐷)E(𝑌 | 𝐷 = 0). Now look at 𝐷2 .
It is zero if and only if 𝐷 = 0, and one if and only if 𝐷 = 1. So 𝐷2 = 𝐷.
However, 𝐷(1−𝐷) is always zero. Therefore, 𝐷E(𝑌 | 𝐷) = 𝐷E(𝑌 | 𝐷 = 1).
Therefore, we take expeation on both sides: E[𝐷E(𝑌 | 𝐷)] = E[𝐷E(𝑌 |
𝐷 = 1)] = E𝐷E(𝑌 | 𝐷 = 1) = P(𝐷 = 1)E(𝑌 | 𝐷 = 1). So
E[𝐷E(𝑌 | 𝐷)] E𝑌 𝐷
E(𝑌 | 𝐷 = 1) = =
P(𝐷 = 1) P(𝐷 = 1)
QED is quod erat demon randum, what we were trying to show. There
is a famous author, and Euclid wrote everything explicitly. Euclid showed
the atements and in the end, summarised.
Exercise  (i). Find E(𝑌 | 𝑋 ∈ 𝑆). What is the conditional expeation
of 𝑌 given that 𝑋 ∈ 𝑆? This might seem range. So really, we have been
conditioning on other variables, but how do you condition on an event?
def
Let 𝐷 = I𝑆 (𝑋). How many values does 𝐷 take? Like a Bernoulli random
variable: 1 if 𝑋 ∈ 𝑆 or 0 if 𝑋 ̸∈ 𝑆. So E(𝑌 | 𝑋 ∈ 𝑆) = E(𝑌 | 𝐷 = 1). The
point is to write this obje as a conditional expeation funion. Then,
look at the problem : this is P(𝐷=1)
E𝑌 𝐷
. Do you agree that this is nothing but
E𝑌 I𝑆 (𝑋) I𝑆 (𝑋)|𝑋)] (𝑋)E(𝑌 |𝑋)]
P(𝐷=1) = E[E(𝑌P(𝐷=1) = E[I𝑆 P(𝐷=1) . Now you have to fiddle around
so that the answer appears as in part . You should learn from this problem
how to write events as random variables.
Exercise  (ii). Censored or truncated regressions are called Tobit
models. When you use the margins command in Stata, it is using the result
d
from this problem. Let 𝑌 = 𝒩 (𝜇, 𝜎 2 ). Show that
(︀ 𝑎−𝜇 )︀ (︁ )︁
𝑏−𝜇
𝜙 𝜎 −𝜙 𝜎
E(𝑌 | 𝑎 < 𝑌 < 𝑏) = 𝜇 + 𝜎 (︁ )︁ (︀ 𝑎−𝜇 )︀ ,
𝑏−𝜇
Φ 𝜎 −Φ 𝜎
def 2 def ∫︀ 𝑡
where 𝜙(𝑥) = √1 e −𝑥 /2 ,
2𝜋
Φ(𝑡) = −∞ 𝜙(𝑥) d𝑥, 𝑡 ∈ R.
def
Fir of all, E(𝑌 | 𝑎 < 𝑌 < 𝑏) = E(𝑌 | 𝑌 ∈ 𝑆) where 𝑆 = (𝑎, 𝑏). But

from part , we know that
∫︁
1
E(𝑌 | 𝑋 ∈ 𝑆) = E(𝑌 | 𝑋 = 𝑥) PDF𝑋 (𝑥) d𝑥
P(𝑋 ∈ 𝑆) 𝑥∈𝑆
But if 𝑋 = 𝑌 , then this implies that

∫︁
1
E(𝑌 | 𝑌 ∈ 𝑆) = E(𝑌 | 𝑌 = 𝑥) PDF𝑌 (𝑥) d𝑥
P(𝑌 ∈ 𝑆) 𝑥∈(𝑎,𝑏)
1
∫︀
Therefore, now, E(𝑌 | 𝑌 ∈ 𝑆) = P(𝑎<𝑌 <𝑏) 𝑥∈(𝑎,𝑏) 𝑥 PDF𝑌 (𝑥) d𝑥 . For
E(𝑌 | 𝑌 = 𝑥), you have to apply the useful rule: E(𝑥 | 𝑌 = 𝑥) = 𝑥. Now
this becomes ∫︁ 𝑏
1 1 (𝑥−𝜇)2
𝑥√ e − 2𝜎2 d𝑥
P(𝑎 < 𝑌 < 𝑏) 𝑎 2𝜋𝜎
Next,
P(𝑎 < 𝑌 < 𝑏) = P(𝑎 < 𝑌 ≤ 𝑏) = P(𝑌 ≤ 𝑏) − P(𝑌 ≤ 𝑎) =

(︂ )︂ (︂ )︂
𝑌 −𝜇 𝑏−𝜇 𝑌 −𝜇 𝑎−𝜇
=P ≤ −P ≤ =
𝜎 𝜎 𝜎 𝜎
(︂ )︂ (︂ )︂
𝑏−𝜇 𝑎−𝜇
= P 𝒩 (0, 1) ≤ − P 𝒩 (0, 1) ≤ =
𝜎 𝜎
(︂ )︂ (︂ )︂
𝑏−𝜇 𝑎−𝜇
=Φ −Φ
𝜎 𝜎
Now,
∫︁ 𝑏 (𝑥−𝜇)2
∫︁ 𝑏
(𝑥−𝜇)2
1 − 1
𝑥√ e 2𝜎 2 d𝑥 = (𝑥 − 𝜇 + 𝜇) √ e − 2𝜎2 d𝑥 =
𝑎 2𝜋𝜎 𝑎 2𝜋𝜎
∫︁ 𝑏 2
∫︁ 𝑏
1 (𝑥−𝜇) 1 (𝑥−𝜇)2
−
= (𝑥 − 𝜇) √ e 2𝜎 2 d𝑥 + 𝜇 √ e − 2𝜎2 d𝑥
𝑎 2𝜋𝜎 𝑎 2𝜋𝜎
⏟ ⏞
PDF𝑌
(𝑥−𝜇)2 𝑥−𝜇
Let 𝑡 = 2𝜎 2
, therefore, d𝑡 = 𝜎2
d𝑥 and 𝜎 2 d𝑡 = (𝑥 − 𝜇) d𝑥.

Then, the fir integral is equal to
∫︁ (𝑏−𝜇)2 ⃒ (𝑏−𝜇)2
2𝜎 2 1 −𝑡 2 1 −𝑡 ⃒⃒ 2𝜎2
√ e 𝜎 d𝑡 = −𝜎 √ e ⃒ =
(𝑎−𝜇)2 2𝜋𝜎 2𝜋 (𝑎−𝜇)2
2𝜎 2 2𝜎 2
(𝑎−𝜇)2 (𝑏−𝜇)2
(︂ )︂ (︂ (︂ )︂ (︂ )︂)︂
1 − − 𝑎−𝜇 𝑏−𝜇
= 𝜎√ e 2𝜎 − e 2𝜎
2 2 =𝜎 𝜙 −𝜙
2𝜋 𝜎 𝜎
You are here to learn. You should be learning  days per week apart
from – hours you need for sleep. Welcome to grad school! It is work,
work, work. Life is tough.
The point is not algebra, the point is concept.
Let’s move on to another topic.
 Independence
This topic is called independence. This is the concept that is really
remarkable. It was invented by Kolmogorov. It is to make probability
caluclations simpler. It may or may not hold.
Suppose that 𝐴 and 𝐵 are events of some sample space. We say that 𝐴
and 𝐵 are independent and we write it as 𝐴 ⊥⊥ 𝐵 if
P(𝐴 ∩ 𝐵) = P(𝐴)P(𝐵)
If some events are independent, do they have to be disjoint, and vice versa?
d def
If 𝐴 ∩ 𝐵 = ∅, does it imply 𝐴 ⊥
⊥ 𝐵? Let 𝑋 = 𝒩 (0, 1), let 𝐴 = (𝑋 > 0)
def
and 𝐵 = (𝑋 ≤ 0). Then 𝐴 ∩ 𝐵 = ∅. You have to apply the definition of
this concept. Then, P(𝐴) = 0.5, P(𝐵) = 0.5, but P(𝐴 ∩ 𝐵) = P(∅) = 0. The
other way around is also not true. Independence is a property of events
and has nothing to do with set theory. Sometimes, this concept is called
ocha ic independence.
What if you had three events? You would the say that the independ-
ence should hold for any subcolleion. We say that 𝐴, 𝐵, 𝐶 are mu-
tually independent if P(𝐴 ∩ 𝐵) = P(𝐴)P(𝐵), P(𝐴 ∩ 𝐶) = P(𝐴)P(𝐶),
P(𝐵 ∩ 𝐶) = P(𝐵)P(𝐶), P(𝐴 ∩ 𝐵 ∩ 𝐶) = P(𝐴)P(𝐵)P(𝐶). This definition is
the culmination of a huge amount of human thought.
When can we claim that two random variables are independent? When
events they generate are independent.

Definition. Let 𝑋 and 𝑌 be random variables or random veors. We
say that 𝑋 is independent of 𝑌 (and we write 𝑋 ⊥ ⊥ 𝑌 ) if P(𝑋 ∈ 𝐴, 𝑌 ∈
𝐵) = P(𝑋 ∈ 𝐴)P(𝑌 ∈ 𝐵) ∀𝐴 ⊂ Rdim 𝑋 , 𝐵 ⊂ Rdim 𝑌 .
Result . It can be shown that
𝑋⊥
⊥ 𝑌 ⇔ PDF𝑋,𝑌 (𝑥, 𝑦) = PDF𝑋 (𝑥) PDF𝑌 (𝑦) ∀(𝑥, 𝑦) ∈ supp(𝑋, 𝑌 )
This is a result that is not so easy to prove. Independence is so rong and

it simplifies calculations.
Result . The reason why independence is a really rong condition.
When you declare two variables are independent, you are imposing a very
rong re riion. If 𝑋 ⊥ ⊥ 𝑌 , then 𝑔(𝑋) ⊥⊥ ℎ(𝑦) ∀𝑔, ℎ. Any funion of 𝑋
is independent of any funion of 𝑌 . If you take a con ant funion, 𝑋 is
independent of 𝑐.
Result . If 𝑋 is a random variable and 𝑐 is a con ant, then 𝑋 ⊥ ⊥ 𝑐.
This is easy to prove. Ju accept it.
Let us look at some re riions imposed by independence. If they are
independent, then their correlation is zero. The converse is not true.
Let 𝑋 and 𝑌 be random variables. If 𝑋 ⊥ ⊥ 𝑌 , then
∫︁ ∫︁
E𝑋𝑌 = PDF𝑋,𝑌 (𝑥, 𝑦) d d𝑦 =
𝑥∈supp 𝑋 𝑦∈supp 𝑌
∫︁ ∫︁ ∫︁ ∫︁
= 𝑥𝑦 PDF𝑋 (𝑥) PDF𝑌 (𝑦) d𝑥 d𝑦 = 𝑥 PDF𝑋 (𝑥) =
𝑥 𝑦 𝑥 𝑦
= PDF𝑌 (𝑦) = E𝑋 · E𝑌
Another result. 𝑋 ⊥
⊥ 𝑌 ⇒ Corr(𝑋, 𝑌 ) = 0 because
Cov(𝑋, 𝑌 ) E𝑋𝑌 − E𝑋E𝑌

Corr(𝑋, 𝑌 ) = = = 0.
sd 𝑋 sd 𝑌 sd 𝑋 sd 𝑌
d def
Example. Let 𝑋 = 𝒩 (0, 1) and let 𝑌 = 𝑋 2 . Then
Cov(𝑌, 𝑋) E𝑋𝑌 − E𝑋E𝑌 E𝑋𝑌 E𝑋 3

Corr(𝑌, 𝑋) = = √ = √ = √ =0
sd 𝑌 sd 𝑋 2·1 2 2
Are these variable independent? 𝑌 is a determini ic funion of 𝑋. The
behaviour of 𝑌 is rily dependent on 𝑋. You cannot even find the joint
density of 𝑌 and 𝑋. The probability of interseion is not equal to the

produ of probabilities.
def def
Let 𝐴 = {𝑋 ∈ (1, ∞)} and {𝐵 = 𝑌 (∈ (−∞, 0.25))}. Then
P(𝐴 ∩ 𝐵) = P(𝑋 > 1, 𝑌 < 0.25) = P(𝑋 > 1, 𝑋 2 < 0.25) =

= P(𝑋 > 1, −0.5 < 𝑋 < 0.5) = 0 ̸= P(𝑋 > 1)P(𝑌 < 0.25)
because the latter is (1 − Φ(1)) · (Φ(0.5) − Φ(−0.5)). These numbers can be

found in tables. Typically, you numerically approximate these integrals.
 Leure  (--)
Let us discuss the relationship between independence and conditioning.
Let 𝑋 and 𝑌 be random variables. If 𝑋 ⊥⊥ 𝑌 , there are two ways to interpret:
P(𝑋 𝑖𝑛𝐴, 𝑌 ∈ 𝐵) = P(𝑋 ∈ 𝐴)P(𝑌 ∈ 𝐵) ∀𝐴, 𝐵 ⊂ R, or PDF𝑋,𝑌 =
PDF𝑋 · PDF𝑌 .
Suppose 𝑋 ⊥ ⊥ 𝑌 . What is the conditional density PDF𝑌 |𝑋 ? But this is
˙
equal to PDFPDF
𝑌 PDF𝑋
𝑋
= PDF𝑌 . If 𝑌 and 𝑋 are independent, they are not
giving information on one another.
What about the other way? If PDF𝑌 |𝑋 = PDF𝑌 , then PDF𝑌 |𝑋 PDF𝑋 =
⏟ ⏞
PDF𝑌 |𝑋
PDF𝑌 PDF𝑋 . So we have equivalence:
𝑌 ⊥
⊥ 𝑋 ⇔ PDF𝑌 |𝑋 = PDF𝑌
In some books, it is used as the definition.

Example. Suppose 𝑋 ⊥ ⊥ 𝑌 . Then E(𝑌 | 𝑋) = E(𝑌 ) because con-
ditioning gives no information. Then, Var(𝑌 | 𝑋) = Var 𝑌 . These are
one-line proofs. What about the other way around? Suppose we know
E(𝑌 | 𝑋) = E𝑌 . Can we conclude that 𝑌 ⊥ ⊥ 𝑋? Saying that the condi-
tional mean is equal to the unconditional is ju a feature of a di ribution.
This is ju one particular feature. What about the conditional quantiles,
variances and modes? So this is not true. You can find such variables
that Var(𝑌 | 𝑋) = Var 𝑌 , but 𝑌 is not independent on 𝑋. Independence
implies re riions on all features of the di ribution.
Homework , AP . Let 𝐴 and 𝐵 be events. Show that everything
between them is independent:
⊥ 𝐵 ⇔ 𝐴𝐶 ⊥
𝐴⊥ ⊥ 𝐵 𝐶 ⇔ 𝐴𝐶 ⊥
⊥𝐵 ⇔ 𝐴⊥ ⊥ 𝐵𝐶 ⇔ 𝐴 ⊥
⊥𝐵

. 𝐴 ⊥⊥ 𝐵 ⇒ 𝐴𝐶 ⊥ ⊥ 𝐵. Now 𝐵 = (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴). Therefore,
P(𝐵) = P(𝐵∖𝐴)+P(𝐵∩𝐴) = P(𝐵∩𝐴𝐶 )+P(𝐵∩𝐴). Therefore, P(𝐵∩𝐴𝐶 ) =
P(𝐵) − P(𝐵 ∩ 𝐴), and through independence it is P(𝐵) − P(𝐵)P(𝐴) =
P(𝐵)(1 − P(𝐴)) = P(𝐵)P(𝐴𝐶 ).
. 𝐴𝐶 ⊥⊥𝐵 ⇒ 𝐴⊥ ⊥ 𝐵 𝐶 . Do you agree that 𝐵 = (𝐵 ∖ 𝐴) ∪ (𝐵 ∩ 𝐴).
So P(𝐵 ∩ 𝐴) = P(𝐵) − P(𝐵 ∩ 𝐴𝐶 ). Also, 𝐴 = (𝐴 ∖ 𝐵) ∪ (𝐴 ∩ 𝐵), therefore,
P(𝐴 ∩ 𝐵) = P(𝐴) − P(𝐴 ∩ 𝐵 𝐶 ). Since P(𝐵 ∩ 𝐴) = P(𝐴 ∩ 𝐵),
P(𝐵) − P(𝐵 ∩ 𝐴𝐶 ) = P(𝐴) − P(𝐴 ∩ 𝐵 𝐶 )
Therefore, P(𝐴 ∩ 𝐵 𝐶 ) = P(𝐴) − P(𝐵) + P(𝐵 ∩ 𝐴𝐶 ) = P(𝐴) − P(𝐵) +

P(𝐵)P(𝐴𝐶 ) = P(𝐴) − P(𝐵)(1 − P(𝐴𝐶 )) = P(𝐴) − P(𝐵)P(𝐴) = P(𝐴)(1 −
P(𝐵)) = P(𝐴)P(𝐵 𝐶 ).
All con ants are independent of all random variable.
d
𝑋 = Rademacher if supp 𝑋 = {−1, 1} and P(𝑋 = −1) = P(𝑋 = 1) =
0.5. It is useful in boot rapping.
def ∑︀
AP . Let 𝑆𝑚 = 𝑚 𝑖=1 𝑋𝑖 , where (𝑋1 , . . . , 𝑋𝑛 ) are a colleion of inde-
pendent identically di ributed. They are independent, and their joint PDF
is the produ of marginals. We say that a random sample is a colleion
of IID random variables. Here, the summation index is random.
d
Let 𝑁 = Poisson(𝜆). Then E𝑆𝑁 is the sum of a random number of terms.
We declare {︃∑︀
𝑚
def 𝑖=1 𝑋𝑖 , 𝑚 ≥ 1,
𝑆𝑚 =
0, 𝑚=0
Now, E𝑆𝑚 = 𝑛𝑖=1 E𝑋𝑖 . The only randomness is coming from the 𝑋’s.
∑︀
However, E𝑆𝑁 has two sources of randomness. You have too much ran-
domness. If we could somehow shut down the randomness, we could solve
it. And we do it by conditioning on 𝑁 .
Step . E𝑆𝑁 = E[E(𝑆𝑁 | 𝑁 )] (LIE). Now, E(𝑆𝑁 | 𝑁 ). In order to proceed,
def
we need E(𝑌 | 𝑋) = E(𝑌 | 𝑋 = 𝑥)|𝑥→𝑋 . You fir have to compute the
conditional expeation funion. I want to find E(𝑆𝑁 | 𝑁 ) = E(𝑆𝑁 | 𝑁 =
𝑛)|𝑛→𝑁 , the CEF.
Step . Now use the information E(𝑆𝑁 | 𝑁 = 𝑛) = E(𝑆𝑛 | 𝑁 = 𝑛) by

the useful rule. Now we know the definition of it:
{︃ ∑︀
E[ 𝑛𝑖=1 𝑋𝑖 | 𝑁 = 𝑛], 𝑛 ≥ 1,
E(𝑆𝑛 | 𝑁 = 𝑛) = =
0, 𝑛 = 0.
{︃∑︀ {︃∑︀
𝑛 𝑛
𝑖=1 E(𝑋 𝑖 | 𝑁 = 𝑛), 𝑛 ≥ 1, 𝑖=1 E𝑋𝑖 , 𝑛 ≥ 1,
= = = 𝑛𝑝,
0, 𝑛 = 0. 0, 𝑛 = 0.
𝑛 ≥ 0. If all 𝑋’s are independent of 𝑛, then this is ju the marginal

expeation of 𝑋! So we have shown that E(𝑆𝑁 | 𝑁 = 𝑛) = 𝑛𝑝. Now,
def
E(𝑆𝑁 | 𝑁 ) = 𝑛𝑝|𝑛→𝑁 = 𝑁 𝑝, and
E𝑆𝑁 = E(E[𝑆𝑁 | 𝑁 ]) = E(𝑁 𝑝) = 𝑝E𝑁 = 𝑝𝜆

⎧
⎨−1, 𝑝1 ,
⎪
Homework , AP. 𝑋 = 0, 𝑝2 , Finding E(𝑋 2 | 𝑋) is obvious and
⎪
1, 𝑝3 .
⎩
2
is equal to 𝑋 . Once you condition on 𝑋, everything that depends on
def
𝑋 as like a con ant. Now, E(𝑋 | 𝑋 2 ) = E(𝑋 2 | 𝑋 2 = 𝑡)|𝑡→𝑋 2 . Now,
supp 𝑋 2 = {0, 1}. Fir , E(𝑋 | 𝑋 2 = 0) = 0 because 𝑋 2 = 0 ⇔ 𝑋 = 0
and because of the useful rule.
Now
def
∑︁ ∑︁
E(𝑋 | 𝑋 2 = 1) = 𝑥 PMF𝑋|𝑋 2 =1 (𝑥) = 𝑥P(𝑋 = 𝑥 | 𝑋 2 = 1).
𝑥∈supp 𝑋 𝑥∈{−1,0,1}
This is equal to −1P(𝑋 = −1 | 𝑋 2 = 1) + 1P(𝑋 = 1 | 𝑋 2 = 1) =

2 =1) 2 =1)
−1 P(𝑋=−1,𝑋
P(𝑋 2 =1)
+ P(𝑋=1,𝑋
P(𝑋 2 =1)
. Now, if 𝑋 = −1, then 𝑋 2 = 1 automatically,
so this is equal to − P(𝑋=−1)
P(𝑋 2 =1)
+ P(𝑋=1)
P(𝑋 2 =1)
. Next, P(𝑋 2 = 1) = 𝑝1 + 𝑝3 , to the
expression is equal to − 𝑝1𝑝+𝑝
1
3
+ 𝑝1𝑝+𝑝
3
3
= 𝑝𝑝31 −𝑝 1
+𝑝3 . So
𝑝3 − 𝑝1 𝑝3 − 𝑝1
E(𝑋 | 𝑋 2 = 𝑡) = 0I𝑡=0 + I𝑡=1 = I𝑡=1 ,
𝑝1 + 𝑝3 𝑝1 + 𝑝3
𝑝3 −𝑝1
and E(𝑋 | 𝑋 2 ) = 𝑝1 +𝑝3 I𝑋 2 =1 .

 Leure  (--)
You will need the properties of Gaussian random veor in portfolio
theory. Gaussian objes possess properties that make them really unique.
They possess a property that other variables do not have.
d
Recall that 𝑋 is said to be a normal (Gaussian) random variable (𝑋 =
𝒩 (0, 1)) if and only if supp 𝑋 = R and
1 2
PDF𝑋 (𝑥) = 𝜙(𝑥) = √ e −𝑥 /2 , 𝑥∈R
2𝜋
Claim. Let 𝑎 and 𝑏 be con ants. Then 𝑎𝑋 + 𝑏 is a Gaussian random
variable. What is the expeed value of 𝑎𝑋 + 𝑏? It is 𝑏. What is its variance?
𝑎2 Var 𝑋 = 𝑎2 . The di ribution remains the same, only the mean and
the variance change. This is true if and only if any linear combination is
Gaussian. The di ribution holds under linear transformations. This is
called the reproducing property of Gaussian random variables. Try and
see if you can prove it. Take the CDF of it, differentiate it and verify.
There is a andard convention in English: if you really like Shakespeare,
you write “shakespearean”. If you are not familiar with him, you write
“Shakespearean”.
Do not make your own definitions if you are asked whether a veor is
Gaussian or not. Before we look at Gaussian random veors, recall some
uff about random veors.
Let 𝑋 be 𝑝 × 1 random veor, (𝑋 (1) , . . . , 𝑋 (𝑝) )⊤ . Then
def
E𝑋 = (E𝑋 (1) , . . . , E𝑋 (𝑝) )⊤ .
It is a veor of con ants. The variance of 𝑋 is defined as E(𝑋 − E𝑋)(𝑋 −

E𝑋)′ = E𝑋𝑋 ′ − (E𝑋)(E𝑋)′ , which is as 𝑝 × 𝑝 matrix.
Remark. Var 𝑋 is a symmetric and positive semi-definite matrix. Sym-
metric means that Var 𝑋 = (Var 𝑋)′ . Positive semi-definiteness means that
𝛼′ (Var 𝑋)𝛼 ≥ 0 ∀𝛼 ∈ R𝑝 . However, if a matrix is positive definite, it is
guaranteed to be invertible. We shall make sure that if the coordinates of
𝑋 are linearly independent with probability , then this matrix is going to
be positive definite.
Remark. If 𝑌 is a random veor, 𝐴 is a matrix of con ants, 𝑏 is a veor
of con ants, then E(𝐴𝑌 + 𝑏) = 𝐴E𝑌 + 𝑏. It is implicit that 𝐴 and 𝑏 have
such dimensions that these multiplication and addition are conformable.

Note that 𝑌 ↦→ 𝐴𝑌 + 𝑏 is a linear mapping, a linear transformation of 𝑌 .
The variance is Var(𝐴𝑌 + 𝑏) = 𝐴(Var 𝑌 )𝐴′ . Never forget these results.
 Gaussian random veor
def def
Let 𝑋 be a 𝑝 × 1 random veor. Let 𝜇𝑝×1 = E(𝑋) and 𝑉𝑝×𝑝 = Var 𝑋.
Now we introduce a very range-looking definition: 𝑋 is said to be a
Gaussian random veor if and only if every linear combination of the
coordinates of 𝑋 is a Gaussian random variable, that is 𝛼′ 𝑋 is a Gaussian
random variable for each 𝛼 ∈ R𝑏 . This definition is really remarkable. It
frees you from remembering any formulæ.
If you have a random veor 𝑍, you check every possible linear combin-
ation of its coordinates.
def
Consequence of this definition. Let 𝑍 = (𝑋, 𝑌, 𝑊 )⊤ be a Gaussian
random veor. This means that any linear combination of components 𝑋,
𝑌 , and 𝑊 is a Gaussian random variable. Do not say “elements” because
this is not a set. Can you tell me what is the di ribution of 𝑋? It is
Normal. Why? Because 𝛼 = (1, 0, 0)⊤ fits the definition: 𝑋 = (1, 0, 0) · 𝑍.
It is automatically implied that any combination is Gaussian. If you ack
two random variables, it is not necessary that the result is Gaussian: you
need many requirements for their joint di ribution. In the came vein,
𝑊 = (0, 0, 1)𝑍. And so is 𝑋 + 𝑍 = (1, 0, 1)𝑍.
Another piece of terminology. If 𝑋 is a Gaussian random veor, then
its coordinates are said to be jointly normal.
Notation. 𝑍 is a Gaussian random veor with mean 𝜇 and variance 𝑉
d d
is denoted as 𝑍 = 𝒩 (𝜇, 𝑉 ). Nice people write it as 𝑍 = MVN(𝜇, 𝑉 ).
d
Result. Let 𝑍 be a Gaussian random veor, that is 𝑍 = 𝒩 (𝜇, 𝑉 ). The
dimension of 𝜇 is the same as 𝑍’s: dim 𝑍 × 1. The dimension of 𝑉 is
dim 𝑍 × dim 𝑍. Let 𝐴 be a matrix of con ants. Let 𝑏 be a veor of con ants.
d
Then 𝐴𝑍 + 𝑏 = 𝒩 . The number of columns of 𝐴 mu be equal to the
number of columns of 𝑍. We claim that this veor is Gaussian. Gaussianity
is preserved. Let’s see why.
How can we check this result? We should show that 𝛼′ 𝑍 is Gaussian.
Fir , the outcome is clearly a veor. It is a random veor. Then, we
mu check every linear combination for Gaussianity. Let 𝛼 ∈ Rdim 𝑏 . Then
def
𝛼′ (𝐴𝑍 + 𝑏) = 𝛼′ 𝐴𝑍 + 𝛼′ 𝑏 = 𝑎′ 𝑍 + 𝑐 when 𝑎 = 𝐴𝛼 and 𝑐 = 𝛼′ 𝑏. Then
𝑎′ 𝑍 is ju a linear combination of coordinates of 𝑍. This is a Gaussian
random variable because 𝑍 was Gaussian in the fir place. So: any linear

transformation of a Gaussian random veor is Gaussian. Then
𝐴𝑍 + 𝑏 = 𝒩 (𝐴E𝑍 + 𝑏, 𝐴(Var 𝑍)𝐴′ )
This result is really remarkable: a linear transformation of a Gaussian

random veor is Gaussian. There is a name for it: reproducing property
of Gaussian random veors.
Why is this so important and beloved in science? Let us do an example.
Example. Let (︂(︂ )︂ (︂ )︂)︂
d 1 1 −2
𝑋=𝒩 ,
2 −2 7
def def
Let 𝑌 (1) = 𝑋 (1) + 𝑋 (2) and 𝑌 (2) = 2𝑋 (1) − 3𝑋 (2) . Find the di ribution
of 𝑌 = (𝑌 (1), 𝑌 (2) )⊤ . This probability calculation can be done without
integrations. This can be figured out in linear algebra. Probability calcu-
lations that are highly non-linear are replaced with simple linear algebra.
People who do ock trading prefer to work with linear approximations
under normality assumptions.
Note.
(︂ (1) )︂ (︂ (1)
𝑋 + 𝑋 (2)
)︂ (︂ )︂ (︂ (1) )︂ (︂ )︂
𝑌 1 1 𝑋 1 1
𝑌 = = = = 𝑋
𝑌 (2) 2𝑋 (1) − 3𝑋 (2) 2 −3 𝑋 (2) 2 −3
d
Therefore, 𝑌 = 𝒩 by the reproducing property of Gaussian random veors
under linear maps. So
(︂(︂ )︂ (︂ )︂ (︂ )︂)︂
d 1 1 1 1 1 2
𝑌 =𝒩 E𝑋, Var 𝑋 ,
2 −3 2 −3 1 −3
(︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂
1 1 1 3 1 1 1 −2 1 2
which gives = and =
2 −3 1 −4 2 −3 −2 7 1 −3
(︂ )︂
4 −17
.
−17 91
Let us see what we have learned.
d
Example. Let 𝑋 = 𝒩 (0, 1) be a andard normal variable. However,
d d d
𝑌 = 𝒩 (0, 𝐼) is a multivariate normal veor. Let 𝑌 = 𝑋𝑅 where 𝑅 =
Rademacher and 𝑅 ⊥ ⊥ 𝑋.
What is the di ribution of 𝑌 ? Observe that supp 𝑌 = R because 𝑋
takes all the values from the real line. Let us find the CDF of 𝑌 . Let 𝑦 ∈ R.

Then
def
CDF𝑌 (𝑦) = P(𝑌 ≤ 𝑦) = P(𝑋𝑅 ≤ 𝑌 ) =
= P(𝑋𝑅 ≤ 𝑦, 𝑅 = −1) + P(𝑋𝑅 ≤ 𝑦, 𝑅 = 1)
because these two events are disjoint. How, it is equal to
P(−𝑋 ≤ 𝑦, 𝑅 = −1) + P(𝑋 ≤ 𝑦, 𝑅 = 1) =

= P(−𝑋 ≤ 𝑌 )P(𝑅 = −1) + P(𝑋 ≤ 𝑦)P(𝑅 = 1)
because 𝑅 and 𝑋 are independent. This is equal to
0.5P(−𝒩 (0, 1) ≤ 𝑌 ) + 0.5P(𝒩 (0, 1) ≤ 𝑦) =

= 0.5P(𝒩 (0, 1) ≤ 𝑌 ) + 0.5P(𝒩 (0, 1) ≤ 𝑦)
because the andard Gaussian is symmetric around the origin. Here, you
are adding the same thing twice:
= P(𝒩 (0, 1) ≤ 𝑦) = CDF𝒩 (0,1) (𝑦)
d
This means that 𝑌 is also andard normal, 𝑌 = 𝒩 (0, 1). We created
another andard normal by using this trick.
Same example, que ion . Now, I am going to create a random veor:
def
let 𝑍 = (𝑋, 𝑌 )⊤ . If you do not agree that 𝑍 is a random veor, you have
a huge exi ential issue here. Is 𝑍 a Gaussian random veor? What do
you have to check? If any linear combination is Gaussian. So let’s check
uncountably infinite many linear combinations, hopefully by the end of
this leure. Let’s check
P(𝑋 + 𝑌 = 0) = P(𝑋 + 𝑅𝑋 = 0) = P(𝑅 = −1) = 0.5
If 𝑋 and 𝑌 were jointly normal, then you would have P(𝑋 + 𝑌 = 0) = 0

because this is a continuous di ribution. However, here, we observe some
point mass. This tells you that 𝑍 is not a Gaussian random veor. Gaussian
random veors can only be recognised through linear combinations.

 Leure  (--)
Yet another reproducing property if Gaussian random veors. There
is the second reproducing property of Gaussians. I am not sure whether
there are other variables with this property. Suppose we have
(︂ )︂ (︂[︂ ]︂ [︂ 2 ]︂)︂
𝑌 d 𝜇𝑋 𝜎𝑋 𝜎𝑋𝑌
=𝒩 ,
𝑋 𝜇𝑌 𝜎𝑌2
d
Result: 𝑌 | 𝑋 = 𝒩 (E(𝑌 | 𝑋), Var(𝑌 | 𝑋)), where E(𝑌 | 𝑋) = BLP(𝑌 | 𝑋)
d
and Var(𝑌 | 𝑋) = Var(𝑌 − BLP(𝑌 | 𝑋)). Now, 𝑋 | 𝑌 = 𝒩 (E(𝑋 |
𝑌 ), Var(𝑋 | 𝑌 )), where E(𝑋 | 𝑌 ) = BLP(𝑋 | 𝑌 ) and Var(𝑋 | 𝑌 ) =
Var(𝑋 − BLP(𝑋 | 𝑌 )) . So 𝑌 is homoskeda ic with respe to 𝑋 and 𝑋 is
homoskeda ic with respe to 𝑌 . This is implied by the joint Gaussianity.
Example. Suppose I told you
(︂ )︂ (︂[︂ ]︂ [︂ ]︂)︂
𝑌1 d 2 7 1
=𝒩 ,
𝑌2 0 1 3
What is the di ribution of 𝑌1 | 𝑌2 ? Then

d
𝑌1 | 𝑌2 = 𝒩 (BLP(𝑌1 | 𝑌2 ), Var(𝑌1 − BLP(𝑌1 | 𝑌2 )))
2 ,𝑌1 )
But BLP(𝑌1 | 𝑌2 ) = 𝛽0 + 𝛽1 𝑌2 , 𝛽0 = E𝑌1 − 𝛽1 E𝑌2 , 𝛽1 = Cov(𝑌 Var 𝑌2 . Then
𝛽0 = 2 − 𝛽1 · 0, 𝛽1 = 13 . So BLP(𝑌1 | 𝑌2 ) = 2 + 13 𝑌2 . What is the variance of
the residual of this linear predior problem? This is Var(𝑌1 − 2 − 13 𝑌2 ) =
Var(𝑌1 − 31 𝑌2 ) = Var(𝑌1 ) + 91 Var 𝑌2 − 32 Cov(𝑌1 , 𝑌2 ) = 7 + 31 − 23 = 20 3 .
d
Therefore, I can write that 𝑌1 | 𝑌2 = 𝒩 (2 + 31 𝑌2 , 20
3 ). So the conditioning
variable appears only in the mean funion.
Once you assume a random veor is Gaussian, calculations become
super-easy. It is a huge computational saving. This is why people dump
Gaussianity in empirical finance.
Example. The only difficult thing is the way we write it.
The moment you see Gaussian random veors, you should
jump with joy.
Suppose (︂ )︂ (︂[︂ ]︂ [︂ ]︂)︂
𝑋1 d 1 3 1
=𝒩 ,
𝑋2 1 1 2

def
Find the di ribution of 𝑋1 + 𝑋2 | 𝑋1 = 𝑋2 . Let 𝑌1 = 𝑋1 + 𝑋2 and
def
𝑌2 = 𝑋1 − 𝑋2 . Then Law(𝑋1 + 𝑋2 | 𝑋1 = 𝑋2 ) = Law(𝑌1 | 𝑌2 = 0).
There is another word for “di ribution”: “law”.
Do you agree that
(︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂
𝑌1 𝑋1 + 𝑋2 1 1 𝑋1 d
= = =
𝑌2 𝑋1 − 𝑋2 1 −1 𝑋2
(︂[︂ ]︂ (︂ )︂ [︂ ]︂ [︂ ]︂ [︂ ]︂)︂ (︂[︂ ]︂ [︂ ]︂)︂
d 1 1 1 1 1 3 1 1 1 d 2 7 1
=𝒩 , =𝒩 ,
1 −1 1 1 −1 1 2 1 −1 0 1 3
d d
So really 𝑌1 | 𝑌2 = 𝒩 (2 + 13 𝑌2 , 20 20
3 ). Hence 𝑌1 | 𝑌2 = 0 = 𝒩 (2, 3 ).
Try at home: 2𝑋2 | 𝑋1 − 𝑋2 = 0.
 The PDF of Gaussian random veors

Let 𝑋 be a 𝑝 × 1 random veor. Then 𝑋 is a Gaussian random vetor
with mean 𝜇 and variance 𝑉 if and only if the joint PDF of the coordinates
of 𝑋 is
(︂ )︂𝑝/2 (︂ )︂
1 1 1
PDF𝑋 (𝑥) = √ exp − (𝑥 − 𝜇) 𝑉 (𝑥 − 𝜇) , 𝑥 ∈ R𝑝
′ −1
2𝜋 det 𝑉 2
This is not an easy result to show. The density has to be this, and then every
linear combination of coordinates will be Gaussian.
If there are two random variables and they are independent, they are
surely uncorrelated. If two random variables are uncorrelated, are they in-
dependent? No. However, if they are jointly normal, are they independent?
For jointly normal variables, uncorrelatedness will imply independence.
This is such a rong property!
Result. Let 𝑋 be a 𝑝 × 1 Gaussian random veor. Then the coordinates
of 𝑋 are independent if and only if they are pairwise uncorrelated. Why
is this result so useful? Suppose I give you a bunch of random variables.
Suppose each of them is marginally normal, and suppose you ack them.
Will the result be Gaussian? Yes.
Proof. Independence always implies uncorrelatedness. It only remains
to prove the other way round, so we want to show that if coordinated of
𝑋 are pairwise uncorrelated, then they are independent. Assume that
coordinates of 𝑋 are pairwise uncorrelated. When you write down the

variance-covariance matrix, it will be diagonal:
⎛ 2 ⎞
𝜎1 · · · 0
Var 𝑋 = ⎝ ... . . . ... ⎠
⎜ ⎟
0 · · · 𝜎𝑝2
But we know that 𝑋 is a Gaussian random veor. So the PDF will be

(︂ )︂𝑝/2 (︂ )︂
1 1 1 ′ −1
PDF𝑋 (𝑥) = √ exp − (𝑥 − 𝜇) 𝑉 (𝑥 − 𝜇)
2𝜋 det 𝑉 2
The inverse of a diagonal matrix is a matrix of reciprocal entries, so it is

equal to
𝑝
)︂𝑝/2 (︃ )︃
1 ∑︁ (𝑥(𝑖) − 𝜇(𝑖) )2
(︂
1 1
PDF𝑋 (𝑥) = exp − =
𝜎𝑖2
√︁
2𝜋 𝜎12 · . . . · 𝜎𝑝2 2
𝑖=1
𝑝 (𝑥 (𝑖) −𝜇(𝑖) )2 𝑝
∏︁ 1 −
2𝜎 2
∏︁
= √ e 𝑖 = PDF𝑋 (𝑖) (𝑥).
𝑖=1
2𝜋𝜎𝑖 𝑖=1
What is the moral of this example? If you have a bunch of random

variables that are marginally normally di ributed and independent, then
they are jointly normal if you ack them up. This is how Gaussian random
veors are con rued.
 Large sample theory

This is quite important because this is the basis for all econometrics. In
econometrics, usually the objes are indexed by sample size, and we want
to observe their behaviour as 𝑛 → ∞. Let us begin with convergence of
real sequences.
Let (𝑥𝑛 ) be a sequence of real numbers. Then (𝑥𝑛 ) → 𝑥 as 𝑛 → ∞
means lim𝑛→∞ 𝑥𝑛 = 𝑥, which in turn means that if you draw a ball around
𝑥, all terms of this sequence lie within this ball, or
∀𝜀 > 0 ∃𝑁𝜀 > 0 : 𝑛 > 𝑁𝜀 ⇒ 𝑥𝑛 ∈ 𝐵(𝑥, 𝜀).
However, some sequence do not converge. Some sequences ju oscillate.

Another type is boundedness: they need not converge to anything, but they

are ju bounded.
Definition. (𝑥𝑛 ) is said to be eventually bounded if ∃ such numbers 𝑀
and 𝑁 such that 𝑛 > 𝑁 ⇒ |𝑥𝑛 | < 𝑀 . A sequence that is bounded never
gets too large or too small.
Example. 𝑛 ∈ N. The sequences 𝑥𝑛 = sin 𝑛 and 𝑥𝑛 = (−1)𝑛 are not
convergent, but eventually bounded. The set of all eventually bounded
non-random sequences is denoted by 𝑂(1). This thing is read as “big oh of
one”. There is another symbol. We would like to isolate all sequences that
go to zero. Among convergent sequences, it is good to have a name for the
ones that converge to zero. 𝑥𝑛 → 𝑥 as 𝑛 → ∞ ⇔ 𝑥𝑛 − 𝑥 → 0 as 𝑛 → ∞.
Definition. The set of all non-random sequences that converge to zero
as 𝑛 → ∞ is denoted by 𝑜(1).
But what if, in ead of real numbers, we had sequences of random
veors? If 𝑥𝑛 → 𝑥 as 𝑛 → ∞, then 𝑥𝑛 − 𝑥 = 𝑜(1) as 𝑛 → ∞. The big
advantage: we do not know how to make algebra with arrows. We know
algebra with equalities.
𝑜(1) is a sequence converging to zero. Then 𝑜(1)𝑂(1) = 𝑜(1), although
the ones on the end are different. Next, 𝑜(1) + 𝑂(1) = 𝑂(1). Next,
𝑜(1) − 𝑜(1) = 𝑜(1) because these might be different sequences. This is
not arithmetic, this is symbol manipulation.
What would change if we had real veors? Recall 𝑥𝑛 → 𝑛 as 𝑛 → ∞ ⇔
𝑥𝑛 − 𝑥 → 0 as 𝑛 → ∞ ⇔ ∀𝜀 > 0 ∃𝑁𝜀 > 0 : ∀𝑛 > 𝑁𝜀 𝑥𝑛 − 𝑥 ∈ 𝐵(0, 𝜀).
This definition is a remarkable definition because nowhere we had said
that 𝑥 is a real number. The very same thing means that |𝑥𝑛 − 𝑥| < 𝜀, but
here you are measuring the di ance between two real numbers. For veor
𝑥, the only thing that would change ||𝑥𝑛 − 𝑥|| < 𝜀.
Let us ju talk about the convergence of random veors. A random
veor is ju like a veor of random variables. You would ju have to do
something to account for randomness inside.
𝑝
Definition. We say that 𝑋𝑛 converges in probability to 𝑋, like 𝑋𝑛 →− 𝑋
as 𝑛 → ∞, or plim𝑛→∞ 𝑋𝑛 = 𝑋 if ∀𝜀 > 0 ∃𝑁𝜀 such that 𝑛 > 𝑁𝑒 ⇒
P(||𝑋𝑛 − 𝑋|| > 𝜀) < 0. This is ju a probabili ic translation of the real
definition.
d 𝑝
Example. Let 𝑋𝑛 = 𝒩 (0, 𝑛), 𝑛 ∈ N. Then, 𝑋𝑛𝑛 → − 0 as 𝑛 → ∞. The
reason Europe became great was calculus, nothing ⃒ ⃒ else.
Let 𝜀 > 0. Then we⃒ have ⃒to show that P(⃒ 𝑋𝑛𝑛 ⃒ > 𝜀) is less than 𝜀. It
is nothing else than P(⃒ 𝒩 (0,𝑛)
⃒ (︀ 1 )︀⃒
⃒𝒩 0, ⃒ > 𝜀) = P(𝒩 0, 1 >
(︀ )︀
> 𝜀) =
⃒ ⃒
𝑛 ⃒ P( 𝑛 𝑛
√ √
𝜀) + P(𝒩 0, 𝑛1 < −𝜀) = P(𝒩 (0, 1) > 𝑛𝜀) + P(𝒩 (0, 1) < − 𝑛𝜀) =
(︀ )︀

√ √
1 − Φ( 𝑛𝜀) + Φ(− 𝑛𝜀) → 0 as 𝑛 → ∞. This is true because the CDF is
converging to zero as 𝑥 → −∞ and to one as 𝑥 → ∞.
Why was convergence invented? As you say, 𝑥𝑛 → 𝑥 as 𝑛 → ∞, what
is the intuition? If you go far out in the tails, you can approximate 𝑥 with
𝑥𝑛 . You can use elements from the tails of the sequence to approximate the
sequence? And to translate this into ati ics, the ati ical equivalence of
approximation is e imation.
The expeation is an unknown obje. How would ∫︀ you calculate the
expeed value of 𝑋? It is the integral: E𝑋 = 𝑥∈supp 𝑋 𝑥 PDF𝑋 (𝑥) d𝑥.
Who alone knows the densities of random variables? Nature. We want to
e imate the unknown value.
Result. Let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be IID copies of the random veor 𝑋. We
have some data on 𝑋. Then
𝑛
1 ∑︁ 𝑝
𝑋𝑖 →
− E𝑋 as 𝑛 → ∞
𝑛
𝑖=1
We say that the sample average converges to the population average. We

¯ In terms of math, the sample
can denote the summation of 𝑛 variables as 𝑋.
average converges in probability to the population average. So you can
approximate E𝑋 by the tail of the fir part. In ati ics, you say that the
sample average is a good e imate of population mean. This was proved by
Jacob Bernoulli in . And this is called Bernoulli’s weak law of large
numbers (WLLN). The Bernoulli family mu have drank from a range
source because three generations of people made contributions to calculus
and medicine. The sample mean is a consi ent e imator of the population
mean.
𝑝
Here is the definition. If 𝑋𝑛 →
− 𝑋 as 𝑛 → ∞, then we say that 𝑋𝑛
is consi ent for 𝑋. The word “consi ency” ju means “convergence in
probability”.
 Leure  (--)
def def
Homework , problem . 𝑈 = 2𝑋2 − 𝑋2 + 𝑋3 , 𝑉 = 𝑋1 + 2𝑋1 + 3𝑋3 .
The fir ep is to find the joint di ribution of 𝑉 and 𝑈 . They can be

written as ⎛ ⎞
(︂ )︂ (︂ )︂ 𝑋1
𝑈 2 −1 1 ⎝ ⎠ d
= 𝑋2 = 𝒩 (. . .)
𝑉 1 2 3
𝑋3
La time, we looked at the Bernoulli’s weak law of large numbers. This
is the andard convention. If you change something, different laws of
large numbers exi . Assume that 𝑋1 , . . . , 𝑋𝑛 are IID random veors. Then
𝑝
𝑋¯ = 1 ∑︀𝑛 𝑋𝑖 → − E𝑋1 as 𝑛 → ∞. You could say that these veors are
𝑛 𝑖=1
copies of a random veor.
The proof of this is super simple. The value of this result is that this
gives you a theoretical ju ification of using sample means to e imate
theoretical means. La year, there was a talk by a ati ician who was
talking about the ati ics. But people have been using sample averages
since Babylonian times. They used cuneiform to report sample averages
of wheat given to the tax colleor. How much on average did the tax
colleors get? The theoretial ju ification took , years.
𝑝 𝑝
It can be shown that if 𝑋¯ →
− E𝑋1 , 𝑋 ¯ − E𝑋1 → − 0 as 𝑛 → ∞. Which is
1 ∑︀𝑛 0 𝑝
− ⇔ 𝑛1 𝑛𝑖=1 (𝑋𝑖 − 𝜇) → − 0 ⇔ 𝑛1 𝑛𝑖=1 (𝑋𝑖 − 𝜇) = 𝑜𝑝 (1).
∑︀ ∑︀
𝑛 𝑖=1 𝑋𝑖 − E𝑋1 →
We denoted such sequences as 𝑜(1), but due to the randomness, we shall
use other notation.
Definition. A generic sequence of random veors or variable that
converges in probability to zero is denoted by 𝑜𝑝 (1), “little oh pee one”.
There are situations where the law does not apply, for example, when
the expeation does not exi .
Let me introduce another notion of convergence. 𝑛1 𝑛𝑖=1 𝑋𝑖 is the
∑︀
sample mean. It is consi ent for the population. We know that in a certain
sense, the sample mean gets closer and closer to the population mean. This
is a degenerate result: the di ribution of the sample mean is putting all
its mass in the limit at the point E𝑋1 . It was natural to ask: can we refine
this result? Can we modify this result so that we have a finite di ribution?
How fa is this convergence? This is a logical que ion. Amazingly, the
final answer was given in  or so by a Swedish mathematician with all
i’s dotted and t’s crossed. Some very great people failed to give a rigorous
proof.
In order to show this result, we need another notion of convergence—
convergence in di ribution. Let (𝑌𝑛 ) be a sequence of random veors
and let 𝑌 be another random veor. We say then that (𝑌𝑛 ) is said to

converge in di ribution to (𝑌𝑛 ) is the following condition holds:
𝑛→∞
CDF𝑌𝑛 (𝑦) −−−→ CDF𝑌 (𝑦),
that is, if the CDF converges pointwise for each 𝑦 at which CDF𝑌 is con-
tinuous. This is the usual pointwise convergence of funions. Everything
is non-random happens. We write this as
𝑑
𝑌𝑛 −
→𝑌
And there is a slight technical wrinkle: you need only the points of con-
tinuity. Remember, 𝑌𝑛 could be discrete, continuous, or mixed. Then the
CDF can have jumps, and you ju exclude them from consideration. But
this notion of convergence was invented to prove one of the main limiting
results in ati ics. Here, the CLT required the development of new math-
ematics and complex analysis. It was not until complex analysis was fully
developed; Gauss could not do it, Laplace could not do it. There always
was a mi ake or inconsi ency. Lindeberg did this.
This is also once of the laws that work in Nature. In Chicago, it works.
There is a Museum of Science and Indu ry. The marbles are dropping
randomly, but their final di ribution is the bell curve.
You can find it on YouTube, but I have seen this personally.

Several times, namely three.
Lindeberg’s Central Limit Theorem (CLT). Assume 𝑋1 , . . . , 𝑋𝑛 are

def def
IID random veors. Let 𝜇 = E𝑋1 and 𝑉 = Var 𝑋1 . Then
𝑛
1 ∑︁ 𝑑
√ (𝑋𝑖 − 𝜇) −−−→ 𝒩 (0, 𝑉 )
𝑛 𝑛→∞
𝑖=1
This has huge praical implication. All the influence te ing, NASA says
that the probability of success is such, then it is based on this approximation
result. People have been using this result without proving it since the days
of Gauss and Laplace.
Recall WLLN:
𝑛
1 ∑︁ 𝑝
(𝑋𝑗 − 𝜇) →− 0
𝑛
𝑗=1

The CLT says that if you look at something with population zero. In ead
of dividing by 𝑛, you are dividing by something smaller. Once you do this,
it turns out that this thing is no longer degenerate in the limit. The limit is
a well-defined random variable with a certain di ribution.
Another que ion was the speed of converges. The limit ∑︀in the WLLN
goes toe number too quickly. What should 𝛼 be so that 𝑛𝛼 𝑛1 𝑛𝑖=1 (𝑋𝑖 −𝜇) =
0. It turns out that the only 𝛼 possible is .. You should not confuse the
WLLN and the CLT.
Note that
√ ∑︁
⎛ ⎞
𝑛 𝑛 1
1 ∑︁ 𝑛 √ 1 ∑︁ √
√ (𝑋𝑖 − 𝜇) = (𝑋𝑖 − 𝜇) = 𝑛 ⎝ 𝑋𝑗 − 𝜇⎠ = 𝑛(𝑋 ¯ − 𝜇)
𝑛 𝑛 𝑛
𝑖=1 𝑖=1 𝑗=1
Then CLT can be rewritten as

√ 𝑑
𝑛(𝑋¯ − 𝜇) −
→ 𝒩 (0, 𝑉 )
What is the meaning of this result? What does it mean that 𝑋𝑛 converges
to 𝑋? The limit can be replaced by the tail of the sequence. What is the
praical meaning of this results? 𝑌 can be approximated probabili ically
by the tail of 𝑌𝑛 . This really means that the di ribution of 𝑌𝑛 is very close
to 𝑌 if 𝑛 is large. If the sample size is large enough. . . but how large? If
someone says 𝑛 = 25, this is shit. That’s why econometricians are doing
Monte-Carlo simulations. For empirical purposes, whatever size you have,
you should use it. If in an exam you are given a sample size 𝑛 and you
write it is not large enough for CLT, you get zero.
If 𝑛 is large enough, then
√ 𝑑 𝑑
¯ − 𝜇) ≈
𝑛(𝑋 ¯≈
𝒩 (0, 𝑉 ) ⇔ 𝑋 𝒩 (𝜇, 𝑉 /𝑛)
What is the sense the approximation is valid? Convergence in di ribution.

Definition. A random sample is a colleion of IID random variables.
Example. What about an example? Problem set , ex. . 𝑋1 , . . . , 𝑋𝑛 are
IID draws from PDF𝑋 (𝑥) = 3𝑥2 I[0,1] (𝑥), 𝑥 ∈ R. Find P(0.6 ≤ 𝑋¯ 15 ≤ 0.8).
How will you solve this problem? You need to know the di ribution of the
sample mean. Do you know the di ribution of the sample mean? The CLT

def
says that it is approximately normal. Let 𝜇 = E𝑋, 𝑉 = Var 𝑋. Then
¯ 𝑛 ≤ 0.8) ≤ P(0.6 − 𝜇 ≤ 𝑋
P(0.6 ≤ 𝑋 ¯ 𝑛 − 𝜇 ≤ 0.8 − 𝜇) =
(︃ )︃
0.6 − 𝜇 ¯𝑛 − 𝜇
𝑋 0.8 − 𝜇
= P √︀ ≤ √︀ ≤ √︀
𝑉 /𝑛 𝑉 /𝑛 𝑉 /𝑛
We are using that the if everything is scalar,
√ 𝑑 𝑋¯ −𝜇 𝑑
¯ − 𝜇) ≈
𝑛(𝑋 𝒩 (0, 𝑉 ) ⇔ √︀ ≈ 𝒩 (0, 1)
𝑉 /𝑛
𝑋𝑛 −𝜇 ¯
What is the ju ification that √ is approximately normal? The CLT. So
𝑉 /𝑛
the latter is (︃ )︃ (︃ )︃
0.8 − 𝜇 0.6 − 𝜇
Φ √︀ −Φ √︀
𝑉 /𝑛 𝑉 /𝑛
In order to get 𝜇 and 𝑉 , you need to do
1
3 4 ⃒⃒1 3
∫︁ ∫︁ ⃒
3
E𝑋 = 𝑥 PDF𝑋 (𝑥) d𝑥 = 3𝑥 d𝑥 = 𝑥 ⃒ =
supp 𝑋 0 4 0 4
Then, Var 𝑋 = E𝑋 2 − (E𝑋)2 .

1
3 5 ⃒⃒1 3
∫︁ ∫︁ ⃒
2 2 2 4
E𝑋 = 𝑥 3𝑥 d𝑥 = 3𝑥 d𝑥 = 𝑥 ⃒ =
supp 𝑋 0 5 0 5
And Var 𝑋 = 35 − 9
16 = 48−45
80 = 3
80 .
So you have
(︃ )︃ (︃ )︃
4/5 − 3/4 3/5 − 3/4
Φ √︀ −Φ √︀
3/80 · 1/15 3/80 · 1/15
The intere ing que ions in science do not have exa answers, but they
have approximate answers.
HW, Additional problem . Define the random variable 𝑋𝑖 that de-

notes the outcome of the 𝑖th game.
{︃
def 5, with prob. 1/3,
𝑋𝑖 =
2, with prob. 2/3
This∑︀ ∑︀𝑛 Your total winnings after 𝑛 games

is ju a discrete random variable.
𝑛
are 𝑖=1 𝑋𝑖 . Find 𝑛 such that P( 𝑖=1 𝑋𝑖 ≥ 0) = 0.99. You have to do
enough algebra to make the CLT appear.
(︃ 𝑛 )︃ (︃ )︃
∑︁ 𝑋¯ −𝜇 −𝜇
P 𝑋𝑖 ≥ 0 ¯ ≥ 0) = P
= P(𝑋 √︀ ≥ √︀ ≈
𝑖=1
(︃ )︃
−𝜇 −𝜇
≈ P 𝒩 (0, 1) ≥ √︀ = 0.99 ⇔ √︀ = 𝑄𝒩 (0,1) (0.01)
√
√ 𝑉 𝑉
𝑛=− 𝑄𝒩 (0,1) (0.01) ⇒ 𝑛 = 2 (𝑄𝒩 (0,1) (0.01))2
𝜇 𝜇
When you talk about convergence of real numbers, the only reason it
is defined so that if you apply a continuous funion, then 𝑔(𝑋𝑛 ) → 𝑔(𝑋).
People realised very early on that continuous funions are very nice (but
infinitely differentiable ones are even nicer). The convergence mu be
in the very same sense: convergence should happen under continuous
transformations.
Continuous mapping theorem (CMP). Let (𝑋𝑛 ) be a sequence of ran-
dom veors. Let 𝑋 ne a random veor. Let 𝑔 be a continuous mapping.
You can think about the mapping on the entire Euclidean space. Then
𝑝 𝑝
𝑋𝑛 −−−→ 𝑋 ⇒ 𝑔(𝑋𝑛 ) −−−→ 𝑔(𝑋)
𝑛→∞ 𝑛→∞
and
𝑑 𝑑
𝑋𝑛 −−−→ 𝑋 ⇒ 𝑔(𝑋𝑛 ) −−−→ 𝑔(𝑋)
𝑛→∞ 𝑛→∞
Example. Let 𝑋1 , . . . , 𝑋𝑛 be IID Bernoulli(𝑝). Find a consi ent e im-

ator or 𝑝. So far, we have not talked about e imation. 𝑝 is the probability
of success, but also the expeed value! Recall, 𝑝 = E𝑋1 . What is the
def ¯
theoretical ju ification? WLLN. So, 𝑝ˆ = 𝑋 is a consi ent e imator of 𝑝.
𝑝
You are intere ed in the odds ratio 1−𝑝 . Odds ratios are useful objes.
They look very range, but in England, they are used in horse race betting.

You could say, the e imate is , because your grandmother told you, but
this is a shitty e imator! E imators should be told by data. A consi ent
𝑝 𝑝^ 𝑝 def 𝑝
e imator of 1−𝑝 is 1−^ 𝑝 . Why? Because CMT. 𝑔(ˆ𝑝) →
− 𝑔(𝑝), so let 𝑔(𝑝) = 1−𝑝 ,
𝑔 : (0, 1) ↦→ R is continuous, so there you go.
Example. Suppose
(︂ )︂ (︂[︂ ]︂ [︂ 2 ]︂)︂ (︂ )︂
𝑋𝑛 𝑑 𝜇𝑋 𝜎 𝜎𝑋𝑌 d 𝑋
−
→𝒩 , 𝑋 =
𝑌𝑛 𝜇𝑌 𝜎𝑌2 𝑌
We do not know the di ribution of 𝑋𝑛 and 𝑌𝑛 , but their asymptotic di ri-

bution is normal. Find the asymptotic di ribution of 𝑋𝑛 + 𝑌𝑛 .
Fir , let 𝑔(𝑎, 𝑏) = 𝑎 + 𝑏, so 𝑔 : R2 ↦→ R is continuous. By the CMT,
𝑑 𝑑 d 2 + 𝜎2 +
𝑔(𝑋𝑛 , 𝑌𝑛 ) −−−→ 𝑔(𝑋, 𝑌 ), i. e. 𝑋𝑛 + 𝑌𝑛 −
→ 𝑋 + 𝑌 = 𝒩 (𝜇𝑋 + 𝜇𝑌 , 𝜎𝑋 𝑌
𝑛→∞
2𝜎𝑋𝑌 ).
Assume that the que ion is about the limiting di ribution of 𝑋 2 + 𝑌 2 .
𝑑
By CMT, 𝑋𝑛2 + 𝑌𝑛2 −
→ 𝑋 2 + 𝑌 2.
Slutsky’s lemma. Now, let (𝑋𝑛 ), (𝑌𝑛 ) be sequences of random variables.
𝑑
Let 𝑋 be a random variable and let 𝑐 be a con ant. Suppose 𝑋𝑛 −→ 𝑋 as
𝑝
𝑛 → ∞ and 𝑌𝑛 →− 𝑐. Then convergence in di ribution is preserved:
𝑑
. 𝑋𝑛 + 𝑌𝑛 −−−→ 𝑋 + 𝑐,
𝑛→∞
𝑑
. 𝑋𝑛 𝑌𝑛 −−−→ 𝑋𝑐,
𝑛→∞
𝑑
. 𝑋𝑛
𝑌𝑛 −−−→ 𝑋 .
𝑛→∞ 𝑐
For example, if 𝑋 is normal, you can divide by it.
You have heard about Slutsky in economics. However, these are the
same Slutsky’s!
Let 𝜃 be a parameter (unknown con ant). Suppose we are e imating
ˆ Suppose that √𝑛(𝜃ˆ − 𝜃) −−𝑑−→ 𝒩 (0, 𝜎 2 ) when 𝜎 2 is
𝜃 by the e imator 𝜃.
𝑛→∞
unknown. In econometrics, convergence in di ribution really comes from
the CLT. Convergence in probability will come from the CLT. Because the
variance is unknown, let’s e imate it. Suppose 𝜎 2 is consi ently e imable
ˆ2.
by 𝜎
√ 𝑑
What is the praical meaning of 𝑛(𝜃ˆ − 𝜃) −−−→ 𝒩 (0, 𝜎 2 )? For large
𝑛→∞
enough 𝑛, the di ribution of this obje is close enough to the Gaussian

√ 𝑑
random variable, so 𝑛(𝜃ˆ − 𝜃) ≈ 𝒩 (0, 𝜎 2 ). Therefore,
𝜎2
(︂ )︂
𝑑
ˆ
𝜃 ≈ 𝒩 𝜃,
𝑛
𝜎2 ˆ Not really because they
𝑛 is the variance. Can we call it the variance of 𝜃?
ˆ
are not equal. It is an approximation to the variance of 𝜃.
Definition. The asymptotic variance is defined as asVar(𝜃) ˆ def 2
= 𝜎 . 𝑛
How do you call the square root of the variance? Standard deviation:
ˆ = √𝜎 = assd(𝜃)
√︁
asVar(𝜃) ˆ
𝑛
However, there is a really nice name: andard error is defined as
ˆ ≡ assd( ˆ def 𝜎
ˆ
se(𝜃) ̂︂ 𝜃) = √
𝑛
Que ion: what is the limiting di ribution of
ˆ
def 𝜃 − 𝜃
𝑇ˆ =
ˆ
se(𝜃)
This is called a 𝑡- ati ic. The limiting di ribution is
𝜃ˆ − 𝜃 𝜃ˆ − 𝜃 𝜎 𝜃ˆ − 𝜃 𝑑
= √ = √ − → 𝒩 (0, 1)
ˆ
se(𝜃) ˆ/ 𝑛 ⏟ 𝜎
𝜎 ˆ⏞ 𝜎/ 𝑛
𝑝
⏟ ⏞
→
−1 𝑑
−
→𝒩 (0,1)
𝑝 2 𝑝 𝑝 𝑝
− 𝜎 2 , Then 𝜎𝜎^ 2 →
ˆ2 →
We said that 𝜎 − 1. And so does 𝜎𝜎^ →
− 1. And 𝜎𝜎^ →
− 1. The
produ of the two, by Slutsky’s lemma, this converges in di ribution to
the andard normal. This is the basis for inference in large sample. When
you are doing 𝑡-te s, without the assumption of normality, this is asymp-
totically normal. The 𝑡 di ribution came from William Gossett. However,
the 𝑡 di ribution requires the normality assumption, and nowhere in this
derivation, we used this assumption.

 Delta method
There are two versions of the delta method: the approximate delta
method and the exa delta method.
Suppose we know that 𝜃ˆ is a consi ent e imator of 𝜃 and
√ 𝑑
𝑛(𝜃ˆ − 𝜃) −−−→ 𝒩 (0, 𝜎 2 )
𝑛→∞
ˆ Let’s assume that

What can you say about the limiting di ribution of 𝑔(𝜃).
𝑝
ˆ − 𝑔(𝜃) −−−→ 0 by CMT. Que ion: what can we
𝑔 is continuous. Then 𝑔(𝜃)
√ 𝑛→∞ ˆ − 𝑔(𝜃))?
say about the di ribution of 𝑛(𝑔(𝜃)
Solution. Let us do a Taylor expansion of 𝑔. Suppose 𝑔 is continuously
differentiable, that is, the derivative exi s and is continuous. Then, by
Taylor expansion,
ˆ = 𝑔(𝜃) + d (︀ ˆ
𝑔 𝜆𝜃 + (1 − 𝜆)𝜃 (𝜃ˆ − 𝜃),
)︀
𝑔(𝜃) 𝜆 ∈ (0, 1)
d𝜃
Therefore,
𝑝
→
− 𝑔𝜃 (𝜃)
√ ⏞ (︀ ⏟ )︀ √ 𝑑
𝑛(𝑔(𝜃) − 𝑔(𝜃)) = 𝑔𝜃 𝜃 + 𝜆(𝜃ˆ − 𝜃) 𝑛(𝜃ˆ − 𝜃) −
ˆ → 𝑔𝜃 (𝜃)𝒩 (0, 𝜎 2 ), 𝜆 ∈ (0, 1)
⏟ ⏞ ⏟ ⏞
𝑝 𝒩 (0,𝜎 2 )
→
−𝜃
The latter is di ributed as 𝒩 (0, (𝑔𝜃 (𝜃))2 𝜎 2 ).

This is a real proof, everything is exa, we are dealing with equalities
everywhere.
Now, there is a method that makes it really scratchy. The approximate
delta-method.
Let 𝑋 and 𝑌 be random variables. Let 𝑔 be a funion: R2 ↦→ R.
Que ion: what is E𝑔(𝑋, 𝑌 )? In general, this is a very difficult que ion, but
the approximate delta-method really works on the back of the envelope.
By a Taylor expansion,
𝑔(𝑋, 𝑌 ) ≈ 𝑔(E𝑋, E𝑌 ) + 𝑔1 (E𝑋, E𝑌 )(𝑋 − E𝑋) + 𝑔2 (E𝑋, E𝑌 )(𝑌 − E𝑌 )
We are expanding 𝑔 around its expeed value. Therefore,
E𝑔(𝑋, 𝑌 ) ≈ 𝑔(E𝑋, E𝑌 ) + 0 + 0

This is purely a back-of-the-envelope calculation only to save your arse if
someone is holding a gun to you head. This is equal only in case of linear
funtions.
What about the variance?
Var 𝑔(𝑋, 𝑌 ) ≈ Var(𝑔1 (E𝑋, E𝑌 )(𝑋 − E𝑋) + 𝑔2 (E𝑋, E𝑌 )(𝑌 − E𝑌 )) =

= 𝑔12 (E𝑋, E𝑌 ) Var 𝑋 + 𝑔22 (E𝑋, E𝑌 ) Var 𝑌 +
+ 2𝑔1 (E𝑋, E𝑌 )𝑔2 (E𝑋, E𝑌 ) Cov(𝑋, 𝑌 )
You should treat the delta method with some caution given the non-
linearity of 𝑔.
Example. Additional problem , part (ii).
You under and what “widget” means? Some crap.

d d
𝑃 = 𝒰(0.9𝑝, 1.1𝑝) and 𝑄 = 𝒰(𝑞 −5, 𝑞 +5). Find E𝑃 𝑄 using the approximate
delta method.
Let 𝑔(𝑎, 𝑏) = 𝑎𝑏. Then, by the approximate delta method, E𝑔(𝑃, 𝑄) ≈
𝑔(E𝑃, E𝑄) = E𝑃 E𝑄. The variance will not be exa. Var 𝑔(𝑃, 𝑄) ≈
𝑔12 (E𝑃, E𝑄) Var 𝑃 + 𝑔22 (E𝑃, E𝑄) Var 𝑄 + 2𝑔12 (E𝑃, E𝑄)𝑔22 (E𝑃, E𝑄) Cov(𝑃, 𝑄).
The latter covariance is zero. Next, if 𝑔(𝑎, 𝑏) = 𝑎𝑏, then 𝑔1 (𝑎, 𝑏) = 𝑏 and
𝑔2 (𝑏, 𝑎) = 𝑎. Then, 𝑔1 (E𝑃, E𝑄) = E𝑄, 𝑔2 (E𝑃, E𝑄) = E𝑃 .
Your brain should reserve space for the useful uff.


Leure ( - ) : Probability Calculations

Uploaded by

Copyright:

Available Formats

Leure ( - ) : Probability Calculations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Leure ( - ) : Probability Calculations

Uploaded by

Copyright:

Available Formats

 Le ure  (--)

If you want a Ph. D. in econometrics, you will have to do measure theory,

A sample space for an experiment is the set of all outcomes associated

We use the indicator fun ion of this set:

 Implications of the probability axioms

Figure : A complement of a set and a subset

. P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵) − P(𝐴 ∪ 𝐵).

Same problem, part . Is this atement true or false? P(𝐴) = 1 ⇒

(On downloaded books.) “His moral is laxer than mine!”

Suppose that an event happen with probability zero. It is not empty!

How many rational numbers are there? A countable infinity. An integral

1 − 𝑝 + 𝑝 = 1. The sum a s exa ly like an integral, so the third axiom

If 𝑋 is a random variable, 𝑔(𝑋) is random. 𝑊 is a fun ion of 𝑋. What is

“The French include zero as a natural number.” — “Well, the

Suppose we have two random variables. 𝑋 and 𝑌 . Que ion: when

Sometimes it is called the population mean of 𝑋. Why is it a useful obje

E(𝑐1 𝑋 + 𝑐2 ) = 𝑐1 E𝑋 + 𝑐2 , E(𝛼𝑋 + 𝛽𝑌 ) = 𝛼E𝑋 + 𝛽E𝑌

The andard deviation, 𝜎, is the square root of variance, 𝜎 2 .

A density mu be non-negative and integrate to .

And this is an illegal operation. The only reasonable conclusion: this

No, there is no relation. However, quantiles are amazing!

𝑄𝑔(𝑋) (𝛼) = 𝑔(𝑄𝑋 (𝛼))

If the data are 𝑎, 𝑎, 𝑏, 𝑏, 𝑏, 𝑐, then

We have created a probability di ribution. This di ribution a ually has

P(−𝑋 ≤ 𝑄−𝑋 (𝛼)) ≥ 𝛼 and P(−𝑋 ≥ 𝑄−𝑋 (𝛼)) ≥ 1 − 𝛼.

P(𝑋 ≥ −𝑄−𝑋 (𝛼)) ≥ 𝛼 and P(𝑋 ≤ −𝑄−𝑋 (𝛼)) ≥ 1 − 𝛼

P(𝑋 ≤ −𝑄−𝑋 (𝛼)) ≥ 1 − 𝛼 and P(𝑋 ≥ −𝑄−𝑋 (𝛼)) ≥ 𝛼

To calculate the latter, you integrate 𝑋 with the density of 𝑋, integrate

for 𝐴, 𝐵 ⊂ 𝑅. These fun ions mu integrate to one.

The conditioning means that we are opping the randomness by fixing

I am being nice to you by making the di in ion between ran-

Let 𝑌 be a random variable. Let 𝑋 be a random ve or. It could have

def PDF𝑌,𝑋 (𝑦, 𝑥)

This obje gives rise to a valid probability definition. Then

As long as 𝑋 is discrete or continuous, the properties of 𝑋 do not matter.

0 · PMF𝑌 |𝑋=3 (0) + 1 · PMF𝑌 |𝑋=3 (1) = 12 .

Now suppose 𝑍 is a random variable. What is 𝑔(𝑍)? It is a random

arg min E|𝑌 − 𝑔(𝑋)|.

arg min E(𝑌 − 𝑔(𝑋))2

E(𝑌 | 𝑋) = arg min E(𝑌 − 𝑔(𝑋))2

For any given 𝑋,

E(𝑌 − 𝑔(𝑋))2 = E(𝑌 − E(𝑌 | 𝑋) +E(𝑌 | 𝑋) − 𝑔(𝑋))2 =

= E𝑈 2 + E(E(𝑌 | 𝑋) − 𝑔(𝑋))2 + 2E ⎝𝑈 E(𝑌 | 𝑋) − 𝑔(𝑋) ⎠ =

You remember that E(𝑈 | 𝑋) = 0 ⇒ E𝑈 = 0 by the law of iterated

= E𝑈 2 + E(E(𝑌 | 𝑋) − 𝑔(𝑋))2 = E𝑈 2 + min E(E(𝑌 | 𝑋) − 𝑔(𝑋))2

 Be linear predi ors

BLP(𝑌 | 𝑋) = arg min E(𝑌 − 𝑔(𝑋))2

Sometimes, people ju call it the variance-covariance matrix. This can be

Cov(𝐴, 𝐵) = E(𝐴 − E𝐴)(𝐵 − E𝐵)′ → dim 𝐴 × dim 𝐵

Cov(𝐴, 𝐵) = E𝐴𝐵 ′ − (E𝐴)(E𝐵)′

 Conditional variance fun ion

Definition. The conditional variance of 𝑌 given 𝑋 is the random

Var 𝑌 = Var(E(𝑌 | 𝑋)) + Var 𝑈

But Var 𝑈 = E𝑈 2 − (E𝑈 )2 = E𝑈 2 because E(𝑈 | 𝑋) = 0 ⇒ E𝑈 = 0.

Var 𝑌 = Var E(𝑌 | 𝑋) + E Var(𝑌 | 𝑋)

This is called the variance decomposition formula, or ANOVA (ANalysis

Additional problem  has  parts.

But if 𝑋 = 𝑌 , then this implies that

P(𝑎 < 𝑌 < 𝑏) = P(𝑎 < 𝑌 ≤ 𝑏) = P(𝑌 ≤ 𝑏) − P(𝑌 ≤ 𝑎) =

 Leure  (--)

We use the indicator funion of this set:

1 − 𝑝 + 𝑝 = 1. The sum as exaly like an integral, so the third axiom

If 𝑋 is a random variable, 𝑔(𝑋) is random. 𝑊 is a funion of 𝑋. What is

We have created a probability di ribution. This di ribution aually has

for 𝐴, 𝐵 ⊂ 𝑅. These funions mu integrate to one.

I am being nice to you by making the di inion between ran-

Let 𝑌 be a random variable. Let 𝑋 be a random veor. It could have

 Be linear prediors

 Conditional variance funion

It is a veor of con ants. The variance of 𝑋 is defined as E(𝑋 − E𝑋)(𝑋 −

 The PDF of Gaussian random veors

But we know that 𝑋 is a Gaussian random veor. So the PDF will be

We are expanding 𝑔 around its expeed value. Therefore,