Multivariate Probability: 1 Discrete Joint Distributions
Multivariate Probability: 1 Discrete Joint Distributions
Multivariate Probability: 1 Discrete Joint Distributions
Multivariate Probability
Chris Piech and Mehran Sahami
Oct 2017
Often you will work on problems where there are several random variables (often interacting with one an-
other). We are going to start to formally look at how those interactions play out.
For now we will think of joint probabilities with two random variables X and Y .
This function tells you the probability of all combinations of events (the “,” means “and”). If you want to
back calculate the probability of an event only for one variable you can calculate a “marginal” from the joint
probability mass function:
pX (a) = P(X = a) = ∑ PX,Y (a, y)
y
In the continuous case a joint probability density function tells you the relative probability of any combination
of events X = a and Y = y.
In the discrete case, we can define the function pX,Y non-parametrically. Instead of using a formula for p we
simply state the probability of each possible outcome.
1
3 Multinomial Distribution
Say you perform n independent trials of an experiment where each trial results in one of m outcomes, with
respective probabilities: p1 , p2 , . . . , pm (constrained so that ∑i pi = 1). Define Xi to be the number of trials
with outcome i. A multinomial distribution is a closed form function that answers the question: What is the
probability that there are ci trials with outcome i. Mathematically:
n
P(X1 = c1 , X2 = c2 , . . . , Xm = cm ) = pc1 pc2 . . . pcmm
c1 , c2 , . . . , cm 1 2
Example 1
A 6-sided die is rolled 7 times. What is the probability that you roll: 1 one, 1 two, 0 threes, 2 fours, 0 fives, 3
sixes (disregarding order).
1 1 0 2 0 3
7! 1 1 1 1 1 1
P(X1 = 1, X2 = 1, X3 = 0, X4 = 2, X5 = 0, X6 = 3) =
2!3! 6 6 6 6 6 6
7
1
= 420
6
Fedaralist Papers
In class we wrote a program to decide whether or not James Madison or Alexander Hamilton wrote Fedaralist
Paper 49. Both men have claimed to be have written it, and hence the authorship is in dispute. First we used
historical essays to estimate pi , the probability that Hamilton generates the word i (independent of all previous
and future choices or words). Similarly we estimated qi , the probability that Madison generates the word i.
For each word i we observe the number of times that word occurs in Fedaralist Paper 49 (we call that count
ci ). We assume that, given no evidence, the paper is equally likely to be written by Madison or Hamilton.
Define three events: H is the event that Hamilton wrote the paper, M is the event that Madison wrote the
paper, and D is the event that a paper has the collection of words observed in Fedaralist Paper 49. We would
like to know whether P(H|D) is larger than P(M|D). This is equivalent to trying to decide if P(H|D)/P(M|D)
is larger than 1.
The event D|H is a multinomial parameterized by the values p. The event D|M is also a multinomial, this
time parameterized by the values q.
Using Bayes Rule we can simplify the desired probability.
P(D|H)P(H)
P(H|D) P(D) P(D|H)P(H) P(D|H)
= P(D|M)P(M)
= =
P(M|D) P(D|M)P(M) P(D|M)
P(D)
n ci c
c1 ,c2 ,...,cm ∏i pi ∏i pi i
= n ci = ci
c1 ,c2 ,...,cm ∏i qi ∏i qi
This seems great! We have our desired probability statement expressed in terms of a product of values we
have already estimated. However, when we plug this into a computer, both the numerator and denominator
come out to be zero. The product of many numbers close to zero is too hard for a computer to represent. To
fix this problem, we use a standard trick in computational probability: we apply a log to both sides and apply
2
some basic rules of logs.
P(H|D) ∏ pci
i i
log = log c
P(M|D) ∏i qi i
= log(∏ pci i ) − log(∏ qci i )
i i
=∑ log(pci i ) − ∑ log(qci i )
i i
= ∑ ci log(pi ) − ∑ ci log(qi )
i i
This expression is “numerically stable” and my computer returned that the answer was a negative number. We
can use exponentiation to solve for P(H|D)/P(M|D). Since the exponent of a negative number is a number
smaller than 1, this implies that P(H|D)/P(M|D) is smaller than 1. As a result, we conclude that Madison
was more likely to have written Federalist Paper 49.
= ∑ xp(x, y) + ∑ yp(x, y)
x,y x,y
= ∑ x ∑ p(x, y) + ∑ y ∑ p(x, y)
x y y x
= ∑ xp(x) + ∑ yp(y)
x y
= E[X] + E[Y ]
This can be generalized to multiple variables:
" #
n n
E ∑ Xi = ∑ E[Xi ]
i=1 i=1
Example 3
A disk surface is a circle of radius R. A single point imperfection is uniformly distributed on the disk with
joint PDF:
(
1
if x2 + y2 ≤ R2
fX,Y (x, y) = πR2
0 else
√
Let D be the distance from the origin: D = X 2 +Y 2 . What is E[D]? Hint: use the lemmas
3
5 Independence with Multiple RVs
Discrete
Two discrete random variables X and Y are called independent if:
P(X = x,Y = y) = P(X = x)P(Y = y) for all x, y
Intuitively: knowing the value of X tells us nothing about the distribution of Y . If two variables are not
independent, they are called dependent. This is a similar conceptually to independent events, but we are
dealing with multiple variables. Make sure to keep your events and variables distinct.
Continuous
Two continuous random variables X and Y are called independent if:
P(X ≤ a,Y ≤ b) = P(X ≤ a)P(Y ≤ b) for all a, b
This can be stated equivalently as:
FX,Y (a, b) = FX (a)FY (b) for all a, b
fX,Y (a, b) = fX (a) fY (b) for all a, b
More generally, if you can factor the joint density function then your continuous random variable are inde-
pendent:
fX,Y (x, y) = h(x)g(y) where − ∞ < x, y < ∞
Example 2
Let N be the # of requests to a web server/day and that N ∼ Poi(λ ). Each request comes from a human
(probability = p) or from a “bot” (probability = (1–p)), independently. Define X to be the # of requests from
humans/day and Y to be the # of requests from bots/day.
Since requests come in independently, the probability of X conditioned on knowing the number of requests
is a Binomial. Specifically:
(X|N) ∼ Bin(N, p)
(Y |N) ∼ Bin(N, 1 − p)
Calculate the probability of getting exactly i human requests and j bot requests. Start by expanding using the
chain rule:
P(X = i,Y = j) = P(X = i,Y = j|X +Y = i + j)P(X +Y = i + j)
We can calculate each term in this expression:
i+ j i
P(X = i,Y = j|X +Y = i + j) = p (1 − p) j
i
λ i+ j
P(X +Y = i + j) = e−λ
(i + j)!
Now we can put those together and simplify:
λ i+ j
i+ j i
P(X = i,Y = j) = p (1 − p) j e−λ
i (i + j)!
As an exercise you can simplify this expression into two independent Poisson distributions.
4
Symmetry of Independence
Independence is symmetric. That means that if random variables X and Y are independent, X is independent
of Y and Y is independent of X. This claim may seem meaningless but it can be very useful. Imagine a
sequence of events X1 , X2 , . . . . Let Ai be the event that Xi is a “record value” (eg it is larger than all previous
values). Is An+1 independent of An ? It is easier to answer that An is independent of An+1 . By symmetry of
independence both claims must be true.
6 Convolution of Distributions
Convolution is the result of adding two different random variables together. For some particular random
variables computing convolution has intuitive closed form equations. Importantly convolution is the sum of
the random variables themselves, not the addition of the probability density functions (PDF)s that correspond
to the random variables.
Independent Poissons
For any two Poisson random variables: X ∼ Poi(λ1 ) and Y ∼ Poi(λ2 ) the sum of those two random variables
is another Poisson: X +Y ∼ Poi(λ1 + λ2 ). This holds when λ1 is not the same as λ2 .
Independent Normals
For any two normal random variables X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ) the sum of those two random
variables is another normal: X +Y ∼ N (µ1 + µ2 , σ12 + σ22 ).
There are direct analogies in the discrete case where you replace the integrals with sums and change notation
for CDF and PDF.
5
Example 1
Calculate the PDF of X + Y for independent uniform random variables X ∼ Uni(0, 1) and Y ∼ Uni(0, 1)?
First plug in the equation for general convolution of independent random variables:
Z 1
fX+Y (a) = fX (a − y) fY (y)dy
y=0
Z 1
fX+Y (a) = fX (a − y)dy Because fY (y) = 1
y=0
It turns out that is not the easiest thing to integrate. By trying a few different values of a in the range [0, 2] we
can observe that the PDF we are trying to calculate is discontinuous at the point a = 1 and thus will be easier
to think about as two cases: a < 1 and a > 1. If we calculate fX+Y for both cases and correctly constrain the
bounds of the integral we get simple closed forms for each case:
a
if 0 < a ≤ 1
fX+Y (a) = 2 − a if 1 < a ≤ 2
0 else
7 Conditional Distributions
Before we looked at conditional probabilities for events. Here we formally go over conditional probabilities
for random variables. The equations for both the discrete and continuous case are intuitive extensions of our
understanding of conditional probability:
Discrete
The conditional probability mass function (PMF) for the discrete case:
P(X = x,Y = y) PX,Y (x, y)
pX|Y (x|y) = P(X = x|Y = y) = =
P(Y = y) pY (y)
The conditional cumulative density function (CDF) for the discrete case:
∑x≤a pX,Y (x, y)
FX|Y (a|y) = P(X ≤ a|Y = y) = = ∑ pX|Y (x|y)
pY (y) x≤a
Continuous
The conditional probability density function (PDF) for the continuous case:
fX,Y (x, y)
fX|Y (x|y) =
fY (y)
The conditional cumulative density function (CDF) for the continuous case:
Z a
FX|Y (a|y) = P(X ≤ a|Y = y) = fX|Y (x|y)dx
−∞
6
Let X be continuous random variable and let N be a discrete random variable. The conditional probabilities
of X given N and N given X respectively are:
pN|X (n|x) fX (x) fX|N (x|n)pN (n)
fX|N (x|n) = pN|X (n|x) =
pN (n) fX (x)
Example 2
Let’s say we have two independent random Poisson variables for requests received at a web server in a day:
X = # requests from humans/day, X ∼ Poi(λ1 ) and Y = # requests from bots/day, Y ∼ Poi(λ2 ). Since the
convolution of Poisson random variables is also a Poisson we know that the total number of requests (X +Y )
is also a Poisson (X +Y ) ∼ Poi(λ1 + λ2 ). What is the probability of having k human requests on a particular
day given that there were n total requests?
P(X = k,Y = n − k) P(X = k)P(Y = n − k)
P(X = k|X +Y = n) = =
P(X +Y = n) P(X +Y = n)
e−λ1 λ1k e−λ2 λ2n−k n!
= · · 1(λ +λ )
k! (n − k)! e 1 2 (λ1 + λ2 )n
k n−k
n λ1 λ2
=
k λ1 + λ2 λ1 + λ2
λ2
∼ Bin n,
λ1 + λ2
Tracking in 2D Example
In this example we are going to explore the problem of tracking an object in 2D space. The object exists at
some (x, y) location, however we are not sure exactly where! Thus we are going to use random variables X
and Y to represent location.
We have a prior belief about where the object is. In this example our prior has X and Y be normals which are
independently distributed with mean 3 and variance 2. First let’s write the prior belief as a joint probability
density function
f (X = x,Y = y) = f (X = x) · f (Y = y) In the prior x and y are independent
1 (x−3)2 1 (y−3)2
=√ · e− 2·4 · √ · e− 2·4 Using the PDF equation for normals
2·4·π 2·4·π
(x−3)2 +(y−3)2
= K1 · e− 8 All constants are put into K1
This combinations of normals is called a bivariate distribution. Here is a visualization of the PDF of our prior.
7
The interesting part about tracking an object is the process of updating your belief about it’s location based
on an observation. Let’s say that we get an instrument reading from a sonar that is sitting on the origin. The
instrument reports that the object is 4 units away. Our instrument is not perfect: if the true distance was t
units away, than the instrument will give a reading which is normally distributed with mean t and variance 1.
Let’s visualize the observation:
Based on this information about the noisiness of our prior, we can compute the conditional probability of
seeing a particular distance reading D, given the true location of thepobject X, Y . If we knew the object was
at location (x, y), we could calculate the true distance to the origin x2 + y2 which would give us the mean
for the instrument Gaussian:
√ 2
d− x2 +y2
1 −
p
f (D = d|X = x,Y = y) = √ ·e 2·1 Normal PDF where µ = x2 + y2
2·1·π
√ 2
d− x2 +y2
= K2 · e− 2·1 All constants are put into K2
Let us test out this formula with numbers. How much more likely is an instrument reading of 1 compared to
2, given that the location of the object is at (1, 1)?
√ 2
1− 12 +12
f (D = 1|X = 1,Y = 1) K2 · e− 2·1
= √ 2 Substituting into the conditional PDF of D
f (D = 2|X = 1,Y = 1) 2− 12 +12
K2 · e− 2·1
e0
= −1/2 ≈ 1.65 Notice how the K2 s cancel out
e
At this point we have a prior belief and we have an observation. We would like to compute an updated belief,
given that observation. This is a classic Bayes’ formula scenario. We are using joint continuous variables, but
that doesn’t change the math much, it just means we will be dealing with densities instead of probabilities:
f (D = 4|X = x,Y = y) · f (X = x,Y = y)
f (X = x,Y = y|D = 4) = Bayes using densities
f (D = 4)
√
[4− x2 +y2 )2 ] [(x−3)2 +(y−3)2 ]
K1 · e− 2 · K2 · e− 8
= Substituting for prior and update
f (D = 4)
[4−√x2 +y2 )2 ]
[(x−3)2 +(y−3)2 ]
K1 · K2
= · e− 2 + 8 f (D = 4) is a constant w.r.t. (x, y)
f (D = 4)
(4−√x2 +y2 )2 [(x−3)2 +(y−3)2 ]
− +
= K3 · e 2 8 K3 is a new constant
Wow! That looks like a pretty interesting function! You have successfully computed the updated belief. Let’s
see what it looks like. Here is a figure with our prior on the left and the posterior on the right:
8
How beautiful is that! Its like a 2D normal distribution merged with a circle. But wait, what about that
constant! We do not know the value of K3 and that is not a problem for two reasons: the first reason is that
if we ever want to calculate a relative probability of two locations, K3 will cancel out. The second reason is
that if we really wanted to know what K3 was, we could solve for it.
This math is used every day in millions of applications. If there are multiple observations the equations can
get truly complex (even worse than this one). To represent these complex functions often use an algorithm
called particle filtering.
Covariance is a quantitative measure of the extent to which the deviation of one variable from its mean
matches the deviation of the other from its mean. It is a mathematical relationship that is defined as:
That is a little hard to wrap your mind around (but worth pushing on a bit). The outer expectation will be a
weighted sum of the inner function evaluated at a particular (x, y) weighted by the probability of (x, y). If x
and y are both above their respective means, or if x and y are both below their respective means, that term will
be positive. If one is above its mean and the other is below, the term is negative. If the weighted sum of terms
is positive, the two random variables will have a positive correlation. We can rewrite the above equation to
9
get an equivalent equation:
Using this equation (and the product lemma) is it easy to see that if two random variables are independent
their covariance is 0. The reverse is not true in general.
Properties of Covariance
Say that X and Y are arbitrary random variables:
Cov(X,Y ) = Cov(Y, X)
Cov(X, X) = E[X 2 ] − E[X]E[X] = Var(X)
Cov(aX + b,Y ) = aCov(X,Y )
That last property gives us a third way to calculate variance. You could use this definition to calculate the
variance of the binomial.
Correlation
Covariance is interesting because it is a quantitative measurement of the relationship between two variables.
Correlation between two random variables, ρ(X,Y ) is the covariance of the two variables normalized by the
variance of each variable. This normalization cancels the units out and normalizes the measure so that it is
always in the range [0, 1]:
Cov(X,Y )
ρ(X,Y ) = p
Var(X)Var(Y )
If ρ(X,Y ) = 0 we say that X and Y are “uncorrelated.” If two varaibles are independent, then their correlation
will be 0. However, it doesn’t go the other way. A correlation of 0 does not imply independence.
When people use the term correlation, they are actually referring to a specific type of correlation called “Pear-
son” correlation. It measures the degree to which there is a linear relationship between the two variables. An
alternative measure is “Spearman” correlation which has a formula almost identical to your regular corre-
lation score, with the exception that the underlying random variables are first transformed into their rank.
“Spearman” correlation is outside the scope of CS109.
10