01 Probability Theory
01 Probability Theory
01 Probability Theory
Branislav L. Slantchev
Department of Political Science, University of California – San Diego
February 3, 2006
Consider any random process that can generate different outcomes. Let S be
the set of these outcomes. That is, we assume a sample space S and a set of
subsets A ⊂ S, B ⊂ S, C ⊂ S, . . . . We call subsets A, B, C, . . . events. Note that
events may not be individual outcomes but rather collections of outcomes. For
any events A and B, we shall interpret A ∩ B = AB as the event “both A and B
occur” (recall that this is what we had in set theory). Similarly, we shall interpret
A ∪ B as “A or B occurs,” and A (or ¬A) as “A does not occur.”
Suppose we conduct an experiment many times over. We may then observe
some event or another, in no regular pattern. The probability of an event is
the proportion of times it occurs during these repetitions. For example, the
experiment may be a coin toss, and the two events “heads” and “tails”. We
could model the process that causes heads or tails to occur, but this would
not be necessary for just about anything we’re going to need. Instead, if we
assume that the coin is fair, we can model the event “heads” as occurring with
probability 1/2, and the event “tails” as occurring with the same probability.
The probability of an event A is denoted by Pr(A) = a, where a is a real
number such that a ∈ [0, 1]. We assume that Pr(S) = 1, that is, some event in S
∞
will occur with certainty. Further, if A = i=1 Ai , where Ai ⊂ X and Ai ∩ Aj = ∅
for all i ≠ j, then
∞
Pr(A) = Pr(Ai ).
i=1
Recall that Pr(A) denotes the probability of an event A occurring while Pr(A)
is the probability of event A not occurring. Also Pr(A ∪ B) is the probability
of event A or event B occurring (the union of the events), and Pr(A ∩ B) is the
probability of event A and event B both occurring (the intersection of the events).
Axiom 1. The probability of any event A is a real number between zero and
one:
0 ≤ Pr(A) ≤ 1.
Pr(S) = 1.
Axiom 3. The probability of an event which is the union of two mutually ex-
clusive events is the sum of the probabilities of the two:
2. For any two events (i.e. not just mutually exclusive ones):
2
Recalling that Pr(A ∩ B) = 0 for mutually exclusive events, we get our
axiom above. Rewriting this gives us a formula for the intersection:
3. Given any two events (this is very useful and is known as the total proba-
bility theorem):
Pr(A) = Pr(A ∩ B) + Pr(A ∩ B).
Pr(A ∩ B) = Pr(A ∪ B)
Pr(A ∪ B) = Pr(A ∩ B)
We shall mostly deal with probabilities when we want to represent some player’s
belief about another player. For example, consider arms reduction negotiations.
One player may possess private information about his preferences: He may be
“tough” and prefer no deal to even small concessions, or “weak” and prefer
significant concessions to no deal at all. The other player has a prior belief
about this negotiator (possibly based on previous experience, etc.) but in the
course of bargaining, as new information accumulates, it is reasonable that this
prior belief will change. We shall call the belief updated in the light of new
evidence the posterior belief.
The classic illustration is from a medical case. Suppose 1% of the population
carries the virus X, and so the prior probability that I carry the virus is 0.01.
There is an imperfect test for the presence of X: it is positive in 90% of the
subjects who carry X and in 20% of the subjects who do not carry X. If I test
positive, what is my posterior belief about carrying the virus?
To answer this question, we need to deal with conditional probability, that is,
the probability of an event A given that event B has occurred. The probability of
event A occurring conditional on event B having occurred is the joint probability
of the two events occurring divided by the probability of B occurring:
Pr(A ∩ B)
Pr(A|B) = (1)
Pr(B)
In words, take the probability of two events occurring jointly, and then incorpo-
rate the information that one of them did, in fact, occur. Note that if Pr(B) = 0,
then Pr(A|B) is undefined. You cannot condition on zero-probability events.
This problem is going to crop up later, so make sure you remember it. Further,
we can get another formula for the intersection of events by rearranging terms
in (1):
Pr(A ∩ B) = Pr(A|B) · Pr(B),
3
which you can extend to an arbitrary number of events. Here’s an example with
three:
With these results, we can now approach the problem above. Let A be the event
that I carry X and let B be the event that I test positive. We are thus given:
Pr(A) = 0.01
Pr(B|A) = 0.90
Pr(B|A) = 0.20
We do not know either Pr(A ∩ B) or Pr(B). However, Bayes’ Rule gives us a way
to solve this by expressing the conditional probability Pr(A|B) in terms of Pr(A),
Pr(B|A), and Pr(B|A):
Pr(B|A) Pr(A)
Pr(A|B) = . (2)
Pr(B|A) Pr(A) + Pr(B|A) Pr(A)
Here’s how you can obtain the formula in (2) from (1). Note that Pr(A ∩ B) =
Pr(B|A) Pr(A) from (1) by exchanging A and B. This yields the numerator in
Equation 2. To see that the denominator equals Pr(B), note that this is just
an application of the total probability theorem. We add (i) the probability of B
occurring conditional on A having occurred times the probability of A occurring
and (ii) the probability of B occurring conditional on A having not occurred
times the probability that A does not occur. Since A can either occur or not (but
not both), these are mutually exclusive and exhaustive events, the sum equals
the probability of B occurring. That is, this is the total probability theorem.1
1
There may be more possible events. For example, suppose I can only get to the office in only
three possible ways: by scooter, shuttle, or bike. Then the probability of “I got to my office”
equals to the probability that either “I got to my office on my bike” or “I got to my office on my
scooter,” or “I got to my office on the shuttle.” The point is that since there are three states of
the world in which I can get to the office (bike, scooter, or shuttle), summing the probabilities of
getting to the office in any of these states yields the probability of getting to the office without
reference to any particular state.
4
We now apply Bayes’ Rule to our little medical problem (using the fact that
Pr(A) = 1 − Pr(A) = 0.99):
(0.90)(0.01) 0.009
Pr(A|B) = = ≈ 0.04,
(0.90)(0.01) + (0.20)(0.99) 0.009 + 0.198
We shall call event B a “signal” and events Ai “states” with every conditional
probability Pr(B|Ai ) being either 0 or 1, depending on whether the state Ai
generates the signal B or not. In our negotiator example, suppose the signal
B is “quit angrily.” The two states are “opponent is tough” and “opponent is
weak.” We shall be interested in the conditional probabilities “weak opponent
walks out” (that is, the probability of walking out conditional on the opponent
being weak) and “tough opponent walks out” (that is, the probability of walking
out conditional on the opponent being tough). We want to be able to calculate
the posterior probability so that a player who observes a walk-out can update
his prior beliefs given the new information.
Because Bayes rule is essential, we shall now go through several examples of
its application. These are taken from the Gintis book.
In a bolt factory, machines A, B, and C manufacture 25%, 35%, and 40% of the
total output, and have defective rates of 5%, 4%, and 2%, respectively. A bolt is
chosen at random and is found to be defective. What are the probabilities that
it was manufactured by each of the three machines?
First, let’s translate the worded problem into symbols to see exactly what
information we have. The probability that a randomly chosen bolt was manu-
factured by machine A is Pr(A) = 0.25, and similarly Pr(B) = 0.35, and Pr(C) =
0.40. Let D denote the event “defective bolt”. We now have Pr(D|A) = 0.05,
Pr(D|B) = 0.04, and Pr(D|C) = 0.02. We want to know each of Pr(A|D), Pr(B|D),
and Pr(C|D). Let’s do one, the other two are analogous. By Bayes rule, we have
Pr(D|A) Pr(A)
Pr(A|D) =
Pr(D|A) Pr(A) + Pr(D|¬A) Pr(¬A)
5
because ¬A ≡ B ∨ C, we have
Pr(D|A) Pr(A)
=
Pr(D|A) Pr(A) + Pr(D|B) Pr(B) + Pr(D|C) Pr(C)
(0.05)(0.25)
=
(0.05)(0.25) + (0.04)(0.35) + (0.02)(0.40)
= 0.3623
We conclude that given a defective bolt chosen at random, the probability that
it came from machine A is 36.23%. You should calculate the other two probabil-
ities and then verify that the three sum to 1. Why should they?
Let A be the event that a “woman has been abused by her husband”, let M be
the event that the “woman was murdered”, and let H be the event “murdered by
her husband.” We know (from sociological research) that (a) 5% of women are
abused by their husbands, (b) 0.5% of women are murdered, (c) 0.25% of women
are murdered by their husbands, (d) 90% of women who are murdered by their
husbands have been abused by them, (e) a woman who is murdered but not by
her husband is neither more nor less likely to have been abused by her husband
than a randomly selected woman.2
Suppose a woman is found murdered and the prosecution has ascertained
that she was abused by her husband. What is the probability that she was mur-
dered by her husband?
We have to be careful how we ask this question. Are we interested in the
probability that a husband murders his wife given that he has abused her? Or
are we interested in the probability that a husband has murdered his wife given
that he has abused her and that she was murdered? The answer depends on
whether you are working for the defense team or the prosecution. Let’s calculate
both probabilities.
First, what is the probability that an abusive husband murders his wife?
Pr(A|H) Pr(H)
Pr(H|A) = .
Pr(A|H) Pr(H) + Pr(A|¬H) Pr(¬H)
6
wife is measly 4.32%, which would seem like a strong argument in the husband’s
presumed innocence.
However, this ignores one piece of additional information we have. Namely,
the fact that event M has occurred. What, then, is the probability that a husband
murders his wife given that he has abused her and that she was murdered?
Pr(AMH)
Pr(H|AM) =
Pr(AM)
Pr(AMH)
=
Pr(AMH) + Pr(AM¬H)
Pr(A|MH) Pr(MH)
=
Pr(A|MH) Pr(MH) + Pr(A|M¬H) Pr(M¬H)
Pr(A|MH) Pr(H|M) Pr(M)
=
Pr(A|MH) Pr(H|M) Pr(M) + Pr(A|M¬H) Pr(¬H|M) Pr(M)
Pr(A|MH) Pr(H|M)
=
Pr(A|MH) Pr(H|M) + Pr(A|M¬H) Pr(¬H|M)
Pr(A|MH) Pr(H|M)
=
Pr(A|MH) Pr(H|M) + Pr(A)(1 − Pr(H|M))
Pr(A|H) Pr(H|M)
=
Pr(A|H) Pr(H|M) + Pr(A)(1 − Pr(H|M))
Pr(A|H) Pr(H)
= .
Pr(A|H) Pr(H) + Pr(A)(Pr(M) − Pr(H))
As before, we know that Pr(A|H) = 0.9 from (d), and that Pr(H) = 0.0025 from
(c). Further, Pr(A) = 0.05 from (a), and Pr(M) = 0.005 from (b). Calculating the
probability gives us Pr(H|AM) = 0.9474. That is, the probability that an abusive
husband has killed his murdered wife is a whooping 94.74% which would seem
like a strong argument about the husband’s guilt.
Let’s just apply Bayes’ Rule to an interesting example. You are a contestant in a
game show. There are three closed doors, with a car behind one and dead goats
behind the other two. You may choose any door. Since you have no information
other than that, your prior about where the car is (we assume you prefer the car
to a dead goat) is 1/3 probability for each door. So you pick door 1. Monty (the
7
show host) opens door 2 and shows you that there is a dead goat behind it. He
then asks, “Would you now like to change your choice?” What should you do?3
There are really two cases depending on what you want to assume about the
way Monty makes his choices. We shall show that the answer to the question
depends on whether you assume that he chooses randomly from all doors ex-
cept the one you picked, or non-randomly by never choosing to open the door
with the car behind it.
Consider the contest with just three doors and suppose (without loss of gener-
ality) that the contestant chooses door 1. Monty opens door 2 and shows him
the goat. What next? We want to know if the contestant can gain from switching
his choice to door 3. So we have to compare the probability of winning a car by
staying with door 1, and the probability of winning by switching to door 3.
Let’s get some notation to facilitate exposition. Let A be the event “car is
behind door 1,” let C be the event “car is behind door 3,” and let B be the event
“Monty picks door 2 and there is a goat behind it.” Since from the contestant’s
initial perspective, the car is equally likely to be behind any of the three doors,
Pr(A) = Pr(C) = 1/3. We want to know Pr(A|B) and Pr(C|B). Bayes rule tells us
that:
Pr(B|A) Pr(A) Pr(B|A)
Pr(A|B) = = 1/3 ×
Pr(B) Pr(B)
Pr(B|C) Pr(C) Pr(B|C)
Pr(C|B) = = 1/3 × .
Pr(B) Pr(B)
First, suppose Monty randomly picks one of the remaining doors. The probabil-
ity that he picks door 2 is then 1/2. The probability that there’s a goat behind
any given door is 1 − 1/3 = 2/3, so the probability that Monty picks door 2 and it
has a goat behind it is Pr(B) = 1/2 × 2/3 = 1/3 because the two events are indepen-
dent by our assumption that Monty picks randomly. We now need to determine
Pr(B|A) and Pr(B|C). If the car is behind door 1, then door 2 certainly has a goat
behind it, and hence the probability that Monty picks it and it has a goat behind
it will equal the probability that Monty picks it: Pr(B|A) = 1/2. Similarly, if the
car is behind door 3, then door 2 also has a goat behind it for sure, yielding
Pr(C|A) = 1/2. Putting all this together yields:
1/ × 1/
3 2
Pr(A|B) = 1/
= 1/2
3
1/ × 1/
3 2
Pr(C|B) = 1/
= 1/2.
3
3
This problem caused quite a controversy on the Internet a (long) while back until it was
finally banned from the newsgroups. The problem was that most people who posted it did not
realize there are really two cases they should deal with. Fortunately, we do realize that (an
example of strategic thinking), so we shall cover both of them.
8
Hence, the contestant cannot gain from switching. This is not surprising: if
Monty makes a random choice, his action reveals nothing.
Suppose now that Monty would never pick a door with a car behind it (much
more likely in a real show). As before, Pr(B|A) = 1/2 because if the car is behind
door 1, then both doors 2 and 3 certainly have goats behind them, and so Monty
will pick randomly between them. However, Pr(B|C) = 1 because if the car is
behind door 3, then door 2 certainly has a goat behind it, but since Monty never
reveals the car (and cannot open door 1 because the contestant picked it), he
will certainly pick door 2. Finally, because Monty will never pick a door with a
car behind it, Pr(B) = 1/2. You can use the total probability theorem to obtain
this number. Let “2” denote the event “the car is behind door 2.” Then:
We used the fact that the probability of the event “Monty picks door 2 and it
has a goat behind it” is zero if there is a car behind that door: Pr(B|2) = 0. This
now yields:
1/ × 1/
2 3
Pr(A|B) = 1/
= 1/3
2
1 × 1/3
Pr(C|B) = 1/
= 2/3.
2
Since Pr(C|B) > Pr(A|B), the contestant should definitely switch. The reason is
that Monty’s informed choice reveals additional information which the contes-
tant should incorporate in his beliefs.
Suppose there are n ≥ 3 doors. Let Ak be the event “car is behind door k.” Let
Bk be the event “Monty chooses door k, which contestant did not choose, and
door k has a goat behind it.” The prior probability is Pr(Ak ) = 1/n for all k.
Bayes’ Rule then gives
Pr(Bk |Ai ) Pr(Ai ) 1 Pr(Bk |Ai )
Pr(Ai |Bk ) = = (3)
Pr(Bk ) n Pr(Bk )
Without loss of generality, suppose that the contestant chooses door 1, and so
he wins if A1 . That is, Pr(Win) = 1/n.
Random Choice. Suppose Monty chooses randomly from all remaining doors
k > 1. He will open a particular door with probability 1/(n − 1), and since the
probability that it has a goat behind it is 1 − (1/n) = (n − 1)/n, it means that
Pr(Bk ) = [(n − 1)/n][1/(n − 1)] = 1/n for all k > 1. If A1 , then door k certainly
has a goat behind it, and so Pr(Bk |A1 ) = 1/(n − 1). From (3), we have
1/(n − 1) 1
Pr(A1 |Bk ) = = for all k > 1.
n(1/n) n−1
9
Since Monty picks randomly (as if he did not know what door had the car behind
it), he chooses doors 2, 3, . . . , n with equal probability. Moreover, whatever door
he chooses certainly has a goat behind it since k ≠ i. From Pr(Bk |Ai ) = 1/(n−1)
and (3) we get
1/(n − 1) 1
Pr(Ai |Bk ) = = for i ≠ k, i > 1.
n(1/n) n−1
That is, if Ai for i > 1, then for k ≠ 1, i, the probability of Bk conditional on Ai
is 1/(n − 1), which is exactly the same as Pr(A1 |Bk ). The contestant cannot gain
from switching.
Non-Random Choice. Of course, one would expect Monty to know where
the car is and not reveal it when he opens the door for it would spoil the fun.
Suppose therefore that he never picks the door with the car behind it. We now
have Pr(Bk |A1 ) = 1/(n − 1), as before, but Pr(Bk ) is now also 1/(n − 1) because
Monty will certainly pick one of the n − 1 doors without a car behind it. From
(3) we have
1/(n − 1) 1
Pr(A1 |Bk ) = = for all k > 1.
n[1/(n − 1)] n
For i > 1 and k ≠ i, Pr(Bk |Ai ) = 1/(n − 2) since Monty must now choose
randomly among all doors except 1 and i. From (3) we get
1/(n − 2) n−1
Pr(Ai |Bk ) = = for i ≠ k, i > 1.
n[1/(n − 1)] n(n − 2)
Since (n − 1)/[n(n − 2)] > 1/n, the contestant should switch. The idea is that
when acting strategically (i.e. non-random selection), Monty conveys additional
information by opening a door. Since Monty knows there’s no car behind the
door he picks, the contestant is able to update his estimate about where the car
actually is. A random pick by Monty, of course, does not reveal anything the
contestant did not know before.
3 Probability Distributions
10
are probabilistic. As we shall see next time, this defines a “lottery” over the
outcomes, which is just another probability distribution. Also, suppose the
decision-maker has three different options to choose from. The choice of each
option is called a “pure strategy.” An important concept is the “mixed strat-
egy” where the decision-maker randomizes among the pure strategies. That
is, instead of choosing one of them with certainty, the decision-maker chooses
among them according to some probability distribution. Mixed strategies are
but probability distributions over the space of pure strategies.
As we shall see, it is often too cumbersome to work with outcomes and events
directly, and it is much more convenient to associate numbers with them. A nu-
merical representation is called a random variable.4 That is, a random variable
X is a function that maps all possible outcomes to real numbers, or X : S → R.
That is, X(s) = x means that the real number x corresponds to outcome s ∈ S.
Events then become sets of real numbers, and we can define the probability
distributions over random variables. Letting A denote any subset of R, and
Pr(X ∈ A) = Pr(s|X(s) ∈ A) denote the probability that X is in this set, the
probability distribution specifies Pr(X ∈ A) for all A.
The most common way of specifying a probability distribution is with a prob-
ability function. When a random variable can take only a finite number of
outcomes, we say that it has a discrete distribution. When it has a discrete
distribution, the probability function is defined as f (x) = Pr(X = x) for any
x ∈ R. If A is the set of all possible values that X can take, then f (x) = 0 for
any x ∉ A. We also require that f (x) ∈ [0, 1] and x∈A f (x) = 1.
For example, consider an experiment that consists of five tosses of a biased
coin that comes up heads with probability 1/3 and tail with probability 2/3. The
sample space has 25 = 32 elements, so I won’t enumerate it. We are inter-
ested in the total number of heads, so let X denote that number. For example,
X(HT T HH) = 3. Clearly, the set of all possible realizations of the random
variable is A = {0, 1, 2, 3, 4, 5}. The probability function takes any x ∈ A and
returns the probability associated with that realization of X.
What is f (5)? There is only one event with 5 heads, and it requires that the
coin comes up heads in each and every toss. The probability of this event is
( 1/3)5 = 1/243. Similarly, the probability of the event no heads at all is f (0) =
( 2/3)5 = 32/243. You could enumerate the rest, or you could use the well-known
formula for the binomial distribution that returns the probability of x successes
in n trials, where each trial results in success with probability p and failure with
probability 1 − p:
n x
f (x) = p (1 − p)n−x
x
4
The usage of “random” does not correspond to the everyday one. Colloquially, we take
“random” to be synonymous with “equally likely.” With a random variable, we mean that it
takes values in a non-deterministic way. That is, when we repeat an experiment, we do not
know for sure what the outcome will be. There is no requirement that the possible outcomes
are equally likely.
11
n n!
for all x = 0, 1, . . . , n. Recalling that x = x!(n−x)! , we can easily compute other
values of f . For example,
5 1 2 2 3
f (2) = ( /3) ( /3) = 10 × 1/9 × 8/27 = 80/
243.
2
5
Let’s check. There are 2 = 10 outcomes with exactly two heads: HHTTT,
HTHTT, HTTHT, HTTTH, THHTT, THTHT, THTTH, TTHHT, TTHTH, TTTHH. Each
of these has the same probability of occurring, and it is the probability of get-
ting two heads and three tails, which is ( 1/3)2 × ( 2/3)3 = 8/243. Since these 10
outcomes are mutually exclusive, we just add the 10 individual probabilities and
obtain the answer above.
What happens if X can take on an infinite number of values? In this case, we
say it it is a continuous random variable or that it has a continuous distribution.
Defining the probability function is a bit trickier now because Pr(X = x) = 0.
Instead, the non-negative function f is such that for any interval A = [a, b],
b
Pr(X ∈ A) = Pr(a ≤ X ≤ b) = f (x) dx.
a
That is, the probability density function (pdf) f is some function such that the
area under its curve between two points a and b in the range of X equals the
probability that X will take on some value between these points. It is important
to realize that the pdf f may not return values that are probabilities because
unlike the discrete case, it does not assign probabilities directly. Rather, its
integral assigns
∞these probabilities. Hence, the two requirements for a pdf are
f (x) > 0 and −∞ f (x) dx = 1.
One very common and convenient distribution is the uniform (rectangular)
distribution defined on some interval [a, b]. When X is distributed uniformly,
the probability that it lies in some subset of the interval with a given length
equals the probability that it lies in any other subset that has the same length.
This means that the pdf must have the same value, say f (x) = c, for any point
x ∈ [a, b] and 0 everywhere else. We now have:
∞ b b
f (x) dx = f (x) dx = c dx = 1.
−∞ a a
We now need to find the precise value of c, which we can obtain by solving the
equation:
b
1
c dx = 1 c(b − a) = 1 c = .
a b−a
Hence, the uniform pdf is defined as follows:
⎧
⎨ 1 if x ∈ [a, b]
f (x) = b−a
⎩0 otherwise.
It is worth repeating (yet again!) that the pdf does not give us a probability
value, its integral does. One very common and convenient interval we shall see
repeatedly is [0, 1], in which case the pdf reduces to f (x) = 1 for all x ∈ [0, 1].
12
3.1 Cumulative Distributions
The continuous case is analogous except that instead of summing we take the
integral:
x
F (x) = f (u) du.
−∞
dF (x)
f (x) = .
dx
In other words, the pdf f (x) is the first derivative of a cdf F (x), with respect
to x where the derivative exists. Hence, a continuous random variable can be
represented either by the pdf or the cdf.
Returning the uniform distribution example, consider the cumulative proba-
bility distribution function F , for which F (a) = 0 (i.e. the probability of a number
less than a is 0) and F (b) = 1 (the probability of a number at most equal to b is
1). So, F (x) represents the probability that X will take on a value at most equal
to x with the property that subsets of the interval that have the same size have
the same probability. We use the pdf we derived above and obtain:
x x
1 x−a
F (x) = f (u) du = du = .
−∞ a b−a b−a
13
Suppose a = 0 and b = 2. The probability of x ∈ [.25, .35] = F (.35) − F (.25) =
.35/2 − .25/2 = .05, which is the same as the probability that x ∈ [1.85, 2],
which is, of course, why the distribution is called “uniform.” When the interval
is [0, 1] (i.e. a = 0 and b = 1), then F (x) = x.
When the number of events is infinite, the probability of any particular event
is 0, and so we must rely on a cumulative distribution function to describe
probabilities of sets of events. To recover the probability for x from F , all we
need to do is calculate F (x) − F (x ), i.e. the difference between values of F at
x and the next smaller event x . In the continuous case, we have Pr(X = x) =
lim→0 F (x + ) − F (x).
Let’s recap. Every random variable has a distribution function. There are a few
things about these functions that are useful to remember. First, the cumulative
distribution function (cdf) is a probability, so its values always lie in the interval
between zero and one:
0 ≤ F (x) ≤ 1.
Also, the probability must be non-decreasing:
The density function, on the other hand, is not necessarily a probability! That
is, it can easily take on values that exceed one. However, it cannot be negative,
and so we have:
f (x) ≥ 0 for all x.
The one important property is that the area under the curve defined by this
function must be exactly one (this immediately follows from the definition of
the cdf):
∞
f (x) dx = 1.
−∞
One crucial concept we shall see over and over again should already be familiar
to you if you’ve taken the intro to statistics class. This is the idea of expectation.
The expected value of a random variable generalizes the mean value (or, in
everyday parlance, the average). If X is a continuous variable with density f (·),
then its expected value is:
∞
E [X] = xf (x) dx,
−∞
14
For example, go back to the number of heads from five tosses of a biased coin.
What is the expected number of heads? Letting xi denote i heads:
5
E [X] = xi f (xi )
i=0
= 0( 32/243) + 1( 80/243) + 2( 80/243) + 3( 40/243) + 4( 10/243) + 5( 1/243)
= 405/
243.
We should expect to get fewer than 2 heads (approximately 1.67) in five consec-
utive tosses of the biased coin.
We can generalize the idea of expectation to functions, and so the expected
value of a function is:
∞
E g(x) = g(x)f (x) dx
−∞
for the continuous case, where g(x) is some function of the random variable
whose density is given by f (x). The analogous case for discrete variables (which
we shall be dealing with almost exclusively) is:
E g(x) = g(x)f (x).
E [a] = a.
E [aX + b] = a E [X] + b.
for any two constants a and b. A bit more generally, you should remember that
E [aX + bY + c] = a E [X] + b E [Y ] + c.
E [aX · bY ] = ab E [XY ] .
This you can get by applying the summation rules and the fact that the expec-
tation is linear. All of these will come in handy very soon when we prove the
expected utility theorem.
15