Homework 2 AI

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

HOMEWORK 2

Introduction to Artitial Intelligence


Shumin Guo
1 Execise 13.7
Consider the set of all possible ve-card poker hands dealt fairly from a standard deck of fty-two
cards.
a. How many atomic events are there in the joint probability distribution (i.e., how many ve-card
hands are there)?
Since the order of hands does not matter, the total number of hands is a combination of all
possiblities choosing ve cards from fty-two cards.
_
52
5
_
=
52!
5! (52 5)!
=
8.06581752 10
67
120 2.58623242 10
59
= 2, 598, 960.
b. What is the probability of each atomic event?
Considering that all the atomic events will happen with equal probability, we can get the probability
of one atomic event as
1
2, 598, 960
= 3.84769292 10
7
.
c. What is the probability of being dealt a royal straight ush? Four of a kind?
A royal straight ush means the combination of A,K,Q,J and 10 with the same type. So there
are totally four royal straight ushes shown as below.
A(), K(), Q(), J(), 10()
A(), K(), Q(), J(), 10()
A(), K(), Q(), J(), 10()
A(), K(), Q(), J(), 10()
And the possibility of being dealt a royal straight ush is
4
2, 598, 960
= 1.53907717 10
6
.
2 Execise 13.11
We wish to transmit an n-bit message to a receiving agent. The bits in the message are indepen-
dently corrupted (ipped) diming transmission with probability each. With an extra parity bit
sent along with the original information, a message can be corrected by the receiver if at most one
bit in the entire message (including the parity bit) has been corrupted. Suppose we want to ensure
that the correct message is received with probability at least 1 . What is the maximum feasible
value of n? Calculate this value for the case = 0.001, = 0.01.
In order to make sure the probability of received message to be at least 1 , we need to make sure
that the total probability of have one-bit error and have zero-bit error to be 1 .
Lets suppose the probability of have 1-bit error is P(1) and 0-bit error is P(0). We have
P(0) =
_
n + 1
0
_

0
(1 )
n+1
= (1 )
n+1
1
P(1) =
_
n + 1
1
_

1
(1 )
n
= (n + 1) (1 )
n
and the maximum acceptable total error rate is:
P(0) + P(1) = 1
(1 + n)(1 )
n
= 1
for = 0.001, and = 0.01,
(1 + 0.001n)(1 0.001)
n
= 1 0.01
(1 + 0.001n)0.999
n
= 0.99
n 148.
So, the maximum feasible value of n is approximately 148.
3 Execise 13.17
Show that the statement of conditional independence
P(X, Y |Z) = P(X|Z)P(Y |Z)
is equivalent to each of the statements
P(X|Y, Z) = P(X|Z) and P(Y |X, Z) = P(Y |Z).
For equivalence of P(X, Y |Z) = P(X|Z)P(Y |Z) and P(X|Y, Z) = P(X|Z):
(a) P(X, Y |Z) = P(X|Z)P(Y |Z) =P(X|Y, Z) = P(X|Z):
P(X, Y |Z) = P(X|Z)P(Y |Z)
P(X, Y, Z)
P(Z)
= P(X|Z)P(Y |Z)
P(X|Y, Z)P(Y |Z) = P(X|Z)P(Y |Z)
P(X|Y, Z) = P(X|Z)
(b) P(X|Y, Z) = P(X|Z) =P(X, Y |Z) = P(X|Z)P(Y |Z):
P(X|Y, Z) = P(X|Z)
P(X, Y, Z)
P(Y, Z)
= P(X|Z)
P(X, Y |Z)
P(Y |Z)
= P(X|Z)
P(X, Y |Z) = P(X|Z)P(Y |Z)
For evalence of P(X, Y |Z) = P(X|Z)P(Y |Z) and P(Y |X, Z) = P(Y |Z):
(a) P(X, Y |Z) = P(X|Z)P(Y |Z) =P(Y |X, Z) = P(Y |Z):
P(X, Y |Z) = P(X|Z)P(Y |Z)
P(X, Y, Z)
P(Z)
= P(X|Z)P(Y |Z)
P(Y |X, Z)P(X|Z) = P(X|Z)P(Y |Z)
P(Y |X, Z) = P(Y |Z)
2
(b) P(Y |X, Z) = P(Y |Z) =P(X, Y |Z) = P(X|Z)P(Y |Z):
P(Y |X, Z) = P(Y |Z)
P(X, Y |Z)
P(X|Z)
= P(Y |Z)
P(X, Y |Z) = P(X|Z)P(Y |Z)
4 Execise 13.18
Suppose you are given a bag containing n unbiased coins. You are told that n 1 of these coins
are normal, with heads on one side and tails on the other, whereas one coin is a fake, with heads
on both sides.
a. Suppose you reach into the bag, pick out a coin at random, ip it, and get a head, What is the
(conditional) probability that the coin you chose is the fake coin?
Lets use H to denote head, T to denote tail, C
normal
to denote normal coin and C
fake
to denote
fake coin. Because all the coins are unbiased, we can calculate the prior probability as follows:
P(C
normal
) =
n 1
n
P(C
fake
) =
1
n
P(H|C
normal
) = 0.5
P(T|C
normal
) = 0.5
P(H|C
fake
) = 1
P(T|C
fake
) = 0
By using the above prior probabilities we can calculate the following probabilities:
P(H) = P(H|C
normal
)P(C
normal
) + P(H|C
fake
)P(C
fake
)
= 0.5
n 1
n
+ 1
1
n
=
0.5(n + 1)
n
And, the probability of choosing a fake coin after getting head from a random chosen coin toss can
be represented by P(C
fake
|H). And we have:
P(C
fake
|H) =
P(H|C
fake
)P(C
fake
)
P(H)
=
1
1
n
0.5(n+1)
n
=
2
n + 1
b. Suppose you continue ipping the coin for a total of k times after picking it and see k heads.
Now what is the conditional probability that you picked the fake coin?
The conditional probability of getting of fake coin can be represented as P(C
fake
|
total k Hs
..
H, . . . , H).
And by using Baysian rule, we have:
3
P(C
fake
|
total k Hs
..
H, . . . , H) =
P(C
fake
,
total k Hs
..
H, . . . , H)
P(H, . . . , H
. .
total k Hs
)
=

k
i=1
P(H|C
fake
)P(C
fake
)
P(H, . . . , H
. .
total k Hs
|C
normal
)P(C
normal
) + P(H, . . . , H
. .
total k Hs
|C
fake
)P(C
fake
)
=
P(C
fake
)
_
k
k
_
P(H|C
normal
)
k
P(T|C
normal
)
0
+
_
k
k
_
P(H|C
fake
)
k
P(T|C
fake
)
0
=
1
n
0.5
k
n1
n
+ 1
k
1
n
=
1
0.5
k
(n 1) + 1
=
2
k
2
k
+ n 1
c. Suppose you wanted to decide whether the chosen coin was fake by ipping it k times. The
decision procedure returns fake if all k ips come up heads; otherwise it returns normal. What
is the (unconditional) probability that this procedure makes an error?
Error will happen when the randomly picked coin is actually a normal coin rather than a fake one
when we get k heads. So, it can be represented with P(C
normal
|
total k Hs
..
H, . . . , H).
By using the result from the last equestion, we can calculate the error rate as:
P(C
normal
|
total k Hs
..
H, . . . , H) = 1 P(C
fake
|
total k Hs
..
H, . . . , H)
= 1
2
k
2
k
+ n 1
=
n 1
2
k
+ n 1
This result tells us that given nite number of coins n, when k is larger, the error rate will become
smaller.
5 Execise 13.21
(Adapted from Pearl (1988).) Suppose you are a witness to a nighttime hit-and-run accident
involving a taxi in Athens. All taxis in Athens are blue or green. You swear, under oath, that
the taxi was blue. Extensive testing shows that, under the dim lighting conditions, discrimination
between blue and green is 75% reliable.
a. Is it possible to calculate the most likely color for the taxi? (Hint: distinguish carefully between
the proposition that the taxi is blue and the proposition that it appears blue.)
According to the description of the problem, lets use B to denote blue taxi, G to denote green
taxi, LB to denote taxi that is likely blue and LG to denote taxi that is likely green, we can have
prior probabilities as follows:
4
Prior probability of blue and green is unknow:
P(B) = ?
P(G) = ?
Discrimination between Blue and Green:
P(LB|B) = 0.75
P(LG|B) = 0.25
P(LG|G) = 0.75
P(LB|G) = 0.25
So, the likely color for the taxi can be reprented as P(B|LB). And according to bayes rule, we have:
P(B|LB) =
P(LB|B)P(B)
P(LB)
In order to calculate this formula, we need to know the prior probability of P(B), which is not given,
so it is impossible to calculate the most likely color for the taxi.
b. What if you know that 9 out of 10 Athenian taxis are green?
Now, we have P(B) = 0.1 and P(G) = 0.9. And we can calculate the probability of the most
likely color of taxi as follows:
P(B|LB) =
P(LB|B)P(B)
P(LB)
=
P(LB|B)P(B)
P(LB|B)P(B) + P(LB|G)P(G)
=
0.75 0.1
0.75 0.1 + 0.25 0.9
=
0.075
0.075 + 0.225
=
0.075
0.3
= 0.25
So, we have: P(G|LB) = 1 P(B|LB)
= 1 0.25
= 0.75
6 Execise 13.22
Text categorization is the task of assigning a given document to one of a xed set of categories on
the basis of the text it contains. Naive Bayes models are often used for this task. In these models,
the query variable is the document category, and the eect variables are the presence or absence
of each word in the language; the assumption is that words occur independently in documents,
with frequencies determined by the document category.
a. Explain precisely how such a model can be constructed, given as training data a set of docu-
ments that have been assigned to categories.
The probability model of a document classier can be represented with a conditional model
p(c|f
1
, f
2
, . . . , f
n
), where c is class variable of document, f
1
. . . , f
n
is vector of features, for exam-
ple, words that appear in documents.
5
Using Bayess theorem, we have:
p(c|f
1
, f
2
, . . . , f
n
) =
p(c)p(f
1
, f
2
, . . . , f
n
|c)
p(f
1
, f
2
, . . . , f
n
)
By means of Naive Bayes model, we can rewrite the above formula to be
p(c|f
1
, f
2
, . . . , f
n
) =
p(c)

n
i=1
p(f
i
|c)
p(f
1
, f
2
, . . . , f
n
)
since the denominator does not depend on class c, and the value of features are given by training
data set, which means that the denominator is constant as far as the classier is concerned, so, we
can ignore the denominator, and the nal classier is
p(c|f
1
, f
2
, . . . , f
n
) p(c)
n

i=1
p(f
i
|c).
In text classication, our goal is to nd the best class for a given document, and the best class in
Naive bayesian classication is the most likely or maximum a posteriori(MAP) class c
map
. And
c
map
= arg max
cC
p(c|d) = arg max
cC
p(c)

1in
p(f
i
|c).
And usually, it is easier for us to compute the log likelihood of the above formula, due to the fact
that logarithm function is monotonically increasing function. So the maximization of MAP can be
written as
c
map
= arg max
cC
[log p(c) +

1in
log p(f
i
|c)].
It is not possible to know the true values of p(c) and p(f
i
|c), but we can estimate their values from
the given training data sets.
p(c) indicates the relative frequency of class c, more frequent classes are more likely to be the correct
class than infrequent classes. The sum of log prior and term weights is a measure of how much
evidence there is for the document being in the class. So we have the following estimates for both
part.
p(c) =
N
c
N
where N
c
is total number of documents in class c and N is the total number of documents.
And similarly, the conditional probability p(f|c) can be estimated as the relative frequency of feature
(here are words) f in documents belonging to class c.
p(t|c) =
N
cf

F
N
cf

where N
cf
is count of occurences of feature f occurs in class c, and

F
N
cf
is count of all the
features that occur in class c.
In summary and technically, in order to build this naive bayesian classier using the given training
data set, we need to do the following things.
(a) Extract a bag of words that appear in training data set.
(b) Count total docs and docs that occur in class c.
(c) Count occurrances of words in each class.
(d) Calculate the prior probability of p(c) and p(f
i
|c).
(e) Calculate log-summed posterior probability using [log p(c) +

1in
log p(f
i
|c)].
6
b. Explain precisely how to categorize a new document.
To categorize a new document, we need to follow steps below:
Extract words from new document.
Calculate logged prior probability of p(c) for each class c.
Calculate logged sum conditional probability of words using

1in
log p(f
i
|c) for each class.
Return the class that can maximize p(c|f
1
, f
2
, . . . , f
n
), which is the most likely class of the
new document based on the training data set.
c. Is the conditional independence assumption reasonable? Discuss.
In reality, the conditional independence does not hold for text data, terms are conditionally de-
pendent on each other. But in fact the Naive Bayesian classier can perform really good. The
reasons are that NB classication does not use estimated probability to make decision, rather, it
uses the relative estimated probability to make the decision. So, even though it might have a bad
estimate with the independence assumption, it can always estimates correct relative probability over
all classes. Thus, although estimate badly, NB classiers can classify very well.
7 Execise 14.1
We have a bag of three biased coins a, b. and c with probabilities of coming up heads of 20%, 60%,
and 80%, respectively. One coin is drawn randomly from the bag (with equal likelihood of drawing
each of the three coins), and then the coin is ipped three times to generate the outcomes X
1
, X
2
,
and X
3
.
a. Draw the Bayesian network corresponding to this setup and dene the necessary CPTs.
Please see Bayesian network in Figure ??. It is obvious that X
1
, X
2
and X
3
are independent
events with the same probability for head(H) and tail(T) and thus can be represented with a single
node (Flip Result) to denote, without any interaction among the three variables. Please nd the
a b c
X1 X2 X3
Flip Result
Figure 1: Bayesian Network for coin ipping.
CPT in Table ??. Note that, because X
1
, X
2
, X
3
have the same probability tables, so, only the
CPT of X
1
is listed.
b. Calculate which coin was most likely to have been drawn from the bag if the observed ips come
out heads twice and tails once.
We can calculate the estimated probability with the given information P(H) =
2
3
and P(T) =
7
Probability Value
P(a)
1
3
P(b)
1
3
P(c)
1
3
P(X
1
= H|a) 0.2
P(X
1
= T|a) 0.8
P(X
1
= H|b) 0.6
P(X
1
= T|b) 0.4
P(X
1
= H|c) 0.8
P(X
1
= T|c) 0.2
Table 1: CPT for Bayesian Network.
1
3
. We need to calculate the posterior probability of three types of coin using the given evidence
probability.
P(C|H, H, T)
where C is type of coin, which can be a, b and c. We can rewrite the probability based on bayesian
and independence rule as
P(C|H, H, T) =
P(H, H, T|C)P(C)
P(H, H, T)
=
P(H|C)P(H|C)P(T|C)P(C)
P(H, H, T)
.
Because denominator P(H, H, T) is independent of class, and will be a constant value for all three
classes upon given evidence, we can simply ignore it. So,
P(C|H, H, T) P(H|C)P(H|C)P(T|C)P(C).
Then, we can calculate the posterior probability for each class as
P(a|H, H, T) P(H|a)P(H|a)P(T|a)P(a) = 0.2 0.2 0.8
1
3
0.0107
P(b|H, H, T) P(H|b)P(H|b)P(T|b)P(b) = 0.6 0.6 0.4
1
3
= 0.048
P(c|H, H, T) P(H|c)P(H|c)P(T|c)P(c) = 0.8 0.8 0.2
1
3
0.0427
It is obvious that P(b|H, H, T) has the maximum probability among three types of coins, so, coin b
is the most likely coin to have been drawn from the bag.
8 Execise 14.4
Consider the Bayesian network in Figure ??.
a. If no evidence is observed, are Burglary and Earthquake independent? Prove this from the
numerical semantics and from the topological semantics.
Yes, they are independent if no evidence is observed.
For numerical semantics, we have
P(Burglary, Earthquake) = P(Burglary|parents(Burglary))P(Earthquake|parents(Earthquake))
= P(Burglary)P(Earthquake).
So, Burglary and Earthquake are independent.
The topological semantics species that each variable is conditionally independent of its non-
8
Figure 2: A typical Bayesian network, showing both the topology and the conditional
probability tables (CPTs). In the CPTs, the letters B, E, A, J, and M stand for
Burglary, Earthquake, Alarm, JohnCalls, and MaryCalls, respectively.
descendants, given its parents. As can be seen from Figure ??, Burglary is non-decendant of
Earthquake, thus they are independent in topological semantics.
b. If we observe Alarm=true, are Burglary and Earthquake independent? Justify your answer by
calculating whether the probabilities involved satisfy the denition of conditional independence.
No, they are not independent.
We need to test if P(B, E|A) = P(B|A)P(E|A) holds.
P(A) = P(A|B, E)P(B)P(E) + P(A|B, E)P(B)P(E)
+ P(A|B, E)P(B)P(E) + P(A|B, E)P(B)P(E)
= 0.95 0.001 0.002 + 0.94 0.001 0.998 + 0.29 0.999 0.002 + 0.001 0.999 0.998
= 0.0000019 + 0.00093812 + 0.00057942 + 0.00098901 = 0.00251
P(B|A) =
P(A|B)P(B)
P(A)
=
P(A|B, E)P(B)P(E) + P(A|B, E)P(B)P(E)
P(A)
=
0.95 0.001 0.002 + 0.94 0.001 0.998
0.00251
=
0.00094002
0.00251
= 0.37451
P(E|A) =
P(A, E)
P(A)
=
P(A, E|B)P(B) + P(A, E|B)P(B)
P(A)
=
P(A|B, E)P(B)P(E) + P(A|B, E)P(B)P(E)
P(A)
=
0.95 0.001 0.002 + 0.29 0.999 0.002
0.00251
=
0.00058132
0.00251
= 0.2316
P(B|A)P(E|A) = 0.37451 0.2316 = 0.0867
P(B, E|A) =
P(A|B, E)P(B, E)
P(A)
=
P(A|B, E)P(B)P(E)
P(A)
=
0.95 0.001 0.002
0.00251
= 0.000757
So, it is clear that P(B, E|A) = P(B|A)P(E|A), which means they are not independent given
Alarm = true.
9
9 Execise 14.11
In your local nuclear power station, there is an alarm that senses when a temperature gauge
exceeds a given threshold. The gauge measures the temperature of the core. Consider the Boolean
variables A (alarm sounds), F
A
(alarm is faulty), and F
G
(gauge is faulty) and the multivalued
nodes G (gauge reacting) and T (actual core temperature).
a. Draw a Bayesian network for this domain, given that the gauge is more likely to fail when the
core temperature gets too high.
Please nd the bayesian network in Figure ??.
A
T
FG
G
FA
Figure 3: Bayesian Network for nuclear power station alarm system.
b. Is your network a polytree? Why or why not?
No, this is not a polytree structure, because there is a loop structure formed by three nodes:
T, G, FG, which makes it no a tree structure any more.
c. Suppose there are just two possible actual and measured temperatures, normal and high; the
probability that the gauge gives the correct temperature is x when it is working, but y when it
is faulty. Give the conditional probability table associated with G.
Lets use T to denote normal temperature, T to denote high temperature, G to denote guage
works normally, and G to denote gauge works abnormally; F
G
denote guage fault and F
G
to
denote guage non-fault. Please see the CPT in Table ??.
Probability Value
P(G|T, F
G
) y
P(G|T, F
G
) 1 y
P(G|T, F
G
) x
P(G|T, F
G
) 1 y
Table 2: CPT for Nuclear station alarm system Bayesian Network.
d. Suppose the alarm works correctly unless it is faulty, in which case it never sounds. Give the
conditional probability table associated with A.
Please see the CPT table in Table ??.
G G
F
G
F
G
F
G
F
g
A 0 0 0 1
A 1 1 1 0
Table 3: CPT for Nuclear station alarm system Bayesian Network.
10
e. Suppose the alarm and gauge are working and the alarm sounds. Calculate an expression for the
probability that the temperature of the core is low or high, in terms of the various conditional
probabilities in the network.
As can be seen from the Bayesian Network, factor A and F
A
will not directly inuence temperature
T, so we can simply ignore both of them and only consider the inuence of G and F
G
on T. And
the probability that is of interests include P(T|F
G
, G), which means the probability of normal
temperature when guage is non-fault and guage reading is normal. By using the Bayesian Rule, and
also because we have the conditional probability table of G, we have
P(T|F
G
, G) =
P(G|T, F
G
)P(F
G
|T)P(T)
P(F
G
, G)
.
And similarly, we can get conditional probability for high temperature T.
P(T|F
G
, G) =
P(G|T, F
G
)P(F
G
|T)P(T)
P(F
G
, G)
.
10 Execise 14.14
Consider the Bayes net shown in Figure ??.
Figure 4: A simple Bayes net with Boolean variables B = BrokeElectionLaw, I = Indicted, M =
PoliticallyMotivatedProsecutor, G = FoundGuilty, J = Jailed.
a. Which of the following are asserted by the network structure?
i. P(B, I, M) = P(B)P(I)P(M).
ii. P(J|G) = P(J|G, I).
iii. P(M|G, B, I) = P(M|G, B, I, J).
(ii) and (iii) will be asserted by the network.
For (i), node B and M have the common eect of I. So, P(B, I, M) = P(B)P(M)P(I|B, M).
For (ii), I, G and J has causal chain structure, so when G is given I and J will be independent, which
is P(J|G) = P(J|G, I).
For (iii), there exists two path from M to J, one is M I G J and M G J.
And both are causal chain structure, so when G is given M and J will be independent, that is
P(M|G, B, I) = P(M|G, B, I, J).
11
b. Calculate the value of P(b, i, m, g, j).
Using the Bayesian Network joint probability rule, we have
P(i) = P(i|b, m)P(b)P(m) + P(i|b, m)P(b)P(m)
+ P(i|b, m)P(b)p(m) + P(i|b, m)P(b, m)
= 0.9 0.9 0.1 + 0.5 0.1 0.1 + 0.5 0.9 0.9 + 0.1 0.1 0.9
= 0.081 + 0.005 + 0.405 + 0.009
= 0.5
P(b, i, m, g, j) = P(b)P(m)P(i|b, m)P(g|b, i, m)P(j|g)
= P(i)P(g|b, i, m)P(j|g)
= 0.5 0.8 0.9
= 0.36
c. Calculate the probability that someone goes to jail given that they broke the law, have been
indicted, and face a politically motivated prosecutor.
We need to calculate P(J|B, I, M).
P(I) = P(I) = 0.5
P(G|B, I, M) = 0.9
P(G) = 0.37
P(J) = 0.396
P(J|B, I, M) =
P(J|B, I, M)P(G|J)
P(G|J)
=
P(G, J|B, I, M)
P(G|J)
=
P(J|B, I, M, G)P(G|B, I, M)
P(G|J)
=
P(J|G)P(G|B, I, M)
P(G|J)
(B, I, M J|G)
=
P(J)P(G|B, I, M)
P(G)
=
0.396 0.9
0.37
= 0.963
d. A context-specic independence (see page 542) allows a variable to be independent of some
of its parents given certain values of others. In addition to the usual conditional independences
given by the graph structure, what context-specic independences exist in the Bayes net in
Figure ???
J B|G
J I|G
J M|G
B M
e. Suppose we want to add the variable P = PresidentialPardon the network; draw the new net-
work and briey explain any links you add.
Please see the new Bayesian Network in Figure ??. In Figure ??, node P is added as parent of J,
12
B I M
G
P J
Figure 5: Bayesian Network with PresidentialPardon.
because presendential pardon will only have direct inuence on J, which is when a person is jailed.
And other factors will not be aected by it.
11 Execise 14.2
Equation (14.1) on page 513 denes the joint distribution represented by a Bayesian network in
terms of the parameters (X
i
|Parents(X
i
)). This exercise asks you to derive the equivalence
between the parameters and the conditional probabilities P(X
i
|Parents(X
i
)) from this denition.
a. Consider a simple network X Y Z with three Boolean variables. Use Equations (13.3)
and (13.6) (pages 485 and 492) to express the conditional probability P(z|y) as the ratio of two
sums, each over entries in the joint distribution P(X, Y, Z).
Equation 13.3 : P(a|b) =
P(a,b)
P(b)
and Equation 13.6 : P(Y ) =

z
P(Y, z).
The simple network tells us that X Z|Y , which means P(X, Z|Y ) = P(X|Y )P(Z|Y ).
Based on knowledge above, we have:
P(z|y) =
P(z, y)
P(y)
=

x
P(x, y, z)

z
P(x, y, z)
.
b. Now use Equation (14.1) to write this expression in terms of the network parameters (X), (Y |X),
and (Z|Y ).
Equation 14.1 says: P(x
1
, x
2
, . . . , x
n
) =

n
i=1
(x
i
|parent(x
i
)).
And
x
P(x, y, z)

z
P(x, y, z)
=

x
(x)(y|x)(z|y)

z
(x)(y|x)(z|y)
c. Next, expand out the summations in your expression from part (b), writing out explicitly
the terms for the true and false values of each summed variable. Assuming that all network
parameters satisfy the constraint

xi
(x
i
|parents(X
i
)) = 1, show that the resulting expression
reduces to (x|y), (I think here should be (z|y)).
First, we have

x
(x) = 1 and

z
(x)(z|y) = 1 1 = 1.
So,

x
(x)(y|x)(z|y)

z
(x)(y|x)(z|y)
=
(y|x)(z|y)
(y|x)
= (z|y).
d. Generalize this derivation to show that (X
i
|Parents(X
i
)) = P(X
i
|Parents(X
i
)) for any
Bayesian network.
13
In any Bayesian Network, the joint probability can be represented with P(x
1
, x
2
, . . . , x
n
) =

n
i=1
(x
i
|parents(x
i
)), which is Equation 14.1. And according to the joint probability theorem
and the independence property in Bayesian networks, we can also write the joint probability of net-
work variables as

n
i=1
P(x
i
|parents(x
i
)). So, it is easy to observe that (X
i
|Parents(X
i
)) =
P(X
i
|Parents(X
i
)).
12 Execise 14.7
The Markov blanket of a variable is dened on page 517. Prove that a variable is independent
of all other variables in the network, given its Markov blanket and derive Equation (14.12) (page
538).
P(x

i
|mb(X
i
)) = P(x

i
|parents(X
i
))

yjChildren(Xi)
P(y
j
|parents(Y
j
))

Proof. According to the theorem of d-speration and topological independence theorem, and simply
considering the polytree structure of a Bayesian Network, we have the folloiwng assertions:
P(x
i
|ancestors(X
i
)) = P(x
i
|parents(X
i
))
which says X
i
will be independent of its other ancestor nodes given its parents.
If given children of X
i
, then for any decendents of Children(X
i
), we have:
P(x
i
|decendents(X
i
)) = P(x
i
|children(X
i
))
If a node is connected with X
i
through v structure with Children(X
i
) as mid-point, then we have
P(x
i
|decendents(X
i
) = P(x
i
|children(X
i
), (parents(children(X
i
)) X
i
))
If for a node X
j
, there is no path between X
i
and X
j
. then they are simply independent.
All the four cases above covers all possible relationships between a node and another dierent
node. And generalize the polytree structure to a directed acyclic graph (DAG) structure, in which
case, there can be multiple paths between node X
i
and any other node. And by denoting Markov
Blanket as
MB(X
i
) = parents(X
i
) children(X
i
) [parents(children(X
i
)) X
i
]
we have that a variable is independent of all other variables given its Markov Blanket.
And
P(x

i
|mb(X
i
)) = P(x

i
|parents(X
i
))

yjChildren(Xi)
P(y
j
|parents(Y
j
)).
14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy