Probabilistic Modelling and Reasoning
Probabilistic Modelling and Reasoning
Probabilistic Modelling and Reasoning
Alexander McMurray
Student Number: 1367329
February 25, 2014
(32%)
(a) Write down the joint distribution dened by the belief network
D
C
(b)Suppose you observe that the temperature sensor reads cold and that wheels are not rotating. Using
simple probability rules, eciently compute the probability of the rover being stuck. In other words, calculate
P (S = stuck|V = static, T = cold). Show your working. One possible way to tackle this question is to use the
elimination algorithm. Do not construct a junction tree for this question.
P (S = stuck|V = static, T = cold) =
P (S = st, V = st, T = co) =
(1.2)
= P (S = st)
P (I|C) (1.3)
Now begin eliminating the variables and computing their messages beginning with I:
mI (C) =
P (I|C) = 1
(1.4)
(1.5)
(1.6)
(1.7)
(1.8)
(1.9)
Similarly for J:
mJ (C) =
(1.10)
(1.11)
(1.12)
And for C:
mC =
(1.13)
(1.14)
To calculate the denominator, P (V = st, T = co) we must also sum over S. This changes mJ (C) as it will now
depend on S. So we must recalculate mJ (C, S):
mJ (C = hi, S = st) = 0.3416
mJ (C = lo, S = st) = 0.2378
(1.15)
(1.16)
As before.
mJ (C = hi, S = f r) = (1 0.2030 0.1) + (0 0.3570 0.9) = 0.0203
mJ (C = lo, S = f r) = (1 0.1670 0.1) + (0.8 0.2730 0.9) = 0.2133
(1.17)
(1.18)
(1.19)
(1.20)
(1.21)
Finally eliminating S:
Now we have both the numerator and the denominator we can nally calculate the desired conditional probability:
P (S = st|V = st, T = co) =
0.2897
= 0.7127
0.4065
(1.22)
(c) Suppose that after observing V = static and T = hot, and calculating P (S = stuck|V = static, T = hot),
you had the option of observing one extra variable in order to obtain a more accurate posterior distribution
for S. Are there any variables whose observation would lead to no extra information, and would therefore be
poor choices? In other words, is S conditionally independent of any other node given V and T ? Show your
reasoning.
No, because S is d-connected to every other node given V and T. This is because V is a collider so having it in
the conditioning set allows S to be d-connected to the other nodes. Meanwhile T is not a collider but it has no
descendents which would be d-seperated from S via the observation of T.
We can state possible paths explicitly:
Connection
SJ
SC
SI
SD
Path
SV J
SV C
SV CI
SV CD
(d) Convert the Belief Network to a Markov Network. Give one conditional independence relationship that is
true in the Belief Network, but is not specied by the Markov Network.
Conversion from a belief network to a Markov network is done by moralising the graph (connecting together
any nodes which share a child) and changing all the directed edges to undirected edges.
D
C
(32%)
The EM algorithm is an iterative procedure and thus at each step we wish to obtain an updated estimate of
that maximises the dierence:
(2.2)
where n is the current (i.e. old, as the updated value will be the n + 1th iteration) estimate of .
We can now introduce the latent, unobserved variables Z. In the case of the mixture model Z denotes which
component a given datapoint belongs to (i.e. which distribution is responsible for generating that datapoint).
Thus we can state the total probability in terms of the latent variables:
P (X|) =
(2.3)
P (X|Z, )P (Z|)
Z
P (X|Z, )P (Z|)
ln P (X|n )
(2.4)
It is known from Jensen's inequality (as ln(x) is convex and thus ln(x) is concave) that:
n
ln
i xi
i=1
n
i=1 i
(2.5)
i ln(xi )
i=1
= 1.
We may apply this to Equation (2.4) letting i take the form P (Z|X, n ) which as a probability measure
satises the requirements that P (Z|X, n ) 0 and z P (Z|X, n ) = 1. Therefore starting from Equation (2.4)
we obtain:
L() L(n ) = ln
P (X|Z, )P (Z|)
ln P (X|n )
P (X|Z, )P (Z|)
P (Z|X, n )
P (Z|X, n )
= ln
Z
ln P (X|n )
P (X|Z, n )P (Z|)
P (Z|X, n )
ln P (X|n )
P (Z|X, ) ln
P (X|Z, n )P (Z|)
P (Z|X, n )
ln P (X|n )
P (Z|X, ) ln
P (X|Z, n )P (Z|)
P (Z|X, n )P (X|n )
= ln
P (Z|X, )
Z
=
Z
(|n )
(2.6)
(2.7)
(2.8)
In the step from Equation (2.6) to Equation (2.7) we used the fact that Z P (Z|X, n ) = 1 such that
ln P (X|n ) = z P (Z|X, n ) ln P (X|n ) and so ln P (X|n ) may be brought into the summation.
l(|n )
(2.9)
L(n ) + (|n )
L() l(|n )
Therefore l(|n ) is bounded from above by L(). Furthermore, we can show that:
l(|n ) = L(n ) + (n |n )
P (Z|X, n ) ln
P (X|Z, n )P (Z|n )
P (Z|X, n )P (X|n )
P (Z|X, n ) ln
= L(n ) +
P (X, Z|n )
P (X, Z|n )
= L(n ) +
Z
P (Z|X, n ) ln 1
= L(n ) +
Z
(2.10)
= L(n )
Thus l(|n ) and L() are equal at the point = n . As we wish to maximise L() and we know that it is an
upper bound to l(|n ) and that the functions are equal at our current estimate n then any new value n+1
which increases the value of l(|n ) must also increase the value of L().
Therefore our update rule can be stated as:
n+1 = arg max{l(|n )}
P (Z|X, ) ln
Z
P (X|Z, n )P (Z|)
P (Z|X, n )P (X|n )
= arg max
= arg max
P (Z|X, ) ln
= arg max
P (X, Z, n ) P (Z, )
P (Z, )
P ()
(2.11)
Therefore th EM algorithm consists of two steps that are iterated over until convergence:
1. E-step: Compute the expected complete data log likelihood Q(|n ) = EZ|X,n {ln (P (X, Z|n ))}
2. M-step: Maximise Q(|n ) wrt.
So to derive the update rule for the mixture of multivariate Bernoullis we wish to calculate:
arg max
For each of the parameters, the vector of probabilities pm and the mixing proportions m .
The complete data log-likelihood may be determined as follows:
L = ln P (X, Z|p, )
N
I[zi = m] ln m +
i=1
(2.12)
Where I[S] is the Indicator function which is equal to one, if the statement S is true, zero otherwise.
Therefore we have:
N
Q(|
old
)=
P (zi = m|Xi ,
old
) ln m +
i=1
(2.13)
d=1
=
pm
Q
pmd
...
Q
pmD
1 xnd
xnd
pmd 1 pmd
i=1
N
=
i=1
=0
(2.14)
Q
pm1
i=1
i=1
N
i=1
N
pmd =
i=1
N
i=1
N
pm =
P (zi = m|Xi
As desired.
(2.15)
, old )
i=1
To derive the update rule for m we must dierentiate Q wrt. m using a Lagrange multiplier to satisfy the
constraint M m = 1.
m=1
M
L(, ) = Q() +
m 1
(2.16)
m=1
L
Q
=
+=0
m
m
i=1
N
(2.18)
(2.17)
(2.19)
i=1
1
=
(2.20)
m
m=1
1 = N =
i=1
(2.21)
So = N and therefore:
m =
1
N
As desired.
(2.22)
i=1
(b) Using the program mix_bernoulli with dierent parameter settings try to model the data. Use values
of M from 1 to 5. Comment on the results for a dierent number of mixture components - how does this t
with what we know to be the true generative model? How do you recommend dealing with the fact that the
parameters found by the EM algorithm depend on the initialization? Provide your ndings and observations.
The code I used for this question is contained in q2b.m and the MATLAB workspace is saved in q2b.mat I saved the workspace because the use of random initialisation means that simply running the script again is
unlikely to result in exactly the same answers (although sucient sample sizes were used such there should be
no major dierences).
First I generated a test set of 200 data points (this number was chosen such that it was the same size as the
training set, allowing the log-likelihood values to be directly compared) using the known true parameters of the
model. Then I calculated the log-likelihood of models with a varying number of components from one to ve.
Each component number had 200 models generated and tested against the test set and training set in order to
investigate the eect of the random initialisation. It should be noted that in the case of the single component
model there is no random initialisation as it has an analytic solution.
Finally graphs of the maximum obtained log-likelihood and the average log-likelihood obtained from the models
were plotted for each component number based on the performance on both the training set (see Fig. 2.1) and
the test set (see Fig. 2.2)
On the training data we see that the larger the number of components the better the t (i.e. the higher the
log-likelihood), however this is simply a result of overtting, as a larger number of components provides more
freedom to t the observed data but will not generalise well to new data. This can be conrmed by looking
at the poor performance of high numbers of components on the test set, and the fact that the higher number
of components has a higher log-likelihood value on the training set than the model that we know was used to
generate the data, which shouldn't be possible.
On the test set it is interesting to look at the eect of averaging the repeated models (because of the random
initialisation). As we know the true model had 3 components we would expect the 3-component model to have
the highest log-likelihood, but for the average values we see that the 2-component model performs better. This
is because averaging them combines all the results, both the good models resulting from good initialisation values and the bad models resulting from poor initialisation values, together. The larger the number of parameters
you have to randomly initialise (i.e. the larger the number of components), the more likely you are to produce
a poor model. But a very small number of parameters does not have much freedom to t the data, and so will
also perform poorly. The compromise between these two eects results in the 2-parameter model performing
best when the models resulting from the random initialisations are averaged.
775
Average
Maximum
780
785
790
795
800
3
Number of components
Figure 2.1: The log-likelihood values obtained from the models on the training data versus the component
number of the models. The reference line is the log-likelihood value of the true model used to generate the
data. Note that for the case of a single component the average is equal to the maximum because there is an
analytical solution so there is no random initialisation. Note the apparent better performance of models with
more parameters due to overtting.
When we take the maximum log-likelihood value obtained from the random initialisations (i.e. the best model
produced for each component number) we nd that the 3-component model performs best as we would expect
given that the model used to generate the data was a 3-component model. Interestingly it still doesn't t the
data as well as the known true model which means it didn't converge onto the true model, but this is perhaps
expected as the optimisation is a dicult problem and is not guaranteed to nd the global optimum.
To get better performance one could use the output of a k-means run on the data to initialise the EM algorithm
rather than using random values. The mixing ratios can be estimated as the proportion of the data attributed
to each class by the k-means run and the probabilities can be put as a high value if k-means attributed the
value to the class and a low value otherwise (avoiding one and zero values as these should only be used in the
case of logical certainties as per Cromwell's Rule and it is almost certain that the k-means algorithm has not
perfectly classied the components).
(c)(i) . . . Explain how to estimate the parameters of these class-conditional distributions, and state the parameter
vectors obtained for each class.
(ii)Given the models for the ham and spam classes, use them to compute P (zi = ham|xi ). Make a plot of
this posterior probability against the index i = 1, . . . , 100. How well do these results agree with the known
ham/spam labels? Make any other relevant observations.
(iii)If we did not know the labels (ham/spam) of the dataset, we might be tempted to t a mixture of multivariate
Bernoullis to the data. Compare the goodness-of-t of such mixture models to the class-conditional model
described above. Make any other relevant observations.
The parameters can be estimated using the maximum log-likelihood estimate from the log-likelihood of the
attributes and class labels:
780
Average
Maximum
782
784
786
788
790
792
794
796
798
800
3
Number of components
Figure 2.2: The log-likelihood values obtained from the models on the test data versus the component number of
the models. The reference line is the log-likelihood value of the true model used to generate the data. Note that
for the case of a single component the average is equal to the maximum because there is an analytical solution
so there is no random initialisation. Note the dierence between the best performing model type according to
the average and the best performing model type according to the maximum.
N =100
L =
N =100 D=10
ln(P (xi , ci )) =
ln [P (xid |ci )]
i=1
D=10 N =100
i=1
d=1
(2.23)
i=1
Where n0 is the number of data points belonging to class 0 and n1 is the number of data points belonging to
class 1.
Finding the MLE of pd c:
L
=
pdc
N =100
N =100
xid 1 xid
pdc 1 pdc
= 0
= 0
i=1
i=1
N =100
i=1
N =100
i=1
N =100
N =100
xid =
i=1
pdc =
pdc = N P dc
i=1
N =100
i=1
xid
(2.24)
Therefore the estimate for the probability of each feature being true given a class is just the number of data
points that feature was true in the data set where the class was true, divided by the total number of data points
where the class was true (i.e. the fraction of the class data for which the feature was true).
Finding the MLE of P (C = 0) (using the fact that P (C = 1) = 1 P (C = 0)):
L
n0
n1
= 0
=
P (C = 0)
P (C = 0) 1 P (C = 0)
n0 n0 P (C = 0) n1 P (C = 0) = 0
n0 = n0 P (C = 0) + n1 P (C = 0)
n0
n0
P (C = 0) =
=
n0 + n1
N
(2.25)
Therefore the class priors are simply the proportion of the data that belongs to each class.
Using these equations we obtain P (C = 0) = P (C = 1) = 0.5 and the following probability vectors:
pham
0
0
0
0
0.28
0
0.32
0.22
0.26
0.22
and
pspam
0.56
0.44
0.40
0.22
0.08
0.26
0
0
0.12
0.18
(2.26)
Zi
Where the priors are P (Zi = ham) = P (Zi = spam) = 0.5 as found previously. The likelihood may be dened
as:
D=10
P (xi |Zi ) =
(2.27)
d=1
where i denotes the datapoint, c denotes the class value and d denotes the feature.
Therefore one can calculate the posterior for each datapoint xi by repeating this calculation for each datapoint
in a loop. It may be possible to entirely vectorise the code, but given the small size of the dataset this seemed
to be unnecessary optimisation. The code is given in q2c.m.
The resulting posterior is plotted in Figure 2.3. Assuming a classication threshold of P (Zi = (ham)|xi ) > 0.5
we can see that all the ham examples were classied correctly although some spam examples were incorrectly
classied as ham. We could of course vary the threshold - requiring that P (Zi = (ham)|xi ) > 0.8 would remove
all the false positives (i.e. it would correctly classify all the spam examples) but at the expense of producing
many false negatives (i.e. classifying some ham as spam) in practice the lower threshold is probably better as it
is more important not to lose potentially important real messages rather than ensure that no spam gets through.
In the unsupervised setting that does not utilise the class labels (imagine a situation where we were not given
the class labels) then we can attempt to t a two-component mixture of multivariate Bernoullis in order to
classify the data (i.e. one component will correspond to ham, the other to spam). I trained 200 seperate
multivariate Bernoulli models on the training data selected the best performing model based on which had the
highest log-likelihood value on the data and used this to compute the class posteriors. The code is contained
within q2c.m and the workspace is in q2c.mat. The workspace is saved because the random initialisation means
that it is not guaranteed to generate the same result every time it is run (although taking the best of the 200
repeats should somewhat reduce this problem by making a global optimum more likely).
0.9
P (Z i = (ham)|x i )
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
10
20
30
40
50
60
70
80
90
100
Index, i
Figure 2.3: The posterior, P (Zi = ham|xi ), plotted against the datapoint index, i. The reference lines separate
the graph into quadrants to show classication performance assuming a classication threshold of 0.5 - the top
left are the true positives (i.e. ham classied as ham), the top right are the false positives (i.e. spam classied
as ham), the bottom right are the true negatives and the bottom left are the false negatives (there were none).
This resulted in the following class priors (i.e. the mixing ratios): P (Z = 1) = 0.2328 and P (Z = 2) = 0.7672.
Note that as this is unsupervised learning we do not know which class corresponds best to the ham class and
which to the spam class. However as we do in fact know the class labels we can determine this from examining
the posterior distributions. We obtain the following probability vectors:
p1 =
0.5262
0.9450
0.8591
0.4725
1.3303 1032
0.4391
3.2497 1035
1.2150 1030
2.9505 1031
0.3869
p2 =
and
0.2053
1.7194 1009
5.4740 1043
3.6010 1024
0.2346
0.0362
0.2086
0.1434
0.2477
0.1433
Plotting the posterior distributions results in Figures 2.4 and 2.5. From these it is clear that Zi = 2 corresponds
to the ham class while Zi = 1 corresponds to the spam class.
To compare the goodness of t of the unsupervised mixture models to the supervised naive Bayes model we
can calculate the number of false positives, false negatives, true positives and true negatives for both models
for the Zi = ham and Zi = 2 cases (As Zi = 2 appeared to be the mixture component corresponding to ham).
I.e. compare their confusion matrices. These are given in Tables 2.1 and 2.2.
0.9
0.8
P (Z i = 1|x i )
0.7
0.6
0.5
0.4
0.3
0.2
0.1
10
20
30
40
50
60
70
80
90
100
Index, i
Figure 2.4: The posterior, P (Zi = 1|xi ), plotted against the datapoint index, i. This appears to be the spam
class. The reference lines separate the graph into quadrants to show classication performance assuming a
classication threshold of 0.5 - the top left are the false negatives (i.e. ham classied as spam, there were none),
the top right are the true positives (i.e. spam classied as spam), the bottom right are the false negatives and
the bottom left are the true negatives.
Predicted Class
Ham Spam
Ham
50
0
Actual Class
Spam 11
39
Predicted Class
Ham Spam
Ham
50
0
Actual Class
Spam 27
23
3 Question 3
(25 marks)
(a) Fred suggests that a Gibbs sampling approach could be used. Is he correct? Explain your reasoning.
0.9
0.8
P (Z i = 2|x i )
0.7
0.6
0.5
0.4
0.3
0.2
0.1
10
20
30
40
50
60
70
80
90
100
Index, i
Figure 2.5: The posterior, P (Zi = 2|xi ), plotted against the datapoint index, i. This appears to be the ham
class. The reference lines separate the graph into quadrants to show classication performance assuming a
classication threshold of 0.5 - the top left are the true positives (i.e. ham classied as ham), the top right are
the false positives (i.e. spam classied as ham), the bottom right are the true negatives and the bottom left are
the false negatives (there were none).