cs228 HW 1
cs228 HW 1
cs228 HW 1
For each factor P (xi |xi−1 ) for i ≥ 2 you are given the probability P (Xi = u|Xi−1 = v) for each
u, v ∈ S in the form of a m × m table. You are also given P (X1 = v) for each v ∈ S.
max P (x1 , x2 , . . . , xn ).
x1 ,x2 ,...,xn ∈S n
State the complexity of your algorithm using Big O notation. Your algorithm should run in
time polynomial in m and n. (Hint: use dynamic programming, decompose the problem into a
sequence of optimization problems, each over a single variable.)
1
Problem 3: Bayesian networks (6 points)
Can be attempted after lecture 2
Let us try to relax the definition of Bayesian networks by removing the assumption that the di-
rected graph is acyclic. Suppose we have a directed graph G = (V, E) and discrete random variables
X1 , . . . , Xn , and define Y
f (x1 , . . . , xn ) = fv (xv |xpa (v))
v∈V
where Xpa (v) refers to the parents of variable Xv in G and fv (xv |xpa (v)) specifies a distribution over
Xv for every assignment to the parents P of Xv , i.e. 0 ≤ fv (xv |xpa (v)) ≤ 1 for all xv ∈ V al(Xv ), and
for all xpa (v) ∈ V al(Xpa (v)) we have xv ∈V al(Xv ) fv (xv |xpa (v)) = 1. Recall that this is precisely the
definition of the joint probability distribution associated with the Bayesian network G, where the fv
are the conditional probability distributions. Show that if G has a directed cycle, f may no longer
define a valid probability distribution.
In particular, give an example of a cyclic graph G and distributions fv that leads to an improper
probability distribution. Very briefly explain how a property of valid probability distributions is
violated (Proof not required). Remember, a valid probability distribution must be non-negative and
sum to one. This is why Bayesian networks must be defined on acyclic graphs.
For each case, justify your response either by showing how to calculate the desired answer or by
explaining why this is not possible.
• (6 points) Suppose we know that β and γ are conditionally independent given α. Now which of
the preceding three sets is sufficient? Justify your response as before.
2
Consider the Bayesian network B given above.
1. (2 points) Compute Pr(A = 0, B = 0) and Pr(E = 1|A = 1). Justify your answers.
2. (3 points) True or false? Why?
3
Burglary Earthquake
TV Alarm Nap
JohnCall MaryCall
1. (8 points) Consider the Burglary Alarm network given above. Construct a Bayesian network,
over all the nodes except Alarm, that is a minimal I-map for the marginal distribution over
the remaining variables (namely, over B, E, N, T, J, M ). Hint: the minimal I-map is not unique
so there can be different correct solutions; be sure to get all the dependencies from the original
network.
2. (8 points) Generalize the procedure you used above to an arbitrary network. More precisely,
assume we are given a network BN, an ordering X1 , . . . , Xn that is consistent with the ordering
of the variables in BN, and a node Xi to be removed. Specify a network BN 0 such that BN 0 is
consistent with this ordering, and such that BN 0 is a minimal I-map of the marginal distribution
PBN (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). Your answer must be an explicit specification of the set of
parents for each variable in BN 0 . As a (possibly factually incorrect) example of the expected
answer format: for every node j that’s a children of i, Pa0Xj = PaXi ∪ PaXj /{Xi }, where Pa
denotes the parent nodes in the original BN and Pa0 denotes the parent nodes in the new BN.
1. (4 points) Suppose you have a Bayes net over variables X1 , . . . , Xn and all variables except
Xi are observed. Using the chain rule and Bayes’ rule, find an efficient algorithm to compute
P (xi |x1 , . . . , xi−1 , xi+1 , . . . xn ), in terms of local conditional distributions. In particular, your
algorithm should not require evaluation of the full joint distribution.
2. (4 points) Find an efficient algorithm to generate random samples from the probability dis-
tribution defined by a Bayesian network. You can assume access to a routine that generates
random samples from any given categorical distribution. Hint: it is possible to sample from
any joint distribution P (X, Y ) by first drawing a sample x ∼ P (X) and then drawing a sample
y ∼ P (Y |X = x). Hint: You may want to check out topological sorting.
4
Figure 1: Bayesian network for the MNIST dataset. X1:784 variables correspond to pixels in an image.
Z1 and Z2 variables are latent.
are a popular family of deep generative models. We will return to the exact details of learning such
models later in the course.
For this programming assignment, we provide a pretrained model trained mnist model. The
starter code pa1.py loads this model and provides functions to directly access the conditional proba-
bility tables. Further, we simplify the problem by discretizing the latent and manifest variables such
that V al(Z1 ) = V al(Z2 ) = {−3, −2.75, . . . , 2.75, 3} and V al(Xj ) = {0, 1}, i.e., the image is binary.
Note: the index for X starts with 0 and ends with 783 in the Python starter code (which corre-
sponds to X1:784 in the problem description)
Note: The programming portion of this homework is graded with the gradescope autograder.
Nevertheless, you must include all plots in your writeup to be considered for full credit. The plots
will be reviewed in the unlikely case that the autograder is in error.
Note: Using an IDE with a type/syntax checker such as VSCode or PyCharm will likely help you
catch many simple bugs (such as unused variables and incorrect type signatures) and reduce the time
taken to complete this and following programming assignments.
Note: The autograder will assign grades based on your final submission. To receive credit, the
autograder must run and complete within the allotted time.
1. (2 points) How many values can the random vector X1:784 take, i.e., how many different 28 × 28
binary images are there?
2. (2 points) How many parameters would you minimally need to specify an arbitrary probability
distribution over all possible 28 × 28 binary images?
3. (4 points) How many parameters do you need minimally to specify the Bayesian network in
Figure 1?
For parts 4-7 below, refer to pa1.py. The starter code contains some helper functions for solving
these questions, and some incomplete functions that you must complete yourself as indicated in the
code. Feel free to introduce your own additional helper functions when useful.
4. (5 points) Produce 5 samples from the joint probability distribution (z1 , z2 , x1:784 ) ∼ p(Z1 , Z2 , X1:784 ),
and plot the corresponding images (values of the pixel variables).
Hint: they should look like (binarized) handwritten digits.
5. (5 points) For each possible value of
(z̄1 , z̄2 ) ∈ {−3, −2.75, . . . , 2.75, 3} × {−3, −2.75, . . . , 2.75, 3},
compute the conditional expectation E[X1:784 |Z1 , Z2 = (z̄1 , z̄2 )]. This is the expected image
corresponding to each possible value of the latent variables Z1 , Z2 . Plot the images on on a 2D
grid where the grid axes correspond to Z1 and Z2 respectively. What is the intuitive role of the
Z1 , Z2 variables in this model?
5
6. (10 points) In q6.mat, you are given a validation and a test dataset. In the test dataset, some
images are “real” handwritten digits, and some are anomalous (corrupted images). We would
like to use our Bayesian network to distinguish real images from the anomalous ones. Intuitively,
our Bayesian network should assign low probability to corrupted images and high probability
to the real ones, and we can use this for classification. To do this, we first compute the average
marginal log-likelihood,
XX
log p(x1:784 ) = log p(z1 , z2 , x1:784 )
z1 z2
on the validation dataset, and the standard deviation (again, standard deviation over the val-
idation set). Consider a simple prediction rule where images with marginal log-likelihood,
log p(x1:784 ), outside three standard deviations of the average marginal log-likelihood are classi-
fied as corrupted. Classify images in the test set as corrupted or real using this rule. Then plot
a histogram of the marginal log-likelihood for the images classified as “real”. Plot a separate
histogram of the marginal log-likelihood for the images classified as “corrupted”.
Hint: If you run into any numerical stability issues (which might not be immediately apparent
as such), search for the “log-sum-exp trick” online for help. In general, you should always
use a library which incorporates this trick such as scipy.special.logsumexp when computing
quantities like log(exp(log p1 ) + exp(log p2 )).
Take extra caution: the variables are labeled from 1 to 784, not from 0 to 783, please read the
comments in the code to avoid indexing errors.
7. (7 points) In q7.mat, you are given a labeled dataset of images of handwritten digits (the label
corresponds to the digit identity). For each image I k , compute the conditional probabilities
p((Z1 , Z2 ) = (z̄1 , z̄2 ) | X1:784 = I k ). Use these probabilities to compute the conditional expec-
tation
E[(Z1 , Z2 ) | X1:784 = I k ].
Plot all the conditional expectations in a single plot, color coding each point as per their label.
What is the relationship with the figure you produced for part 5?