cs228 HW 1

Homework 1
CS228: Probabilistic Graphical Models

Instructor: Stefano Ermon
ermon@stanford.edu
Available: Jan. 9, 2024
Due date: 11:59pm January 23, 2024, via GradeScope. Total points: 100
Problem 1: Probability theory (4 points)

Can be attempted after lecture 1
The doctor has bad news and good news for X. The bad news is that X tested positive for a serious
disease, UncoVID-19, and that the test is 99% accurate (i.e., the probability of testing positive given
that you have the disease is 0.99, and the probability of testing negative given that you don’t have the
disease is also 0.99). The good news is that unlike COVID-19, UncoVID-19 is an uncommon disease,
striking only one in 10,000 people. Why is it good news that the disease is rare? What are the chances
that X actually has the disease?
Fun fact: Only 15-20% of physicians can answer a variant of this question correctly! See the
original study of Casscells, Schoenberger, and Grayboys (1978), which has been replicated numerous
times, e.g. by Manrai, Bhatia, Strymish, Kohane, and Jain (2014).
Problem 2: Review of dynamic programming (7 points)

Suppose you have a probability distribution P over random variables X1 , X2 , . . . , Xn which all take
values in the set S = {v1 , . . . , vm }, where the vj are some distinct values (e.g., integers or letters).
Suppose that P satisfies the Markov assumption: for all i ≥ 2 we have
P (xi |xi−1 , . . . , x1 ) = P (xi |xi−1 ).
In other words, P factorizes as
P (x1 , x2 , . . . , xn ) = P (x1 )P (x2 |x1 ) · · · P (xn |xn−1 ).
For each factor P (xi |xi−1 ) for i ≥ 2 you are given the probability P (Xi = u|Xi−1 = v) for each
u, v ∈ S in the form of a m × m table. You are also given P (X1 = v) for each v ∈ S.
• (7 points) Find an algorithm to solve the optimization problem
max P (x1 , x2 , . . . , xn ).
x1 ,x2 ,...,xn ∈S n
State the complexity of your algorithm using Big O notation. Your algorithm should run in
time polynomial in m and n. (Hint: use dynamic programming, decompose the problem into a
sequence of optimization problems, each over a single variable.)
1
Problem 3: Bayesian networks (6 points)
Let us try to relax the definition of Bayesian networks by removing the assumption that the di-
rected graph is acyclic. Suppose we have a directed graph G = (V, E) and discrete random variables
X1 , . . . , Xn , and define Y
f (x1 , . . . , xn ) = fv (xv |xpa (v))
v∈V
where Xpa (v) refers to the parents of variable Xv in G and fv (xv |xpa (v)) specifies a distribution over
Xv for every assignment to the parents P of Xv , i.e. 0 ≤ fv (xv |xpa (v)) ≤ 1 for all xv ∈ V al(Xv ), and
for all xpa (v) ∈ V al(Xpa (v)) we have xv ∈V al(Xv ) fv (xv |xpa (v)) = 1. Recall that this is precisely the
definition of the joint probability distribution associated with the Bayesian network G, where the fv
are the conditional probability distributions. Show that if G has a directed cycle, f may no longer
define a valid probability distribution.
In particular, give an example of a cyclic graph G and distributions fv that leads to an improper
probability distribution. Very briefly explain how a property of valid probability distributions is
violated (Proof not required). Remember, a valid probability distribution must be non-negative and
sum to one. This is why Bayesian networks must be defined on acyclic graphs.
Problem 4: Conditional Independence (12 points)

The question investigates the way in which conditional independence relationships affect the amount
of information needed for probabilistic calculations. Let α, β, and γ be three random variables.
• (6 points) Suppose we wish to calculate Pr(α|β, γ) and we have no conditional independence

information. Which of the following quantities is sufficient for the calculation? (Assuming that
all the parameters about the provided distributions are given.)
1. Pr(β, γ), Pr(α), Pr(β|α) and Pr(γ|α).

2. Pr(β, γ), Pr(α) and Pr(β, γ|α)
3. Pr(β|α), Pr(γ|α) and Pr(α).
For each case, justify your response either by showing how to calculate the desired answer or by
explaining why this is not possible.
• (6 points) Suppose we know that β and γ are conditionally independent given α. Now which of
the preceding three sets is sufficient? Justify your response as before.
Problem 5: Bayesian networks (AD Exercise 4.1) (5 points)

2
Consider the Bayesian network B given above.
1. (2 points) Compute Pr(A = 0, B = 0) and Pr(E = 1|A = 1). Justify your answers.
2. (3 points) True or false? Why?
(a) d-sepB (A; E|{B, H})

(b) d-sepB (G; E|D)
(c) d-sepB ({A, B}; {G, H}|F ) (i.e. conditioned on F : A ⊥ G, B ⊥ G, A ⊥ H and B ⊥ H)
Problem 6: Bayesian Networks and explaining away (7 points)

You want to model the admission process of Farm University. Students are admitted based on their
Creativity (C) and Intelligence (I). You decide to model them as continuous random variables, and
your data suggests that both are uniformly distributed in [0, 1], and are independent of each other.
Formally I ∼ Uniform[0, 1], C ∼ Uniform[0, 1], C ⊥ I. Being very prestigious, the school admits all
students such that C + I ≥ 1.5.
1. (1 points) What’s the expected Creativity score of a student?
2. (2 points) What’s the expected Creativity score of an admitted student?
3. (2 points) What’s the expected Creativity score of a student with I = 0.95 (a highly intelligent
student)?
4. (2 points) What’s the expected Creativity score of an admitted student with I = 0.95? How
does it compare to the expected Creativity score of an admitted student (computed in 2)?
Hint: it might be helpful to think about the correlation between Creativity and Intelligence in the
general student population and among admitted students.
Problem 7: Bayesian networks (Exercise 3.11 from Koller-Friedman) (16 points)

3
Burglary Earthquake
TV Alarm Nap
JohnCall MaryCall
1. (8 points) Consider the Burglary Alarm network given above. Construct a Bayesian network,
over all the nodes except Alarm, that is a minimal I-map for the marginal distribution over
the remaining variables (namely, over B, E, N, T, J, M ). Hint: the minimal I-map is not unique
so there can be different correct solutions; be sure to get all the dependencies from the original
network.
2. (8 points) Generalize the procedure you used above to an arbitrary network. More precisely,
assume we are given a network BN, an ordering X1 , . . . , Xn that is consistent with the ordering
of the variables in BN, and a node Xi to be removed. Specify a network BN 0 such that BN 0 is
consistent with this ordering, and such that BN 0 is a minimal I-map of the marginal distribution
PBN (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). Your answer must be an explicit specification of the set of
parents for each variable in BN 0 . As a (possibly factually incorrect) example of the expected
answer format: for every node j that’s a children of i, Pa0Xj = PaXi ∪ PaXj /{Xi }, where Pa
denotes the parent nodes in the original BN and Pa0 denotes the parent nodes in the new BN.
Problem 8: Towards inference in Bayesian networks (8 points)

1. (4 points) Suppose you have a Bayes net over variables X1 , . . . , Xn and all variables except
Xi are observed. Using the chain rule and Bayes’ rule, find an efficient algorithm to compute
P (xi |x1 , . . . , xi−1 , xi+1 , . . . xn ), in terms of local conditional distributions. In particular, your
algorithm should not require evaluation of the full joint distribution.
2. (4 points) Find an efficient algorithm to generate random samples from the probability dis-
tribution defined by a Bayesian network. You can assume access to a routine that generates
random samples from any given categorical distribution. Hint: it is possible to sample from
any joint distribution P (X, Y ) by first drawing a sample x ∼ P (X) and then drawing a sample
y ∼ P (Y |X = x). Hint: You may want to check out topological sorting.
Problem 9: Programming assignment (35 points)

In this programming assignment, we will investigate the structure of the binarized MNIST dataset
of handwritten digits using Bayesian networks. The dataset contains images of handwritten digits
with dimensions 28 × 28 (784) pixels. Consider the Bayesian network in Figure 1. The network
contains two layers of variables. The variables in the bottom layer, X1:784 denote the pixel values
of the flattened image and are referred to as manifest variables. The variables in the top layer, Z1
and Z2 , are referred to as latent variables, because the value of these variables will not be explicitly
provided by the data and will have to be inferred.
The Bayesian network specifies a joint probability distribution over binary images and latent
variables p(Z1 , Z2 , X1:784
P ). The model is trained so that the marginal probability of the manifest
variables, p(x1:784 ) = z1 ,z2 p(z1 , z2 , x1:784 ) is high on images that look like digits, and low for other
images. We consider a model parameterized using neural networks, trained using stochastic gradient
descent. Bayesian networks specified as such are popularly referred to as variational autoencoders and
4
Figure 1: Bayesian network for the MNIST dataset. X1:784 variables correspond to pixels in an image.
Z1 and Z2 variables are latent.
are a popular family of deep generative models. We will return to the exact details of learning such
models later in the course.
For this programming assignment, we provide a pretrained model trained mnist model. The
starter code pa1.py loads this model and provides functions to directly access the conditional proba-
bility tables. Further, we simplify the problem by discretizing the latent and manifest variables such
that V al(Z1 ) = V al(Z2 ) = {−3, −2.75, . . . , 2.75, 3} and V al(Xj ) = {0, 1}, i.e., the image is binary.
Note: the index for X starts with 0 and ends with 783 in the Python starter code (which corre-
sponds to X1:784 in the problem description)
Note: The programming portion of this homework is graded with the gradescope autograder.
Nevertheless, you must include all plots in your writeup to be considered for full credit. The plots
will be reviewed in the unlikely case that the autograder is in error.
Note: Using an IDE with a type/syntax checker such as VSCode or PyCharm will likely help you
catch many simple bugs (such as unused variables and incorrect type signatures) and reduce the time
taken to complete this and following programming assignments.
Note: The autograder will assign grades based on your final submission. To receive credit, the
autograder must run and complete within the allotted time.
1. (2 points) How many values can the random vector X1:784 take, i.e., how many different 28 × 28
binary images are there?
2. (2 points) How many parameters would you minimally need to specify an arbitrary probability
distribution over all possible 28 × 28 binary images?
3. (4 points) How many parameters do you need minimally to specify the Bayesian network in
Figure 1?
For parts 4-7 below, refer to pa1.py. The starter code contains some helper functions for solving
these questions, and some incomplete functions that you must complete yourself as indicated in the
code. Feel free to introduce your own additional helper functions when useful.
4. (5 points) Produce 5 samples from the joint probability distribution (z1 , z2 , x1:784 ) ∼ p(Z1 , Z2 , X1:784 ),
and plot the corresponding images (values of the pixel variables).
Hint: they should look like (binarized) handwritten digits.
5. (5 points) For each possible value of
(z̄1 , z̄2 ) ∈ {−3, −2.75, . . . , 2.75, 3} × {−3, −2.75, . . . , 2.75, 3},
compute the conditional expectation E[X1:784 |Z1 , Z2 = (z̄1 , z̄2 )]. This is the expected image
corresponding to each possible value of the latent variables Z1 , Z2 . Plot the images on on a 2D
grid where the grid axes correspond to Z1 and Z2 respectively. What is the intuitive role of the
Z1 , Z2 variables in this model?
5
6. (10 points) In q6.mat, you are given a validation and a test dataset. In the test dataset, some
images are “real” handwritten digits, and some are anomalous (corrupted images). We would
like to use our Bayesian network to distinguish real images from the anomalous ones. Intuitively,
our Bayesian network should assign low probability to corrupted images and high probability
to the real ones, and we can use this for classification. To do this, we first compute the average
marginal log-likelihood,
XX
log p(x1:784 ) = log p(z1 , z2 , x1:784 )
z1 z2
on the validation dataset, and the standard deviation (again, standard deviation over the val-
idation set). Consider a simple prediction rule where images with marginal log-likelihood,
log p(x1:784 ), outside three standard deviations of the average marginal log-likelihood are classi-
fied as corrupted. Classify images in the test set as corrupted or real using this rule. Then plot
a histogram of the marginal log-likelihood for the images classified as “real”. Plot a separate
histogram of the marginal log-likelihood for the images classified as “corrupted”.
Hint: If you run into any numerical stability issues (which might not be immediately apparent
as such), search for the “log-sum-exp trick” online for help. In general, you should always
use a library which incorporates this trick such as scipy.special.logsumexp when computing
quantities like log(exp(log p1 ) + exp(log p2 )).
Take extra caution: the variables are labeled from 1 to 784, not from 0 to 783, please read the
comments in the code to avoid indexing errors.
7. (7 points) In q7.mat, you are given a labeled dataset of images of handwritten digits (the label
corresponds to the digit identity). For each image I k , compute the conditional probabilities
p((Z1 , Z2 ) = (z̄1 , z̄2 ) | X1:784 = I k ). Use these probabilities to compute the conditional expec-
tation
E[(Z1 , Z2 ) | X1:784 = I k ].
Plot all the conditional expectations in a single plot, color coding each point as per their label.
What is the relationship with the figure you produced for part 5?

cs228 HW 1

Uploaded by

Copyright:

Available Formats

cs228 HW 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

cs228 HW 1

Uploaded by

Copyright:

Available Formats

Homework 1

CS228: Probabilistic Graphical Models

Problem 1: Probability theory (4 points)

Problem 2: Review of dynamic programming (7 points)

P (xi |xi−1 , . . . , x1 ) = P (xi |xi−1 ).

In other words, P factorizes as

P (x1 , x2 , . . . , xn ) = P (x1 )P (x2 |x1 ) · · · P (xn |xn−1 ).

• (7 points) Find an algorithm to solve the optimization problem

Problem 4: Conditional Independence (12 points)

• (6 points) Suppose we wish to calculate Pr(α|β, γ) and we have no conditional independence

1. Pr(β, γ), Pr(α), Pr(β|α) and Pr(γ|α).

Problem 5: Bayesian networks (AD Exercise 4.1) (5 points)

(a) d-sepB (A; E|{B, H})

Problem 6: Bayesian Networks and explaining away (7 points)

Problem 7: Bayesian networks (Exercise 3.11 from Koller-Friedman) (16 points)

Problem 8: Towards inference in Bayesian networks (8 points)

Problem 9: Programming assignment (35 points)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.