Probabilistic Models: AA228/CS238 Exercises
Probabilistic Models: AA228/CS238 Exercises
Probabilistic Models: AA228/CS238 Exercises
Probabilistic Models
Question 1. Consider the definition of conditional probability:
P (A, B) = P (A | B)P (B)
Can you come up with a simple explanation in words as to why this works? Use similar reasoning to come
up with an expression for P (A, B | C).
Question 2. 1% of women at age forty who participate in routine screening have breast cancer. 80% of
women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will
also get positive mammographies. A woman in this age group had a positive mammography in a routine
screening. What is the probability that she actually has breast cancer?
Question 3. There is a 50% chance there is both life and water on Mars, a 25% chance there is life but no
water, and a 25% chance there is no life and no water. What is the probability that there is life on Mars
given that there is water?
Question 4. In the textbook, it is stated that if all variables in a Bayesian network are binary, the probability
distribution over some variable X with n parents P aX can be represented by 2n independent parameters.
Imagine that X is a binary variable with two parent variables that are not necessarily binary. Imagine
that the first parent can assume three different values, and that the second can assume two values. How
many independent parameters are needed to represent this distribution (P (X | P aX ))? How many would it
be if you added another parent that can assume four different values?
Now assume that X itself is not binary, but can assume three different values (and still has the three
parents as specified above). How many values are needed to represent this distribution? Can you come up
with a general rule for the number of independent parameters needed to represent a distribution over some
variable X with parents P aX ?
Question 5. Given the Bayes net below, determine whether the following are true or false:
B D
C
F
E
1. (B ⊥ D | A)
2. (B ⊥ D | C)
3. (B ⊥ D | E)
4. (B ⊥ C | A)
Question 6. It is known that 80-foot blue whales consume, on average, 3200 kg of krill per day. 100-footers
consume on average 3600 kg of krill per day. Assume that the mean daily krill consumption varies linearly
with whale length and that the daily consumption for a given whale follows a Gaussian distribution with a
standard deviation of 200 kg of krill per day. Define the linear Gaussian distribution, P (k | l), relating the
rate of krill consumption k to whale length l .
1
Question 7. Assuming a hidden Markov model with states s0:t and observations o0:t , prove the following:
Question 8. What is the Markov blanket for some node ot of the hidden Markov model below? Explain
why this is so.
S0 S1 S2
O0 O1 O2
where Bset is a set of mutually exclusive and exhaustive propositions. Can you find a similar expression for
P (A | C)?
Question 10. What is a topological sort? Why is it important to perform a topological sort before sampling
from a Bayesian network? Does a topological sort always exist? Is a topological sort always unique?
Question 11. Given the following Bayesian network where each node can take on one of four values, how
many independent parameters are there?
D B
G C
Question 12. Give a topological sort for the following Bayesian network:
D F
A B
C G
2
Question 13. Formulate the following 3SAT problem as a Bayesian network.
Question 14. What are the differences between inference, parameter learning, and structure learning?
What are you looking for in each case and what is assumed to be known? When might you use each of
them?
Question 15. What is a classification task? Assume you are classifying using a naive Bayes model. What
assumptions are you making? Draw a naive Bayes model using the compact representation shown in class.
What is the name of this kind of representation?
Question 16. What is an important drawback of maximum likelihood estimation?
Question 17. Bayesian parameter learning estimates a posterior p(θ | D) for the parameters θ given the
data D. What are some of its advantages and drawbacks?
Question 18. What is the gamma function Γ? What is Γ(5)?
Question 19. Imagine that you want to estimate θ, the probability that one basketball team (call them
Team A) beats another team (call them Team B). Assume you know nothing else about the two teams.
What is a reasonable prior distribution?
Now imagine that you know the two teams well and are confident that they are evenly matched. Would
a prior of Beta(9,9) be better than a Beta(2,2) in this case? If so, why?
Now imagine that you know that Team A is more likely to win (maybe they are the Warriors). What
kind of prior might you use in this case? Imagine that the teams are going to play many games against each
other. What does this mean for the prior you select?
Question 20. Consider the two Beta distributions Beta(2,2) and Beta(9,9). Beta(9,9) gives much more
weight to θ = 0.5. Can you explain intuitively why this is so?
Question 21. Suppose you have a lot of data and are trying to learn the structure of a Bayesian network
that fits this data. Consider two arbitrary Bayesian network designs. One is relatively sparse, whereas the
other has many connections between its nodes.
Imagine that your data consists of very few samples. Which Bayesian network would you expect to
achieve a better Bayesian score? How would this change if there were many samples?
Question 22. How many members are there in the Markov equivalence class represented by the partially
directed graph shown below?
B D
A C
Question 23. Gibbs sampling offers a fast way to produce samples with which to estimate a distribution.
What are some downsides of Gibbs sampling and how are they handled?
Question 24. What is a topological sorting of the nodes shown in the following Bayesian network?
3
A
B D
C
F
E
Question 25. What does the neighborhood of a graph consist of and what are the basic graph operations
that can be performed?
Question 26. Draw two Markov equivalent Directed Acyclic Graphs (DAGs) that contain four nodes, three
edges, and one v-structure.
Question 27. What is P (s0 | b0 , d1 ) for the probabilities P (s1 ) = 0.6, P (d1 | e1 ) = 0.3, P (d1 | e0 ) = 0.5,
P (e0 | s0 , b0 ) = 0.2 and P (e0 | s1 , b0 ) = 0.9, where S, B, E, D are binary random variables, given the
following Bayes net?
S B
Question 28. Given the same Bayes net as the previous question and the following probability tables, use
variable elimination with tables to find P (S | b0 , d1 ).
4
Decision Problems
Question 29. What does it mean to be rational?
Question 30. Explain the value of information in words. What is the value of information of an observation
that does not change the optimal action? Imagine that the optimal action changes after an observation. What
does this say about the value of information of that observation?
Question 31. The prisoners dilemma is an example of a game with a dominant strategy equilibrium.
Imagine that the game is modified so that if one prisoner testifies the other only gets four years of prison
instead of ten. Does this game still have a dominant strategy equilibrium? Are there any other equilibria?
Question 32. Explain why the traveler’s dilemma has a unique Nash equilibrium of 2. Draw the utility
matrix and use it to show the equilibrium.
Question 33. What is the Nash equilibrium for a game with the following payoff matrix?
Heads Tails
Heads 1,-1 -1,1
Tails -1,1 1,-1
Question 34. Give an example of a 3 player game where no pure strategy Nash equilibria exist.
Question 35. Suppose that a satellite in orbit has detected a probability of collision with a piece of space
debris P (c1 ) = .0001. The satellite operators can decide to maneuver m1 or not maneuver m0 . The utility
of a collision U (c1 ) = −40000000, the utility of maneuvering is U (m1 ) = −5000, and the utility of doing
nothing U (m0 ) = 0. After performing a simulation of the maneuver the probability of collision becomes
approximately 0. What action should they take assuming that there is no uncertainty in the probability of
collision measurement?
5
Sequential Problems
Question 36. What is the Markov assumption? What does a Markov decision process consist of? What is
a stationary MDP? Draw a compact representation of a stationary MDP.
Question 37. What is the purpose of the discount factor in infinite horizon problems? What is an alternative
to using a discount factor in infinite horizon problems? What effect does a small discount factor have? What
about a large one? When is one preferable to the other?
Question 38. Does the optimal policy have to be unique? Does the optimal value for each state have to
be unique?
Question 39. What is the Bellman equation? How does it simplify if transitions are deterministic?
Question 40. The policy evaluation equation in matrix form is
−1
Uπ = (I − γTπ ) Rπ
where Uπ and Rπ are the utility and reward functions represented as vectors. What is the meaning of Tπ ?
How does this relate Markov decision processes and Markov chains?
Question 41. What is dynamic programming? Can you give an example? Why is dynamic programming
more efficient than brute force methods for solving MDPs?
Question 42. Can you explain what policy iteration and value iteration are? What are their similarities
and differences?
Question 43. What is the difference between open and closed-loop planning?
Question 44. Consider the simple gridworld shown below. An agent in this world can move to the cell to
its immediate left or to the cell to its immediate right, and the transitions are deterministic. Moving left in
s1 gives a reward of 100 and terminates the game. Moving right in s4 does nothing. Perform value iteration
and determine the utility of being in each state assuming a discount factor of 0.9.
s1 s2 s3 s4
Question 45. How does asynchronous value iteration differ from standard value iteration? What is the
importance of the state ordering?
Apply Gauss-Seidel value iteration to the simple gridworld from the previous problem. First, use a state
ordering of s1 , s2 , s3 , s4 . Then use an ordering of s4 , s3 , s2 , s1 . How many iterations did each ordering take
to converge?
Question 46. In what cases would you prefer to use dynamic programming? Approximate dynamic pro-
gramming? Online methods?
Question 47. Establish a lower bound on the optimal value function for an MDP with discount factor γ.
Assume you know the reward R(s, a) for all states and actions, but do not know the transition function.
Question 48. (a) What is the difference between U π and U ∗ ?
(b) What is the difference between U π and Qπ ? Express U π as a function of Qπ , and vice versa.
Question 49. Why would you use an online solution technique as opposed to an offline method? Why not?
Question 50. Consider a continuous state MDP which we want to solve using Local Approximation Value
Iteration. A friend suggests finding the value at the finite set of states {s1 , s2 }, and to use the weighting
functions β1 = e−(s−s1 ) and β2 = e−(s−s2 ) . What is wrong with these weighting functions, and how would
2 2
you change them so they are valid for Local Approximation Value Iteration?
Question 51. Consider a value function U (s) = es . Say we want to approximate this value function
using Global Approximation Value Iteration. We choose a basis β1:4 (s) = {1, s, s2 , s3 }. What will the λ1:4
approximately converge to?
6
Model Uncertainty
Question 52. For what types of problems do we use reinforcement learning? What are the two main
approaches?
Question 53. Why is the concept of exploration versus exploitation so important in reinforcement learning?
What is a multi-armed bandit? Describe the various parameters involved in a multi-armed bandit problem.
Imagine you have a two-armed bandit and are convinced that one of the levers yields a payout of $1 with
probability 0.9. You have never pulled the other lever, and are unsure if it has any payout. Relate this to
the problem of exploration and exploitation.
Question 54. Suppose we have a two-armed bandit. Our estimate of the payout rate of the first lever is
0.7, and our estimate of the payout rate for the second lever is 0.6. That is, ρ1 =0.7 and ρ2 =0.6. Our 95%
confidence intervals for θ1 and θ2 are (0.6, 0.8) and (0.3, 0.9), respectively.
What is the difference between θi and ρi ? Suppose you used an -greedy strategy with = 0.5. How
might you decide what lever to pull? Suppose you used an interval exploration strategy with 95% confidence
intervals. What lever would you pull?
Question 55. What are Q-values and how do they differ from utility values U ? Imagine you have a model
of the reward and transition functions. If you were to run a value iteration using the Q-values instead of the
utility values U , what would be the update equation?
Question 56. What is the central equation behind incremental estimation? Identify the temporal difference
error and the learning rate. Imagine you have an estimate of some random variable X. Imagine that this
estimate is x̂ = 3. If the learning rate is 0.1, what happens to your estimate after observing a new sample
x = 7? What happens if the learning rate is 0.5? Comment on the effect that learning rate has on incremental
estimation.
Question 57. What are the similarities and differences between Q-learning and Sarsa?
Question 58. Use Q-values, the Bellman equation, and the incremental update equation to derive the
update equations for Q-learning and Sarsa.
Question 59. What is the difference between Sarsa and Sarsa(λ)? What types of problems can be solved
more efficiently using eligibility traces?
Question 60. What are the differences between model-based reinforcement learning and model-free rein-
forcement learning in terms of the quality of the learned policy and computational cost?
Question 61. Use the temporal difference equation to derive a value function update for samples of the
form (s, s0 , r). Is the resulting algorithm model-based or model-free?
Question 62. When is Generalization needed in the context of model uncertainty? Describe different
Generalization methods.
7
State Uncertainty
Question 63. What is a POMDP and how does it differ from an MDP? Draw the structure of a POMDP
and compare it to that of an MDP.
Question 64. Examine the two gridworlds shown below. In the left-most gridworld, you know the position
of the agent, represented by the red square. In the right-most gridworld, you only have a probability
distribution over possible states. How might you represent the “state” for each case? Use this to explain
why POMDPs are sometimes called “belief-state MDPs” and are generally intractable.
1 2 3 1 2 3
4 5 6 4 5 6
7 8 9 7 8 9
Question 65. A key to solving POMDPs is the ability to maintain a belief, or probability distribution, over
states. What methods can be used to update beliefs? When might one be preferred over the others?
Question 66. Derive the following equation for a discrete state filter:
X
b0 (s0 ) ∝ O(o | s0 , a) T (s0 | s, a)b(s)
s
Write down the alpha vectors for this problem. How sure should you be that there will be no exam be-
fore you take the evening off? Imagine you have a third option, which is to drop out of school and live in the
wilderness. This simple lifestyle would give you a reward of 30, regardless of whether the exam takes place
or not. What can you say about this option? Would you ever take it?
8
Question 71. Imagine that you have already solved for the policy of a 3-state, and you have the following
alpha vectors:
300 167 27
100 , 10 , 50
0 100 50
The first and third alpha vectors correspond to action 1, and the second alpha vector corresponds to action 2.
Is this a valid policy? Can you have multiple alpha vectors per action? If the policy is valid, determine the
action you would take given you have the following belief: 0% chance in state 1, 70% chance in state 2, 30%
chance in state 3.
Question 72. What does it mean to solve a POMDP offline versus solving it online? What are the
advantages and disadvantages of each? How do QMDP, FIB, and point-based value iteration work? What
are the advantages and disadvantages of each?
Question 73. The update equation for QMDP is shown below.
(k)
X
αa(k+1) (s) = R(s, a) + γ T (s0 | s, a) max
0
αa0 (s0 )
a
s0
How many operations does each iteration take? Compare this with the number of operations required per
iteration required for FIB, whose update equation is shown below.
(k)
X X
αa(k+1) (s) = R(s, a) + γ max
0
O (o | s0 , a) T (s0 | s, a) αa0 (s0 )
a
o s0
Question 74. What is a practical advantage of using a particle filter to update the beliefs?
Question 75. Since in a POMDP the next belief state depends only on the current belief state (and the
current action and observation), a POMDP can be thought of as an MDP where the continuous state space
is the belief space. Why do we not solve a POMDP by constructing an equivalent MDP and then use value
iteration?
Question 76. Suppose Question 44 from Sequential Problems is modified such that moving right in s4 also
gives a reward of 100 and terminates the game. Use QMDP to find the alpha vectors associated with the
actions move left and move right.
Suppose our belief state is b = [0.3, 0.1, 0.5, 0.1]T . What is the optimal action?
Question 77. Suppose you have discrete state, action, and observation spaces. Suppose further that you
have a transition function T (s0 | s, a) and observation function O(o | s0 , a). Given that you take action a
from belief b, what is the probability of observing o? What is the computational complexity of calculating
this probability?
Now suppose your state is static; the state cannot change. That is, T (s0 = s | s, a) = 1. Now what is
the probability of observing o? What is the computation complexity of calculating this probability?
Question 78. Following up on the previous question, what would the belief update look like when the state
of the world doesn’t change i.e. T (s0 = s | s, a) = 1?
Question 79. The traditional way that we model POMDPs is to identify a set of goal states and to stipulate
that the process terminates when it enters a goal state. But this assumes that the goal state is completely
observable and reachable with probability one. However this assumption may not always hold. Give one
alternative way to model such problems.
Question 80. You are concerned with collision avoidance for two aircraft that are flying near each other.
Why might you want to model this as a POMDP rather than an MDP? What is the difference between the
observation and the state in the POMDP formulation?