Assignment 4: Reinforcement Learning Prof. B. Ravindran
Assignment 4: Reinforcement Learning Prof. B. Ravindran
Reinforcement Learning
Prof. B. Ravindran
1. State True/False
The state transition graph for any MDP is a directed acyclic graph.
(a) True
(b) False
Sol. (b)
The statement is false. There is a possibility of transitioning to the same state, as well as
having other cycles.
2. Consider the following statements:
(i) The optimal policy of an MDP is unique.
(ii) We can determine an optimal policy for a MDP using only the optimal value function(v ∗ ),
without accessing the MDP parameters.
(iii) We can determine an optimal policy for a given MDP using only the optimal q-value
function(q ∗ ), without accessing the MDP parameters.
Which of these statements are true?
(a) They do not require the state of the agent for solving a MDP.
(b) They do not require the action taken by the agent for solving a MDP.
(c) They do not require the state transition probability matrix for solving a MDP.
(d) They do not require the reward signal for solving a MDP.
Sol. (c)
RL algorithms require to know the state the agent is in, the action it takes and a reward
signal from the environment to solve the MDP. However, they do not need to know the state
transition probability matrix.
1
P∞
(i) v π (s) = Eπ [ i=t γ i−t Ri+1 |St = s]
(ii) q π (s, a) = s′ p(s′ |s, a)v π (s′ )
P
Sol. (d)
(i) is the definition of v π (s) and (iii) follows from definition of v π (s) and q π (s, a). (ii) doesn’t
contain the immediate reward term and hence is wrong.
5. State True/False
While solving MDPs, in case of discounted rewards, the value of γ(discount factor) cannot
affect the optimal policy.
(a) True
(b) False
Sol. (b)
With the change in γ value, the expected return of any state could change and thus, the
optimal policy could change.
6. Consider the following statements for a finite MDP (I is an identity matrix with dimensions
|S| × |S|(S is the set of all states) and Pπ is a stochastic matrix):
(i) MDP with stochastic rewards may not have a deterministic optimal policy.
(ii) There can be multiple optimal stochastic policies.
(iii) If 0 ≤ γ < 1, then rank of the matrix I − γPπ is equal to |S|.
(iv) If 0 ≤ γ < 1, then rank of the matrix I − γPπ is less than |S|.
Which of the above statements are true?
Sol. (a)
Check the lectures, it states that there always exists a deterministic optimal policy.
Lectures provide an example of multiple stochastic optimal policies.
I − γPπ will have non zero eigenvalues, so the rank will be equal to the number of rows which
is |S|
2
7. Consider an MDP with 3 states A, B, C. From each state, we can go to either of the two
states, i.e, from state A, we can perform 2 actions, that lead to state B and C respectively.
The rewards for all the transitions are: r(A, B) = 2 (reward if we go from A to B), r(B, A) = 5,
r(B, C) = 7, r(C, B) = 10, r(A, C) = 1, r(C, A) = 12. The discount factor is 0.7. Find the
value function for the policy given by: π(A) = C (if we are in state A, we choose the action
to go to C), π(B) = A and π(C) = B ([v π (A), v π (B), v π (C)]).
Sol. (c)
We can just substitute the options to find out which one is a fixed point of the Bellman equa-
tion, alternatively, compute (I − γPπ )−1 rπ
8. Suppose x is a fixed point for the function A, y is a fixed point for the function B, and x =
BA(x), where BA is the composition of B and A. Consider the following statements:
(i) x is a fixed point for B
(ii) x = y
(iii) BA(y) = y
Sol. (c)
(c) can be zero when x is not the zero vector.
3
10. Which of the following is a contraction mapping in any norm?
Sol. (a)
(a) is a contraction mapping in any norm as ∥T u − T v∥ = 0.5∥u − v∥.