0% found this document useful (0 votes)
110 views4 pages

Assignment 4: Reinforcement Learning Prof. B. Ravindran

The document contains the solutions to 10 questions about reinforcement learning and Markov decision processes. It discusses topics like state transition graphs, determining optimal policies from value and q-value functions, benefits of RL algorithms, equations related to value functions, properties of optimal policies and stochastic matrices, solving an MDP with a given policy, and contraction mappings.

Uploaded by

Sathya J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views4 pages

Assignment 4: Reinforcement Learning Prof. B. Ravindran

The document contains the solutions to 10 questions about reinforcement learning and Markov decision processes. It discusses topics like state transition graphs, determining optimal policies from value and q-value functions, benefits of RL algorithms, equations related to value functions, properties of optimal policies and stochastic matrices, solving an MDP with a given policy, and contraction mappings.

Uploaded by

Sathya J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment 4

Reinforcement Learning
Prof. B. Ravindran
1. State True/False
The state transition graph for any MDP is a directed acyclic graph.
(a) True
(b) False
Sol. (b)
The statement is false. There is a possibility of transitioning to the same state, as well as
having other cycles.
2. Consider the following statements:
(i) The optimal policy of an MDP is unique.
(ii) We can determine an optimal policy for a MDP using only the optimal value function(v ∗ ),
without accessing the MDP parameters.
(iii) We can determine an optimal policy for a given MDP using only the optimal q-value
function(q ∗ ), without accessing the MDP parameters.
Which of these statements are true?

(a) Only (ii)


(b) Only (iii)
(c) Only (i), (ii)
(d) Only (i), (iii)
(e) Only (ii), (iii)
Sol. (b)
Optimal policy can be recovered from an optimal q-value function. Also, a given MDP can
have multiple optimal policies.

3. Which of the following is a benefit of using RL algorithms for solving MDPs?

(a) They do not require the state of the agent for solving a MDP.
(b) They do not require the action taken by the agent for solving a MDP.
(c) They do not require the state transition probability matrix for solving a MDP.
(d) They do not require the reward signal for solving a MDP.

Sol. (c)
RL algorithms require to know the state the agent is in, the action it takes and a reward
signal from the environment to solve the MDP. However, they do not need to know the state
transition probability matrix.

4. Consider the following equations:

1
P∞
(i) v π (s) = Eπ [ i=t γ i−t Ri+1 |St = s]
(ii) q π (s, a) = s′ p(s′ |s, a)v π (s′ )
P

(iii) v π (s) = a π(a|s)q π (s, a)


P

Which of the above are correct?


(a) Only (i)
(b) Only (i), (ii)
(c) Only (ii), (iii)
(d) Only (i), (iii)
(e) (i), (ii), (iii)

Sol. (d)
(i) is the definition of v π (s) and (iii) follows from definition of v π (s) and q π (s, a). (ii) doesn’t
contain the immediate reward term and hence is wrong.
5. State True/False
While solving MDPs, in case of discounted rewards, the value of γ(discount factor) cannot
affect the optimal policy.
(a) True
(b) False
Sol. (b)
With the change in γ value, the expected return of any state could change and thus, the
optimal policy could change.
6. Consider the following statements for a finite MDP (I is an identity matrix with dimensions
|S| × |S|(S is the set of all states) and Pπ is a stochastic matrix):
(i) MDP with stochastic rewards may not have a deterministic optimal policy.
(ii) There can be multiple optimal stochastic policies.
(iii) If 0 ≤ γ < 1, then rank of the matrix I − γPπ is equal to |S|.
(iv) If 0 ≤ γ < 1, then rank of the matrix I − γPπ is less than |S|.
Which of the above statements are true?

(a) Only (ii), (iii)


(b) Only (ii), (iv)
(c) Only (i), (iii)
(d) Only (i), (ii), (iii)

Sol. (a)
Check the lectures, it states that there always exists a deterministic optimal policy.
Lectures provide an example of multiple stochastic optimal policies.
I − γPπ will have non zero eigenvalues, so the rank will be equal to the number of rows which
is |S|

2
7. Consider an MDP with 3 states A, B, C. From each state, we can go to either of the two
states, i.e, from state A, we can perform 2 actions, that lead to state B and C respectively.
The rewards for all the transitions are: r(A, B) = 2 (reward if we go from A to B), r(B, A) = 5,
r(B, C) = 7, r(C, B) = 10, r(A, C) = 1, r(C, A) = 12. The discount factor is 0.7. Find the
value function for the policy given by: π(A) = C (if we are in state A, we choose the action
to go to C), π(B) = A and π(C) = B ([v π (A), v π (B), v π (C)]).

(a) [10.2, 16.7, 20.2]


(b) [14.2, 16.5, 15.1]
(c) [15.9, 16.1, 21.3]
(d) [12.2, 6.2, 14.5]

Sol. (c)
We can just substitute the options to find out which one is a fixed point of the Bellman equa-
tion, alternatively, compute (I − γPπ )−1 rπ

Note: Pπ = [[0 , 0 , 1], [1 , 0 , 0], [0 , 1 , 0]] ; rπ = [1 , 5 , 10]T ; γ = 0.7

8. Suppose x is a fixed point for the function A, y is a fixed point for the function B, and x =
BA(x), where BA is the composition of B and A. Consider the following statements:
(i) x is a fixed point for B
(ii) x = y
(iii) BA(y) = y

Which of the above must be true?


(a) Only (i)
(b) Only (ii)
(c) Only (i), (ii)
(d) (i), (ii), (iii)
Sol. (a)
x = B(A(x)) =⇒ x = B(x). Therefore, x is a fixed point of B.
However, that does not mean x = y. The function B could have multiple fixed points (consider
the identity function), so (ii) is False.
There is no guarantee that y is a fixed point for A. So, we cannot say (iii) is True.
9. Which of the following is not a valid norm function? (x is a D dimensional vector)
(a) maxd∈{1,...,D} |xd |
q
(b) ΣD 2
d=1 xd

(c) mind∈{1,...,D} |xd |


(d) ΣD
d=1 |xd |

Sol. (c)
(c) can be zero when x is not the zero vector.

3
10. Which of the following is a contraction mapping in any norm?

(a) T ([x1 , x2 ]) = [0.5x1 , 0.5x2 ]


(b) T ([x1 , x2 ]) = [2x1 , 2x2 ]
(c) T ([x1 , x2 ]) = [2x1 , 3x2 ]
(d) T ([x1 , x2 ]) = [x1 + x2 , x1 − x2 ]

Sol. (a)
(a) is a contraction mapping in any norm as ∥T u − T v∥ = 0.5∥u − v∥.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy