0% found this document useful (0 votes)
31 views14 pages

RL_UNIT-II (1)

The document discusses Markov Decision Problems (MDPs), a framework for modeling decision-making under uncertainty, detailing its components such as state space, action space, transition probabilities, reward function, policy, and value function. It explains the importance of finding an optimal policy to maximize cumulative rewards, the role of exploration and exploitation, and various algorithms for solving MDPs. Additionally, it covers different reward models, including infinite discounted, total, finite horizon, and average rewards, as well as the distinction between episodic and continuing tasks.

Uploaded by

wwaterbottle6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

RL_UNIT-II (1)

The document discusses Markov Decision Problems (MDPs), a framework for modeling decision-making under uncertainty, detailing its components such as state space, action space, transition probabilities, reward function, policy, and value function. It explains the importance of finding an optimal policy to maximize cumulative rewards, the role of exploration and exploitation, and various algorithms for solving MDPs. Additionally, it covers different reward models, including infinite discounted, total, finite horizon, and average rewards, as well as the distinction between episodic and continuing tasks.

Uploaded by

wwaterbottle6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT-II Page 1

UNIT-II
2.1 Markov Decision Problem:
A Markov Decision Problem (MDP) is a mathematical framework used in the field of
reinforcement learning and decision-making under uncertainty. It provides a formal way to
model situations where an agent interacts with an environment and must make a sequence of
decisions to maximize a cumulative reward. Let us delve into the details of MDP:
Components of a Markov Decision Problem:
State Space (S): This is a finite set of all possible situations or states the agent can be in. States
represent the information needed to make decisions.
Action Space (A): This is a finite set of all possible actions the agent can take. Actions are the
decisions or choices made by the agent.
Transition Probabilities (P): These represent the probability of transitioning from one state
to another after taking a particular action. It defines the dynamics of the environment.
Reward Function (R): This function assigns a numerical reward to each state-action pair. It
quantifies the immediate benefit or cost of taking a specific action in a particular state.
Policy (π): A policy is a strategy that specifies which action to take in each state. It defines the
agent's behaviour and can be deterministic or stochastic.
Value Function (V): The value function, denoted as V(s), measures the expected cumulative
reward an agent can achieve starting from a particular state while following a given policy π.
It helps evaluate the desirability of states.

Markov Property:
One of the key assumptions in an MDP is the Markov property, which means that the future
state and reward only depend on the current state and action, not on the history of states and
actions that led to the current state. This property simplifies modeling and computation.
Objective:
The goal in an MDP is typically to find an optimal policy, denoted as π*, that maximizes the
expected cumulative reward over time. This optimal policy guides the agent's decision-making
to achieve the highest possible long-term reward.
Solving MDPs:
There are various algorithms for solving MDPs, including:
Value Iteration: An iterative method that computes the optimal value function and policy.
Policy Iteration: An algorithm that alternates between policy evaluation and policy
improvement until convergence to an optimal policy.
Q-Learning: A model-free reinforcement learning algorithm that learns the optimal action-
value function (Q-function) directly through exploration.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 2

Applications:
MDPs are used in a wide range of applications, including robotics, game playing, autonomous
decision-making systems, finance, healthcare, and more. They are especially relevant in
scenarios where an agent must make sequential decisions under uncertainty.
Extensions:
MDPs can be extended to handle continuous state and action spaces (Continuous MDPs),
partially observable environments (Partially Observable MDPs or POMDPs), and situations
with multiple agents (Multi-Agent MDPs).
Trade-Offs:
In practice, finding the optimal policy for large-scale MDPs can be computationally expensive.
Approximation techniques, such as function approximation and deep reinforcement learning,
are often used to address this challenge.
A Markov Decision Problem provides a formal framework for modelling decision-making in
environments with uncertainty. It involves defining states, actions, transition probabilities,
rewards, and policies, with the goal of finding the optimal policy that maximizes the agent's
expected cumulative reward. MDPs are fundamental to the field of reinforcement learning and
have numerous real-world applications.

2.2 Policy (π):


In the context of a Markov Decision Problem (MDP), a policy is a crucial concept that defines
the strategy or behavior an agent should follow when making decisions in different states of
the environment. Policies guide the agent in selecting actions based on its current knowledge
and the goal of maximizing its expected cumulative reward over time. Let us explore policies
in MDPs in detail:
Definition of a Policy (π):
A policy, denoted as π, is a mapping from states to actions. It specifies, for each possible state
in the MDP, which action the agent should take.
Mathematically, a policy can be represented as: π(s) → a, where π(s) is the action chosen in
state’s’, and 'a' is the selected action.
Types of Policies:
Policies can be classified into two main categories:
Deterministic Policy: In a deterministic policy, for each state’s’, there is only one associated
action 'a'. It provides a clear and fixed mapping of actions to states.
Stochastic Policy: In a stochastic policy, there can be multiple actions associated with each
state, and each action is chosen with a certain probability. Stochastic policies introduce
randomness into decision-making.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 3

Exploration and Exploitation:


Policies play a crucial role in balancing exploration and exploitation in MDPs.
Exploration involves trying out different actions to gather information about the environment,
especially in states with uncertain outcomes.
Exploitation involves selecting actions that are known or believed to maximize expected
rewards based on the current knowledge.
Optimal Policy (π)*:
The goal in solving an MDP is typically to find an optimal policy, denoted as π*, that
maximizes the expected cumulative reward over time.
The optimal policy guides the agent to take actions that lead to the highest possible long-term
rewards in each state.
Value Function and Policy Evaluation:
To determine the quality of a policy, MDPs often involve the concept of a value function,
denoted as V(s) or Q(s, a).
The value function measures the expected cumulative reward that can be obtained by following
the policy π (or taking action 'a' in state’s’ for Q(s, a)) starting from a particular state.
The value function is used to evaluate and compare different policies, and the goal is to find a
policy that maximizes this value.
Policy Improvement:
Once the value function is computed, policies can be improved iteratively.
Policy improvement involves selecting actions that lead to higher value estimates in each state.
It is common to use algorithms like Policy Iteration to find an optimal policy by iteratively
evaluating and improving the current policy.
Exploration Strategies:
In cases where exploration is necessary, stochastic policies or ε-greedy policies are often used.
A ε-greedy policy chooses the action with the highest estimated value with probability 1-ε and
selects a random action (exploration) with probability ε.
Applications:
Policies are fundamental in various applications, including robotics, autonomous systems,
game playing, recommendation systems, finance, and more.
Different domains may require different types of policies, such as deterministic or stochastic,
depending on the nature of the problem and the level of uncertainty.
A policy in a Markov Decision Problem defines how an agent should behave in different states
of the environment. It is a fundamental component in reinforcement learning and decision-
making, and the goal is often to find an optimal policy that maximizes expected cumulative

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 4

rewards over time. Policies can be deterministic or stochastic and are key to balancing
exploration and exploitation in the agent's decision-making process.

2.3 Value function:


In a Markov Decision Problem (MDP), the value function is a critical concept that quantifies
the expected cumulative reward an agent can achieve by following a specific policy while
interacting with the environment. The value function provides a way to evaluate and compare
different policies, helping the agent make decisions that maximize its long-term rewards. Let's
explore the value function in MDPs in detail:
Definition of Value Function:
The value function, denoted as V(s) for a state 's', represents the expected cumulative reward
an agent can obtain by starting in state 's' and following a given policy π.
Mathematically, it is defined as the expected sum of rewards when following policy π from
states and is often represented as:
V(s) = E [Σt=0 to ∞ γ^t * R_t | s, π]
Here: 'E' denotes the expectation over all possible sequences of rewards, given the initial state
's' and policy π.
'γ' is the discount factor (0 ≤ γ < 1), which reflects the agent's preference for immediate rewards
over future rewards.
'R_t' represents the reward received at time step’t’.
Use Cases of Value Function:
The value function serves several purposes in MDPs:
Policy Evaluation: It helps evaluate the quality of a policy by quantifying how much expected
cumulative reward it can achieve in different states.
Policy Improvement: It guides policy improvement algorithms by identifying actions that lead
to higher value estimates in each state.
Policy Comparison: It allows for the comparison of different policies to determine which one
is better in terms of expected long-term rewards.
Optimal Policy Selection: The optimal policy is often selected based on the highest value
estimates across states.
Bellman Equation for Value Function:
The Bellman equation provides a recursive relationship for the value function based on the
principle of optimality. For a state’s’, it can be expressed as:
V(s) = max [a in A] Σ[sp in S] P(sp | s, a) * [R(s, a, sp) + γ * V(sp)]
Here: 'a' is an action from the action space A.
'sp' represents the possible successor states.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 5

'P(sp | s, a)' is the probability of transitioning to state 'sp' when taking action 'a' in state 's'.
'R(s, a, sp)' is the immediate reward received after taking action 'a' in state 's' and transitioning
to state 'sp'.
Methods for Solving the Value Function:
There are several methods to compute or approximate the value function:
Dynamic Programming: Algorithms like Value Iteration and Policy Iteration iteratively
update the value function until convergence.
Monte Carlo Methods: These methods estimate the value function by sampling sequences of
states and rewards to approximate expected values.
Temporal Difference Learning: Algorithms like Q-learning and SARSA use temporal
differences to update value estimates based on observed rewards and transitions.
Applications:
Value functions are widely used in applications such as reinforcement learning, robotics, game
playing, autonomous systems, finance, and optimization, where long-term decision-making is
critical.
Discount Factor (γ):
The choice of the discount factor 'γ' affects how much importance is given to immediate
rewards versus future rewards. A higher 'γ' values prioritize long-term rewards.
Optimal Value Function (V)*:
The optimal value function, denoted as V*, represents the maximum expected cumulative
reward achievable under an optimal policy. It satisfies the Bellman optimality equation:
V*(s) = max [a in A] Σ[sp in S] P(sp | s, a) * [R(s, a, sp) + γ * V*(sp)]
The value function in a Markov Decision Problem quantifies the expected cumulative reward
an agent can achieve by following a specific policy while considering the trade-off between
immediate and future rewards. It plays a central role in policy evaluation, policy improvement,
policy comparison, and the selection of an optimal policy in the context of decision-making
under uncertainty.

2.4 Reward models (infinite discounted, total, finite horizon and


average)
Reward models play a central role in Markov Decision Processes (MDPs), which are a formal
framework for modelling sequential decision-making problems in reinforcement learning. Let
us discuss each of the reward models—Infinite Discounted, Total, Finite Horizon, and Average
in the context of MDPs:

2.4.1 Infinite Discounted Reward (Discounted Sum):


The infinite discounted reward model is a fundamental concept in Markov Decision Processes
(MDPs) within the field of reinforcement learning. It provides a framework for optimizing an

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 6

agent's behavior over an infinite time horizon while taking into account the uncertainty
associated with future rewards. Let's delve into the details of the infinite discounted reward
model in the context of an MDP:
Markov Decision Process (MDP):
An MDP is a mathematical framework used to model sequential decision-making problems.
It consists of:
A set of states (S): These represent all possible situations or configurations of the environment.
A set of actions (A): These are the choices available to the agent to interact with the
environment.
Transition probabilities (P): These describe the likelihood of transitioning from one state to
another when taking a particular action.
Reward function (R): This defines the immediate reward received by the agent when it takes
a specific action in a given state.
Discount factor (γ): A value between 0 and 1 that determines the importance of future rewards
relative to immediate rewards.
Infinite Discounted Reward Objective:
In the infinite discounted reward model, the agent aims to maximize the expected sum of
rewards over an infinite time horizon.
At each time step t, the agent receives a reward R(t) when transitioning from one state to
another by taking a particular action.
The agent discounts these rewards over time by multiplying each reward by the discount factor
γ (0 ≤ γ < 1). This discounting represents the agent's preference for immediate rewards over
future rewards due to uncertainty and the desire for stability in decision-making.
The objective is to maximize the expected cumulative discounted reward:
G(t) = Σ[γ^t * R(t)], where t = 0, 1, 2, ...
Optimal Policy:
The agent's goal is to find an optimal policy, which is a mapping from states to actions that
maximizes the expected cumulative discounted reward.
The optimal policy, denoted as π*, should yield the highest expected sum of rewards over an
infinite time horizon.
Balancing Immediate and Future Rewards:
The discount factor γ plays a critical role in the agent's decision-making.
A smaller γ places more emphasis on immediate rewards, making the agent act more
myopically.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 7

A larger γ values future rewards more, encouraging the agent to consider long-term
consequences.
The choice of γ depends on the nature of the problem. For example, in some scenarios, it's more
important to optimize immediate outcomes, while in others, long-term planning is necessary.
Computational Challenges:
Solving for the optimal policy in the infinite discounted reward model can be computationally
challenging, especially for large MDPs.
Common algorithms for solving MDPs include value iteration, policy iteration, and various
forms of reinforcement learning algorithms like Q-learning and deep reinforcement learning.
In summary, the infinite discounted reward model in an MDP frames the agent's objective as
maximizing the expected cumulative reward over an infinite time horizon while accounting for
the uncertainty of future rewards through discounting. The choice of the discount factor γ
influences the balance between immediate and future rewards and guides the agent's decision-
making process in the face of uncertainty.

2.4.2Total Reward:
In some scenarios, it is natural to consider a finite time horizon H. The agent aims to maximize
the expected cumulative reward over this fixed horizon.
The objective is to maximize the expected cumulative reward within this horizon:
G(t) = Σ[R(t)], where t = 0, 1, 2, ..., H - 1.
Total reward models are suitable for episodic tasks where there's a natural end after a certain
number of time steps or episodes.
Finite Horizon Reward:
Similar to total rewards, finite horizon rewards focus on optimizing behavior within a fixed
time horizon H.
The agent aims to maximize the expected cumulative reward over the time horizon:
G(t) = Σ[R(t)], where t = 0, 1, 2, ..., H - 1.
This model is used when there's a predetermined limit to how long the agent can operate
effectively.
Average Reward:
Average reward models are concerned with maximizing the expected long-term average reward
in an MDP.
They are typically applied in scenarios where the MDP is ergodic, meaning it eventually
reaches a steady-state distribution.
The objective is to maximize the expected average reward:
G(t) = lim (T -> ∞) [1/T * Σ[R(t)]], where t = 0, 1, 2... T.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 8

This model assumes that as the agent interacts with the environment for a long time, the average
reward converges to a stable value. It is commonly used in continuous control tasks.
The choice of reward model in an MDP depends on the specific problem and the agent's
objectives. Infinite discounted rewards emphasize balancing short-term and long-term rewards,
total and finite horizon rewards are suited for tasks with fixed horizons or episodic nature, and
average reward models are useful when considering long-term average performance.
Designing appropriate reward functions is a critical aspect of specifying an MDP and
influencing the agent's learning and decision-making process within that environment.

2.5 Episodic & continuing tasks:


2.5.1 Episodic tasks:
Episodic tasks are a specific type of problem where the agent operates in a sequence of
episodes, each of which has a distinct beginning and end. Understanding episodic tasks is
fundamental to grasping the principles of reinforcement learning. Here is a detailed
explanation:
Key Characteristics of Episodic Tasks:

Episodes: In episodic tasks, the agent interacts with the environment in a series of episodes.
Each episode represents a self-contained task with its own unique start and finish. When an
episode ends, the environment typically resets to some initial state or condition, and a new
episode begins.
Termination Condition: Episodic tasks are defined by a termination condition that determines
when an episode ends. This condition can take various forms, such as reaching a specific state,
a time limit, or achieving a certain goal. Once this condition is met, the current episode
concludes, and a new one starts.
Objective: The primary objective of the agent in episodic tasks is to maximize the expected
cumulative reward obtained within each episode. This means that at the start of each episode,
the agent's goal resets, and it aims to make the best sequence of decisions to earn the most
reward within that episode.
Independence: Episodes in episodic tasks are typically considered independent of each other.
The agent's experience and learning from one episode do not directly influence subsequent
episodes. This independence allows for learning from scratch in each episode.
Examples: Episodic tasks are often exemplified by games and puzzles, where each round or
game represents an episode. For instance, in a game of chess or a game of Tic-Tac-Toe, each
match is an episode with a clear beginning (start of the game) and end (win, lose, or draw).
Mathematical Formulation:
In the mathematical representation of episodic tasks within MDPs, key components include:
State Space (S): The set of all possible states that the environment can be in during an episode.
Action Space (A): The set of all possible actions that the agent can take.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 9

Reward Function (R): A function that maps state-action pairs to immediate rewards. In
episodic tasks, rewards are typically assigned at the end of each episode, based on the outcome
of that episode.
Transition Probability (P): A function that specifies the probability of transitioning from one
state to another when taking a particular action. In episodic tasks, this transition function is
often simplified because there's a reset to an initial state at the end of each episode.
Policy (π): The agent's strategy or policy, which defines the mapping from states to actions.
The objective is to find the optimal policy that maximizes expected cumulative rewards within
an episode.
Challenges and Considerations:
In episodic tasks, since each episode starts from scratch, it is possible to apply simpler methods
like Monte Carlo methods for estimating value functions and policies.
Exploration-exploitation strategies play a significant role within each episode, as the agent
must balance exploration to learn and exploitation to maximize reward within that episode.
While episodic tasks have clear endpoints, the challenge is to find the best sequence of actions
within each episode to achieve the maximum cumulative reward, considering the stochastic
nature of the environment.
Episodic tasks in MDPs involve problems where the agent operates in distinct episodes, each
with its own start and finish. The agent's goal is to maximize cumulative rewards within each
episode, and episodes are typically considered independent of each other. This framework is
well suited for modelling scenarios with clear task boundaries, such as games or puzzle-
solving.

2.5.2 Continuing tasks:


Continuing tasks in the context of Markov Decision Processes (MDPs) represent a different
type of problem compared to episodic tasks. In continuing tasks, the agent interacts with the
environment indefinitely, and there is no natural concept of episodes or distinct endpoints.
Instead, the goal is to maximize the agent's expected cumulative reward over an extended
period. Here's a detailed explanation of continuing tasks:
Key Characteristics of Continuing Tasks:
Infinite Horizon: Unlike episodic tasks, which have well-defined episodes with clear
beginnings and ends, continuing tasks involve interactions that continue indefinitely. There is
no predefined termination condition that signals the end of an episode.
Reward Discounting: To handle continuing tasks, a discount factor, typically denoted as γ
(gamma), is introduced. This discount factor is a number between 0 and 1. It represents how
much the agent values immediate rewards compared to rewards received in the future. A higher
γ places more emphasis on long-term rewards, while a lower γ emphasizes short-term rewards.
Objective: In continuing tasks, the primary objective is to maximize the expected sum of
discounted rewards over an infinite time horizon. This means that the agent aims to make

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 10

decisions that not only consider immediate rewards but also account for the expected future
rewards, appropriately weighted by the discount factor.
No Natural Endpoint: Continuing tasks are used to model scenarios where there is no natural
endpoint to the agent's interactions with the environment. For example, in the case of
controlling a robot, managing a financial portfolio, or operating a recommendation system, the
agent's actions continue indefinitely.
Policy Evaluation and Improvement: In continuing tasks, policies are evaluated based on
their long-term expected reward. Techniques like the Bellman equation and dynamic
programming are used to compute the value function, which represents the expected
cumulative rewards from a given state under a policy. Policies are improved iteratively to
maximize this value function.
Mathematical Formulation:
In the mathematical representation of continuing tasks within MDPs, similar components to
episodic tasks are present:
State Space (S): The set of all possible states in the environment.
Action Space (A): The set of all possible actions that the agent can take.
Reward Function (R): The reward function still maps state-action pairs to immediate rewards,
but these rewards are discounted based on the γ factor to account for the future.
Transition Probability (P): The transition function specifies the probability of transitioning
from one state to another when taking a particular action. In continuing tasks, there is no reset
to an initial state at the end of an episode.
Policy (π): The agent's strategy or policy, which defines the mapping from states to actions.
The objective is to find the optimal policy that maximizes the expected cumulative rewards
over the infinite time horizon.
Challenges and Considerations:
In continuing tasks, the agent must balance short-term rewards with long-term considerations,
as the impact of actions can reverberate through an infinite time horizon.
Convergence of algorithms and value functions in continuing tasks can be more challenging
due to the infinite time horizon. Techniques like discounting and the use of the discount factor
γ are essential for ensuring convergence.
Exploration strategies are crucial to continue exploring the environment and potentially
discover more rewarding actions, even as the agent accumulates knowledge over time.
Continuing tasks in MDPs represent problems where the agent interacts with the environment
indefinitely, without predefined episodes. The agent's objective is to maximize the expected
cumulative reward over an infinite time horizon, considering the discount factor γ. These tasks
are well suited for modelling scenarios where the agent's actions have ongoing consequences,
and the objective is to make decisions that balance short-term and long-term rewards.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 11

2.6 Bellman’s optimality operator:


Bellman's optimality operator, often denoted as "T" or "𝓣," is a fundamental concept in
reinforcement learning and dynamic programming. It plays a central role in finding the optimal
policy and value function in Markov Decision Processes (MDPs). This operator is essential for
solving reinforcement learning problems, such as finding the best strategy for an agent in a
given environment. Here is a detailed explanation of Bellman's optimality operator:
Definition: Bellman's optimality operator, represented as "T" or "𝓣," is a mathematical
operator used to define and update the value function associated with an MDP. The value
function, denoted as "V(s)," assigns a value to each state in the MDP, indicating the expected
cumulative reward an agent can achieve when starting from that state and following a particular
policy.
Mathematically, the Bellman's optimality operator for a given policy π is defined as follows:

Tπ(V)(s)=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γV(s′)]
Tπ(V)(s) represents the updated value of state s under policy π after applying the Bellman
operator
π(a∣s) is the probability of taking action a in state s according to policy π.

 p(s′,r∣s,a) represents the probability of transitioning to state ′s′ and receiving reward r when
taking action a in state s.
 r is the immediate reward received when transitioning from state s to state ′s′ by taking action
a.
 γ is the discount factor, a value between 0 and 1 that represents the agent's preference for
immediate rewards over future rewards.
 V(s′) is the current value estimate for the next state ′s′.

Role in Finding the Optimal Policy:

The Bellman's optimality operator is used iteratively to find the optimal policy and value
function in an MDP. The optimal value function, denoted as V∗(s), represents the maximum
expected cumulative reward achievable from each state when following the optimal policy.

The Bellman Optimality Equation, which uses Bellman's optimality operator, defines the
optimal value function V∗(s) as follows:

V∗(s)=maxa∑s′,rp(s′,r∣s,a)[r+γV∗(s′)]

In this equation:

 The operator maxa selects the action that maximizes the expected sum of rewards over all
possible actions.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 12

 The right-hand side of the equation computes the expected cumulative reward for each action
a and selects the action that yields the maximum value.

Algorithmic Usage:

To find the optimal policy and value function in an MDP, an algorithm called the Bellman
Optimality Operator Iteration is often used. This algorithm iteratively applies the Bellman's
optimality operator until the value function converges to the optimal value function, V∗(s).
Once V∗(s) is found, the optimal policy can be derived by selecting actions that maximize the
right-hand side of the Bellman Optimality Equation for each state.

Bellman's optimality operator is a key concept in reinforcement learning, providing a


principled way to compute the optimal policy and value function in MDPs. It forms the
foundation for many reinforcement learning algorithms, such as Value Iteration and Q-
Learning, which aim to find the best strategy for an agent in uncertain environments.
2.7 Value Iteration and Policy Iteration
Value Iteration and Policy Iteration are two fundamental algorithms used in reinforcement
learning and dynamic programming to solve Markov Decision Processes (MDPs) and find the
optimal policy. Both methods are iterative and aim to determine the best policy for an agent in a
given environment. Here's a detailed explanation of each:

1. Value Iteration:

Overview: Value Iteration is an iterative algorithm used to find the optimal value function and,
consequently, the optimal policy for an MDP. It's based on the principle of iteratively improving
the value estimates for each state until convergence.

Algorithm: The core idea behind Value Iteration is to iteratively update the value function using
Bellman's Optimality Equation until it converges to the optimal value function. Here is the step-
by-step process:

1. Initialization: Start with an initial estimate of the value function, denoted as V0(s), for all states s
in the MDP.
2. Iteration:
 For each state s in the state space:
 Compute the new value estimate Vk+1(s) using the Bellman Optimality Equation:
Vk+1(s)=maxa∑s′,rp(s′,r∣s,a)[r+γVk(s′)]
 k represents the iteration number.
3. Convergence Check: Repeat step 2 until the change in the value function between iterations is
below a predefined threshold, indicating convergence.
4. Policy Extraction: Once the value function has converged, derive the optimal policy by selecting
actions that maximize the right-hand side of the Bellman Optimality Equation for each state:
π∗(s)=argmaxa∑s′,rp(s′,r∣s,a)[r+γV∗(s′)]

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 13

Advantages:

 Value Iteration is guaranteed to converge to the optimal policy in a finite number of iterations for
finite MDPs.
 It is conceptually simple and often easy to implement.

Disadvantages:

 It can be computationally expensive for large state spaces because it requires iterating over all
states in each iteration.

2. Policy Iteration:

Overview: Policy Iteration is another iterative algorithm for finding the optimal policy in MDPs. It
alternates between two main steps: policy evaluation and policy improvement.

Algorithm: Policy Iteration proceeds as follows:

1. Initialization: Start with an initial policy π for all states s in the MDP.
2. Policy Evaluation:
 Compute the value function Vπ for the current policy π. This can be done using iterative
methods like Bellman's equation or solving linear equations.
3. Policy Improvement:
 For each state s, compute the action a that maximizes the expected sum of rewards when
starting from state s and following policy π:
 π′(s)=argmaxa∑s′,rp(s′,r∣s,a)[r+γVπ(s′)]
 Update the policy: π←π′.
4. Convergence Check: Repeat steps 2 and 3 until the policy no longer changes (i.e., it converges
to the optimal policy).

Advantages:

 Policy Iteration can converge faster than Value Iteration in certain cases, especially when the
initial policy is relatively good.
 It often requires fewer iterations to reach the optimal policy compared to Value Iteration.

Disadvantages:

 It can get stuck in local optima if the initial policy is far from the optimal policy.
 Like Value Iteration, it may be computationally expensive for large state spaces because it
involves policy evaluation, which can require solving linear equations or iterations.

Comparison:

 Value Iteration focuses on directly finding the optimal value function and then deriving the
optimal policy from it.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING


UNIT-II Page 14

 Policy Iteration alternates between policy evaluation and policy improvement steps, which can
converge faster if the initial policy is reasonable.

In practice, the choice between Value Iteration and Policy Iteration depends on the specific
problem and computational resources available. Both methods are widely used in reinforcement
learning and dynamic programming and serve as fundamental building blocks for more
advanced algorithms.

K.PRIYANKA /CSE-AIML,MRCE REINFORCEMENT LEARNING

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy