RL_UNIT-II (1)
RL_UNIT-II (1)
UNIT-II
2.1 Markov Decision Problem:
A Markov Decision Problem (MDP) is a mathematical framework used in the field of
reinforcement learning and decision-making under uncertainty. It provides a formal way to
model situations where an agent interacts with an environment and must make a sequence of
decisions to maximize a cumulative reward. Let us delve into the details of MDP:
Components of a Markov Decision Problem:
State Space (S): This is a finite set of all possible situations or states the agent can be in. States
represent the information needed to make decisions.
Action Space (A): This is a finite set of all possible actions the agent can take. Actions are the
decisions or choices made by the agent.
Transition Probabilities (P): These represent the probability of transitioning from one state
to another after taking a particular action. It defines the dynamics of the environment.
Reward Function (R): This function assigns a numerical reward to each state-action pair. It
quantifies the immediate benefit or cost of taking a specific action in a particular state.
Policy (π): A policy is a strategy that specifies which action to take in each state. It defines the
agent's behaviour and can be deterministic or stochastic.
Value Function (V): The value function, denoted as V(s), measures the expected cumulative
reward an agent can achieve starting from a particular state while following a given policy π.
It helps evaluate the desirability of states.
Markov Property:
One of the key assumptions in an MDP is the Markov property, which means that the future
state and reward only depend on the current state and action, not on the history of states and
actions that led to the current state. This property simplifies modeling and computation.
Objective:
The goal in an MDP is typically to find an optimal policy, denoted as π*, that maximizes the
expected cumulative reward over time. This optimal policy guides the agent's decision-making
to achieve the highest possible long-term reward.
Solving MDPs:
There are various algorithms for solving MDPs, including:
Value Iteration: An iterative method that computes the optimal value function and policy.
Policy Iteration: An algorithm that alternates between policy evaluation and policy
improvement until convergence to an optimal policy.
Q-Learning: A model-free reinforcement learning algorithm that learns the optimal action-
value function (Q-function) directly through exploration.
Applications:
MDPs are used in a wide range of applications, including robotics, game playing, autonomous
decision-making systems, finance, healthcare, and more. They are especially relevant in
scenarios where an agent must make sequential decisions under uncertainty.
Extensions:
MDPs can be extended to handle continuous state and action spaces (Continuous MDPs),
partially observable environments (Partially Observable MDPs or POMDPs), and situations
with multiple agents (Multi-Agent MDPs).
Trade-Offs:
In practice, finding the optimal policy for large-scale MDPs can be computationally expensive.
Approximation techniques, such as function approximation and deep reinforcement learning,
are often used to address this challenge.
A Markov Decision Problem provides a formal framework for modelling decision-making in
environments with uncertainty. It involves defining states, actions, transition probabilities,
rewards, and policies, with the goal of finding the optimal policy that maximizes the agent's
expected cumulative reward. MDPs are fundamental to the field of reinforcement learning and
have numerous real-world applications.
rewards over time. Policies can be deterministic or stochastic and are key to balancing
exploration and exploitation in the agent's decision-making process.
'P(sp | s, a)' is the probability of transitioning to state 'sp' when taking action 'a' in state 's'.
'R(s, a, sp)' is the immediate reward received after taking action 'a' in state 's' and transitioning
to state 'sp'.
Methods for Solving the Value Function:
There are several methods to compute or approximate the value function:
Dynamic Programming: Algorithms like Value Iteration and Policy Iteration iteratively
update the value function until convergence.
Monte Carlo Methods: These methods estimate the value function by sampling sequences of
states and rewards to approximate expected values.
Temporal Difference Learning: Algorithms like Q-learning and SARSA use temporal
differences to update value estimates based on observed rewards and transitions.
Applications:
Value functions are widely used in applications such as reinforcement learning, robotics, game
playing, autonomous systems, finance, and optimization, where long-term decision-making is
critical.
Discount Factor (γ):
The choice of the discount factor 'γ' affects how much importance is given to immediate
rewards versus future rewards. A higher 'γ' values prioritize long-term rewards.
Optimal Value Function (V)*:
The optimal value function, denoted as V*, represents the maximum expected cumulative
reward achievable under an optimal policy. It satisfies the Bellman optimality equation:
V*(s) = max [a in A] Σ[sp in S] P(sp | s, a) * [R(s, a, sp) + γ * V*(sp)]
The value function in a Markov Decision Problem quantifies the expected cumulative reward
an agent can achieve by following a specific policy while considering the trade-off between
immediate and future rewards. It plays a central role in policy evaluation, policy improvement,
policy comparison, and the selection of an optimal policy in the context of decision-making
under uncertainty.
agent's behavior over an infinite time horizon while taking into account the uncertainty
associated with future rewards. Let's delve into the details of the infinite discounted reward
model in the context of an MDP:
Markov Decision Process (MDP):
An MDP is a mathematical framework used to model sequential decision-making problems.
It consists of:
A set of states (S): These represent all possible situations or configurations of the environment.
A set of actions (A): These are the choices available to the agent to interact with the
environment.
Transition probabilities (P): These describe the likelihood of transitioning from one state to
another when taking a particular action.
Reward function (R): This defines the immediate reward received by the agent when it takes
a specific action in a given state.
Discount factor (γ): A value between 0 and 1 that determines the importance of future rewards
relative to immediate rewards.
Infinite Discounted Reward Objective:
In the infinite discounted reward model, the agent aims to maximize the expected sum of
rewards over an infinite time horizon.
At each time step t, the agent receives a reward R(t) when transitioning from one state to
another by taking a particular action.
The agent discounts these rewards over time by multiplying each reward by the discount factor
γ (0 ≤ γ < 1). This discounting represents the agent's preference for immediate rewards over
future rewards due to uncertainty and the desire for stability in decision-making.
The objective is to maximize the expected cumulative discounted reward:
G(t) = Σ[γ^t * R(t)], where t = 0, 1, 2, ...
Optimal Policy:
The agent's goal is to find an optimal policy, which is a mapping from states to actions that
maximizes the expected cumulative discounted reward.
The optimal policy, denoted as π*, should yield the highest expected sum of rewards over an
infinite time horizon.
Balancing Immediate and Future Rewards:
The discount factor γ plays a critical role in the agent's decision-making.
A smaller γ places more emphasis on immediate rewards, making the agent act more
myopically.
A larger γ values future rewards more, encouraging the agent to consider long-term
consequences.
The choice of γ depends on the nature of the problem. For example, in some scenarios, it's more
important to optimize immediate outcomes, while in others, long-term planning is necessary.
Computational Challenges:
Solving for the optimal policy in the infinite discounted reward model can be computationally
challenging, especially for large MDPs.
Common algorithms for solving MDPs include value iteration, policy iteration, and various
forms of reinforcement learning algorithms like Q-learning and deep reinforcement learning.
In summary, the infinite discounted reward model in an MDP frames the agent's objective as
maximizing the expected cumulative reward over an infinite time horizon while accounting for
the uncertainty of future rewards through discounting. The choice of the discount factor γ
influences the balance between immediate and future rewards and guides the agent's decision-
making process in the face of uncertainty.
2.4.2Total Reward:
In some scenarios, it is natural to consider a finite time horizon H. The agent aims to maximize
the expected cumulative reward over this fixed horizon.
The objective is to maximize the expected cumulative reward within this horizon:
G(t) = Σ[R(t)], where t = 0, 1, 2, ..., H - 1.
Total reward models are suitable for episodic tasks where there's a natural end after a certain
number of time steps or episodes.
Finite Horizon Reward:
Similar to total rewards, finite horizon rewards focus on optimizing behavior within a fixed
time horizon H.
The agent aims to maximize the expected cumulative reward over the time horizon:
G(t) = Σ[R(t)], where t = 0, 1, 2, ..., H - 1.
This model is used when there's a predetermined limit to how long the agent can operate
effectively.
Average Reward:
Average reward models are concerned with maximizing the expected long-term average reward
in an MDP.
They are typically applied in scenarios where the MDP is ergodic, meaning it eventually
reaches a steady-state distribution.
The objective is to maximize the expected average reward:
G(t) = lim (T -> ∞) [1/T * Σ[R(t)]], where t = 0, 1, 2... T.
This model assumes that as the agent interacts with the environment for a long time, the average
reward converges to a stable value. It is commonly used in continuous control tasks.
The choice of reward model in an MDP depends on the specific problem and the agent's
objectives. Infinite discounted rewards emphasize balancing short-term and long-term rewards,
total and finite horizon rewards are suited for tasks with fixed horizons or episodic nature, and
average reward models are useful when considering long-term average performance.
Designing appropriate reward functions is a critical aspect of specifying an MDP and
influencing the agent's learning and decision-making process within that environment.
Episodes: In episodic tasks, the agent interacts with the environment in a series of episodes.
Each episode represents a self-contained task with its own unique start and finish. When an
episode ends, the environment typically resets to some initial state or condition, and a new
episode begins.
Termination Condition: Episodic tasks are defined by a termination condition that determines
when an episode ends. This condition can take various forms, such as reaching a specific state,
a time limit, or achieving a certain goal. Once this condition is met, the current episode
concludes, and a new one starts.
Objective: The primary objective of the agent in episodic tasks is to maximize the expected
cumulative reward obtained within each episode. This means that at the start of each episode,
the agent's goal resets, and it aims to make the best sequence of decisions to earn the most
reward within that episode.
Independence: Episodes in episodic tasks are typically considered independent of each other.
The agent's experience and learning from one episode do not directly influence subsequent
episodes. This independence allows for learning from scratch in each episode.
Examples: Episodic tasks are often exemplified by games and puzzles, where each round or
game represents an episode. For instance, in a game of chess or a game of Tic-Tac-Toe, each
match is an episode with a clear beginning (start of the game) and end (win, lose, or draw).
Mathematical Formulation:
In the mathematical representation of episodic tasks within MDPs, key components include:
State Space (S): The set of all possible states that the environment can be in during an episode.
Action Space (A): The set of all possible actions that the agent can take.
Reward Function (R): A function that maps state-action pairs to immediate rewards. In
episodic tasks, rewards are typically assigned at the end of each episode, based on the outcome
of that episode.
Transition Probability (P): A function that specifies the probability of transitioning from one
state to another when taking a particular action. In episodic tasks, this transition function is
often simplified because there's a reset to an initial state at the end of each episode.
Policy (π): The agent's strategy or policy, which defines the mapping from states to actions.
The objective is to find the optimal policy that maximizes expected cumulative rewards within
an episode.
Challenges and Considerations:
In episodic tasks, since each episode starts from scratch, it is possible to apply simpler methods
like Monte Carlo methods for estimating value functions and policies.
Exploration-exploitation strategies play a significant role within each episode, as the agent
must balance exploration to learn and exploitation to maximize reward within that episode.
While episodic tasks have clear endpoints, the challenge is to find the best sequence of actions
within each episode to achieve the maximum cumulative reward, considering the stochastic
nature of the environment.
Episodic tasks in MDPs involve problems where the agent operates in distinct episodes, each
with its own start and finish. The agent's goal is to maximize cumulative rewards within each
episode, and episodes are typically considered independent of each other. This framework is
well suited for modelling scenarios with clear task boundaries, such as games or puzzle-
solving.
decisions that not only consider immediate rewards but also account for the expected future
rewards, appropriately weighted by the discount factor.
No Natural Endpoint: Continuing tasks are used to model scenarios where there is no natural
endpoint to the agent's interactions with the environment. For example, in the case of
controlling a robot, managing a financial portfolio, or operating a recommendation system, the
agent's actions continue indefinitely.
Policy Evaluation and Improvement: In continuing tasks, policies are evaluated based on
their long-term expected reward. Techniques like the Bellman equation and dynamic
programming are used to compute the value function, which represents the expected
cumulative rewards from a given state under a policy. Policies are improved iteratively to
maximize this value function.
Mathematical Formulation:
In the mathematical representation of continuing tasks within MDPs, similar components to
episodic tasks are present:
State Space (S): The set of all possible states in the environment.
Action Space (A): The set of all possible actions that the agent can take.
Reward Function (R): The reward function still maps state-action pairs to immediate rewards,
but these rewards are discounted based on the γ factor to account for the future.
Transition Probability (P): The transition function specifies the probability of transitioning
from one state to another when taking a particular action. In continuing tasks, there is no reset
to an initial state at the end of an episode.
Policy (π): The agent's strategy or policy, which defines the mapping from states to actions.
The objective is to find the optimal policy that maximizes the expected cumulative rewards
over the infinite time horizon.
Challenges and Considerations:
In continuing tasks, the agent must balance short-term rewards with long-term considerations,
as the impact of actions can reverberate through an infinite time horizon.
Convergence of algorithms and value functions in continuing tasks can be more challenging
due to the infinite time horizon. Techniques like discounting and the use of the discount factor
γ are essential for ensuring convergence.
Exploration strategies are crucial to continue exploring the environment and potentially
discover more rewarding actions, even as the agent accumulates knowledge over time.
Continuing tasks in MDPs represent problems where the agent interacts with the environment
indefinitely, without predefined episodes. The agent's objective is to maximize the expected
cumulative reward over an infinite time horizon, considering the discount factor γ. These tasks
are well suited for modelling scenarios where the agent's actions have ongoing consequences,
and the objective is to make decisions that balance short-term and long-term rewards.
Tπ(V)(s)=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γV(s′)]
Tπ(V)(s) represents the updated value of state s under policy π after applying the Bellman
operator
π(a∣s) is the probability of taking action a in state s according to policy π.
p(s′,r∣s,a) represents the probability of transitioning to state ′s′ and receiving reward r when
taking action a in state s.
r is the immediate reward received when transitioning from state s to state ′s′ by taking action
a.
γ is the discount factor, a value between 0 and 1 that represents the agent's preference for
immediate rewards over future rewards.
V(s′) is the current value estimate for the next state ′s′.
The Bellman's optimality operator is used iteratively to find the optimal policy and value
function in an MDP. The optimal value function, denoted as V∗(s), represents the maximum
expected cumulative reward achievable from each state when following the optimal policy.
The Bellman Optimality Equation, which uses Bellman's optimality operator, defines the
optimal value function V∗(s) as follows:
V∗(s)=maxa∑s′,rp(s′,r∣s,a)[r+γV∗(s′)]
In this equation:
The operator maxa selects the action that maximizes the expected sum of rewards over all
possible actions.
The right-hand side of the equation computes the expected cumulative reward for each action
a and selects the action that yields the maximum value.
Algorithmic Usage:
To find the optimal policy and value function in an MDP, an algorithm called the Bellman
Optimality Operator Iteration is often used. This algorithm iteratively applies the Bellman's
optimality operator until the value function converges to the optimal value function, V∗(s).
Once V∗(s) is found, the optimal policy can be derived by selecting actions that maximize the
right-hand side of the Bellman Optimality Equation for each state.
1. Value Iteration:
Overview: Value Iteration is an iterative algorithm used to find the optimal value function and,
consequently, the optimal policy for an MDP. It's based on the principle of iteratively improving
the value estimates for each state until convergence.
Algorithm: The core idea behind Value Iteration is to iteratively update the value function using
Bellman's Optimality Equation until it converges to the optimal value function. Here is the step-
by-step process:
1. Initialization: Start with an initial estimate of the value function, denoted as V0(s), for all states s
in the MDP.
2. Iteration:
For each state s in the state space:
Compute the new value estimate Vk+1(s) using the Bellman Optimality Equation:
Vk+1(s)=maxa∑s′,rp(s′,r∣s,a)[r+γVk(s′)]
k represents the iteration number.
3. Convergence Check: Repeat step 2 until the change in the value function between iterations is
below a predefined threshold, indicating convergence.
4. Policy Extraction: Once the value function has converged, derive the optimal policy by selecting
actions that maximize the right-hand side of the Bellman Optimality Equation for each state:
π∗(s)=argmaxa∑s′,rp(s′,r∣s,a)[r+γV∗(s′)]
Advantages:
Value Iteration is guaranteed to converge to the optimal policy in a finite number of iterations for
finite MDPs.
It is conceptually simple and often easy to implement.
Disadvantages:
It can be computationally expensive for large state spaces because it requires iterating over all
states in each iteration.
2. Policy Iteration:
Overview: Policy Iteration is another iterative algorithm for finding the optimal policy in MDPs. It
alternates between two main steps: policy evaluation and policy improvement.
1. Initialization: Start with an initial policy π for all states s in the MDP.
2. Policy Evaluation:
Compute the value function Vπ for the current policy π. This can be done using iterative
methods like Bellman's equation or solving linear equations.
3. Policy Improvement:
For each state s, compute the action a that maximizes the expected sum of rewards when
starting from state s and following policy π:
π′(s)=argmaxa∑s′,rp(s′,r∣s,a)[r+γVπ(s′)]
Update the policy: π←π′.
4. Convergence Check: Repeat steps 2 and 3 until the policy no longer changes (i.e., it converges
to the optimal policy).
Advantages:
Policy Iteration can converge faster than Value Iteration in certain cases, especially when the
initial policy is relatively good.
It often requires fewer iterations to reach the optimal policy compared to Value Iteration.
Disadvantages:
It can get stuck in local optima if the initial policy is far from the optimal policy.
Like Value Iteration, it may be computationally expensive for large state spaces because it
involves policy evaluation, which can require solving linear equations or iterations.
Comparison:
Value Iteration focuses on directly finding the optimal value function and then deriving the
optimal policy from it.
Policy Iteration alternates between policy evaluation and policy improvement steps, which can
converge faster if the initial policy is reasonable.
In practice, the choice between Value Iteration and Policy Iteration depends on the specific
problem and computational resources available. Both methods are widely used in reinforcement
learning and dynamic programming and serve as fundamental building blocks for more
advanced algorithms.