0% found this document useful (0 votes)
30 views20 pages

Unit 5 Notes

The document discusses Markov Decision Processes and Reinforcement Learning. It defines key concepts like states, actions, rewards, and policies. It also explains how to model problems as Markov Decision Processes using a grid world example. Bellman equations and Monte Carlo methods are introduced for solving MDPs.

Uploaded by

Anil Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views20 pages

Unit 5 Notes

The document discusses Markov Decision Processes and Reinforcement Learning. It defines key concepts like states, actions, rewards, and policies. It also explains how to model problems as Markov Decision Processes using a grid world example. Bellman equations and Monte Carlo methods are introduced for solving MDPs.

Uploaded by

Anil Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Markov Decision Process

Reinforcement Learning :
Reinforcement Learning is a type of Machine Learning. It allows machines and software agents to
automatically determine the ideal behavior within a specific context, in order to maximize its
performance. Simple reward feedback is required for the agent to learn its behavior; this is known
as the reinforcement signal.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement
Learning is defined by a specific type of problem, and all its solutions are classed as Reinforcement
Learning algorithms. In the problem, an agent is supposed to decide the best action to select based
on his current state. When this step is repeated, the problem is known as a Markov Decision
Process.
A Markov Decision Process (MDP) model contains:

 A set of possible world states S.


 A set of Models.
 A set of possible actions A.
 A real valued reward function R(s,a).
 A policy the solution of Markov Decision Process.

What is a State?
A State is a set of tokens that represent every state that the agent can be in.
What is a Model?
A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular, T(S,
a, S’) defines a transition T where being in state S and taking an action ‘a’ takes us to state S’ (S
and S’ may be same). For stochastic actions (noisy, non-deterministic) we also define a probability
P(S’|S,a) which represents the probability of reaching a state S’ if action ‘a’ is taken in state S. Note
Markov property states that the effects of an action taken in a state depend only on that state and
not on the prior history.
What is Actions?
An Action A is set of all possible actions. A(s) defines the set of actions that can be taken being in
state S.
What is a Reward?
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the state
S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’) indicates the
reward for being in a state S, taking an action ‘a’ and ending up in a state S’.
What is a Policy?
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It
indicates the action ‘a’ to be taken while in state S.
Let us take the example of a grid world:

An agent lives in the grid. The above example is a 3*4 grid. The grid has a START state(grid no
1,1). The purpose of the agent is to wander around the grid to finally reach the Blue Diamond (grid
no 4,3). Under all circumstances, the agent should avoid the Fire grid (orange color, grid no 4,2).
Also the grid no 2,2 is a blocked grid, it acts like a wall hence the agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT

Walls block the agent path, i.e., if there is a wall in the direction the agent would have taken, the
agent stays in the same place. So for example, if the agent says LEFT in the START grid he would
stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such
sequences can be found:
 RIGHT RIGHT UP UP RIGHT
 UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.
The move is now noisy. 80% of the time the intended action works correctly. 20% of the time the
action agent takes causes it to move at right angles. For example, if the agent says UP the
probability of going UP is 0.8 whereas the probability of going LEFT is 0.1 and probability of
going RIGHT is 0.1 (since LEFT and RIGHT is right angles to UP).
The agent receives rewards each time step:-
 Small reward each step (can be negative when can also be term as punishment, in the above
example entering the Fire can have a reward of -1).
 Big rewards come at the end (good or bad).
 The goal is to Maximize sum of rewards.
Bellman Equation for Value Function

Bellman Equation helps us to find optimal policies and value function.We know that our policy
changes with experience so we will have different value function according to different
policies.Optimal value function is one which gives maximum value compared to all other value
functions.

Bellman Equation states that value function can be decomposed into two parts:

 Immediate Reward, R[t+1]


 Discounted value of successor states,

Mathematically, we can define Bellman Equation as :

Bellman Equation for Value Function

Let’s understand what this equation says with a help of an example :

Suppose, there is a robot in some state (s) and then he moves from this state to some other state (s’).
Now, the question is how good it was for the robot to be in the state(s). Using the Bellman
equation, we can that it is the expectation of reward it got on leaving the state(s) plus the value of
the state (s’) he moved to.

Let’s look at another example :

Backup Diagram

We want to know the value of state s.The value of state(s) is the reward we got upon leaving that
state, plus the discounted value of the state we landed upon multiplied by the transition probability
that we will move into it.
Value Calculation

The above equation can be expressed in matrix form as follows :

Bellman Linear Equation

Where v is the value of state we were in, which is equal to the immediate reward plus the
discounted value of the next state multiplied by the probability of moving into that state.

The running time complexity for this computation is O(n³). Therefore, this is clearly not a practical
solution for solving larger MRPs (same for MDPs, as well).In later Blogs, we will look at more
efficient methods like Dynamic Programming (Value iteration and Policy iteration), Monte-
Claro methods and TD-Learning.
Monte Carlo Policy Evaluation
We begin by considering Monte Carlo methods for learning the state-value function for a given
policy. Recall that the value of a state is the expected return--expected cumulative future
discounted reward--starting from that state. An obvious way to estimate it from experience, then, is
simply to average the returns observed after visits to that state. As more returns are observed, the
average should converge to the expected value. This idea underlies all Monte Carlo methods.

In particular, suppose we wish to estimate , the value of a state under policy , given a set of
episodes obtained by following and passing through . Each occurrence of state in an episode is
called a visit to . The every-visit MC method estimates as the average of the returns
following all the visits to in a set of episodes. Within a given episode, the first time is visited is
called the first visit to . The first-visit MC method averages just the returns following first visits to
. These two Monte Carlo methods are very similar but have slightly different theoretical
properties. First-visit MC has been most widely studied, dating back to the 1940s, and is the one we
focus on in this chapter. We reconsider every-visit MC in Chapter 7. First-visit MC is shown in
procedural form in Figure  5.1.

 
Figure 5.1: First-visit MC method for estimating .

Both first-visit MC and every-visit MC converge to as the number of visits (or first visits) to
goes to infinity. This is easy to see for the case of first-visit MC. In this case each return is an
independent, identically distributed estimate of . By the law of large numbers the sequence of
averages of these estimates converges to their expected value. Each average is itself an unbiased
estimate, and the standard deviation of its error falls as , where is the number of returns
averaged. Every-visit MC is less straightforward, but its estimates also converge asymptotically to
.
What is the difference between First-Visit Monte-Carlo and Every-Visit Monte-Carlo Policy
Evaluation?

The first-visit and the every-visit Monte-Carlo (MC) algorithms are both used to solve the prediction
problem (or, also called, "evaluation problem"), that is, the problem of estimating the value function
associated with a given (as input to the algorithms) fixed (that is, it does not change during the execution of
the algorithm) policy, denoted by π. In general, even if we are given the policy π, we are not necessarily
able to find the exact corresponding value function, so these two algorithms are used to estimate the value
function associated with π

Intuitively, we care about the value function associated with π

because we might want or need to know "how good it is to be in a certain state", if the agent behaves in the
environment according to the policy π

For simplicity, assume that the value function is the state value function (but it could also be e.g.
the state-action value function), denoted by vπ(s)

, where vπ(s) is the expected return (or, in other words, expected cumulative future discounted reward),
starting from state s (at some time step t) and then following (after time step t) the given policy π. Formally,
vπ(s)=Eπ[Gt∣St=s], where Gt=∑∞k=0γkRt+k+1 is the return (after time step t).

In the case of MC algorithms, Gt

is often defined as ∑Tk=0Rt+k+1, where T∈N+ is the last time step of the episode, that is, the sum goes up
to the final time step of the episode, T. This is because MC algorithms, in this context, often assume that the
problem can be naturally split into episodes and each episode proceeds in a discrete number of time steps
(from t=0 to t=T).

As I defined it here, the return, in the case of MC algorithms, is only associated with a single
episode (that is, it is the return of one episode). However, in general, the expected return can be
different from one episode to the other, but, for simplicity, we will assume that the expected return
(of all states) is the same for all episodes.

To recapitulate, the first-visit and every-visit MC (prediction) algorithms are used to estimate vπ(s)

, for all states s∈S. To do that, at every episode, these two algorithms use π to behave in the environment,
so that to obtain some knowledge of the environment in the form of sequences of states, actions and rewards.
This knowledge is then used to estimate vπ(s). How is this knowledge used in order to estimate vπ

? Let us have a look at the pseudocode of these two algorithms.


N(s)

is a "counter" variable that counts the number of times we visit state s throughout the entire algorithm (i.e.
from episode one to num_episodes). Returns(s) is a list of (undiscounted) returns for state s

I think it is more useful for you to read the pseudocode (which should be easily translatable to actual code)
and understand what it does rather than explaining it with words. Anyway, the basic idea (of both
algorithms) is to generate trajectories (of states, actions and rewards) at each episode, keep track of the
returns (for each state) and number of visits (of each state), and then, at the end of all episodes, average these
returns (for all states). This average of returns should be an approximation of the expected return (which is
what we wanted to estimate).

The differences of the two algorithms are highlighted in red


. The part "If state St is not in the sequence S0,S1,…,St−1" means that the associated block of code will be
executed only if St is not part of the sequence of states that were visited (in the episode sequence generated
with π) before the time step t. In other words, that block of code will be executed only if it is the first time
we encounter St in the sequence of states, action and rewards: S0,A0,R1,S1,A1,R2…,ST−1,AT−1,RT
(which can be collectively be called "episode sequence"), with respect to the time step and not the way the
episode sequence is processed. Note that a certain state s might appear more than once in
S0,A0,R1,S1,A1,R2…,ST−1,AT−1,RT: for example, S3=s and S5=s

Do not get confused by the fact that, within each episode, we proceed from the time step T−1

to time step t=0, that is, we process the "episode sequence" backwards. We are doing that only to more
conveniently compute the returns (given that the returns are iteratively computed as follows G←G+Rt+1).
What is the difference between value iteration
and policy iteration?
The basic difference is -

In Policy Iteration - You randomly select a policy and find value function corresponding to it ,
then find a new (improved) policy based on the previous value function, and so on this will lead to
optimal policy .

In Value Iteration - You randomly select a value function , then find a new (improved) value
function in an iterative process, until reaching the optimal value function , then derive optimal
policy from that optimal value function .

Policy iteration works on principle of “Policy evaluation —-> Policy improvement”.

Value Iteration works on principle of “ Optimal value function —-> optimal policy”.

Let's look at them side by side. The key parts for comparison are highlighted. Figures are from
Sutton and Barto's book: Reinforcement Learning: An Introduction.

Key points:

1. Policy iteration includes: policy evaluation + policy improvement, and the two are
repeated iteratively until policy converges.
2. Value iteration includes: finding optimal value function + one policy extraction. There
is no repeat of the two because once the value function is optimal, then the policy out of it
should also be optimal (i.e. converged).
3. Finding optimal value function can also be seen as a combination of policy improvement
(due to max) and truncated policy evaluation (the reassignment of v_(s) after just one sweep
of all states regardless of convergence).
4. The algorithms for policy evaluation and finding optimal value function are highly
similar except for a max operation (as highlighted)
5. Similarly, the key step to policy improvement and policy extraction are identical except
the former involves a stability check.
Value iteration (VI) is the result of directly applying the optimal Bellman operator to the value
function in a recursive manner, so that it converges to the optimal value. Then, we get the optimal
policy as the one that is greedy with respect to the optimal value function for every state.

On the other hand, policy iteration (PI) performs two steps at each iteration of the main outer loop:

1. Policy evaluation: apply the Bellman operator for the current best policy (i.e., we average
over the policy distribution, rather than only getting the greedy action) in a recursive
manner until convergence to the value function for such policy.
2. Policy improvement: then, improve the current policy by taking the action that maximizes
the value function for every state.
3. In short you can think of value iteration as a special case of policy iteration. Whereas
each policy iteration might involve several policy evaluations, in value iteration one
iteration is stopped after only one policy evaluation.
4. See below the pseudocode of the two algorithms (adopted from Sutton’s RL book 2nd
edition)
5.
6.

7. See the code in red rectangle? This is the part that leads to possibly multiple policy
evaluations.
What is Q-Learning ? Mathematics behind Q-Learning

Ans. Estimates the Value function through experimentation in the environment


(huge number of samples to learn, and infeasible in environment in which choosing
the wrong action has cathastrophic consequences)

Q-Learning — a simplistic overview


Let’s say that a robot has to cross a maze and reach the end point. There are mines, and the robot
can only move one tile at a time. If the robot steps onto a mine, the robot is dead. The robot has to
reach the end point in the shortest time possible.
The scoring/reward system is as below:
1. The robot loses 1 point at each step. This is done so that the robot takes the shortest path and reaches
the goal as fast as possible.
2. If the robot steps on a mine, the point loss is 100 and the game ends.
3. If the robot gets power ⚡️, it gains 1 point.
4. If the robot reaches the end goal, the robot gets 100 points.
Now, the obvious question is: How do we train a robot to reach the end goal with the shortest
path without stepping on a mine?

So, how do we solve this?


Introducing the Q-Table
Q-Table is just a fancy name for a simple lookup table where we calculate the maximum expected
future rewards for action at each state. Basically, this table will guide us to the best action at each
state.
There will be four numbers of actions at each non-edge tile. When a robot is at a state it can either
move up or down or right or left.
So, let’s model this environment in our Q-Table.
In the Q-Table, the columns are the actions and the rows are the states.

Each Q-table score will be the maximum expected future reward that the robot will get if it takes
that action at that state. This is an iterative process, as we need to improve the Q-Table at each
iteration.
But the questions are:
 How do we calculate the values of the Q-table?
 Are the values available or predefined?
To learn each value of the Q-table, we use the Q-Learning algorithm.

Mathematics: the Q-Learning algorithm


Q-function
The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).

Using the above function, we get the values of Q for the cells in the table.
When we start, all the values in the Q-table are zeros.
There is an iterative process of updating the values. As we start to explore the environment, the Q-
function gives us better and better approximations by continuously updating the Q-values in the
table.
Now, let’s understand how the updating takes place.
Introducing the Q-learning algorithm process

Each of the colored boxes is one step. Let’s understand each of these steps in detail.
Step 1: initialize the Q-Table
We will first build a Q-table. There are n columns, where n= number of actions. There are m rows,
where m= number of states. We will initialise the values at 0.

In our robot example, we have four actions (a=4) and five states (s=5). So we will build a table with
four columns and five rows.
Steps 2 and 3: choose and perform an action
This combination of steps is done for an undefined amount of time. This means that this step runs
until the time we stop the training, or the training loop stops as defined in the code.
We will choose an action (a) in the state (s) based on the Q-Table. But, as mentioned earlier, when
the episode initially starts, every Q-value is 0.
So now the concept of exploration and exploitation trade-off comes into play. This article has more
details.
We’ll use something called the epsilon greedy strategy.
In the beginning, the epsilon rates will be higher. The robot will explore the environment and
randomly choose actions. The logic behind this is that the robot does not know anything about the
environment.
As the robot explores the environment, the epsilon rate decreases and the robot starts to exploit the
environment.
During the process of exploration, the robot progressively becomes more confident in estimating
the Q-values.
For the robot example, there are four actions to choose from: up, down, left, and right. We are
starting the training now — our robot knows nothing about the environment. So the robot chooses a
random action, say right.

We can now update the Q-values for being at the start and moving right using the Bellman
equation.
Steps 4 and 5: evaluate
Now we have taken an action and observed an outcome and reward.We need to update the function
Q(s,a).
In the case of the robot game, to reiterate the scoring/reward structure is:
 power = +1
 mine = -100
 end = +100

We will repeat this again and again until the learning is stopped. In this way the Q-Table will be
updated.
What is Q-Learning ?
Q-Learning is a basic form of Reinforcement Learning which uses Q-values (also called action values) to
iteratively improve the behavior of the learning agent.
SARSA
SARSA is acronym for State-Action-Reward-State-Action(SARSA)

SARSA is an on-policy TD control method. A policy is a state-action pair tuple. In python, you can
think of it as a dictionary with keys as the state and values as the action. Policy maps the action to
be taken at each state. An on-policy control method chooses the action for each state during
learning by following a certain policy (mostly the one it is evaluating itself, like in policy iteration).
Our aim is to estimate Qπ(s, a) for the current policy π and all state-action (s-a) pairs. We do this
using TD update rule applied at every timestep by letting the agent transition from one state-action
pair to another state-action pair (unlike model dependent RL techniques where the agent transitions
from a state to another state).

Q-value- You must be already familiar with the utility value of a state, Q-value is the same with
the only difference of being defined over the state-action pair rather than just the state. It’s a
mapping between state-action pair and a real number denoting its utility. Q-learning and SARSA
are both policy control methods which work on evaluating the optimal Q-value for all action-state
pairs.

If a state S is terminal (goal state or end state) then, Q(S, a) = 0 Ɐ a ∈ A where A is the set of all
possible actions

Source: Introduction to Reinforcement learning by Sutton and Barto —Chapter 6

The action A’ in the above algorithm is given by following the same policy (ε-greedy over the Q
values) because SARSA is an on-policy method.

ε-greedy policy

Epsilon-greedy policy is this:

1. Generate a random number r ∈[0,1]


2. If r<ε choose an action derived from the Q values (which yields the maximum utility)
3. Else choose a random action
\
What is Model-Based Reinforcement
Learning?
Model-based Reinforcement Learning refers to learning optimal behavior indirectly by learning a model of
the environment by taking actions and observing the outcomes that include the next state and the immediate
reward. The models predict the outcomes of actions and are used in lieu of or in addition to interaction with
the environment to learn optimal policies.

As reinforcement learning is a broad field, let’s focus on one specific aspect: model-based
reinforcement learning. As we’ll see, model-based RL attempts to overcome the issue of a lack of
prior knowledge by enabling the agent — whether this agent happens to be a robot in the real
world, an avatar in a virtual one, or just a piece software that take actions — to construct a
functional representation of its environment.

“Model” is one of those terms that gets thrown around a lot in machine learning (and in scientific
disciplines more generally), often with a relatively vague explanation of what we mean.
Fortunately, in reinforcement learning, a model has a very specific meaning: it refers to the
different dynamic states of an environment and how these states lead to a reward.

Model-based RL entails constructing such a model. Model-free RL, conversely, forgoes this
environmental information and only concerns itself with determining what action to take given a
specific state. As a result, model-based RL tends to emphasize planning, whereas model-free RL
tends to emphasize learning (that said, a lot of learning also goes on in model-based RL). The
distinction between these two approaches can seem a bit abstract, so let’s consider a real-world
analogy.

In general, the core function of RL algorithms is to determine a policy that maximizes this long-term return,
though there are a variety of different methods and algorithms to accomplish this. And again, the major
difference between model-based and model-free RL is simply that the former incorporates a model of the
agent’s environment, specifically one that influences how the agent’s overall policy is determined.
What's the difference between model-free and model-based reinforcement
learning?

In Reinforcement Learning, the terms "model-based" and "model-free" do not refer to the use of a
neural network or other statistical learning model to predict values, or even to predict next state
(although the latter may be used as part of a model-based algorithm and be called a "model"
regardless of whether the algorithm is model-based or model-free).

Instead, the term refers strictly as to whether, whilst during learning or acting, the agent uses
predictions of the environment response. The agent can use a single prediction from the model of
next reward and next state (a sample), or it can ask the model for the expected next reward, or the
full distribution of next states and next rewards. These predictions can be provided entirely outside
of the learning agent - e.g. by computer code that understands the rules of a dice or board game. Or
they can be learned by the agent, in which case they will be approximate.

Just because there is a model of the environment implemented, does not mean that a RL agent is
"model-based". To qualify as "model-based", the learning algorithms have to explicitly reference
the model:

 Algorithms that purely sample from experience such as Monte Carlo Control, SARSA, Q-
learning, Actor-Critic are "model free" RL algorithms. They rely on real samples from the
environment and never use generated predictions of next state and next reward to alter
behaviour (although they might sample from experience memory, which is close to being a
model).
 The archetypical model-based algorithms are Dynamic Programming (Policy Iteration and
Value Iteration) - these all use the model's predictions or distributions of next state and
reward in order to calculate optimal actions. Specifically in Dynamic Programming, the
model must provide state transition probabilities, and expected reward from any state,
action pair. Note this is rarely a learned model.
 Basic TD learning, using state values only, must also be model-based in order to work as a
control system and pick actions. In order to pick the best action, it needs to query a model
that predicts what will happen on each action, and implement a policy like
π(s)=argmaxa∑s′,rp(s′,r|s,a)(r+v(s′))

where p(s′,r|s,a) is the probability of receiving reward r and next state s′ when taking action a
in state s. That function p(s′,r|s,a)

 is essentially the model.

The RL literature differentiates between "model" as a model of the environment for "model-based"
and "model-free" learning, and use of statistical learners, such as neural networks.

In RL, neural networks are often employed to learn and generalise value functions, such as the Q
value which predicts total return (sum of discounted rewards) given a state and action pair. Such a
trained neural network is often called a "model" in e.g. supervised learning. However, in RL
literature, you will see the term "function approximator" used for such a network to avoid
ambiguity.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy