0% found this document useful (0 votes)

36 views15 pages

Reinforcement Learning and Control: CS229 Lecture Notes

This document summarizes key concepts from Andrew Ng's CS229 lecture notes on reinforcement learning and control. It introduces Markov decision processes (MDPs) as a framework for modeling reinforcement learning problems. An MDP is defined by its states, actions, transition probabilities between states, rewards, and discount factor. The goal in reinforcement learning is to find a policy that maximizes the expected discounted reward by choosing optimal actions over time. Two algorithms for solving MDPs are described: value iteration and policy iteration. Value iteration works by iteratively updating the value function to satisfy the Bellman equations until convergence.

Uploaded by

Ivan Avramov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views15 pages

Reinforcement Learning and Control: CS229 Lecture Notes

Uploaded by

Ivan Avramov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CS229 Lecture notes

Andrew Ng
Part XIII
Reinforcement Learning and
Control
We now begin our study of reinforcement learning and adaptive control.
In supervised learning, we saw algorithms that tried to make their outputs
mimic the labels y given in the training set. In that setting, the labels gave
an unambiguous right answer for each of the inputs x. In contrast, for
many sequential decision making and control problems, it is very dicult to
provide this type of explicit supervision to a learning algorithm. For example,
if we have just built a four-legged robot and are trying to program it to walk,
then initially we have no idea what the correct actions to take are to make
it walk, and so do not know how to provide explicit supervision for a learning
algorithm to try to mimic.
In the reinforcement learning framework, we will instead provide our al-
gorithms only a reward function, which indicates to the learning agent when
it is doing well, and when it is doing poorly. In the four-legged walking ex-
ample, the reward function might give the robot positive rewards for moving
forwards, and negative rewards for either moving backwards or falling over.
It will then be the learning algorithms job to gure out how to choose actions
over time so as to obtain large rewards.
Reinforcement learning has been successful in applications as diverse as
autonomous helicopter ight, robot legged locomotion, cell-phone network
routing, marketing strategy selection, factory control, and ecient web-page
indexing. Our study of reinforcement learning will begin with a denition of
the Markov decision processes (MDP), which provides the formalism in
which RL problems are usually posed.
1
2
1 Markov decision processes
A Markov decision process is a tuple (S, A, {P
sa
}, , R), where:
S is a set of states. (For example, in autonomous helicopter ight, S
might be the set of all possible positions and orientations of the heli-
copter.)
A is a set of actions. (For example, the set of all possible directions in
which you can push the helicopters control sticks.)
P
sa
are the state transition probabilities. For each state s S and
action a A, P
sa
is a distribution over the state space. Well say more
about this later, but briey, P
sa
gives the distribution over what states
we will transition to if we take action a in state s.
[0, 1) is called the discount factor.
R : S A R is the reward function. (Rewards are sometimes also
written as a function of a state S only, in which case we would have
R : S R).
The dynamics of an MDP proceeds as follows: We start in some state s
0
,
and get to choose some action a
0
A to take in the MDP. As a result of our
choice, the state of the MDP randomly transitions to some successor state
s
1
, drawn according to s
1
P
s
0
a
0
. Then, we get to pick another action a
1
.
As a result of this action, the state transitions again, now to some s
2
P
s
1
a
1
.
We then pick a
2
, and so on. . . . Pictorially, we can represent this process as
follows:
s
0
a
0
s
1
a
1
s
2
a
2
s
3
a
3
. . .
Upon visiting the sequence of states s
0
, s
1
, . . . with actions a
0
, a
1
, . . ., our
total payo is given by
R(s
0
, a
0
) + R(s
1
, a
1
) +
2
R(s
2
, a
2
) + .
Or, when we are writing rewards as a function of the states only, this becomes
R(s
0
) + R(s
1
) +
2
R(s
2
) + .
For most of our development, we will use the simpler state-rewards R(s),
though the generalization to state-action rewards R(s, a) oers no special
diculties.
3
Our goal in reinforcement learning is to choose actions over time so as to
maximize the expected value of the total payo:
E
_
R(s
0
) + R(s
1
) +
2
R(s
2
) +

Note that the reward at timestep t is discounted by a factor of

t
. Thus, to
make this expectation large, we would like to accrue positive rewards as soon
as possible (and postpone negative rewards as long as possible). In economic
applications where R() is the amount of money made, also has a natural
interpretation in terms of the interest rate (where a dollar today is worth
more than a dollar tomorrow).
A policy is any function : S A mapping from the states to the
actions. We say that we are executing some policy if, whenever we are
in state s, we take action a = (s). We also dene the value function for
a policy according to
V

(s) = E
_
R(s
0
) + R(s
1
) +
2
R(s
2
) +

s
0
= s, ].
V

(s) is simply the expected sum of discounted rewards upon starting in
state s, and taking actions according to .
1
Given a xed policy , its value function V

satises the Bellman equa-
tions:
V

(s) = R(s) +

S
P
s(s)
(s

)V

(s

).
This says that the expected sum of discounted rewards V

(s) for starting
in s consists of two terms: First, the immediate reward R(s) that we get
rightaway simply for starting in state s, and second, the expected sum of
future discounted rewards. Examining the second term in more detail, we
see that the summation term above can be rewritten E
s

P
s(s)
[V

(s

)]. This
is the expected sum of discounted rewards for starting in state s

, where s

is distributed according P
s(s)
, which is the distribution over where we will
end up after taking the rst action (s) in the MDP from state s. Thus, the
second term above gives the expected sum of discounted rewards obtained
after the rst step in the MDP.
Bellmans equations can be used to eciently solve for V

. Specically,
in a nite-state MDP (|S| < ), we can write down one such equation for
V

(s) for every state s. This gives us a set of |S| linear equations in |S|
variables (the unknown V

(s)s, one for each state), which can be eciently
solved for the V

(s)s.
1
This notation in which we condition on isnt technically correct because isnt a
random variable, but this is quite standard in the literature.
4
We also dene the optimal value function according to
V

(s) = max

V

(s). (1)
In other words, this is the best possible expected sum of discounted rewards
that can be attained using any policy. There is also a version of Bellmans
equations for the optimal value function:
V

(s) = R(s) + max
aA

S
P
sa
(s

)V

(s

). (2)
The rst term above is the immediate reward as before. The second term
is the maximum over all actions a of the expected future sum of discounted
rewards well get upon after action a. You should make sure you understand
this equation and see why it makes sense.
We also dene a policy

: S A as follows:

(s) = arg max

S
P
sa
(s

)V

(s

). (3)
Note that

(s) gives the action a that attains the maximum in the max
in Equation (2).
It is a fact that for every state s and every policy , we have
V

(s) = V

(s) V

(s).
The rst equality says that the V

, the value function for

, is equal to the
optimal value function V

for every state s. Further, the inequality above
says that

s value is at least a large as the value of any other other policy.

In other words,

as dened in Equation (3) is the optimal policy.

Note that

has the interesting property that it is the optimal policy

for all states s. Specically, it is not the case that if we were starting in
some state s then thered be some optimal policy for that state, and if we
were starting in some other state s

then thered be some other policy thats

optimal policy for s

. Specically, the same policy

attains the maximum

in Equation (1) for all states s. This means that we can use the same policy

no matter what the initial state of our MDP is.

2 Value iteration and policy iteration
We now describe two ecient algorithms for solving nite-state MDPs. For
now, we will consider only MDPs with nite state and action spaces (|S| <
, |A| < ).
The rst algorithm, value iteration, is as follows:
5
1. For each state s, initialize V (s) := 0.
2. Repeat until convergence {
For every state, update V (s) := R(s) + max
aA

P
sa
(s

)V (s

).
}
This algorithm can be thought of as repeatedly trying to update the esti-
mated value function using Bellman Equations (2).
There are two possible ways of performing the updates in the inner loop of
the algorithm. In the rst, we can rst compute the new values for V (s) for
every state s, and then overwrite all the old values with the new values. This
is called a synchronous update. In this case, the algorithm can be viewed as
implementing a Bellman backup operator that takes a current estimate of
the value function, and maps it to a new estimate. (See homework problem
for details.) Alternatively, we can also perform asynchronous updates.
Here, we would loop over the states (in some order), updating the values one
at a time.
Under either synchronous or asynchronous updates, it can be shown that
value iteration will cause V to converge to V

. Having found V

, we can
then use Equation (3) to nd the optimal policy.
Apart from value iteration, there is a second standard algorithm for nd-
ing an optimal policy for an MDP. The policy iteration algorithm proceeds
as follows:
1. Initialize randomly.
2. Repeat until convergence {
(a) Let V := V

.
(b) For each state s, let (s) := arg max
aA

P
sa
(s

)V (s

).
}
Thus, the inner-loop repeatedly computes the value function for the current
policy, and then updates the policy using the current value function. (The
policy found in step (b) is also called the policy that is greedy with re-
spect to V .) Note that step (a) can be done via solving Bellmans equations
as described earlier, which in the case of a xed policy, is just a set of |S|
linear equations in |S| variables.
After at most a nite number of iterations of this algorithm, V will con-
verge to V

, and will converge to

.
6
Both value iteration and policy iteration are standard algorithms for solv-
ing MDPs, and there isnt currently universal agreement over which algo-
rithm is better. For small MDPs, policy iteration is often very fast and
converges with very few iterations. However, for MDPs with large state
spaces, solving for V

explicitly would involve solving a large system of lin-
ear equations, and could be dicult. In these problems, value iteration may
be preferred. For this reason, in practice value iteration seems to be used
more often than policy iteration.
3 Learning a model for an MDP
So far, we have discussed MDPs and algorithms for MDPs assuming that the
state transition probabilities and rewards are known. In many realistic prob-
lems, we are not given state transition probabilities and rewards explicitly,
but must instead estimate them from data. (Usually, S, A and are known.)
For example, suppose that, for the inverted pendulum problem (see prob-
lem set 4), we had a number of trials in the MDP, that proceeded as follows:
s
(1)
0
a
(1)
0
s
(1)
1
a
(1)
1
s
(1)
2
a
(1)
2
s
(1)
3
a
(1)
3
. . .
s
(2)
0
a
(2)
0
s
(2)
1
a
(2)
1
s
(2)
2
a
(2)
2
s
(2)
3
a
(2)
3
. . .
. . .
Here, s
(j)
i
is the state we were at time i of trial j, and a
(j)
i
is the cor-
responding action that was taken from that state. In practice, each of the
trials above might be run until the MDP terminates (such as if the pole falls
over in the inverted pendulum problem), or it might be run for some large
but nite number of timesteps.
Given this experience in the MDP consisting of a number of trials,
we can then easily derive the maximum likelihood estimates for the state
transition probabilities:
P
sa
(s

) =
#times took we action a in state s and got to s

#times we took action a in state s

(4)
Or, if the ratio above is 0/0corresponding to the case of never having
taken action a in state s beforethe we might simply estimate P
sa
(s

) to be
1/|S|. (I.e., estimate P
sa
to be the uniform distribution over all states.)
Note that, if we gain more experience (observe more trials) in the MDP,
there is an ecient way to update our estimated state transition probabilities
7
using the new experience. Specically, if we keep around the counts for both
the numerator and denominator terms of (4), then as we observe more trials,
we can simply keep accumulating those counts. Computing the ratio of these
counts then given our estimate of P
sa
.
Using a similar procedure, if R is unknown, we can also pick our estimate
of the expected immediate reward R(s) in state s to be the average reward
observed in state s.
Having learned a model for the MDP, we can then use either value it-
eration or policy iteration to solve the MDP using the estimated transition
probabilities and rewards. For example, putting together model learning and
value iteration, here is one possible algorithm for learning in an MDP with
unknown state transition probabilities:
1. Initialize randomly.
2. Repeat {
(a) Execute in the MDP for some number of trials.
(b) Using the accumulated experience in the MDP, update our esti-
mates for P
sa
(and R, if applicable).
(c) Apply value iteration with the estimated state transition probabil-
ities and rewards to get a new estimated value function V .
(d) Update to be the greedy policy with respect to V .
}
We note that, for this particular algorithm, there is one simple optimiza-
tion that can make it run much more quickly. Specically, in the inner loop
of the algorithm where we apply value iteration, if instead of initializing value
iteration with V = 0, we initialize it with the solution found during the pre-
vious iteration of our algorithm, then that will provide value iteration with
a much better initial starting point and make it converge more quickly.
4 Continuous state MDPs
So far, weve focused our attention on MDPs with a nite number of states.
We now discuss algorithms for MDPs that may have an innite number of
states. For example, for a car, we might represent the state as (x, y, , x, y,

),
comprising its position (x, y); orientation ; velocity in the x and y directions
x and y; and angular velocity

. Hence, S = R
6
is an innite set of states,
8
because there is an innite number of possible positions and orientations
for the car.
2
Similarly, the inverted pendulum you saw in PS4 has states
(x, , x,

), where is the angle of the pole. And, a helicopter ying in 3d
space has states of the form (x, y, z, , , , x, y, z,

,

,

), where here the roll
, pitch , and yaw angles specify the 3d orientation of the helicopter.
In this section, we will consider settings where the state space is S = R
n
,
and describe ways for solving such MDPs.
4.1 Discretization
Perhaps the simplest way to solve a continuous-state MDP is to discretize
the state space, and then to use an algorithm like value iteration or policy
iteration, as described previously.
For example, if we have 2d states (s
1
, s
2
), we can use a grid to discretize
the state space:
Here, each grid cell represents a separate discrete state s. We can then ap-
proximate the continuous-state MDP via a discrete-state one (

S, A, {P
sa
}, , R),
where

S is the set of discrete states, {P
sa
} are our state transition probabil-
ities over the discrete states, and so on. We can then use value iteration or
policy iteration to solve for the V

( s) and

( s) in the discrete state MDP

(

S, A, {P
sa
}, , R). When our actual system is in some continuous-valued
state s S and we need to pick an action to execute, we compute the
corresponding discretized state s, and execute action

( s).
This discretization approach can work well for many problems. However,
there are two downsides. First, it uses a fairly naive representation for V

2
Technically, is an orientation and so the range of is better written [, ) than
R; but for our purposes, this distinction is not important.
9
(and

). Specically, it assumes that the value function is takes a constant

value over each of the discretization intervals (i.e., that the value function is
piecewise constant in each of the gridcells).
To better understand the limitations of such a representation, consider a
supervised learning problem of tting a function to this dataset:
1 2 3 4 5 6 7 8
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x
y
Clearly, linear regression would do ne on this problem. However, if we
instead discretize the x-axis, and then use a representation that is piecewise
constant in each of the discretization intervals, then our t to the data would
look like this:
1 2 3 4 5 6 7 8
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x
y
This piecewise constant representation just isnt a good representation for
many smooth functions. It results in little smoothing over the inputs, and no
generalization over the dierent grid cells. Using this sort of representation,
we would also need a very ne discretization (very small grid cells) to get a
good approximation.
10
A second downside of this representation is called the curse of dimen-
sionality. Suppose S = R
n
, and we discretize each of the n dimensions of the
state into k values. Then the total number of discrete states we have is k
n
.
This grows exponentially quickly in the dimension of the state space n, and
thus does not scale well to large problems. For example, with a 10d state, if
we discretize each state variable into 100 values, we would have 100
10
= 10
20
discrete states, which is far too many to represent even on a modern desktop
computer.
As a rule of thumb, discretization usually works extremely well for 1d
and 2d problems (and has the advantage of being simple and quick to im-
plement). Perhaps with a little bit of cleverness and some care in choosing
the discretization method, it often works well for problems with up to 4d
states. If youre extremely clever, and somewhat lucky, you may even get it
to work for some 6d problems. But it very rarely works for problems any
higher dimensional than that.
4.2 Value function approximation
We now describe an alternative method for nding policies in continuous-
state MDPs, in which we approximate V

directly, without resorting to dis-
cretization. This approach, caled value function approximation, has been
successfully applied to many RL problems.
4.2.1 Using a model or simulator
To develop a value function approximation algorithm, we will assume that
we have a model, or simulator, for the MDP. Informally, a simulator is
a black-box that takes as input any (continuous-valued) state s
t
and action
a
t
, and outputs a next-state s
t+1
sampled according to the state transition
probabilities P
stat
:
Therere several ways that one can get such a model. One is to use
physics simulation. For example, the simulator for the inverted pendulum
11
in PS4 was obtained by using the laws of physics to calculate what position
and orientation the cart/pole will be in at time t +1, given the current state
at time t and the action a taken, assuming that we know all the parameters
of the system such as the length of the pole, the mass of the pole, and so
on. Alternatively, one can also use an o-the-shelf physics simulation software
package which takes as input a complete physical description of a mechanical
system, the current state s
t
and action a
t
, and computes the state s
t+1
of the
system a small fraction of a second into the future.
3
An alternative way to get a model is to learn one from data collected in
the MDP. For example, suppose we execute m trials in which we repeatedly
take actions in an MDP, each trial for T timesteps. This can be done picking
actions at random, executing some specic policy, or via some other way of
choosing actions. We would then observe m state sequences like the following:
s
(1)
0
a
(1)
0
s
(1)
1
a
(1)
1
s
(1)
2
a
(1)
2

a
(1)
T1
s
(1)
T
s
(2)
0
a
(2)
0
s
(2)
1
a
(2)
1
s
(2)
2
a
(2)
2

a
(2)
T1
s
(2)
T

s
(m)
0
a
(m)
0
s
(m)
1
a
(m)
1
s
(m)
2
a
(m)
2

a
(m)
T1
s
(m)
T
We can then apply a learning algorithm to predict s
t+1
as a function of s
t
and a
t
.
For example, one may choose to learn a linear model of the form
s
t+1
= As
t
+ Ba
t
, (5)
using an algorithm similar to linear regression. Here, the parameters of the
model are the matrices A and B, and we can estimate them using the data
collected from our m trials, by picking
arg min
A,B
m

i=1
T1

t=0
_
_
_s
(i)
t+1

_
As
(i)
t
+ Ba
(i)
t
__
_
_
2
.
(This corresponds to the maximum likelihood estimate of the parameters.)
Having learned A and B, one option is to build a deterministic model,
in which given an input s
t
and a
t
, the output s
t+1
is exactly determined.
3
Open Dynamics Engine (http://www.ode.com) is one example of a free/open-source
physics simulator that can be used to simulate systems like the inverted pendulum, and
that has been a reasonably popular choice among RL researchers.
12
Specically, we always compute s
t+1
according to Equation (5). Alterna-
tively, we may also build a stochastic model, in which s
t+1
is a random
function of the inputs, by modelling it as
s
t+1
= As
t
+ Ba
t
+
t
,
where here
t
is a noise term, usually modeled as
t
N(0, ). (The covari-
ance matrix can also be estimated from data in a straightforward way.)
Here, weve written the next-state s
t+1
as a linear function of the current
state and action; but of course, non-linear functions are also possible. Specif-
ically, one can learn a model s
t+1
= A
s
(s
t
) + B
a
(a
t
), where
s
and
a
are
some non-linear feature mappings of the states and actions. Alternatively,
one can also use non-linear learning algorithms, such as locally weighted lin-
ear regression, to learn to estimate s
t+1
as a function of s
t
and a
t
. These
approaches can also be used to build either deterministic or stochastic sim-
ulators of an MDP.
4.2.2 Fitted value iteration
We now describe the tted value iteration algorithm for approximating
the value function of a continuous state MDP. In the sequel, we will assume
that the problem has a continuous state space S = R
n
, but that the action
space A is small and discrete.
4
Recall that in value iteration, we would like to perform the update
V (s) := R(s) + max
a
_
s

P
sa
(s

)V (s

)ds

(6)
= R(s) + max
a
E
s

Psa
[V (s

)] (7)
(In Section 2, we had written the value iteration update with a summation
V (s) := R(s) + max
a

P
sa
(s

)V (s

) rather than an integral over states;

the new notation reects that we are now working in continuous states rather
than discrete states.)
The main idea of tted value iteration is that we are going to approxi-
mately carry out this step, over a nite sample of states s
(1)
, . . . , s
(m)
. Specif-
ically, we will use a supervised learning algorithmlinear regression in our
4
In practice, most MDPs have much smaller action spaces than state spaces. E.g., a car
has a 6d state space, and a 2d action space (steering and velocity controls); the inverted
pendulum has a 4d state space, and a 1d action space; a helicopter has a 12d state space,
and a 4d action space. So, discretizing ths set of actions is usually less of a problem than
discretizing the state space would have been.
13
description belowto approximate the value function as a linear or non-linear
function of the states:
V (s) =
T
(s).
Here, is some appropriate feature mapping of the states.
For each state s in our nite sample of m states, tted value itera-
tion will rst compute a quantity y
(i)
, which will be our approximation
to R(s) + max
a
E
s

Psa
[V (s

)] (the right hand side of Equation 7). Then,

it will apply a supervised learning algorithm to try to get V (s) close to
R(s) + max
a
E
s

Psa
[V (s

)] (or, in other words, to try to get V (s) close to

y
(i)
).
In detail, the algorithm is as follows:
1. Randomly sample m states s
(1)
, s
(2)
, . . . s
(m)
S.
2. Initialize := 0.
3. Repeat {
For i = 1, . . . , m {
For each action a A {
Sample s

1
, . . . , s

k
P
s
(i)
a
(using a model of the MDP).
Set q(a) =
1
k

k
j=1
R(s
(i)
) + V (s

j
)
// Hence, q(a) is an estimate of R(s
(i)
)+E
s

P
s
(i)
a
[V (s

)].
}
Set y
(i)
= max
a
q(a).
// Hence, y
(i)
is an estimate of R(s
(i)
)+ max
a
E
s

P
s
(i)
a
[V (s

)].
}
// In the original value iteration algorithm (over discrete states)
// we updated the value function according to V (s
(i)
) := y
(i)
.
// In this algorithm, we want V (s
(i)
) y
(i)
, which well achieve
// using supervised learning (linear regression).
Set := arg min

1
2

m
i=1
_

T
(s
(i)
) y
(i)
_
2
}
14
Above, we had written out tted value iteration using linear regression as
the algorithm to try to make V (s
(i)
) close to y
(i)
. That step of the algorithm is
completely analogous to a standard supervised learning (regression) problem
in which we have a training set (x
(1)
, y
(1)
), (x
(2)
, y
(2)
), . . . , (x
(m)
, y
(m)
), and
want to learn a function mapping from x to y; the only dierence is that
here s plays the role of x. Even though our description above used linear
regression, clearly other regression algorithms (such as locally weighted linear
regression) can also be used.
Unlike value iteration over a discrete set of states, tted value iteration
cannot be proved to always to converge. However, in practice, it often does
converge (or approximately converge), and works well for many problems.
Note also that if we are using a deterministic simulator/model of the MDP,
then tted value iteration can be simplied by setting k = 1 in the algorithm.
This is because the expectation in Equation (7) becomes an expectation over
a deterministic distribution, and so a single example is sucient to exactly
compute that expectation. Otherwise, in the algorithm above, we had to
draw k samples, and average to try to approximate that expectation (see the
denition of q(a), in the algorithm pseudo-code).
Finally, tted value iteration outputs V , which is an approximation to
V

. This implicitly denes our policy. Specically, when our system is in
some state s, and we need to choose an action, we would like to choose the
action
arg max
a
E
s

Psa
[V (s

)] (8)
The process for computing/approximating this is similar to the inner-loop of
tted value iteration, where for each action, we sample s

1
, . . . , s

k
P
sa
to
approximate the expectation. (And again, if the simulator is deterministic,
we can set k = 1.)
In practice, therere often other ways to approximate this step as well.
For example, one very common case is if the simulator is of the form s
t+1
=
f(s
t
, a
t
) +
t
, where f is some determinstic function of the states (such as
f(s
t
, a
t
) = As
t
+ Ba
t
), and is zero-mean Gaussian noise. In this case, we
can pick the action given by
arg max
a
V (f(s, a)).
In other words, here we are just setting
t
= 0 (i.e., ignoring the noise in
the simulator), and setting k = 1. Equivalent, this can be derived from
Equation (8) using the approximation
E
s
[V (s

)] V (E
s
[s

]) (9)
= V (f(s, a)), (10)
15
where here the expection is over the random s

P
sa
. So long as the noise
terms
t
are small, this will usually be a reasonable approximation.
However, for problems that dont lend themselves to such approximations,
having to sample k|A| states using the model, in order to approximate the
expectation above, can be computationally expensive.

Creating and Managing Users and Groups: This Lab Contains The Following Projects and Activities
No ratings yet
Creating and Managing Users and Groups: This Lab Contains The Following Projects and Activities
16 pages
Sierra Soft-ProSt User Guide 2009
75% (4)
Sierra Soft-ProSt User Guide 2009
718 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Process Control Report No.1
No ratings yet
Process Control Report No.1
21 pages
My Philosophy On Alerting
No ratings yet
My Philosophy On Alerting
8 pages
MovieProject Report
100% (2)
MovieProject Report
23 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
CS229
No ratings yet
CS229
17 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
lec12
No ratings yet
lec12
60 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
Lec 09
No ratings yet
Lec 09
51 pages
Rl Lecture4
No ratings yet
Rl Lecture4
7 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Lecture_12_slides_-_after
No ratings yet
Lecture_12_slides_-_after
50 pages
AR514_MDP
No ratings yet
AR514_MDP
27 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
subtitle (10)
No ratings yet
subtitle (10)
2 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
DLMAIRIL01_Q4-2024_Session2
No ratings yet
DLMAIRIL01_Q4-2024_Session2
68 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
06 MDP
No ratings yet
06 MDP
89 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
1-markov
No ratings yet
1-markov
34 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
2025_MDPs 1
No ratings yet
2025_MDPs 1
62 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
!!! Denavit-Hartenberg Convention
No ratings yet
!!! Denavit-Hartenberg Convention
17 pages
The Anatomy of A Humanoid Robot
No ratings yet
The Anatomy of A Humanoid Robot
8 pages
Robot Assist Users Manual PDF
No ratings yet
Robot Assist Users Manual PDF
80 pages
Mechanical Arm Teleoperation Control System by Dynamic Hand Gesture Recognition Based PDF
No ratings yet
Mechanical Arm Teleoperation Control System by Dynamic Hand Gesture Recognition Based PDF
4 pages
31-Telerobotics !!!!!
No ratings yet
31-Telerobotics !!!!!
17 pages
Robotics
No ratings yet
Robotics
17 pages
Kinematical Analysis of Wunderlich Mechanism
No ratings yet
Kinematical Analysis of Wunderlich Mechanism
16 pages
T-Test and ANOVA: Appendix F
No ratings yet
T-Test and ANOVA: Appendix F
1 page
Personal Edu Robotics
No ratings yet
Personal Edu Robotics
4 pages
MIT OCW CEE 1-964 - Design For Sustainability - Energy in Buildings
No ratings yet
MIT OCW CEE 1-964 - Design For Sustainability - Energy in Buildings
61 pages
UJ ARC 364 Climate and Design Course Outline 120311a
No ratings yet
UJ ARC 364 Climate and Design Course Outline 120311a
5 pages
PerceptionsAndDice
No ratings yet
PerceptionsAndDice
13 pages
Defining Climatic Zones For Architectural Design in Nigeria
No ratings yet
Defining Climatic Zones For Architectural Design in Nigeria
14 pages
Climate Variability and Climate Change
No ratings yet
Climate Variability and Climate Change
2 pages
Robotics in Edutainment
No ratings yet
Robotics in Edutainment
6 pages
Read&Write
No ratings yet
Read&Write
7 pages
Human Computer Interaction May 2007 Question Paper
100% (1)
Human Computer Interaction May 2007 Question Paper
4 pages
SmarterMail and Microsoft Exchange - An Administrative Comparison
No ratings yet
SmarterMail and Microsoft Exchange - An Administrative Comparison
10 pages
A Quick Tutorial On RSLogix Emulator 5000
No ratings yet
A Quick Tutorial On RSLogix Emulator 5000
12 pages
Swarm Optimized Fake News Detection On Social-Media Textual Content Using Deep Learning
No ratings yet
Swarm Optimized Fake News Detection On Social-Media Textual Content Using Deep Learning
8 pages
Abstract Classes and Interface in PHP March 24
No ratings yet
Abstract Classes and Interface in PHP March 24
6 pages
ME96SS: Mitsubishi Electronic Multi-Measuring Instrument
No ratings yet
ME96SS: Mitsubishi Electronic Multi-Measuring Instrument
32 pages
Computer Programming 1 Module Final 2024-2025 -First Year
No ratings yet
Computer Programming 1 Module Final 2024-2025 -First Year
129 pages
IBM VIOS Maintenance
100% (1)
IBM VIOS Maintenance
46 pages
Semaphore
No ratings yet
Semaphore
6 pages
Slides WK1
No ratings yet
Slides WK1
77 pages
Planning Guide SAP Business Suite
No ratings yet
Planning Guide SAP Business Suite
72 pages
Challenges of Promoting E-Governance in Tanzania
0% (1)
Challenges of Promoting E-Governance in Tanzania
5 pages
856 Ship Notice/Manifest: Version: 1.3 Final
No ratings yet
856 Ship Notice/Manifest: Version: 1.3 Final
41 pages
AISC Search DLL
0% (1)
AISC Search DLL
550 pages
Enrico Pagello: Intelligent Autonomous System Laboratory (IAS-Lab) University of Padua, Italy
No ratings yet
Enrico Pagello: Intelligent Autonomous System Laboratory (IAS-Lab) University of Padua, Italy
13 pages
Electronic Health Records System
100% (1)
Electronic Health Records System
22 pages
Cisco 4-Port and 8-Port Layer 2 Gigabit Etherswitch Network Interface Module Configuration Guide For Cisco 4000 Series Isr
No ratings yet
Cisco 4-Port and 8-Port Layer 2 Gigabit Etherswitch Network Interface Module Configuration Guide For Cisco 4000 Series Isr
26 pages
Mesh-Intro 14.0 L-01 Introduction To ANSYS
No ratings yet
Mesh-Intro 14.0 L-01 Introduction To ANSYS
14 pages
Track Schedule
No ratings yet
Track Schedule
9 pages
Online Hotel Management System: Synopsis Synopsis Synopsis Synopsis
No ratings yet
Online Hotel Management System: Synopsis Synopsis Synopsis Synopsis
34 pages
Ambulance Management System Using GIS
100% (1)
Ambulance Management System Using GIS
77 pages
ADI Nat Nan
No ratings yet
ADI Nat Nan
3 pages
Thesis On FPGA
No ratings yet
Thesis On FPGA
79 pages
Ansys ICEM CFD 10.0 Manual
No ratings yet
Ansys ICEM CFD 10.0 Manual
1,287 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Reinforcement Learning and Control: CS229 Lecture Notes

Uploaded by

Reinforcement Learning and Control: CS229 Lecture Notes

Uploaded by

CS229 Lecture notes

Note that the reward at timestep t is discounted by a factor of

(s) = arg max

, the value function for

s value is at least a large as the value of any other other policy.

as dened in Equation (3) is the optimal policy.

has the interesting property that it is the optimal policy

then thered be some other policy thats

. Specically, the same policy

attains the maximum

no matter what the initial state of our MDP is.

#times we took action a in state s

( s) in the discrete state MDP

). Specically, it assumes that the value function is takes a constant

) rather than an integral over states;

)] (the right hand side of Equation 7). Then,

)] (or, in other words, to try to get V (s) close to

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.