0% found this document useful (0 votes)
141 views

Q-Learning and Deep Q Networks (DQN)

Q-learning and Deep Q networks (DQN) provide value-based reinforcement learning methods for solving sequential decision-making problems. Q-learning uses the Bellman equation to iteratively update Q-values towards the optimal Q* function through trials and errors without needing a model of the environment. When combined with deep neural networks as function approximators, DQN can handle continuous high-dimensional state spaces that are intractable for tabular Q-learning. DQN has been successfully applied to challenging domains like Atari games.

Uploaded by

bscjjw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views

Q-Learning and Deep Q Networks (DQN)

Q-learning and Deep Q networks (DQN) provide value-based reinforcement learning methods for solving sequential decision-making problems. Q-learning uses the Bellman equation to iteratively update Q-values towards the optimal Q* function through trials and errors without needing a model of the environment. When combined with deep neural networks as function approximators, DQN can handle continuous high-dimensional state spaces that are intractable for tabular Q-learning. DQN has been successfully applied to challenging domains like Atari games.

Uploaded by

bscjjw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Q-learning and Deep Q networks (DQN)

Vincent François-Lavet

December 8, 2021
Outline

Motivation for value-based reinforcement learning

The Bellman operator


Dynamic programming
Q-learning

Q-learning with deep learning as a function approximator

A few variants of DQN

Discussion of a parallel with neurosciences


How to discount deep RL

Conclusions

Project
Motivation for value-based
reinforcement learning
Overview of deep RL
In general, a reinforcement learning (RL) agent may include one or
more of the following components:
I a representation of a value function that provides a prediction
of how good is each state or each couple state/action,
I a direct representation of the policy π(x) or π(x, a), or
I a model of the environment in conjunction with a planning
algorithm.

Model-based
RL

Experience

Model
Acting
learning
Modef-free
RL Value-based Policy-based
Model Value/policy
RL RL

Planning

Deep learning has brought its generalization capabilities to RL.


The Bellman operator
Value based methods: recall

In an MDP (X , A, T , R, γ), the expected return V π (x) : X → R


(π ∈ Π, e.g., X → A) is defined such that
hX∞ i
V π (x) = E γ k rt+k | xt = x, π ,
k=0

with γ ∈ [0, 1).


Value based methods: recall

In addition to the V-value function, the Q-value function


Q π (x, a) : X × A → R is defined as follows:
hX∞ i
Q π (x, a) = E γ k rt+k | xt = x, at = a, π .
k=0

The particularity of the Q-value function as compared to the


V-value function is that the optimal policy can be obtained directly
from Q ∗ (x, a):
π ∗ (x) = argmax Q ∗ (x, a).
a∈A
Value based methods

The Bellman equation that is at the core of reinforcement learning


makes use of the fact that the Q-function can be written in a
recursive form:
hX∞ i
Q π (x, a) = E γ k rt+k | xt = x, at = a, π
h k=0X∞ i
= E rt + γ k rt+k | xt = x, at = a, π
k=1
= E rt + γQ (xt+1 , a0 ∼ π) | xt = x, at = a, π
π
 

In particular:

Q ∗ (x, a) = E rt + Q ∗ (xt+1 , a0 ∼ π ∗ ) | xt = x, at = a, π ∗
 
 
∗ 0 ∗
= E rt + max
0
Q (xt+1 , a ) | xt = x, at = a, π
a ∈A
Value-based method: Q-learning with one entry for every
state-action pair

To obtain Q ∗ , you can:


1. Solve the system of equations (if you know T and R),
2. Initialize the Q-values and repeatedly apply “Bellman
iterations” until you find the fixed point (if you know T and
R) → the dynamic programming case, or
3. Use reinforcement learning to perform the Bellman iterations
from data (trials and errors in the environment).
Dynamic programming
Value-based method: Q-learning with one entry for every
state-action pair

In order to learn the optimal Q-value function, the Q-learning


algorithm makes use of the Bellman equation for the Q-value
function whose unique solution is Q ∗ (x, a):

Q ∗ (x, a) = (BQ ∗ )(x, a),

where B is the Bellman operator mapping any function


K : X × A → R into another function X × A → R and is defined
as follows:
X  
0 0 0 0
(BK )(x, a) = T (x, a, x ) R(x, a, x ) + γ max
0
K (x , a ) .
a ∈A
x 0 ∈S
The chain problem

b,0.2 a,1

a,0 a,0 a,0 a,0


1 2 3 4 5
b, 0
b, 0
b, 0
b, 0
Figure: The chain environment (γ = 0.9)
The chain problem: tabular Q-values with dynamic
programming

state/action a b state/action a b state/action a b


1 0 0.2 1 0 0.38 1 ? ?
2 0 0 2 0 0.18 2 ? ?
···
3 0 0 3 0 0 3 ? ?
4 0 0 4 0.9 0 4 ? ?
5 1 0 5 1.9 0 5 ? ?

Table: Update of the tabular Q-values starting from an initialization to 0.

The resulting policy is to choose action a in all states when the


Bellman iterations have converged to its fixed point.
Value-based method: Q-learning (dynamic programming)
Value function Resulting policy
V = maxa Q(x, a) π = argmaxa Q(x, a)

0 0 1 R=1

i=0 0 0 0

0 0 0

0 0.9 1 R=1

i=1 0 0 0.9

0 0 0

0.81 0.9 1 R=1

i=2 0 0.81 0.9

0 0 0.81

Figure: Grid-world MDP with γ = 0.9, and where we assume that after
obtaining R=1, we end up in terminal state (i.e. all following rewards=0)
Q-learning
Value-based method: Q-learning

As opposed to dynamic programming that assumes access to the


knowledge a priori of the MDP, RL makes use of learning through
trials and errors.

Algorithm 1 Pseudocode for the Q-learning algorithm in the tabular setting


1: procedure get Q values(node x)
2: Initialize Q(x,a) arbitrarily
3: for each episode do
4: Initialize x
5: for each step in episode do
6: Choose a given x using policy derived from Q (e.g.,  greedy)
7: Take action a, observe r , x 0
8: Q(x, a) ← Q(x, a) + α[r + γmaxa0 Q(x 0 , a0 ) − Q(x, a)]
9: return Q(·, ·)
Convergence Q-learning

Theorem: Given a finite MDP, the Q-learning algorithm given by


the update rule

Q(xt , at ) ← Q(xt , at ) + αt [rt + γ max


0
Qt (xt+1 , a0 ) − Qt (xt , at )],
a ∈A

converges
P w.p.1 to
Pthe2 optimal Q-function as long as
t α t = ∞ and t αt < ∞. and
The exploration policy π is such that
Pπ [at = a|xt = x] > 0, ∀(x, a).

More details: “Convergence of Q-learning: a simple proof”,


Francisco S. Melo
Example 1: Mountain car
A car tries to reach the top of the hill but the engine is not strong
enough.
I State: position and velocity
I Action: accelerate forward, accelerate backward, coast.
I Goal: get the car to the top of the hill (e.g., reward = 1 at
the top).

Figure: Mountain car


Example 1: Mountain car

Figure: Application to the mountain car domain: V = maxa Q(x, a),


where the state space has been discretized finely.
Example 1: Mountain car

Figure: Mountain car optimal policy


Q-learning with deep learning as a
function approximator
Why function approximators?

A tabular approach with discretization fails due to the curse of


dimensionality when the number of (initially continuous)
dimensions for the state ' 10 or for a large number of states.

When do we need function approximators?


I large and/or continuous state space → DQN
I (large and/or continuous action space) → next week we’ll see
the continuous action space
Q-learning with function approximator

To deal with continuous state and/or action space, we can


represent value function with function approximators and
parameters θ:
Q(x, a; θ) ≈ Q(x, a)
The parameters θ are updated such that:
d  2
θ := θ + α Q(x, a; θ) − YkQ

with
YkQ = r + γ max
0
Q(x 0 , a0 ; θk ).
a ∈A

With deep learning, the update usually uses a mini-batch (e.g., 32


elements) of tuples < x, a, r , x 0 >.
DQN algorithm
For Deep Q-Learning, we can represent value function by deep
Q-network with weights θ (instabilities !). In the DQN algorithm:
I Replay memory
I Target network

s1 , . . . , sNreplay , a1 , . . . , aNreplay Update Every C:


Q(x, a; θk ) θk− := θk
Policy r1 , . . . , rNreplay

s1+1 , . . . , sNreplay +1
Environment
rt + γmax
0
(Q(xt+1 , a0 ; θk− ))
a ∈A

Figure: Sketch of the DQN algorithm. Q(x, a; θk ) is initialized to random


values (close to 0) everywhere on its domain and the replay memory is initially
empty; the target Q-network parameters θk− are only updated every C iterations
with the Q-network parameters θk and are held fixed between updates; the
update uses a mini-batch (e.g., 32 elements) of tuples < x, a, r , x 0 > taken
randomly in the replay memory.
Visualization of Q-values in mountain car

Figure: DQN for mountain car


Example 2: toy example in finance
This environment simulates the possibility of buying or selling a good.
The agent can either have one unit or zero unit of that good. At each
transaction with the market, the agent obtains a reward equivalent to the
price of the good when selling it and the opposite when buying. In
addition, a penalty of 0.5 (negative reward) is added for each transaction.
The price pattern is made by repeating the following signal plus a
random constant between 0 and 3:

Figure: Price signal


I State: current price and price at the last five time steps+ whether
the agent has one item of the good. (This problem becomes very
complex without function approximator)
I Action: buy, sell, do nothing.
I Goal: get as much $$$ as possible.
Example using the DeeR library

You can then launch “run toy env simple.py” in the folder
“examples/toy env/”.
Example: run toy env simple.py

r n g = np . random . RandomState ( 1 2 3 4 5 6 )

# −−− I n s t a n t i a t e e n v i r o n m e n t −−−
env = Toy env ( rng )

# −−− I n s t a n t i a t e q n e t w o r k −−−
q n e t w o r k = MyQNetwork (
e n v i r o n m e n t=env ,
r a n d o m s t a t e=r n g )

# −−− I n s t a n t i a t e a g e n t −−−
agent = NeuralAgent (
env ,
qnetwork ,
r a n d o m s t a t e=r n g )

# −−− Bind c o n t r o l l e r s t o t h e a g e n t −−−


# B e f o r e e v e r y t r a i n i n g epoch , we want t o p r i n t a summary o f i m p o r t a n t e l e m e n t s .
a g e n t . a t t a c h ( bc . V e r b o s e C o n t r o l l e r ( ) )

# D u r i n g t r a i n i n g e p o c h s , we want t o t r a i n t h e a g e n t a f t e r e v e r y a c t i o n it takes .
a g e n t . a t t a c h ( bc . T r a i n e r C o n t r o l l e r ( ) )

# We a l s o want t o i n t e r l e a v e a ” t e s t e p o c h ” b e t w e e n e a c h t r a i n i n g e p o c h .
a g e n t . a t t a c h ( bc . I n t e r l e a v e d T e s t E p o c h C o n t r o l l e r ( e p o c h l e n g t h =500))

# −−− Run t h e e x p e r i m e n t −−−


a g e n t . r u n ( n e p o c h s =100 , e p o c h l e n g t h =1000)
Example: run toy env simple.py
Every 10 epochs, a graph is saved in the “toy env” folder:

In this graph, you can see that the agent has successfully learned to take
advantage of the price pattern. It is important to note that the results shown
are made on a validation set that is different from the training and we can see
that learning generalizes well. For instance, the action of buying at time step 7
and 16 is the expected result because in average this will allow to make profit
since the agent has no information on the future.
Real-world application of deep RL: the microgrid
benchmark
A microgrid is an electrical system that includes multiple loads and
distributed energy resources that can be operated in parallel with
the broader utility grid or as an electrical island.

Microgrid
Microgrids and storage
There exist opportunities with microgrids featuring:
I A short term storage capacity (typically batteries),
I A long term storage capacity (e.g., hydrogen).
Structure of the Q-network

Fully-connected
Convolutions layers Outputs

Input #1

Input #2

Input #3

..
.

Figure: Sketch of the structure of the neural network architecture. The


neural network processes the time series using a set of convolutional
layers. The output of the convolutions and the other inputs are followed
by fully-connected layers and the ouput layer. Architectures based on
LSTMs instead of convolutions obtain similar results.
Results

0.40 Without any external info


With seasonal info
0.38 With solar prediction
Optimal deterministic LEC
0.36 Naive policy LEC
LEC ( /kWh)

0.34
0.32
0.30
0.28
0.2670 80 90 100 110 120
% of the robust sizings (PV, Battery, H2 storage)

Figure: LEC on the test data function of the sizings of the microgrid.
A few variants of DQN
Distributional DQN

Another approach is to aim for a richer representation through a


value distribution, i.e. the distribution of possible cumulative
returns.
The value distribution Z π is a mapping from state-action pairs to
distributions of returns when following policy π. It has an
expectation equal to Q π :

Q π (x, a) = EZ π (x, a).

This random return is also described by a recursive equation, but


one of a distributional nature:

Z π (x, a) = R(x, a, X 0 ) + γZ π (X 0 , A0 ),

where we use capital letters to emphasize the random nature of the


next state-action pair (X 0 , A0 ) and A0 ∼ π(·|X 0 ).
Distributional DQN
It has been shown that such a distributional Bellman equation can
be used in practice, with deep learning as the function
approximator. This approach has the following advantages:
I It is possible to implement risk-aware behavior.
I It leads to more performant learning in practice. One of the
main elements is that the distributional perspective naturally
provides a richer set of training signals than a scalar value
function Q(x, a) (effect of auxiliary tasks).
Multi-step learning

In DQN, the target value used is estimated based on its own value
estimate at the next time-step. For that reason, the learning
algorithm is said to bootstrap as it recursively uses its own value
estimates.

Such a variant in the case of DQN can be obtained by using the


n-step target value given by:
n−1
YkQ,n =
X
γ t rt + γ n max
0
Q(xn , a0 ; θk )
a ∈A
t=0

where (x0 , a0 , r0 , · · · , sn−1 , an−1 , rn−1 , sn ) is any trajectory of n + 1


time steps with s = s0 and a = a0 .
Warning: Online data is required for convergence without bias (or
other specific techniques)
Discussion of a parallel with
neurosciences
How to discount deep RL
Motivations

Effect of the discount factor in an online setting.


I Empirical studies of cognitive mechanisms in delay of
gratification: The capacity to wait longer for the preferred
rewards seems to develop markedly only at about ages 3-4
(“marshmallow experiment”).
Increasing discount factor (using the DQN aglorithm)

Figure: Illustration for the game q-bert of a discount factor γ held fixed on the
right and an adaptive discount factor on the right.
Conclusions
Summary of the lecture

I Introduction to Q-learning in the tabular case and with deep


learning (DQN).
I Toy examples and real-world examples
I Brief discussion on the role of the discount factor and some
relations to neuroscience
Further ressources (optional)

I Watkins, Christopher JCH, and Peter Dayan. ”Q-learning.”


Machine learning 8, no. 3-4 (1992): 279-292.
I Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei
A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
“Human-level control through deep reinforcement learning.”
nature 518, no. 7540 (2015): 529-533.
Further ressources

Function Learning Controllers


• train/validation
Approximators algorithms and test phases
• hyper-parameters
management

Policies
Replay memory Exploration/Exploitation
(e.g., via -greedy)

AGENT

ENVIRONMENT

Implementation : https://github.com/VinF/deer
Questions?
Project
Project
You consider the chain environment made up of 5 discrete states
and 2 discrete actions, where you get a reward on 0.2 on one end
of the chain and 1 at the other end (see illustration below).

b,0.2 a,1

a,0 a,0 a,0 a,0


1 2 3 4 5
b, 0
b, 0
b, 0
b, 0

Figure: The chain environment (γ = 0.9). Initial state is state 1.

In part 1, you work in the tabular context:


I Solve using tabular Q-learning and -greedy. Provide the
optimal Q-values, discuss the learning rate α and  (3 points)
I Increase the size of the chain to 10 states while keeping the
rewards at both end of the chain. Discuss the new results, in
particular  (2 points).
Project

In part 2, you will solve the chain problem using function


approximators (5points) for γ = 0.9 and 10 states.
I Provide illustrations of the solutions of your optimal Q-values
(2 points)
I Discuss the hyper-parameters and the convergence (3 points)
If you go for deep Q learning, Here are additional tips:
I We advise to start from an existing implementation (e.g.
DeeR, doc and examples available).
I Normalize the state encoding, e.g. uniformly between [−1, 1].
I Start the code as early as possible.
Deadline : 24th of December (try to aim for one week earlier!)
Example: run toy env simple.py
If you start from
https://github.com/VinF/deer/blob/master/examples/toy env,
You must modify Toy env.py and run toy env simple.py.
I You must code the MDP transition (and the reward) in the
method act (you don’t need to use rng )
def act ( s e l f , action ) :
...

I Your state is simply defined as one scalar (without history).

def inputDimensions ( s e l f ) :
return [(1 ,)]

I Your have two actions

def nActions ( s e l f ) :
return 2
Example: run toy env simple.py

I You never have terminal states:

def inTerminalState ( s e l f ) :
return False

I The function “observe” provides the encoded representation


of the state

def observe ( s e l f ) :
r e t u r n np . a r r a y ( s e l f . l a s t p o n c t u a l o b s e r v a t i o n )
Questions?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy