0% found this document useful (0 votes)

141 views

Q-Learning and Deep Q Networks (DQN)

Q-learning and Deep Q networks (DQN) provide value-based reinforcement learning methods for solving sequential decision-making problems. Q-learning uses the Bellman equation to iteratively update Q-values towards the optimal Q* function through trials and errors without needing a model of the environment. When combined with deep neural networks as function approximators, DQN can handle continuous high-dimensional state spaces that are intractable for tabular Q-learning. DQN has been successfully applied to challenging domains like Atari games.

Uploaded by

bscjjw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views

Q-Learning and Deep Q Networks (DQN)

Uploaded by

bscjjw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Q-learning and Deep Q networks (DQN)

Vincent François-Lavet

December 8, 2021
Outline

Motivation for value-based reinforcement learning

The Bellman operator

Dynamic programming
Q-learning

Q-learning with deep learning as a function approximator

A few variants of DQN

Discussion of a parallel with neurosciences

How to discount deep RL

Conclusions

Project
Motivation for value-based
reinforcement learning
Overview of deep RL
In general, a reinforcement learning (RL) agent may include one or
more of the following components:
I a representation of a value function that provides a prediction
of how good is each state or each couple state/action,
I a direct representation of the policy π(x) or π(x, a), or
I a model of the environment in conjunction with a planning
algorithm.

Model-based
RL

Experience

Model
Acting
learning
Modef-free
RL Value-based Policy-based
Model Value/policy
RL RL

Planning

Deep learning has brought its generalization capabilities to RL.

The Bellman operator
Value based methods: recall

In an MDP (X , A, T , R, γ), the expected return V π (x) : X → R

(π ∈ Π, e.g., X → A) is defined such that
hX∞ i
V π (x) = E γ k rt+k | xt = x, π ,
k=0

with γ ∈ [0, 1).

Value based methods: recall

In addition to the V-value function, the Q-value function

Q π (x, a) : X × A → R is defined as follows:
hX∞ i
Q π (x, a) = E γ k rt+k | xt = x, at = a, π .
k=0

The particularity of the Q-value function as compared to the

V-value function is that the optimal policy can be obtained directly
from Q ∗ (x, a):
π ∗ (x) = argmax Q ∗ (x, a).
a∈A
Value based methods

The Bellman equation that is at the core of reinforcement learning

makes use of the fact that the Q-function can be written in a
recursive form:
hX∞ i
Q π (x, a) = E γ k rt+k | xt = x, at = a, π
h k=0X∞ i
= E rt + γ k rt+k | xt = x, at = a, π
k=1
= E rt + γQ (xt+1 , a0 ∼ π) | xt = x, at = a, π
π

In particular:

Q ∗ (x, a) = E rt + Q ∗ (xt+1 , a0 ∼ π ∗ ) | xt = x, at = a, π ∗

∗ 0 ∗
= E rt + max
0
Q (xt+1 , a ) | xt = x, at = a, π
a ∈A
Value-based method: Q-learning with one entry for every
state-action pair

To obtain Q ∗ , you can:

1. Solve the system of equations (if you know T and R),
2. Initialize the Q-values and repeatedly apply “Bellman
iterations” until you find the fixed point (if you know T and
R) → the dynamic programming case, or
3. Use reinforcement learning to perform the Bellman iterations
from data (trials and errors in the environment).
Dynamic programming
Value-based method: Q-learning with one entry for every
state-action pair

In order to learn the optimal Q-value function, the Q-learning

algorithm makes use of the Bellman equation for the Q-value
function whose unique solution is Q ∗ (x, a):

Q ∗ (x, a) = (BQ ∗ )(x, a),

where B is the Bellman operator mapping any function

K : X × A → R into another function X × A → R and is defined
as follows:
X
0 0 0 0
(BK )(x, a) = T (x, a, x ) R(x, a, x ) + γ max
0
K (x , a ) .
a ∈A
x 0 ∈S
The chain problem

b,0.2 a,1

a,0 a,0 a,0 a,0

1 2 3 4 5
b, 0
b, 0
b, 0
b, 0
Figure: The chain environment (γ = 0.9)
The chain problem: tabular Q-values with dynamic
programming

state/action a b state/action a b state/action a b

1 0 0.2 1 0 0.38 1 ? ?
2 0 0 2 0 0.18 2 ? ?
···
3 0 0 3 0 0 3 ? ?
4 0 0 4 0.9 0 4 ? ?
5 1 0 5 1.9 0 5 ? ?

Table: Update of the tabular Q-values starting from an initialization to 0.

The resulting policy is to choose action a in all states when the

Bellman iterations have converged to its fixed point.
Value-based method: Q-learning (dynamic programming)
Value function Resulting policy
V = maxa Q(x, a) π = argmaxa Q(x, a)

0 0 1 R=1

i=0 0 0 0

0 0 0

0 0.9 1 R=1

i=1 0 0 0.9

0 0 0

0.81 0.9 1 R=1

i=2 0 0.81 0.9

0 0 0.81

Figure: Grid-world MDP with γ = 0.9, and where we assume that after
obtaining R=1, we end up in terminal state (i.e. all following rewards=0)
Q-learning
Value-based method: Q-learning

As opposed to dynamic programming that assumes access to the

knowledge a priori of the MDP, RL makes use of learning through
trials and errors.

Algorithm 1 Pseudocode for the Q-learning algorithm in the tabular setting

1: procedure get Q values(node x)
2: Initialize Q(x,a) arbitrarily
3: for each episode do
4: Initialize x
5: for each step in episode do
6: Choose a given x using policy derived from Q (e.g., greedy)
7: Take action a, observe r , x 0
8: Q(x, a) ← Q(x, a) + α[r + γmaxa0 Q(x 0 , a0 ) − Q(x, a)]
9: return Q(·, ·)
Convergence Q-learning

Theorem: Given a finite MDP, the Q-learning algorithm given by

the update rule

Q(xt , at ) ← Q(xt , at ) + αt [rt + γ max

0
Qt (xt+1 , a0 ) − Qt (xt , at )],
a ∈A

converges
P w.p.1 to
Pthe2 optimal Q-function as long as
t α t = ∞ and t αt < ∞. and
The exploration policy π is such that
Pπ [at = a|xt = x] > 0, ∀(x, a).

More details: “Convergence of Q-learning: a simple proof”,

Francisco S. Melo
Example 1: Mountain car
A car tries to reach the top of the hill but the engine is not strong
enough.
I State: position and velocity
I Action: accelerate forward, accelerate backward, coast.
I Goal: get the car to the top of the hill (e.g., reward = 1 at
the top).

Figure: Mountain car

Example 1: Mountain car

Figure: Application to the mountain car domain: V = maxa Q(x, a),

where the state space has been discretized finely.
Example 1: Mountain car

Figure: Mountain car optimal policy

Q-learning with deep learning as a
function approximator
Why function approximators?

A tabular approach with discretization fails due to the curse of

dimensionality when the number of (initially continuous)
dimensions for the state ' 10 or for a large number of states.

When do we need function approximators?

I large and/or continuous state space → DQN
I (large and/or continuous action space) → next week we’ll see
the continuous action space
Q-learning with function approximator

To deal with continuous state and/or action space, we can

represent value function with function approximators and
parameters θ:
Q(x, a; θ) ≈ Q(x, a)
The parameters θ are updated such that:
d 2
θ := θ + α Q(x, a; θ) − YkQ
dθ
with
YkQ = r + γ max
0
Q(x 0 , a0 ; θk ).
a ∈A

With deep learning, the update usually uses a mini-batch (e.g., 32

elements) of tuples < x, a, r , x 0 >.
DQN algorithm
For Deep Q-Learning, we can represent value function by deep
Q-network with weights θ (instabilities !). In the DQN algorithm:
I Replay memory
I Target network

s1 , . . . , sNreplay , a1 , . . . , aNreplay Update Every C:

Q(x, a; θk ) θk− := θk
Policy r1 , . . . , rNreplay

s1+1 , . . . , sNreplay +1
Environment
rt + γmax
0
(Q(xt+1 , a0 ; θk− ))
a ∈A

Figure: Sketch of the DQN algorithm. Q(x, a; θk ) is initialized to random

values (close to 0) everywhere on its domain and the replay memory is initially
empty; the target Q-network parameters θk− are only updated every C iterations
with the Q-network parameters θk and are held fixed between updates; the
update uses a mini-batch (e.g., 32 elements) of tuples < x, a, r , x 0 > taken
randomly in the replay memory.
Visualization of Q-values in mountain car

Figure: DQN for mountain car

Example 2: toy example in finance
This environment simulates the possibility of buying or selling a good.
The agent can either have one unit or zero unit of that good. At each
transaction with the market, the agent obtains a reward equivalent to the
price of the good when selling it and the opposite when buying. In
addition, a penalty of 0.5 (negative reward) is added for each transaction.
The price pattern is made by repeating the following signal plus a
random constant between 0 and 3:

Figure: Price signal

I State: current price and price at the last five time steps+ whether
the agent has one item of the good. (This problem becomes very
complex without function approximator)
I Action: buy, sell, do nothing.
I Goal: get as much $$$ as possible.
Example using the DeeR library

You can then launch “run toy env simple.py” in the folder
“examples/toy env/”.
Example: run toy env simple.py

r n g = np . random . RandomState ( 1 2 3 4 5 6 )

# −−− I n s t a n t i a t e e n v i r o n m e n t −−−
env = Toy env ( rng )

# −−− I n s t a n t i a t e q n e t w o r k −−−
q n e t w o r k = MyQNetwork (
e n v i r o n m e n t=env ,
r a n d o m s t a t e=r n g )

# −−− I n s t a n t i a t e a g e n t −−−
agent = NeuralAgent (
env ,
qnetwork ,
r a n d o m s t a t e=r n g )

# −−− Bind c o n t r o l l e r s t o t h e a g e n t −−−

# B e f o r e e v e r y t r a i n i n g epoch , we want t o p r i n t a summary o f i m p o r t a n t e l e m e n t s .
a g e n t . a t t a c h ( bc . V e r b o s e C o n t r o l l e r ( ) )

# D u r i n g t r a i n i n g e p o c h s , we want t o t r a i n t h e a g e n t a f t e r e v e r y a c t i o n it takes .
a g e n t . a t t a c h ( bc . T r a i n e r C o n t r o l l e r ( ) )

# We a l s o want t o i n t e r l e a v e a ” t e s t e p o c h ” b e t w e e n e a c h t r a i n i n g e p o c h .
a g e n t . a t t a c h ( bc . I n t e r l e a v e d T e s t E p o c h C o n t r o l l e r ( e p o c h l e n g t h =500))

# −−− Run t h e e x p e r i m e n t −−−

a g e n t . r u n ( n e p o c h s =100 , e p o c h l e n g t h =1000)
Example: run toy env simple.py
Every 10 epochs, a graph is saved in the “toy env” folder:

In this graph, you can see that the agent has successfully learned to take
advantage of the price pattern. It is important to note that the results shown
are made on a validation set that is different from the training and we can see
that learning generalizes well. For instance, the action of buying at time step 7
and 16 is the expected result because in average this will allow to make profit
since the agent has no information on the future.
Real-world application of deep RL: the microgrid
benchmark
A microgrid is an electrical system that includes multiple loads and
distributed energy resources that can be operated in parallel with
the broader utility grid or as an electrical island.

Microgrid
Microgrids and storage
There exist opportunities with microgrids featuring:
I A short term storage capacity (typically batteries),
I A long term storage capacity (e.g., hydrogen).
Structure of the Q-network

Fully-connected
Convolutions layers Outputs

Input #1

Input #2

Input #3

..
.

Figure: Sketch of the structure of the neural network architecture. The

neural network processes the time series using a set of convolutional
layers. The output of the convolutions and the other inputs are followed
by fully-connected layers and the ouput layer. Architectures based on
LSTMs instead of convolutions obtain similar results.
Results

0.40 Without any external info

With seasonal info
0.38 With solar prediction
Optimal deterministic LEC
0.36 Naive policy LEC
LEC ( /kWh)

0.34
0.32
0.30
0.28
0.2670 80 90 100 110 120
% of the robust sizings (PV, Battery, H2 storage)

Figure: LEC on the test data function of the sizings of the microgrid.
A few variants of DQN
Distributional DQN

Another approach is to aim for a richer representation through a

value distribution, i.e. the distribution of possible cumulative
returns.
The value distribution Z π is a mapping from state-action pairs to
distributions of returns when following policy π. It has an
expectation equal to Q π :

Q π (x, a) = EZ π (x, a).

This random return is also described by a recursive equation, but

one of a distributional nature:

Z π (x, a) = R(x, a, X 0 ) + γZ π (X 0 , A0 ),

where we use capital letters to emphasize the random nature of the

next state-action pair (X 0 , A0 ) and A0 ∼ π(·|X 0 ).
Distributional DQN
It has been shown that such a distributional Bellman equation can
be used in practice, with deep learning as the function
approximator. This approach has the following advantages:
I It is possible to implement risk-aware behavior.
I It leads to more performant learning in practice. One of the
main elements is that the distributional perspective naturally
provides a richer set of training signals than a scalar value
function Q(x, a) (effect of auxiliary tasks).
Multi-step learning

In DQN, the target value used is estimated based on its own value
estimate at the next time-step. For that reason, the learning
algorithm is said to bootstrap as it recursively uses its own value
estimates.

Such a variant in the case of DQN can be obtained by using the

n-step target value given by:
n−1
YkQ,n =
X
γ t rt + γ n max
0
Q(xn , a0 ; θk )
a ∈A
t=0

where (x0 , a0 , r0 , · · · , sn−1 , an−1 , rn−1 , sn ) is any trajectory of n + 1

time steps with s = s0 and a = a0 .
Warning: Online data is required for convergence without bias (or
other specific techniques)
Discussion of a parallel with
neurosciences
How to discount deep RL
Motivations

Effect of the discount factor in an online setting.

I Empirical studies of cognitive mechanisms in delay of
gratification: The capacity to wait longer for the preferred
rewards seems to develop markedly only at about ages 3-4
(“marshmallow experiment”).
Increasing discount factor (using the DQN aglorithm)

Figure: Illustration for the game q-bert of a discount factor γ held fixed on the
right and an adaptive discount factor on the right.
Conclusions
Summary of the lecture

I Introduction to Q-learning in the tabular case and with deep

learning (DQN).
I Toy examples and real-world examples
I Brief discussion on the role of the discount factor and some
relations to neuroscience
Further ressources (optional)

I Watkins, Christopher JCH, and Peter Dayan. ”Q-learning.”

Machine learning 8, no. 3-4 (1992): 279-292.
I Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei
A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
“Human-level control through deep reinforcement learning.”
nature 518, no. 7540 (2015): 529-533.
Further ressources

Function Learning Controllers

• train/validation
Approximators algorithms and test phases
• hyper-parameters
management

Policies
Replay memory Exploration/Exploitation
(e.g., via -greedy)

AGENT

ENVIRONMENT

Implementation : https://github.com/VinF/deer
Questions?
Project
Project
You consider the chain environment made up of 5 discrete states
and 2 discrete actions, where you get a reward on 0.2 on one end
of the chain and 1 at the other end (see illustration below).

b,0.2 a,1

a,0 a,0 a,0 a,0

1 2 3 4 5
b, 0
b, 0
b, 0
b, 0

Figure: The chain environment (γ = 0.9). Initial state is state 1.

In part 1, you work in the tabular context:

I Solve using tabular Q-learning and -greedy. Provide the
optimal Q-values, discuss the learning rate α and (3 points)
I Increase the size of the chain to 10 states while keeping the
rewards at both end of the chain. Discuss the new results, in
particular (2 points).
Project

In part 2, you will solve the chain problem using function

approximators (5points) for γ = 0.9 and 10 states.
I Provide illustrations of the solutions of your optimal Q-values
(2 points)
I Discuss the hyper-parameters and the convergence (3 points)
If you go for deep Q learning, Here are additional tips:
I We advise to start from an existing implementation (e.g.
DeeR, doc and examples available).
I Normalize the state encoding, e.g. uniformly between [−1, 1].
I Start the code as early as possible.
Deadline : 24th of December (try to aim for one week earlier!)
Example: run toy env simple.py
If you start from
https://github.com/VinF/deer/blob/master/examples/toy env,
You must modify Toy env.py and run toy env simple.py.
I You must code the MDP transition (and the reward) in the
method act (you don’t need to use rng )
def act ( s e l f , action ) :
...

I Your state is simply defined as one scalar (without history).

def inputDimensions ( s e l f ) :
return [(1 ,)]

I Your have two actions

def nActions ( s e l f ) :
return 2
Example: run toy env simple.py

I You never have terminal states:

def inTerminalState ( s e l f ) :
return False

I The function “observe” provides the encoded representation

of the state

def observe ( s e l f ) :
r e t u r n np . a r r a y ( s e l f . l a s t p o n c t u a l o b s e r v a t i o n )
Questions?

!realistic FPS Prefab Documentation
No ratings yet
!realistic FPS Prefab Documentation
30 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Fuzzy Timed Petri Net Definitions, Properties, and Applications
No ratings yet
Fuzzy Timed Petri Net Definitions, Properties, and Applications
16 pages
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
No ratings yet
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
7 pages
Reinforcement Learning, Q-Learning
No ratings yet
Reinforcement Learning, Q-Learning
20 pages
Hopfield Neural Network
100% (1)
Hopfield Neural Network
6 pages
Lecture 1: Introduction To Reinforcement Learning: David Silver
No ratings yet
Lecture 1: Introduction To Reinforcement Learning: David Silver
46 pages
LSTM
No ratings yet
LSTM
42 pages
Agenda: 1. What Is Tensorflow?
No ratings yet
Agenda: 1. What Is Tensorflow?
10 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
Unit I
0% (1)
Unit I
21 pages
Lecture 26-30 Unit 2
No ratings yet
Lecture 26-30 Unit 2
20 pages
UNIT_2_DL[1]
No ratings yet
UNIT_2_DL[1]
43 pages
Unit II
No ratings yet
Unit II
56 pages
Unit IV
No ratings yet
Unit IV
22 pages
Deep Learning (MODULE-3) (1)
No ratings yet
Deep Learning (MODULE-3) (1)
85 pages
MP Neuron
No ratings yet
MP Neuron
35 pages
G5Aiai Introduction To AI: Graham Kendall
No ratings yet
G5Aiai Introduction To AI: Graham Kendall
48 pages
Train A Simple NN - Jupyter Notebook
No ratings yet
Train A Simple NN - Jupyter Notebook
4 pages
Various Neural Network Architect Assignment Questions
No ratings yet
Various Neural Network Architect Assignment Questions
9 pages
Unit V
No ratings yet
Unit V
21 pages
Correction TD2 Programmation C Embarquée
No ratings yet
Correction TD2 Programmation C Embarquée
1 page
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
Unit III
No ratings yet
Unit III
58 pages
3D U-Net Based Brain Tumor Segmentation
No ratings yet
3D U-Net Based Brain Tumor Segmentation
11 pages
Snort 2
No ratings yet
Snort 2
50 pages
CS230 Midterm Solutions Fall 2022
No ratings yet
CS230 Midterm Solutions Fall 2022
20 pages
Experiment 3: Aim: Generate or Functions Using Mcculloch-Pitts Neural Net by A Matlab Program
No ratings yet
Experiment 3: Aim: Generate or Functions Using Mcculloch-Pitts Neural Net by A Matlab Program
3 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Deep Learning in Object Detection, PDF
No ratings yet
Deep Learning in Object Detection, PDF
64 pages
PPT_Btech CSE
No ratings yet
PPT_Btech CSE
17 pages
Perceptons Neural Networks
No ratings yet
Perceptons Neural Networks
33 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
Introduction To Machine Learning by Urtasan
No ratings yet
Introduction To Machine Learning by Urtasan
92 pages
Diffusion Models
No ratings yet
Diffusion Models
46 pages
Artificial Neural Networks - MiniProject
100% (1)
Artificial Neural Networks - MiniProject
16 pages
07 HDLC
100% (1)
07 HDLC
12 pages
Unit V
100% (1)
Unit V
24 pages
Computer Vision I: Ai Courses by Opencv
No ratings yet
Computer Vision I: Ai Courses by Opencv
9 pages
OS Lecture3 - Inter Process Communication
No ratings yet
OS Lecture3 - Inter Process Communication
43 pages
Reinforcement Learning: Nazia Bibi
100% (1)
Reinforcement Learning: Nazia Bibi
61 pages
Deep Learning & Machine Learning
No ratings yet
Deep Learning & Machine Learning
180 pages
Deep Neural Network
No ratings yet
Deep Neural Network
12 pages
Neuro Fuzzy Systems
100% (1)
Neuro Fuzzy Systems
27 pages
Lecture Notes 5
No ratings yet
Lecture Notes 5
3 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Unit -3-NNDL- Notes
No ratings yet
Unit -3-NNDL- Notes
17 pages
How To Use An Existing DNN Recognizer For Decoding in Kaldi
No ratings yet
How To Use An Existing DNN Recognizer For Decoding in Kaldi
14 pages
Back Propagation
100% (1)
Back Propagation
27 pages
Deep Learning Kathi
No ratings yet
Deep Learning Kathi
18 pages
Artificial Intelligence & Neural Networks Unit-5 Basics of NN
50% (2)
Artificial Intelligence & Neural Networks Unit-5 Basics of NN
16 pages
AI-Lecture 12 - Simple Perceptron
100% (1)
AI-Lecture 12 - Simple Perceptron
24 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
H.263:Video Compression Standard: Presented By:ekta Tiwari
No ratings yet
H.263:Video Compression Standard: Presented By:ekta Tiwari
23 pages
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
No ratings yet
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
9 pages
Lab Manual Soft Computing
No ratings yet
Lab Manual Soft Computing
44 pages
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Session 7: Genetics, Experience and Financial Sophistication
100% (1)
Session 7: Genetics, Experience and Financial Sophistication
40 pages
ORCO Financial Highlights
No ratings yet
ORCO Financial Highlights
1 page
Empirical Methods For Finance: Sjoerd Van Den Hauwe
No ratings yet
Empirical Methods For Finance: Sjoerd Van Den Hauwe
27 pages
Using Key Performance Indicators (Kpis) in Inclusive Insurance Supervision
No ratings yet
Using Key Performance Indicators (Kpis) in Inclusive Insurance Supervision
31 pages
Martin Hofmann's Contributions To Type Theory: Groupoids and Univalence
No ratings yet
Martin Hofmann's Contributions To Type Theory: Groupoids and Univalence
7 pages
BDC224 FD ADV Financial Statement A4
No ratings yet
BDC224 FD ADV Financial Statement A4
1 page
Digitaal Business Definition Model (Abell)
No ratings yet
Digitaal Business Definition Model (Abell)
1 page
Kenneth R. French - Data Library: Fama/French Factors Six Size/book-To-Market ..
No ratings yet
Kenneth R. French - Data Library: Fama/French Factors Six Size/book-To-Market ..
2 pages
Column Generation Tutorial: Marc de Leenheer Ghent University - IBBT, Belgium University of California, Davis, USA
No ratings yet
Column Generation Tutorial: Marc de Leenheer Ghent University - IBBT, Belgium University of California, Davis, USA
23 pages
Private Value Too Big Fail Guarantees
No ratings yet
Private Value Too Big Fail Guarantees
46 pages
Dse601020 Wiring Diagram DEEPSEA
60% (5)
Dse601020 Wiring Diagram DEEPSEA
3 pages
.CNÑ Î Íø - GA H110M S2PH - R10 Schematic PDF
100% (2)
.CNÑ Î Íø - GA H110M S2PH - R10 Schematic PDF
44 pages
Sitrans Mass6000 19 Inch Mounting
No ratings yet
Sitrans Mass6000 19 Inch Mounting
9 pages
Sunday - Tenor Sax
No ratings yet
Sunday - Tenor Sax
3 pages
Exploiting Deep Learning For Persian Sentiment Analysis
No ratings yet
Exploiting Deep Learning For Persian Sentiment Analysis
8 pages
Abnormal Operation Manual V1.1-241118
No ratings yet
Abnormal Operation Manual V1.1-241118
77 pages
Internship Presentation 2
No ratings yet
Internship Presentation 2
16 pages
OS-Assignment-3 Deadlock CSE G
No ratings yet
OS-Assignment-3 Deadlock CSE G
2 pages
Power System Reliability Evaluation With SCADA Cybersecurity Considerations
No ratings yet
Power System Reliability Evaluation With SCADA Cybersecurity Considerations
15 pages
SWE 302-lect06- Part II- Example
No ratings yet
SWE 302-lect06- Part II- Example
28 pages
Instant ebooks textbook Practical Machine Learning with AWS : Process, Build, Deploy, and Productionize Your Models Using AWS Himanshu Singh download all chapters
100% (5)
Instant ebooks textbook Practical Machine Learning with AWS : Process, Build, Deploy, and Productionize Your Models Using AWS Himanshu Singh download all chapters
55 pages
Lesson 13: Clinical Data Repositories
No ratings yet
Lesson 13: Clinical Data Repositories
4 pages
ETI microproject[1]
No ratings yet
ETI microproject[1]
17 pages
Beginners Guide to Dirty Cow Vulnerability
No ratings yet
Beginners Guide to Dirty Cow Vulnerability
3 pages
Comp Arithmetic
No ratings yet
Comp Arithmetic
50 pages
SEECAdmin Guide
No ratings yet
SEECAdmin Guide
148 pages
Daewoo Doosan d20s 5 Service
No ratings yet
Daewoo Doosan d20s 5 Service
1,450 pages
B2B scenario Sender EDI to IDOC
No ratings yet
B2B scenario Sender EDI to IDOC
17 pages
Restarting SAP Workflows: (1) Restarting The Workflow Which Is in Error
No ratings yet
Restarting SAP Workflows: (1) Restarting The Workflow Which Is in Error
7 pages
Ophthalmic Ultrasound Authorized Distributors Albania: Gen - Albfarma
No ratings yet
Ophthalmic Ultrasound Authorized Distributors Albania: Gen - Albfarma
12 pages
ds89c430 ds89c450
No ratings yet
ds89c430 ds89c450
46 pages
FE1.1s - USB 2.0 H S 4-P H C: Revision B
No ratings yet
FE1.1s - USB 2.0 H S 4-P H C: Revision B
22 pages
Hfe Yamaha rx-v371 Flyer en PDF
No ratings yet
Hfe Yamaha rx-v371 Flyer en PDF
2 pages
Taurus Series Multimedia Player TB3 Specifications V1.6.5
No ratings yet
Taurus Series Multimedia Player TB3 Specifications V1.6.5
9 pages
Activity 6
No ratings yet
Activity 6
3 pages
Checkpoint R80
No ratings yet
Checkpoint R80
122 pages
Instant Access to Deep Learning A Visual Approach Glassner ebook Full Chapters
100% (2)
Instant Access to Deep Learning A Visual Approach Glassner ebook Full Chapters
65 pages
Abreviacoes Siglas Usadas Eletronica
No ratings yet
Abreviacoes Siglas Usadas Eletronica
6 pages
Vacon 20 Quick Guide DPD00511H1 UK
No ratings yet
Vacon 20 Quick Guide DPD00511H1 UK
62 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Q-Learning and Deep Q Networks (DQN)

Uploaded by

Q-Learning and Deep Q Networks (DQN)

Uploaded by

Q-learning and Deep Q networks (DQN)

Motivation for value-based reinforcement learning

The Bellman operator

Q-learning with deep learning as a function approximator

A few variants of DQN

Discussion of a parallel with neurosciences

Deep learning has brought its generalization capabilities to RL.

In an MDP (X , A, T , R, γ), the expected return V π (x) : X → R

with γ ∈ [0, 1).

In addition to the V-value function, the Q-value function

The particularity of the Q-value function as compared to the

The Bellman equation that is at the core of reinforcement learning

To obtain Q ∗ , you can:

In order to learn the optimal Q-value function, the Q-learning

Q ∗ (x, a) = (BQ ∗ )(x, a),

where B is the Bellman operator mapping any function

a,0 a,0 a,0 a,0

state/action a b state/action a b state/action a b

Table: Update of the tabular Q-values starting from an initialization to 0.

The resulting policy is to choose action a in all states when the

0.81 0.9 1 R=1

i=2 0 0.81 0.9

As opposed to dynamic programming that assumes access to the

Algorithm 1 Pseudocode for the Q-learning algorithm in the tabular setting

Theorem: Given a finite MDP, the Q-learning algorithm given by

Q(xt , at ) ← Q(xt , at ) + αt [rt + γ max

More details: “Convergence of Q-learning: a simple proof”,

Figure: Mountain car

Figure: Application to the mountain car domain: V = maxa Q(x, a),

Figure: Mountain car optimal policy

A tabular approach with discretization fails due to the curse of

When do we need function approximators?

To deal with continuous state and/or action space, we can

With deep learning, the update usually uses a mini-batch (e.g., 32

s1 , . . . , sNreplay , a1 , . . . , aNreplay Update Every C:

Figure: Sketch of the DQN algorithm. Q(x, a; θk ) is initialized to random

Figure: DQN for mountain car

Figure: Price signal

# −−− Bind c o n t r o l l e r s t o t h e a g e n t −−−

# −−− Run t h e e x p e r i m e n t −−−

Figure: Sketch of the structure of the neural network architecture. The

0.40 Without any external info

Another approach is to aim for a richer representation through a

Q π (x, a) = EZ π (x, a).

This random return is also described by a recursive equation, but

where we use capital letters to emphasize the random nature of the

Such a variant in the case of DQN can be obtained by using the

where (x0 , a0 , r0 , · · · , sn−1 , an−1 , rn−1 , sn ) is any trajectory of n + 1

Effect of the discount factor in an online setting.

I Introduction to Q-learning in the tabular case and with deep

I Watkins, Christopher JCH, and Peter Dayan. ”Q-learning.”

Function Learning Controllers

a,0 a,0 a,0 a,0

Figure: The chain environment (γ = 0.9). Initial state is state 1.

In part 1, you work in the tabular context:

In part 2, you will solve the chain problem using function

I Your state is simply defined as one scalar (without history).

I Your have two actions

I You never have terminal states:

I The function “observe” provides the encoded representation

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.