Q-Learning and Deep Q Networks (DQN)
Q-Learning and Deep Q Networks (DQN)
Vincent François-Lavet
December 8, 2021
Outline
Conclusions
Project
Motivation for value-based
reinforcement learning
Overview of deep RL
In general, a reinforcement learning (RL) agent may include one or
more of the following components:
I a representation of a value function that provides a prediction
of how good is each state or each couple state/action,
I a direct representation of the policy π(x) or π(x, a), or
I a model of the environment in conjunction with a planning
algorithm.
Model-based
RL
Experience
Model
Acting
learning
Modef-free
RL Value-based Policy-based
Model Value/policy
RL RL
Planning
In particular:
Q ∗ (x, a) = E rt + Q ∗ (xt+1 , a0 ∼ π ∗ ) | xt = x, at = a, π ∗
∗ 0 ∗
= E rt + max
0
Q (xt+1 , a ) | xt = x, at = a, π
a ∈A
Value-based method: Q-learning with one entry for every
state-action pair
b,0.2 a,1
0 0 1 R=1
i=0 0 0 0
0 0 0
0 0.9 1 R=1
i=1 0 0 0.9
0 0 0
0 0 0.81
Figure: Grid-world MDP with γ = 0.9, and where we assume that after
obtaining R=1, we end up in terminal state (i.e. all following rewards=0)
Q-learning
Value-based method: Q-learning
converges
P w.p.1 to
Pthe2 optimal Q-function as long as
t α t = ∞ and t αt < ∞. and
The exploration policy π is such that
Pπ [at = a|xt = x] > 0, ∀(x, a).
s1+1 , . . . , sNreplay +1
Environment
rt + γmax
0
(Q(xt+1 , a0 ; θk− ))
a ∈A
You can then launch “run toy env simple.py” in the folder
“examples/toy env/”.
Example: run toy env simple.py
r n g = np . random . RandomState ( 1 2 3 4 5 6 )
# −−− I n s t a n t i a t e e n v i r o n m e n t −−−
env = Toy env ( rng )
# −−− I n s t a n t i a t e q n e t w o r k −−−
q n e t w o r k = MyQNetwork (
e n v i r o n m e n t=env ,
r a n d o m s t a t e=r n g )
# −−− I n s t a n t i a t e a g e n t −−−
agent = NeuralAgent (
env ,
qnetwork ,
r a n d o m s t a t e=r n g )
# D u r i n g t r a i n i n g e p o c h s , we want t o t r a i n t h e a g e n t a f t e r e v e r y a c t i o n it takes .
a g e n t . a t t a c h ( bc . T r a i n e r C o n t r o l l e r ( ) )
# We a l s o want t o i n t e r l e a v e a ” t e s t e p o c h ” b e t w e e n e a c h t r a i n i n g e p o c h .
a g e n t . a t t a c h ( bc . I n t e r l e a v e d T e s t E p o c h C o n t r o l l e r ( e p o c h l e n g t h =500))
In this graph, you can see that the agent has successfully learned to take
advantage of the price pattern. It is important to note that the results shown
are made on a validation set that is different from the training and we can see
that learning generalizes well. For instance, the action of buying at time step 7
and 16 is the expected result because in average this will allow to make profit
since the agent has no information on the future.
Real-world application of deep RL: the microgrid
benchmark
A microgrid is an electrical system that includes multiple loads and
distributed energy resources that can be operated in parallel with
the broader utility grid or as an electrical island.
Microgrid
Microgrids and storage
There exist opportunities with microgrids featuring:
I A short term storage capacity (typically batteries),
I A long term storage capacity (e.g., hydrogen).
Structure of the Q-network
Fully-connected
Convolutions layers Outputs
Input #1
Input #2
Input #3
..
.
0.34
0.32
0.30
0.28
0.2670 80 90 100 110 120
% of the robust sizings (PV, Battery, H2 storage)
Figure: LEC on the test data function of the sizings of the microgrid.
A few variants of DQN
Distributional DQN
Z π (x, a) = R(x, a, X 0 ) + γZ π (X 0 , A0 ),
In DQN, the target value used is estimated based on its own value
estimate at the next time-step. For that reason, the learning
algorithm is said to bootstrap as it recursively uses its own value
estimates.
Figure: Illustration for the game q-bert of a discount factor γ held fixed on the
right and an adaptive discount factor on the right.
Conclusions
Summary of the lecture
Policies
Replay memory Exploration/Exploitation
(e.g., via -greedy)
AGENT
ENVIRONMENT
Implementation : https://github.com/VinF/deer
Questions?
Project
Project
You consider the chain environment made up of 5 discrete states
and 2 discrete actions, where you get a reward on 0.2 on one end
of the chain and 1 at the other end (see illustration below).
b,0.2 a,1
def inputDimensions ( s e l f ) :
return [(1 ,)]
def nActions ( s e l f ) :
return 2
Example: run toy env simple.py
def inTerminalState ( s e l f ) :
return False
def observe ( s e l f ) :
r e t u r n np . a r r a y ( s e l f . l a s t p o n c t u a l o b s e r v a t i o n )
Questions?