0% found this document useful (0 votes)

3 views

10 Deep Reinforcement

The document discusses deep reinforcement learning (RL) for time series decision-making, outlining key concepts such as the agent-environment interaction, exploration-exploitation dilemma, and various RL algorithms including Q-learning and DQN. It emphasizes the application of these methods in trading actions for stock price predictions and the use of experience replay for stable learning. Additionally, it introduces advanced architectures like DDPG and DDQN to improve performance and reduce overestimations in decision-making tasks.

Uploaded by

Matei Dinu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

10 Deep Reinforcement

Uploaded by

Matei Dinu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Deep reinforcement learning

for time series decision making

Ruxandra Stoean
Further bibliography
 R. S. Sutton, A. G. Barto, Reinforcement Learning, second edition: An Introduction
(Adaptive Computation and Machine Learning series), 2018
 M. Lapan, Deep Reinforcement Learning Hands-On: Apply modern RL methods to
practical problems of chatbots, robotics, discrete optimization, web automation, and
more, 2nd Edition, 2018
 L. Graesser, W. L. Keng, Foundations of Deep Reinforcement Learning: Theory and
Practice in Python, 2019
 M. Morales, Grokking Deep Reinforcement Learning, 2020
 W. B. Powell, Reinforcement Learning and Stochastic Optimization: A Unified
Framework for Sequential Decisions, 2022
Reinforcement learning
 A learning paradigm different from
 Supervised learning
 Associate input to output in labeled data
 Unsupervised learning
 Find patterns in unlabeled data

 Reinforcement learning
 An agent in an initial state in an environment
 Loop until reach target
 Experience: take actions -> move to next state
 Get reward from the environment
 Maximize cumulative reward
 Exploit-explore information
 Perform several such episodes (similar to epochs in neural networks)
Concepts 1/3
 Action taken by the agent in the environment

 Environment response to the agent

 Reward (Value): feedback to reinforce behavior
 State: the state changes for the agent as a consequence of its action

 Loop until terminal state is reached

 Reach destination
 Obtain a maximal reward
 A number of time steps is reached
 Game over
Environment
 Deterministic
 State transition and reward are deterministic functions
 The reward for a same action in a given state is always the same
 The specific action in the particular state determines the same next state every time

 Stochastic
 The reward and the transition to a new state after a same action may not be the same as
in a previous encounter
Concepts 2/3
 Policy (π): the strategy followed by the agent in its quest
 Optimal, when it maximizes the value
 Value function
 The expected value (reward) of a state s if the agent follows the policy π
 The state-value for a policy
 Q-value (quality-value) function
 The value of the long-time gain if the agent in a state takes the action a and follows the
policy π
 Action-value function for a policy
 Temporal difference (TD)
 Computes the estimated value of a state for the policy π, based on the reward received by
the agent and the value of the next state
Exploration-Exploitation Dilemma
 Exploitation
 Take the best learned action, with the maximum expected reward at a given state
 Exploration
 Take a random action, without taking rewards into account
 Trade-off between exploitation and exploration
 Exploitation only: get stuck into local optimum
 Exploration only: large time to discover all the information
 The ε-greedy policy
 Random action is selected with probability ε
 Optimal action with 1-ε probability
 On- and off-policy approaches
On- versus off-policy
 On - SARSA (State-Action-Reward-State-Action)
 Employs the ε-greedy policy
 To estimate the Q-value, it takes the next action a’ in next state s’ using the same strategy
 target(s’) = R(s, a, s’) + γQk(s’, a’)
 Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)]
 Off – Q-learning
 ε-greedy policy
 And, to estimate the Q-value, it uses a max greedy target policy for the best action (with
the maximum value) in the next state s’
 target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’) (Bellman equation)
 Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)]
 Alternative formulation: Qk+1(s, a) = Qk(s, a) + α[R(s, a, s’) + γmaxa’Qk(s’, a’) – Qk(s, a)]
Concepts 3/3
 Learning rate α
 Values in [0,1]
 A value of 0 leads to no learning
 A value of 0.9 leads to very fast learning

 The discount factor γ

 Also in [0, 1]
 Makes further rewards count less than immediate ones

 ε-decay
 Initially high ε, then value is decreased to allow less random actions
Q-learning
 Model-free RL approach
 Trial-and-error algorithm, learning from action-outcome as it goes through the
environment
 It does not construct an internal model

https://www.baeldung.com/cs/reinforcement-learning-neural-network
Tabular (exact) Q-learning Algorithm
 Initialize Q0(s, a) for all states and actions (by 0)
 Repeat
 Initialize state s
 For k = 1, 2, …
 Sample an action a according to policy
 Execute a and get next state s’
 If s’ is terminal
 target(s’) = R(s, a, s’) (reward of transition)
 Else
 target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’)
 Update Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)] to be closer to the target
 s = s’
 Until number of episodes reached
Example
Q function parameterized by a function approximator
 Q values computed by e.g. neural network (deep learning) -> get parameters θ of
the Q function; initially random weights
 Iterative regression -> fit Q-values to the computed targets
 Optimizing squared loss function

 Problem: non-stable targets, catastrophic forgetting

1. Q values for a state and action will not remain stationary as before, as the neural
network generalize between states
2. Large swings in state distributions
Approximate Q-learning Algorithm
 Initialize Q0(s, a) for all states and actions (by 0)
 Repeat
 Initialize state s
 For k = 1, 2, …
 Sample an action a according to policy
 Execute a and get next state s’
 If s’ is terminal
 target(s’) = R(s, a, s’)
 Else
 target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’)
 Gradient update on the function approximator θk+1 = θk - α∇θEs’[(Qθ(s, a) – target(s’))2] θ=θk
 s = s’
 Until number of episodes reached (complete passes of the data)
DQN Algorithm
 Transform Q-learning to a supervised learning task
1. Experience replay buffer
 Take action - get reward - go to next state and store each transition in the buffer
 Online single learning update replaced with applying batch update - sampling mini-
batch of past transitions from the buffer -> a stabler update
 Data distribution is more stationary
 Steadier learning
2. Save a copy of the weights fixed for some time to compute the target function
(target network), instead of using the current weights γmaxa’Qk(s’, a’, θ-)
Example: Trading actions in stock time series
 Problem
 Given a historical stock price time series, decide on best trading action
 BUY
 SELL
 HOLD

 Could be solved through a recurrent architecture (LSTM, GRU) to estimate the

stock price evolution
 Take the estimations and formulate a separate optimization problem to determine
the best trading actions per time step, e.g. evolutionary algorithms
State
representation
definition
State representation
Portofolio performance
Final plots: transaction history
Final plots: returns across RL episodes
Agent definition
Deep model architecture
Reset, remember transition, take action
Experience replay
Initialize Agent, import data, define actions
Hold, Buy, Sell actions
Logs
RL loop
Predict action from state and execute it
Compute reward
Call experience buffer
 In practice, for the experience memory, a deque structure is used, which is
larger than the batch on which the model is trained
 Updates when replay buffer length is larger than batch_size threshold
 New memories are pushed in and older ones are taken out from the deque
Save model at each episode and plot returns
across episodes
Evaluation
stage on
test data
Take model at episode 10 and try to get a
portofolio different from 0
Trading actions and their plot
Return by episode
Trading decisions on test data
Further deep RL architectures to avoid
overestimations
 DDPG (Deep Deterministic Policy Gradient)
 Combines DQN with DPG (Deterministic Policy Gradient)
 An actor-critic method (two neural networks)
 The actor is a deterministic policy network to determine the action
 The critic estimates the Q-value
 DDQN (Double DQN)
 Two networks: a DQN and a Target Network
 The DQN selects the best action with maximum Q-value for the next state
 The target network calculates the estimated Q-value for the action selected

Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
A2004 Operator Manual
No ratings yet
A2004 Operator Manual
82 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
UNIT- 5
No ratings yet
UNIT- 5
43 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
Lecture Notes on Reinforcement Learning Basics
No ratings yet
Lecture Notes on Reinforcement Learning Basics
6 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Sections
No ratings yet
Sections
76 pages
37 RL
No ratings yet
37 RL
18 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Unit 1
No ratings yet
Unit 1
18 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Unit-5 Mlt
No ratings yet
Unit-5 Mlt
13 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
DRL
No ratings yet
DRL
9 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
lecture-06
No ratings yet
lecture-06
98 pages
RL RS-Unit_3 (1)
No ratings yet
RL RS-Unit_3 (1)
6 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
Chapter_1_Introduction_RL_Report_Kiran
No ratings yet
Chapter_1_Introduction_RL_Report_Kiran
2 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Q_Networks[1]-31-50
No ratings yet
Q_Networks[1]-31-50
20 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Unit3
No ratings yet
Unit3
13 pages
What is TD Learning
No ratings yet
What is TD Learning
15 pages
Rule-based Reinforcement Learning augmented by External Knowledge
No ratings yet
Rule-based Reinforcement Learning augmented by External Knowledge
7 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Lecture 10 - Overview of RL With A VIP Perspective
No ratings yet
Lecture 10 - Overview of RL With A VIP Perspective
35 pages
Hota-ML-ReinforcementLearning
No ratings yet
Hota-ML-ReinforcementLearning
12 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Tema 2
No ratings yet
Tema 2
23 pages
Tema lab 1
No ratings yet
Tema lab 1
1 page
9 Deep Leaning RNN
No ratings yet
9 Deep Leaning RNN
64 pages
7 Selectia trasaturilor
No ratings yet
7 Selectia trasaturilor
54 pages
6 Evaluarea performantei
No ratings yet
6 Evaluarea performantei
43 pages
5 2 ensemble learning
No ratings yet
5 2 ensemble learning
38 pages
5 1 decision trees
No ratings yet
5 1 decision trees
34 pages
Template_dissertation_20240614 (3)
No ratings yet
Template_dissertation_20240614 (3)
39 pages
AA Textbook STATICS SolMan
100% (1)
AA Textbook STATICS SolMan
19 pages
Carta Descriptiva Kid's Box 5
No ratings yet
Carta Descriptiva Kid's Box 5
15 pages
Light Load Efficiency Improvement Ver2
No ratings yet
Light Load Efficiency Improvement Ver2
38 pages
Dynamics of Reciprocating Engines PDF
No ratings yet
Dynamics of Reciprocating Engines PDF
13 pages
Quantization and Sampling Test
No ratings yet
Quantization and Sampling Test
13 pages
Lenovo IdeaPad U510 LA-8971p Rev 0.1
No ratings yet
Lenovo IdeaPad U510 LA-8971p Rev 0.1
58 pages
Work and Energy
No ratings yet
Work and Energy
3 pages
HN8255Ws Datasheet: Product Details
No ratings yet
HN8255Ws Datasheet: Product Details
4 pages
Manual CMP45
No ratings yet
Manual CMP45
10 pages
Specular Showdown in The Wild West - Self Shadow
No ratings yet
Specular Showdown in The Wild West - Self Shadow
9 pages
Hypersonic Tunnel
No ratings yet
Hypersonic Tunnel
3 pages
Direct Cuspal Coverage
No ratings yet
Direct Cuspal Coverage
8 pages
Spacelabs Ultraview 1030, 1050 Monitor - Circuit Diagrams
No ratings yet
Spacelabs Ultraview 1030, 1050 Monitor - Circuit Diagrams
46 pages
Plastech COI, MOA, AOA
No ratings yet
Plastech COI, MOA, AOA
20 pages
Advanced Nuclear Physics
No ratings yet
Advanced Nuclear Physics
105 pages
Atos - VV Prop TN6 DHZO e TN 10 DKZOR
100% (1)
Atos - VV Prop TN6 DHZO e TN 10 DKZOR
6 pages
Heat
No ratings yet
Heat
7 pages
1.2. Basics of Telecom Networks
No ratings yet
1.2. Basics of Telecom Networks
50 pages
IP Global Internet
No ratings yet
IP Global Internet
9 pages
ECU - Application Manual KDI - 1903TCR - 2504TCR - StageV
No ratings yet
ECU - Application Manual KDI - 1903TCR - 2504TCR - StageV
29 pages
Site Selection For Subdivision Development in Guig
No ratings yet
Site Selection For Subdivision Development in Guig
22 pages
T007 Jun 7 2021
No ratings yet
T007 Jun 7 2021
217 pages
Manufacturing Overhead Variance: Compute The Variable Manufacturing Overhead Spending and Efficiency Variances
No ratings yet
Manufacturing Overhead Variance: Compute The Variable Manufacturing Overhead Spending and Efficiency Variances
12 pages
FAG Series Housings
No ratings yet
FAG Series Housings
33 pages
3.3 Integration and Di Erentiation of Fourier Se-Ries: 3.3.1 Goal
No ratings yet
3.3 Integration and Di Erentiation of Fourier Se-Ries: 3.3.1 Goal
5 pages
Welcome TO Saint Nicholas Academy of Vintar, Inc
No ratings yet
Welcome TO Saint Nicholas Academy of Vintar, Inc
16 pages
22643 MEC
No ratings yet
22643 MEC
15 pages
2024 NIH Notes
No ratings yet
2024 NIH Notes
9 pages
Simulate With Modelsim
No ratings yet
Simulate With Modelsim
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

10 Deep Reinforcement

Uploaded by

10 Deep Reinforcement

Uploaded by

Deep reinforcement learning

for time series decision making

 Environment response to the agent

 Loop until terminal state is reached

 The discount factor γ

 Problem: non-stable targets, catastrophic forgetting

 Could be solved through a recurrent architecture (LSTM, GRU) to estimate the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.