0% found this document useful (0 votes)
3 views

10 Deep Reinforcement

The document discusses deep reinforcement learning (RL) for time series decision-making, outlining key concepts such as the agent-environment interaction, exploration-exploitation dilemma, and various RL algorithms including Q-learning and DQN. It emphasizes the application of these methods in trading actions for stock price predictions and the use of experience replay for stable learning. Additionally, it introduces advanced architectures like DDPG and DDQN to improve performance and reduce overestimations in decision-making tasks.

Uploaded by

Matei Dinu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

10 Deep Reinforcement

The document discusses deep reinforcement learning (RL) for time series decision-making, outlining key concepts such as the agent-environment interaction, exploration-exploitation dilemma, and various RL algorithms including Q-learning and DQN. It emphasizes the application of these methods in trading actions for stock price predictions and the use of experience replay for stable learning. Additionally, it introduces advanced architectures like DDPG and DDQN to improve performance and reduce overestimations in decision-making tasks.

Uploaded by

Matei Dinu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Deep reinforcement learning

for time series decision making

Ruxandra Stoean
Further bibliography
 R. S. Sutton, A. G. Barto, Reinforcement Learning, second edition: An Introduction
(Adaptive Computation and Machine Learning series), 2018
 M. Lapan, Deep Reinforcement Learning Hands-On: Apply modern RL methods to
practical problems of chatbots, robotics, discrete optimization, web automation, and
more, 2nd Edition, 2018
 L. Graesser, W. L. Keng, Foundations of Deep Reinforcement Learning: Theory and
Practice in Python, 2019
 M. Morales, Grokking Deep Reinforcement Learning, 2020
 W. B. Powell, Reinforcement Learning and Stochastic Optimization: A Unified
Framework for Sequential Decisions, 2022
Reinforcement learning
 A learning paradigm different from
 Supervised learning
 Associate input to output in labeled data
 Unsupervised learning
 Find patterns in unlabeled data

 Reinforcement learning
 An agent in an initial state in an environment
 Loop until reach target
 Experience: take actions -> move to next state
 Get reward from the environment
 Maximize cumulative reward
 Exploit-explore information
 Perform several such episodes (similar to epochs in neural networks)
Concepts 1/3
 Action taken by the agent in the environment

 Environment response to the agent


 Reward (Value): feedback to reinforce behavior
 State: the state changes for the agent as a consequence of its action

 Loop until terminal state is reached


 Reach destination
 Obtain a maximal reward
 A number of time steps is reached
 Game over
Environment
 Deterministic
 State transition and reward are deterministic functions
 The reward for a same action in a given state is always the same
 The specific action in the particular state determines the same next state every time

 Stochastic
 The reward and the transition to a new state after a same action may not be the same as
in a previous encounter
Concepts 2/3
 Policy (π): the strategy followed by the agent in its quest
 Optimal, when it maximizes the value
 Value function
 The expected value (reward) of a state s if the agent follows the policy π
 The state-value for a policy
 Q-value (quality-value) function
 The value of the long-time gain if the agent in a state takes the action a and follows the
policy π
 Action-value function for a policy
 Temporal difference (TD)
 Computes the estimated value of a state for the policy π, based on the reward received by
the agent and the value of the next state
Exploration-Exploitation Dilemma
 Exploitation
 Take the best learned action, with the maximum expected reward at a given state
 Exploration
 Take a random action, without taking rewards into account
 Trade-off between exploitation and exploration
 Exploitation only: get stuck into local optimum
 Exploration only: large time to discover all the information
 The ε-greedy policy
 Random action is selected with probability ε
 Optimal action with 1-ε probability
 On- and off-policy approaches
On- versus off-policy
 On - SARSA (State-Action-Reward-State-Action)
 Employs the ε-greedy policy
 To estimate the Q-value, it takes the next action a’ in next state s’ using the same strategy
 target(s’) = R(s, a, s’) + γQk(s’, a’)
 Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)]
 Off – Q-learning
 ε-greedy policy
 And, to estimate the Q-value, it uses a max greedy target policy for the best action (with
the maximum value) in the next state s’
 target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’) (Bellman equation)
 Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)]
 Alternative formulation: Qk+1(s, a) = Qk(s, a) + α[R(s, a, s’) + γmaxa’Qk(s’, a’) – Qk(s, a)]
Concepts 3/3
 Learning rate α
 Values in [0,1]
 A value of 0 leads to no learning
 A value of 0.9 leads to very fast learning

 The discount factor γ


 Also in [0, 1]
 Makes further rewards count less than immediate ones

 ε-decay
 Initially high ε, then value is decreased to allow less random actions
Q-learning
 Model-free RL approach
 Trial-and-error algorithm, learning from action-outcome as it goes through the
environment
 It does not construct an internal model

https://www.baeldung.com/cs/reinforcement-learning-neural-network
Tabular (exact) Q-learning Algorithm
 Initialize Q0(s, a) for all states and actions (by 0)
 Repeat
 Initialize state s
 For k = 1, 2, …
 Sample an action a according to policy
 Execute a and get next state s’
 If s’ is terminal
 target(s’) = R(s, a, s’) (reward of transition)
 Else
 target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’)
 Update Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)] to be closer to the target
 s = s’
 Until number of episodes reached
Example
Q function parameterized by a function approximator
 Q values computed by e.g. neural network (deep learning) -> get parameters θ of
the Q function; initially random weights
 Iterative regression -> fit Q-values to the computed targets
 Optimizing squared loss function

 Problem: non-stable targets, catastrophic forgetting


1. Q values for a state and action will not remain stationary as before, as the neural
network generalize between states
2. Large swings in state distributions
Approximate Q-learning Algorithm
 Initialize Q0(s, a) for all states and actions (by 0)
 Repeat
 Initialize state s
 For k = 1, 2, …
 Sample an action a according to policy
 Execute a and get next state s’
 If s’ is terminal
 target(s’) = R(s, a, s’)
 Else
 target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’)
 Gradient update on the function approximator θk+1 = θk - α∇θEs’[(Qθ(s, a) – target(s’))2] θ=θk
 s = s’
 Until number of episodes reached (complete passes of the data)
DQN Algorithm
 Transform Q-learning to a supervised learning task
1. Experience replay buffer
 Take action - get reward - go to next state and store each transition in the buffer
 Online single learning update replaced with applying batch update - sampling mini-
batch of past transitions from the buffer -> a stabler update
 Data distribution is more stationary
 Steadier learning
2. Save a copy of the weights fixed for some time to compute the target function
(target network), instead of using the current weights γmaxa’Qk(s’, a’, θ-)
Example: Trading actions in stock time series
 Problem
 Given a historical stock price time series, decide on best trading action
 BUY
 SELL
 HOLD

 Could be solved through a recurrent architecture (LSTM, GRU) to estimate the


stock price evolution
 Take the estimations and formulate a separate optimization problem to determine
the best trading actions per time step, e.g. evolutionary algorithms
State
representation
definition
State representation
Portofolio performance
Final plots: transaction history
Final plots: returns across RL episodes
Agent definition
Deep model architecture
Reset, remember transition, take action
Experience replay
Initialize Agent, import data, define actions
Hold, Buy, Sell actions
Logs
RL loop
Predict action from state and execute it
Compute reward
Call experience buffer
 In practice, for the experience memory, a deque structure is used, which is
larger than the batch on which the model is trained
 Updates when replay buffer length is larger than batch_size threshold
 New memories are pushed in and older ones are taken out from the deque
Save model at each episode and plot returns
across episodes
Evaluation
stage on
test data
Take model at episode 10 and try to get a
portofolio different from 0
Trading actions and their plot
Return by episode
Trading decisions on test data
Further deep RL architectures to avoid
overestimations
 DDPG (Deep Deterministic Policy Gradient)
 Combines DQN with DPG (Deterministic Policy Gradient)
 An actor-critic method (two neural networks)
 The actor is a deterministic policy network to determine the action
 The critic estimates the Q-value
 DDQN (Double DQN)
 Two networks: a DQN and a Target Network
 The DQN selects the best action with maximum Q-value for the next state
 The target network calculates the estimated Q-value for the action selected

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy