10 Deep Reinforcement
10 Deep Reinforcement
Ruxandra Stoean
Further bibliography
R. S. Sutton, A. G. Barto, Reinforcement Learning, second edition: An Introduction
(Adaptive Computation and Machine Learning series), 2018
M. Lapan, Deep Reinforcement Learning Hands-On: Apply modern RL methods to
practical problems of chatbots, robotics, discrete optimization, web automation, and
more, 2nd Edition, 2018
L. Graesser, W. L. Keng, Foundations of Deep Reinforcement Learning: Theory and
Practice in Python, 2019
M. Morales, Grokking Deep Reinforcement Learning, 2020
W. B. Powell, Reinforcement Learning and Stochastic Optimization: A Unified
Framework for Sequential Decisions, 2022
Reinforcement learning
A learning paradigm different from
Supervised learning
Associate input to output in labeled data
Unsupervised learning
Find patterns in unlabeled data
Reinforcement learning
An agent in an initial state in an environment
Loop until reach target
Experience: take actions -> move to next state
Get reward from the environment
Maximize cumulative reward
Exploit-explore information
Perform several such episodes (similar to epochs in neural networks)
Concepts 1/3
Action taken by the agent in the environment
Stochastic
The reward and the transition to a new state after a same action may not be the same as
in a previous encounter
Concepts 2/3
Policy (π): the strategy followed by the agent in its quest
Optimal, when it maximizes the value
Value function
The expected value (reward) of a state s if the agent follows the policy π
The state-value for a policy
Q-value (quality-value) function
The value of the long-time gain if the agent in a state takes the action a and follows the
policy π
Action-value function for a policy
Temporal difference (TD)
Computes the estimated value of a state for the policy π, based on the reward received by
the agent and the value of the next state
Exploration-Exploitation Dilemma
Exploitation
Take the best learned action, with the maximum expected reward at a given state
Exploration
Take a random action, without taking rewards into account
Trade-off between exploitation and exploration
Exploitation only: get stuck into local optimum
Exploration only: large time to discover all the information
The ε-greedy policy
Random action is selected with probability ε
Optimal action with 1-ε probability
On- and off-policy approaches
On- versus off-policy
On - SARSA (State-Action-Reward-State-Action)
Employs the ε-greedy policy
To estimate the Q-value, it takes the next action a’ in next state s’ using the same strategy
target(s’) = R(s, a, s’) + γQk(s’, a’)
Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)]
Off – Q-learning
ε-greedy policy
And, to estimate the Q-value, it uses a max greedy target policy for the best action (with
the maximum value) in the next state s’
target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’) (Bellman equation)
Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)]
Alternative formulation: Qk+1(s, a) = Qk(s, a) + α[R(s, a, s’) + γmaxa’Qk(s’, a’) – Qk(s, a)]
Concepts 3/3
Learning rate α
Values in [0,1]
A value of 0 leads to no learning
A value of 0.9 leads to very fast learning
ε-decay
Initially high ε, then value is decreased to allow less random actions
Q-learning
Model-free RL approach
Trial-and-error algorithm, learning from action-outcome as it goes through the
environment
It does not construct an internal model
https://www.baeldung.com/cs/reinforcement-learning-neural-network
Tabular (exact) Q-learning Algorithm
Initialize Q0(s, a) for all states and actions (by 0)
Repeat
Initialize state s
For k = 1, 2, …
Sample an action a according to policy
Execute a and get next state s’
If s’ is terminal
target(s’) = R(s, a, s’) (reward of transition)
Else
target(s’) = R(s, a, s’) + γmaxa’Qk(s’, a’)
Update Qk+1(s, a) = (1- α)Qk(s, a) + α[target(s’)] to be closer to the target
s = s’
Until number of episodes reached
Example
Q function parameterized by a function approximator
Q values computed by e.g. neural network (deep learning) -> get parameters θ of
the Q function; initially random weights
Iterative regression -> fit Q-values to the computed targets
Optimizing squared loss function