Hota-ML-ReinforcementLearning
Hota-ML-ReinforcementLearning
29.11.2024
Image source: https://highlandcanine.com/ Image source: UltraTech Cement Stock, 27th Nov 2024 from
www.nseindia.com
reward
St St+1 St+2
Q-Learning Algorithm
• Q-learning is a model-free reinforcement learning (RL) algorithm used to
learn the optimal policy for a Markov Decision Process (MDP)
Initialize Q-Table
Select an Action
After multiple Episodes,
a good Q-Table is ready
Perform Action
Measure Reward
+ [ + max ]
Update Q-Table
An Example of Q-Learning
• Initializing the environment: States: {s0, s1, s2}, Actions: {a0, a1}, Rewards:
R(s0, a0) = -1, R(s0, a1) = +2, R(s1, a0) = +3, R(s1, a1) = +1, R(s2, any action) = 0
(terminal state).
• Transitions: T(s0, a0) s1, T(s0,a1) s2 (goal), T(s1,a0) s2, T(s1, a1)s0
• Episode 1:
• current state: s0, action chosen: a0 (randomly using exploration), reward: R(s0,
a0) = -1, next state: s1.
• Update Q(s0,a0) using Bellman’s equation:
• Q(s0, a0) 0.5 * [-1 + 0] = -0.5 (Since, Q(s1, a’) = 0 initially (no knowledge of s1).
Updated Q-values after 3 Episodes
State Action(a0) Action(a1)
Ex. Continued… s0 -0.5 1.0
s1 1.5 0.0
s2 0.0 0.0
•Episode 2: From s1
•current state: s1, action chosen: a0, reward: R(s1, a0) = +3, next state: s2.
•Update Q(s0,a0) using Bellman’s equation:
Q(s1,a0) Q(s1,a0) + α[R+ γ max Q(s2, a’) – Q(s1, a0)]
a’
•Q (s1, a0) 0 + 0.5 [3 + 0.9 * 0 – 0] = 1.5
• Alternatively, you may use an ANN to learn Q-values: Deep Q-Learning (DQN)
Optimal Solution using Q-Learning: Maze
import numpy as np
import
matplotlib.pyplot
as plt
# Maze parameters
maze = [
[0, 1, 0, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 1, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 2]
#'2'is the diamond
(goal state)
]
maze = np.array(maze)
Python Code
Continued…
…
Deep Q-Learning (DQN) for RL
• When the number of states and actions become very large, how do you
scale?
• Solution: Combine Q-Learning and Deep Learning Deep Q-Networks (DQN)
• Goal: Approximate a function: Q(s,a; θ), where θ represents the
trainable weights of the network