0% found this document useful (0 votes)
13 views

Hota-ML-ReinforcementLearning

The document provides an introduction to Reinforcement Learning (RL) and its applications, including gaming and robotics. It explains the formal modeling of RL using Markov Decision Processes (MDP) and details the Q-Learning algorithm for learning optimal policies. Additionally, it discusses the use of Deep Q-Learning (DQN) to handle large state and action spaces by combining Q-Learning with deep learning techniques.

Uploaded by

2024tm05030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Hota-ML-ReinforcementLearning

The document provides an introduction to Reinforcement Learning (RL) and its applications, including gaming and robotics. It explains the formal modeling of RL using Markov Decision Processes (MDP) and details the Q-Learning algorithm for learning optimal policies. Additionally, it discusses the use of Deep Q-Learning (DQN) to handle large state and action spaces by combining Q-Learning with deep learning techniques.

Uploaded by

2024tm05030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Birla Institute of Technology and Science Pilani, Hyderabad Campus

29.11.2024

BITS F464: Machine Learning (1st Sem 2024-25)


Introduction to Reinforcement Learning(RL)

Chittaranjan Hota, Sr. Professor


Dept. of Computer Sc. and Information Systems
hota@hyderabad.bits-pilani.ac.in
What is Reinforcement Learning?
• Agent tries to maximize the cumulative reward from the
environment by performing a set of actions.

Image source: https://highlandcanine.com/ Image source: UltraTech Cement Stock, 27th Nov 2024 from
www.nseindia.com

Applications: Gaming, Robotics, Autonomous vehicles, Personalized treatment


etc.
Formal Modelling: Markov Decision Process
Markov: The future state can be determined only from the present state that
encapsulates all the necessary information from the past.

What should the player ‘O’ do here to avoid a loss?


MDP Continued…

(Discounted Cumulative Reward)


action

Agent state/obser Environment : Discount factor controlling


future rewards.

reward

St St+1 St+2
Q-Learning Algorithm
• Q-learning is a model-free reinforcement learning (RL) algorithm used to
learn the optimal policy for a Markov Decision Process (MDP)

Initialize Q-Table

Select an Action
After multiple Episodes,
a good Q-Table is ready

Perform Action

Measure Reward
+ [ + max ]
Update Q-Table
An Example of Q-Learning
• Initializing the environment: States: {s0, s1, s2}, Actions: {a0, a1}, Rewards:
R(s0, a0) = -1, R(s0, a1) = +2, R(s1, a0) = +3, R(s1, a1) = +1, R(s2, any action) = 0
(terminal state).

• Transitions: T(s0, a0)  s1, T(s0,a1)  s2 (goal), T(s1,a0)  s2, T(s1, a1)s0

• Parameters: α = 0.5, γ = 0.9, Initial Q-values (Q(s, a) = 0 for all s, a).

• Episode 1:
• current state: s0, action chosen: a0 (randomly using exploration), reward: R(s0,
a0) = -1, next state: s1.
• Update Q(s0,a0) using Bellman’s equation:

• Q (s0, a0)  0 + 0.5 [-1 + 0.9 * max Q(s1, a’) – 0]

• Q(s0, a0)  0.5 * [-1 + 0] = -0.5 (Since, Q(s1, a’) = 0 initially (no knowledge of s1).
Updated Q-values after 3 Episodes
State Action(a0) Action(a1)
Ex. Continued… s0 -0.5 1.0
s1 1.5 0.0
s2 0.0 0.0
•Episode 2: From s1
•current state: s1, action chosen: a0, reward: R(s1, a0) = +3, next state: s2.
•Update Q(s0,a0) using Bellman’s equation:
Q(s1,a0) Q(s1,a0) + α[R+ γ max Q(s2, a’) – Q(s1, a0)]
a’
•Q (s1, a0)  0 + 0.5 [3 + 0.9 * 0 – 0] = 1.5

•Episode 3: Back to s0(different action)


•current state: s0, action chosen: a1, reward: R(s0, a1) = +2, next state: s2.
•Update Q(s0,a1) using Bellman’s equation:
•Q(s0,a1) Q(s0,a1) + α[R+ γ max Q(s2, a’) – Q(s0, a1)]
a’
•Q (s0, a1)  0 + 0.5 [2 + 0.9 * 0 – 0] = 1.0

• Alternatively, you may use an ANN to learn Q-values: Deep Q-Learning (DQN)
Optimal Solution using Q-Learning: Maze
import numpy as np
import
matplotlib.pyplot
as plt

# Maze parameters
maze = [
[0, 1, 0, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 1, 0],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 2]
#'2'is the diamond
(goal state)
]

maze = np.array(maze)
Python Code
Continued…


Deep Q-Learning (DQN) for RL
• When the number of states and actions become very large, how do you
scale?
• Solution: Combine Q-Learning and Deep Learning Deep Q-Networks (DQN)
• Goal: Approximate a function: Q(s,a; θ), where θ represents the
trainable weights of the network

• Q(s,a) = r(s,a) + γ max Q(s’,a) Bellman’s equation

• Cost = {Q(s,a; θ) – [r(s,a)+γ max Q(s’,a; θ)]}2 ⇔


ANN
Q(a1)
s
Q
Q(a2)
⇒ Q(a3)
a (In-efficient as we need more
iterations) (Improved)
Thank You!

Good luck for Comprehensive Exams!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy