0% found this document useful (0 votes)
8 views8 pages

10 ML Introduction to Reinforcement Learning

The document provides an overview of Reinforcement Learning (RL), including its key features, terminology, and the Markov Decision Process (MDP) framework. It explains the roles of agents, environments, states, actions, rewards, and policies in RL, as well as the concepts of episodic and continuing tasks. Additionally, it discusses the importance of maximizing expected returns and the relationship between returns and policies in the context of MDPs.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

10 ML Introduction to Reinforcement Learning

The document provides an overview of Reinforcement Learning (RL), including its key features, terminology, and the Markov Decision Process (MDP) framework. It explains the roles of agents, environments, states, actions, rewards, and policies in RL, as well as the concepts of episodic and continuing tasks. Additionally, it discusses the importance of maximizing expected returns and the relationship between returns and policies in the context of MDPs.

Uploaded by

tdr2mqm6gr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Content

Introduction
Simplified view of Reinforcement Learning
Different ML techniques
Applications of RL
RL Terminology
Categories or RL algorithms

Content
Introduction
Simplified view of Reinforcement Learning
Different ML techniques
Applications of RL
RL Terminology
Categories or RL algorithms
Markov Decision Process
Definition
Markov Property
Reward, Goal, Episodes, Returns, Policy, Value Functions

Introduction | Key Features of RL


No explicit teacher
Learning by trial and error
Learning through repeated agent-environment interactions
Goal-oriented learning by maximizing cumulative reward
Delayed rewards
Need to balance exploitation vs exploration

1/ 8
The artificial entity that is being trained to perform a
Agent task by learning from
its own experience.
Everything outside the purview of the agent. The
environment has its own
Environment
internal dynamics which is usually not visible to the
agent.
Current situation of the environment (as observed by
State the agent), which
forms the basis for the decisions taken by the agent.
Choices made by the agent to change the state of the
Action
environment.
Scalar quantity emitted by the environment in response
Reward
to the action.
The cumulative sum of future rewards to be received
Return
by the agent.
Goal Maximize the expected return.
Defines the agent's behavior. This can be viewed as a
Policy mapping from per-
ceived states to actions to be taken in those states.
Specifies what is good in the long run. Value of a state
Value is the expected
Function return that an agent can expect to get starting from
that state.
Something that mimics the behavior of the
environment and allows infer-
Model
ences to be made about how the environment will
behave.

Markov Decision Process Intuitive Definition


An agent in RL must be capable of:
sensing the state of the environment
taking actions for affecting the state
realizing goals related to the state of the environment

2/ 8
What is a Markov Decision Process?
A Markov Decision Process (MDP) is a formal mathematical framework that is used to
define the interation between the agent and its environment in terms of states, actions and
rewards.

Markov Decision Process Formal Definition

Markov Decision Process (MDP)


An MDP is defined as the tuple , where:
is a finite set of states.
is a finite set of actions.
is a state transition probability function, which defines the probability of
transitioning to the next state from the current state on taking the action .

is a reward function, which defines the expected reward to be received on taking a


particular action in a given state.

is a discount factor for assigning more importance to immediate rewards.

Markov Property
A state is said to possess the Markov Property when it includes information about all
aspects of the past agent-environment interaction that make a difference for the future
(future is independent of past states, actions and rewards).

Exercise: Markovian and Non-Markovian Env.


(1) Devise an example task that fits into the MDP framework, identifying for each its states,
actions, and rewards.

(2) Can you think of an environment in which states do not have the Markov property?

3/ 8
Reward
The reward is a scalar quantity that forms the basis of evaluating the action taken
by an agent.
Reward is a measure of the immediate benefit of taking a particular action
The agent must be able to measure how well it is performing frequently over its
lifespan
If rewards are sparse figuring out good actions can be difficult

Exercise: Maze Runner


What is a good choice of rewards for a maze solver?

Goal
The goal of an RL agent is to maximize the expected return (cumulative rewards from
current state to final state).

Any goal can be thought of as the maximization of the expected return


A goal should be outside the agent's direct control

Episodic Tasks
Agent-environment interaction breaks down naturally into subsequences known as
episodes
Agent's state reset after terminal state

Continuing Tasks
Interaction does not break down into sub-sequences (e.g. gas pipeline monitoring, heating
system monitoring)

4/ 8
Markov Decision Process Returns
For episodic tasks, if the agent expects to receive rewards
from time till time , the return is defined as:

For continuing tasks, , so can evaluate to


A discounting factor used to limit the value of to a finite quantity.

Return

Markov Decision Process Recursive Relationship of


Return
The relationship between returns at successive steps can be easily derived:
Recursive relationship between and

Exercise: Calculate the Return


Suppose and the following sequence of rewards is received
, and , with . What are ? Hint:
Work backwards.

Markov Decision Process | Unified Notation


Episodic tasks can be viewed as a special case of continuing tasks
Terminal state acts as an absorbing state for which the reward is always 0

Both continuing and episodic tasks using

5/ 8
Markov Decision Process | Policy

Policy

Policy is a mapping from states to probabilities of selecting an action.


RL methods specify how the agent changes its policy based on its experience
A good policy is one that results in a lot of rewards in the long run

Markov Decision Process Value Functions

State-value Function
The state-value function of a state under a policy is defined as the expected return when
starting in and following thereafter.

Action-value Function
The action-value function of a state and action a under a policy is defined as the
expected return when starting in , taking the action a (which may not necessarily be
predicted by ) and following thereafter.

Markov Decision Process Recap

6/ 8
Term Description Expression
where
- is a finite set of states.
Framework
- is a finite set of actions.
defining
agent- is a state transition prob. func.
MDP
environment
interaction - is a reward function

- is a discount factor,
Current
state
Markov includes all
information -
Property
about the
past
Scalar
quantity for
evaluating
Reward
the
agent's
action.

Discounted
Return sum of future
rewards.
$
Maximize
Goal expected at each
Return

Markov Decision Process Recap

7/ 8
Term Description Expression
Mapping
from states
to
Policy probabilities
of
actions.
Expected
Return
State- when
value starting in
Function and
following $
thereafter.
Expected
Return
when
Action- starting in ,
value taking
Function action and
following
$
there-
after.

8/ 8

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy