10 ML Introduction to Reinforcement Learning
10 ML Introduction to Reinforcement Learning
Introduction
Simplified view of Reinforcement Learning
Different ML techniques
Applications of RL
RL Terminology
Categories or RL algorithms
Content
Introduction
Simplified view of Reinforcement Learning
Different ML techniques
Applications of RL
RL Terminology
Categories or RL algorithms
Markov Decision Process
Definition
Markov Property
Reward, Goal, Episodes, Returns, Policy, Value Functions
1/ 8
The artificial entity that is being trained to perform a
Agent task by learning from
its own experience.
Everything outside the purview of the agent. The
environment has its own
Environment
internal dynamics which is usually not visible to the
agent.
Current situation of the environment (as observed by
State the agent), which
forms the basis for the decisions taken by the agent.
Choices made by the agent to change the state of the
Action
environment.
Scalar quantity emitted by the environment in response
Reward
to the action.
The cumulative sum of future rewards to be received
Return
by the agent.
Goal Maximize the expected return.
Defines the agent's behavior. This can be viewed as a
Policy mapping from per-
ceived states to actions to be taken in those states.
Specifies what is good in the long run. Value of a state
Value is the expected
Function return that an agent can expect to get starting from
that state.
Something that mimics the behavior of the
environment and allows infer-
Model
ences to be made about how the environment will
behave.
2/ 8
What is a Markov Decision Process?
A Markov Decision Process (MDP) is a formal mathematical framework that is used to
define the interation between the agent and its environment in terms of states, actions and
rewards.
Markov Property
A state is said to possess the Markov Property when it includes information about all
aspects of the past agent-environment interaction that make a difference for the future
(future is independent of past states, actions and rewards).
(2) Can you think of an environment in which states do not have the Markov property?
3/ 8
Reward
The reward is a scalar quantity that forms the basis of evaluating the action taken
by an agent.
Reward is a measure of the immediate benefit of taking a particular action
The agent must be able to measure how well it is performing frequently over its
lifespan
If rewards are sparse figuring out good actions can be difficult
Goal
The goal of an RL agent is to maximize the expected return (cumulative rewards from
current state to final state).
Episodic Tasks
Agent-environment interaction breaks down naturally into subsequences known as
episodes
Agent's state reset after terminal state
Continuing Tasks
Interaction does not break down into sub-sequences (e.g. gas pipeline monitoring, heating
system monitoring)
4/ 8
Markov Decision Process Returns
For episodic tasks, if the agent expects to receive rewards
from time till time , the return is defined as:
Return
5/ 8
Markov Decision Process | Policy
Policy
State-value Function
The state-value function of a state under a policy is defined as the expected return when
starting in and following thereafter.
Action-value Function
The action-value function of a state and action a under a policy is defined as the
expected return when starting in , taking the action a (which may not necessarily be
predicted by ) and following thereafter.
6/ 8
Term Description Expression
where
- is a finite set of states.
Framework
- is a finite set of actions.
defining
agent- is a state transition prob. func.
MDP
environment
interaction - is a reward function
- is a discount factor,
Current
state
Markov includes all
information -
Property
about the
past
Scalar
quantity for
evaluating
Reward
the
agent's
action.
Discounted
Return sum of future
rewards.
$
Maximize
Goal expected at each
Return
7/ 8
Term Description Expression
Mapping
from states
to
Policy probabilities
of
actions.
Expected
Return
State- when
value starting in
Function and
following $
thereafter.
Expected
Return
when
Action- starting in ,
value taking
Function action and
following
$
there-
after.
8/ 8