0% found this document useful (0 votes)
6 views21 pages

Unit-4 MDP

Uploaded by

H Srinivasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views21 pages

Unit-4 MDP

Uploaded by

H Srinivasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Markov Decision process

Track
• MDP formulation,
• utility theory,
• utility functions,
• value iteration,
• policy iteration and
• partially observable MDPs.
Markov Decision process
A Markov decision process (MDP) is defined as a stochastic decision-
making process that uses a mathematical framework to model the
decision-making of a dynamic system in scenarios where the results are
either random or controlled by a decision maker, which makes
sequential decisions over time.
• MDPs rely on variables such as the environment, agent’s actions, and
rewards to decide the system’s next optimal action. They are classified into
four types — finite, infinite, continuous, or discrete — depending on
various factors such as sets of actions, available states, and the decision-
making frequency.

• MDPs have been around since the early part of the 1950s. The name
Markov refers to the Russian mathematician Andrey Markov who played a
pivotal role in shaping stochastic processes. In its initial days, MDPs were
known to solve issues related to inventory management and control,
queuing optimization, and routing matters. Today, MDPs find applications
in studying optimization problems via dynamic programming, robotics,
automatic control, economics, manufacturing, etc.
• In artificial intelligence, MDPs model sequential decision-making
scenarios with probabilistic dynamics. They are used to design intelligent
machines or agents that need to function longer in an environment where
actions can yield uncertain results.
• MDP models are typically popular in two sub-areas of AI: probabilistic
planning and reinforcement learning (RL).

• Probabilistic planning is the discipline that uses known models to


accomplish an agent’s goals and objectives. While doing so, it
emphasizes guiding machines or agents to make decisions while
enabling them to learn how to behave to achieve their goals.
• Reinforcement learning allows applications to learn from the
feedback the agents receive from the environment.
• Let’s understand this through a real-life example:

• Consider a hungry antelope in a wildlife sanctuary looking for food in its environment. It stumbles upon
a place with a mushroom on the right and a cauliflower on the left. If the antelope eats the mushroom,
it receives water as a reward. However, if it opts for the cauliflower, the nearby lion’s cage opens and
sets the lion free in the sanctuary. With time, the antelope learns to choose the side of the mushroom,
as this choice offers a valuable reward in return.

• In the above MDP example, two important elements exist — agent and environment. The agent here is
the antelope, which acts as a decision-maker. The environment reveals the surrounding (wildlife
sanctuary) in which the antelope resides. As the agent performs different actions, different situations
emerge. These situations are labeled as states. For example, when the antelope performs an action of
eating the mushroom, it receives the reward (water) in correspondence with the action and transitions
to another state. The agent (antelope) repeats the process over a period and learns the optimal action
at each state.

• In the context of MDP, we can formalize that the antelope knows the optimal action to perform (eat the
mushroom). Therefore, it does not prefer eating the cauliflower as it generates a reward that can harm
its survival. The example illustrates that MDP is essential in capturing the dynamics of RL problems.
• The MDP model operates by using key elements such as the agent, states, actions,
rewards, and optimal policies. The agent refers to a system responsible for making
decisions and performing actions. It operates in an environment that details the
various states that the agent is in while it transitions from one state to another.
MDP defines the mechanism of how certain states and an agent’s actions lead to
the other states. Moreover, the agent receives rewards depending on the action it
performs and the state it attains (current state). The policy for the MDP model
reveals the agent’s following action depending on its current state.

• The MDP framework has the following key components:

• S: states (s ∈ S)
• A: Actions (a ∈ A)
• P (St+1|st.at): Transition probabilities
• R (s): Reward
• The graphical representation of the MDP model is as follows:

• MDP model :
• The MDP model uses the Markov Property, which states that the future can be determined
only from the present state that encapsulates all the necessary information from the past. The
Markov Property can be evaluated by using this equation:

• P[St+1|St] = P[St+1 |S1,S2,S3……St]

• According to this equation, the probability of the next state (P[St+1]) given the present state
(St) is given by the next state’s probability (P[St+1]) considering all the previous states
(S1,S2,S3……St). This implies that MDP uses only the present/current state to evaluate the
next actions without any dependencies on previous states or actions.
• We have a problem where we need to decide whether the tribes
should go deer hunting or not in a nearby forest to ensure long-term
returns. Each deer generates a fixed return. However, if the tribes
hunt beyond a limit, it can result in a lower yield next year. Hence, we
need to determine the optimum portion of deer that can be caught
while maximizing the return over a longer period.

• The problem statement can be simplified in this case: whether to hunt


a certain portion of deer or not. In the context of MDP, the problem
can be expressed as follows:
States: The number of deer available in the forest in the year under
consideration. The four states include empty, low, medium, and high,
which are defined as follows:

• Empty: No deer available to hunt


• Low: Available deer count is below a threshold t_1
• Medium: Available deer count is between t_1 and t_2
• High: Available deer count is above a threshold t_2
• Rewards: Hunting at each state generates rewards of some kind. The
rewards for hunting at different states, such as state low, medium,
and high, maybe $5K, $50K, and $100k, respectively. Moreover, if the
action results in an empty state, the reward is -$200K. This is due to
the required e-breeding of new deer, which involves time and money.
• State transitions: Hunting in a state causes the transition to a state
with fewer deer. Subsequently, the action of no_hunting causes the
transition to a state with more deer, except for the ‘high’ state.
Examples of the Markov Decision Process
1. Routing problems
2.Managing maintenance and repair of dynamic systems
3. Designing intelligent machines
4. Designing quiz games
5. Managing wait time at a traffic intersection
6. Determining the number of patients to admit to a hospital
UTILITY THEORY AND UTILITY
FUNCTIONS

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy