Unit-4 MDP
Unit-4 MDP
Track
• MDP formulation,
• utility theory,
• utility functions,
• value iteration,
• policy iteration and
• partially observable MDPs.
Markov Decision process
A Markov decision process (MDP) is defined as a stochastic decision-
making process that uses a mathematical framework to model the
decision-making of a dynamic system in scenarios where the results are
either random or controlled by a decision maker, which makes
sequential decisions over time.
• MDPs rely on variables such as the environment, agent’s actions, and
rewards to decide the system’s next optimal action. They are classified into
four types — finite, infinite, continuous, or discrete — depending on
various factors such as sets of actions, available states, and the decision-
making frequency.
• MDPs have been around since the early part of the 1950s. The name
Markov refers to the Russian mathematician Andrey Markov who played a
pivotal role in shaping stochastic processes. In its initial days, MDPs were
known to solve issues related to inventory management and control,
queuing optimization, and routing matters. Today, MDPs find applications
in studying optimization problems via dynamic programming, robotics,
automatic control, economics, manufacturing, etc.
• In artificial intelligence, MDPs model sequential decision-making
scenarios with probabilistic dynamics. They are used to design intelligent
machines or agents that need to function longer in an environment where
actions can yield uncertain results.
• MDP models are typically popular in two sub-areas of AI: probabilistic
planning and reinforcement learning (RL).
• Consider a hungry antelope in a wildlife sanctuary looking for food in its environment. It stumbles upon
a place with a mushroom on the right and a cauliflower on the left. If the antelope eats the mushroom,
it receives water as a reward. However, if it opts for the cauliflower, the nearby lion’s cage opens and
sets the lion free in the sanctuary. With time, the antelope learns to choose the side of the mushroom,
as this choice offers a valuable reward in return.
• In the above MDP example, two important elements exist — agent and environment. The agent here is
the antelope, which acts as a decision-maker. The environment reveals the surrounding (wildlife
sanctuary) in which the antelope resides. As the agent performs different actions, different situations
emerge. These situations are labeled as states. For example, when the antelope performs an action of
eating the mushroom, it receives the reward (water) in correspondence with the action and transitions
to another state. The agent (antelope) repeats the process over a period and learns the optimal action
at each state.
• In the context of MDP, we can formalize that the antelope knows the optimal action to perform (eat the
mushroom). Therefore, it does not prefer eating the cauliflower as it generates a reward that can harm
its survival. The example illustrates that MDP is essential in capturing the dynamics of RL problems.
• The MDP model operates by using key elements such as the agent, states, actions,
rewards, and optimal policies. The agent refers to a system responsible for making
decisions and performing actions. It operates in an environment that details the
various states that the agent is in while it transitions from one state to another.
MDP defines the mechanism of how certain states and an agent’s actions lead to
the other states. Moreover, the agent receives rewards depending on the action it
performs and the state it attains (current state). The policy for the MDP model
reveals the agent’s following action depending on its current state.
• S: states (s ∈ S)
• A: Actions (a ∈ A)
• P (St+1|st.at): Transition probabilities
• R (s): Reward
• The graphical representation of the MDP model is as follows:
• MDP model :
• The MDP model uses the Markov Property, which states that the future can be determined
only from the present state that encapsulates all the necessary information from the past. The
Markov Property can be evaluated by using this equation:
• According to this equation, the probability of the next state (P[St+1]) given the present state
(St) is given by the next state’s probability (P[St+1]) considering all the previous states
(S1,S2,S3……St). This implies that MDP uses only the present/current state to evaluate the
next actions without any dependencies on previous states or actions.
• We have a problem where we need to decide whether the tribes
should go deer hunting or not in a nearby forest to ensure long-term
returns. Each deer generates a fixed return. However, if the tribes
hunt beyond a limit, it can result in a lower yield next year. Hence, we
need to determine the optimum portion of deer that can be caught
while maximizing the return over a longer period.