Ai3 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Monte Carlo Tree Search

The Games like Tic-Tac-Toe, Rubik’s Cube, Sudoku, Chess,


Go and many others have common property that lead to
exponential increase in the number of possible actions that can
be played. These possible steps increase exponentially as the
game goes forward. Ideally if you can predict every possible
move and its result that may occur in the future. You can
increase your chance of winning.

But since the moves increase exponentially — the


computation power that is required to calculate the moves also
goes through the roof.

Monte Carlo Tree Search is a method usually used in games to


predict the path (moves) that should be taken by the policy to
reach the final winning solution.

Final output could be any one of these

Before we discover the right path(moves) that will lead us for


the win. We first need to arrange the moves of the present state
of the game. These moves connected together will look like a
tree. Hence the name Tree Search. See the diagram below to
understand the sudden exponential increase in moves.
Tree Search Algorithm

It is a method used to search every possible move that may


exist after a turn in the game. For Example: There are many
different options for a player in the game of tic-tac-toe, which
can be visualised using a tree representation. The move can be
further increased for the next turn evolving the diagram into a
tree.

Source

Brute forcing an exponential increasing tree to find the final


solution for a problem by considering every children
move/node requires a lot of computational power. resulting in
extremely slow output.

Monte Carlo Tree Search (MCTS) is a well-known


technique for resolving problems relating to decision-making.
It has been successfully used in games like Go and chess to
search through all possible moves in a game tree using Monte
Carlo simulations. MCTS has also been used in various
industries, including robotics and driverless cars.

What is Monte Carlo Tree Search ?

MCTS is an algorithm that figures out the best move out of a


set of moves by Selecting → Expanding → Simulating →
Updating the nodes in tree to find the final solution. This
method is repeated until it reaches the solution and learns the
policy of the game.

How does Monte Carlo Tree Search Work?


Let’s look at parts of the loop one-by-one.

SELECTION
Selecting👆| This process is used to select a node on the tree
that has the highest possibility of winning. For Example —
Consider the moves with winning
possibility 2/3, 0/1 & 1/2 after the first move 4/6, the
node 2/3 has the highest possibility of winning.

The node selected is searched from the current state of the tree
and selected node is located at the end of the branch. Since the
selected node has the highest possibility of winning — that
path is also most likely to reach the solution faster than other
path in the tree.
EXPANSION

Expanding — After selecting the right node. Expanding is


used to increase the options further in the game by expanding
the selected node and creating many children nodes. We are
using only one children node in this case. These children
nodes are the future moves that can be played in the game.
The nodes that are not expanded further for the time being are
known are leaves.

SIMULATION
Simulating|Exploring 🚀 Since nobody knows which node is
the best children/ leaf from the group. The move which will
perform best and lead to the correct answer down the tree. But,

How do we find the best children which will lead us to the


correct solution?

We use Reinforcement Learning to make random decisions in


the game further down from every children node. Then,
reward is given to every children node — by calculating how
close the output of their random decision was from the final
output that we need to win the game.

For example: In the game of Tic-Tac-Toe. Does the random


decision to make cross(X) next to previous cross(X) in the
game results in three consecutive crosses(X-X-X) that are
needed to win the game?

FYI: This can be considered as policy π of RL algorithm.


Learn more about Policy & Value Network…

Policy Networks vs Value Networks in Reinforcement


Learning
In Reinforcement Learning, the agents take random
decisions in their environment and learns on selecting the
right one…
towardsdatascience.com

The simulation is done for every children node is followed by


their individual rewards.
UPDATING | BACK-PROPAGATION

Let’s say the simulation of the node gives optimistic results for
its future and gets a positive score 1/1.

Updating|Back-propagation — Due to the new nodes and


their positive or negative scores in the environment. The total
scores of their parent nodes must be updated by going back up
the tree one-by-one. The new updated scores changes the state
of the tree and may also change new future node of the
selection process.

After updating all the nodes, the loop again begins by selection
the best node in the tree→ expanding of the selected node →
using RL for simulating exploration → back-propagating the
updated scores → then finally selecting a new node further
down the tree that is actual the required final winning result.

For Example: Solved Rubik’s Cube, Sudoku’s correct


solution, Killing the King in Chess or

Final Required Solution of TIC-TAC-TOE

Conclusion
Instead of brute forcing from millions of possible ways to find
the right path.

Monte Carlo Tree Search algorithm chooses the best possible


move from the current state of Game’s Tree with the help of
Reinforcement Learning.

Stochastic Games
A stochastic game:
A repeated interaction between several participants in which the
underlying state of the environment changes stochastically, and it
depends on the decisions of the participants.

A strategy:
A rule that dictates how a participant in an interaction makes his
decisions as a function of the observed behavior of the other participants
and of the evolution of the environment.

Evaluation of stage payoffs:


The way that a participant in a repeated interaction evaluates the stream
of stage payoffs that he receives (or stage costs that he pays) along the
interaction.

An equilibrium:
A collection of strategies, one for each player, such that each player
maximizes (or minimizes, in case of stage costs) his evaluation of stage
payoffs given the strategies of the other players.

A correlated equilibrium:
An equilibrium in an extended game in which at the outset of the game
each player receives a private signal, and the vector of private signals is
chosen according to a known joint probability distribution. In the
extended game, a strategy of a player depends, in addition to past play, on
the signal he received.

Stochastic Games in Artificial


Intelligence

 Many unexpected external occurrences can place us in


unexpected circumstances in real life.
 Many games, such as dice tossing, have a random
element to reflect this unpredictability. These are
known as stochastic games.
 Backgammon is a classic game that mixes skill and luck.
The legal moves are determined by rolling dice at the
start of each player’s turn white, for example, has rolled
a 6–5 and has four alternative moves in the backgammon
scenario shown in the figure below.

This is a standard backgammon position. The object of the game is to get all of
one’s pieces off the board as quickly as possible. White moves in a clockwise
direction toward 25, while Black moves in a counter clockwise direction toward
0. Unless there are many opponent pieces, a piece can advance to any position;
if there is only one opponent, it is caught and must start over. White has rolled a
6–5 and must pick between four valid moves: (5–10,5–11), (5–11,19–24),
(5–10,10–16), and (5–11,11–16), where the notation (5–11,11–16) denotes
moving one piece from position 5 to 11 and then another from 11 to 16.
Stochastic game tree for a backgammon position
White knows his or her own legal moves, but he or she has no idea how Black
will roll, and thus has no idea what Black’s legal moves will be.
That means White won’t be able to build a normal game tree-like in chess or
tic-tac-toe. In backgammon, in addition to M A X and M I N nodes, a game tree
must include chance nodes. The figure below depicts chance nodes as circles.
The possible dice rolls are indicated by the branches leading from each chance
node; each branch is labelled with the roll and its probability. There are 36
different ways to roll two dice, each equally likely, yet there are only 21 distinct
rolls because a 6–5 is the same as a 5–6. P (1–1) = 1/36 because each of the six
doubles (1–1 through 6–6) has a probability of 1/36. Each of the other 15 rolls
has a 1/18 chance of happening.

The following phase is to learn how to make good decisions. Obviously, we


want to choose the move that will put us in the best position. Positions, on the
other hand, do not have specific minimum and maximum values. Instead, we
can only compute a position’s anticipated value, which is the average of all
potential outcomes of the chance nodes.
As a result, we can generalize the deterministic minimax value to an expected-
minimax value for games with chance nodes. Terminal nodes, MAX and MIN
nodes (for which the dice roll is known), and MAX and MIN nodes (for which
the dice roll is unknown) all function as before. We compute the expected value
for chance nodes, which is the sum of all outcomes, weighted by the probability
of each chance action.
where r is a possible dice roll (or other random events) and RESULT(s,r)
denotes the same state as s, but with the addition that the dice roll’s result is r.
3.5 Partially Observable Games
 In a partially observable environment, The agent is not familiar
with the complete environment at a given time.
 Real-life Example: Playing card games is a perfect example of
a partially-observable environment where a player is not aware
of the card in the opponent's hand.
 A partially observable environment is one in which the
agent does not have complete information about the
current state of the environment. The agent can only
observe a subset of the environment, and some aspects of
the environment may be hidden or uncertain. Examples
of partially observable environments include driving a
car in traffic.
3.6 limitations of game search algorithms

 Firstly, they rely on the assumption that the game is fully observable,
deterministic and has perfect information.
 Secondly, their effectiveness is highly dependent on the complexity of the game
and the branching factor of the game tree.
 Thirdly, they can be computationally expensive and time-consuming to run.
 Fourthly, the assumption that players are rational may not always hold, leading
to suboptimal decisions.
 Fifthly, game search algorithms cannot solve games beyond their rule-based
definitions and cannot cope with games that may incorporate random events.
 Lastly, they may not always result in an optimal or even satisfactory solution as
they do not take into account human behaviour or intuition, which can
sometimes lead to unexpected outcomes.
3.7 Constraint Satisfaction Problems (CSP)
 Finding a solution that meets a set of constraints is the
goal of constraint satisfaction problems (CSPs), a type
of AI issue.
 Finding values for a group of variables that fulfill a set
of restrictions or rules is the aim of constraint
satisfaction problems.
 For tasks including resource allocation, planning,
scheduling, and decision-making, CSPs are frequently
employed in AI.

There are mainly three basic components in the


constraint satisfaction problem:

1)Variables: The things that need to be determined are


variables.
Variables in a CSP are the objects that must have values
assigned to them in order to satisfy a particular set of
constraints.
Boolean, integer, and categorical variables are just a few
examples of the various types of variables Variables, for
instance, could stand in for the many puzzle cells that need to
be filled with numbers in a sudoku puzzle.
2)Domains: The range of potential values that a variable
can have is represented by domains. Depending on the issue,
a domain may be finite or limitless. For instance, in Sudoku,
the set of numbers from 1 to 9 can serve as the domain of a
variable representing a problem cell.

3)Constraints: The guidelines that control how variables


relate to one another are known as constraints. Constraints in
a CSP define the ranges of possible values for variables.
Unary constraints, binary constraints, and higher-order
constraints are only a few examples of the various sorts of
constraints.
For instance, in a sudoku problem, the restrictions might be
that each row, column, and 3×3 box can only have one
instance of each number from 1 to 9.
Constraint Satisfaction Problems (CSP) representation:

 The finite set of variables V1, V2, V3 ……………..Vn.


 Non-empty domain for every single variable D1, D2,
D3 …………..Dn.
 The finite set of constraints C1, C2 …….…, Cm.
 where each constraint Ci restricts the possible
values for variables,
 e.g., V1 ≠ V2

 Each constraint Ci is a pair <scope, relation>


 Example: <(V1, V2), V1 not equal to V2>

 Scope = set of variables that participate in


constraint.
 Relation = list of valid variable value
combinations.
 There might be a clear list of permitted
combinations. Perhaps a relation that is
abstract and that allows for membership
testing and listing.

Constraint Satisfaction Problems (CSP) algorithms:


1) Backtracking algorithm
 The backtracking algorithm is a depth-first search
algorithm that methodically investigates the search space
of potential solutions up until a solution is discovered that
satisfies all the restrictions.
 The method begins by choosing a variable and giving it a
value before repeatedly attempting to give values to the
other variables.
 The method returns to the prior variable and tries a
different value if at any time a variable cannot be given a
value that fulfills the requirements
 Once all assignments have been tried or a solution that
satisfies all constraints has been discovered, the algorithm
ends.

2) Forward-checking algorithm
 The forward-checking algorithm is a variation of the
backtracking algorithm that condenses the search space
using a type of local consistency.
 For each unassigned variable, the method keeps a list of
remaining values and applies local constraints to eliminate
inconsistent values from these sets. The algorithm
examines a variable’s neighbors after it is given a value to
see whether any of its remaining values become
inconsistent and removes them from the sets if they do.
The algorithm goes backward if, after forward checking, a
variable has no more values.
 Algorithms for propagating constraints are a class that
uses local consistency and inference to condense the
search space. These algorithms operate by propagating
restrictions between variables and removing inconsistent
values from the variable domains using the information
obtained.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy