Full Text 01
Full Text 01
DAVID AVELLAN-HULTMAN
DAVID AVELLAN-HULTMAN
EMIL GUNNBERG QUERAT
Abstract
This thesis aims to investigate general game-playing by conducting a compar-
ison between the well-known methods Alpha-beta Pruning and Monte Carlo
Tree Search in a new context, namely a three-dimensional version of the game
Connect Four. The methods are compared by conducting a tournament with
instances of both methods at varying levels of allowed search extent and mea-
suring the performance as a function of the average thinking time taken per
move. Alpha-beta Pruning proves to clearly be the stronger method at sub-
0.1 second thinking times. However, Monte Carlo Tree Search seems to scale
better with increased thinking time and overtakes Alpha-beta Pruning as the
better method at thinking times of about 10 seconds, in this experiment. This
study is a contribution to the body of knowledge on how these methods per-
form in the context of general game-playing, but further comparisons of the
methods with regard to varying game complexity, game-specific heuristics and
augmentations of the methods are needed to make any definite generalizations.
iv
Sammanfattning
Denna uppsats syftar till att undersöka generella tekniker för att spela spel
genom en jämförelse mellan de välkända metoderna Alpha-beta Pruning och
Monte Carlo Tree Search i en tredimensionell version av spelet Fyra i rad.
Metoderna jämförs genom en turnering med instanser av båda metoderna av
varierande tillåten sökvidd och deras prestanda mäts som en funktion av den
genomsnittliga betänketiden per drag. Alpha-beta Pruning är tydligt den bätt-
re metoden vid mindre än 0.1 sekunders betänketid. Monte Carlo Tree Search
verkar däremot skala bättre med ökad betänketid och blir den bättre metoden
vid betänketider på cirka 10 sekunder i detta experiment. Denna studie är ett
bidrag till förståelsen om hur dessa metoder presterar i allmänhet, men ytter-
ligare jämförelser av metoderna med avseende på varierande spelkomplexitet,
spelspecifika heuristiker och förbättringar av metoderna krävs för några be-
stämda generaliseringar ska kunna göras.
Contents
1 Introduction 1
1.1 Aim and Research Topic . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope and Approach . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 The 3D Connect 4 game . . . . . . . . . . . . . . . . . . . . 4
2.2 Minimax and Alpha-beta Pruning . . . . . . . . . . . . . . . 6
2.3 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . 7
2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Method 9
3.1 Implementation and execution . . . . . . . . . . . . . . . . . 9
3.2 Players . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Alpha-beta Pruning . . . . . . . . . . . . . . . . . . . 9
3.2.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . 10
3.3 Tournament . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 Playing speed measurement . . . . . . . . . . . . . . 11
4 Results 13
4.1 Results overview . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Result matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Player performance . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Convergence of win rate . . . . . . . . . . . . . . . . . . . . 16
4.5 Game repetition . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Discussion 18
5.1 Analyzing the ratings . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Generalizability . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Potential issues . . . . . . . . . . . . . . . . . . . . . . . . . 19
v
vi CONTENTS
Bibliography 23
Chapter 1
Introduction
The task of designing game-playing algorithms has been studied since the early
days of computers. For simple games like Tic Tac Toe, a computer can eas-
ily search through all future possibilities and always be able to play the best
move. However, for most games, such as Chess or Go, the state space is far too
large to search through by brute force, and the central question about playing
these games then becomes about how to play them as well as possible given
restrictions on the time the algorithm is allowed to run.
Instead of brute force, other methods of game-play have been developed.
Most traditional approaches to playing board games search through a smaller
portion of future possibilities starting from the current position, employing
game-specific heuristics and various speed-up techniques such as Alpha-beta
Pruning. One such example is IBM’s Deep Blue computer [1] which used
these methods in 1997 to defeat the then-reigning World Chess Champion in a
six-game match. There has since been a growing interest in developing general
algorithms for game-playing, which work well for many different games with-
out the need for game-specific heuristics [2]. An example of an algorithm that
has been used in systems that attempt general game-playing is the Monte Carlo
Tree Search algorithm, which conducts numerous random play-outs in order
to estimate the likelihood of winning for each possible next move [3]. This
method has been used as a component of DeepMind’s program AlphaZero,
together with a deep neural network, to achieve superhuman performance in
Go, Chess and Shogi [4]. This advancement is a great achievement for general
game-play, however, it requires great computational resources for training. Not
all applications of game-playing algorithms, such as simple AI opponents in
computer games, require the sophistication of AlphaZero or have access to re-
sources like DeepMind’s. Instead, the generalizability of Alpha-beta Pruning
1
2 CHAPTER 1. INTRODUCTION
• The different players that were studied and how Alpha-beta Pruning and
Monte Carlo Tree Search were implemented.
Background
4
CHAPTER 2. BACKGROUND 5
to search for winning future states and thus determine the next move of the
current player. Two different methods of tree-search will be presented in the
following sections.
If using Negamax to determine what moves to play, any move which transi-
tions to a state t that achieves the maximum −U (t) is optimal. However since
naïvely computing this requires traversing the whole exponentially-sized game
tree, there is usually a depth cutoff used, i.e. states at a certain depth from the
start node s0 of the computation are considered terminal and assigned a heuris-
tic utility value.
A commonly used speed-up technique for Negamax (and Minimax as well)
is Alpha-beta Pruning, a method which allows certain sections of the game tree
to be excluded when calculating Negamax utility values if they can be proven
to not affect the final outcome [5]. This is done by always maintaining upper
and lower bounds on the final answer (α and β). Pseudocode for limited-depth
Negamax with Alpha-beta pruning (in later sections only referred to as Alpha-
beta pruning) is given in algorithm 1.
CHAPTER 2. BACKGROUND 7
1. Selection: Starting from the root node s0 , select child nodes by repeat-
edly making moves that at this point are are in some way optimal (see
below), until either an unvisited node is reached or the game ends, i.e.
at a node with no children.
3. Simulation: Evaluate the predicted result starting from the reached node.
Terminal nodes have fixed results, but for non-terminal nodes the re-
sult is estimated by performing random moves until a terminal state is
reached.
4. Backpropagation: Update the nodes on the path taken in the first step
with the information gained from the result or estimated result.
The main problem in Monte Carlo Tree Search is how to choose which
child node to explore. Kocsis and Szepesvári [6] suggest picking the move i
which has the highest upper confidence bound on its predicted reward, which
is calculated as
8 CHAPTER 2. BACKGROUND
r
ln N
U CT = vi + c ,
ni
where vi is the previously estimated value of the child node corresponding
to the move i, N is the number of times the current node has been reached, ni
is the number of times the child node corresponding to the move i has been
reached, and c is a hyper-parameter corresponding to the degree of exploration
in the search.
Method
3.2 Players
The experiment compared 14 different players: 8 Alpha-beta Pruning players
with different search depths, 5 Monte Carlo Tree Search players with different
numbers of play-outs, and one completely random player as a baseline. The
selection of players was solely based on the average thinking times, choosing
players such that the run time of the experiment did not grow too long. Play-
ers of increased search depth and play-out counts were added until average
thinking times passed 20 seconds. The players are enumerated in table 3.1.
9
10 CHAPTER 3. METHOD
Player Description
Random Selects a random move every time
ABP-1 Alpha-beta Pruning with max depth 1
ABP-2 Alpha-beta Pruning with max depth 2
ABP-3 Alpha-beta Pruning with max depth 3
ABP-4 Alpha-beta Pruning with max depth 4
ABP-5 Alpha-beta Pruning with max depth 5
ABP-6 Alpha-beta Pruning with max depth 6
ABP-7 Alpha-beta Pruning with max depth 7
ABP-8 Alpha-beta Pruning with max depth 8
MCTS-20 Monte Carlo Tree Search with 20 play-outs
MCTS-200 Monte Carlo Tree Search with 200 play-outs
MCTS-2k Monte Carlo Tree Search with 2 000 play-outs
MCTS-20k Monte Carlo Tree Search with 20 000 play-outs
MCTS-200k Monte Carlo Tree Search with 200 000 play-outs
choosing a move among those which produce the best result, shown in al-
gorithm 2. The randomness is necessary to avoid game repetition, see sec-
tions 4.5 and 5.3.1. As the study is concerned with algorithms that do not
use game-specific heuristics, the value of non-terminal leaf nodes used in the
search is set to the constant 0, equivalent to a draw.
Algorithm 2 Move selection from state s in the ABP player with depth d0
1: function MakeAbpMove(s)
2: v ← {}
3: for each valid move i from s do
4: t ← the state after making the move i
5: v[i] ← −Negamax(t, −∞, ∞, d0 − 1)
6: return RandomElement({i : v[i] = max(v)})
Algorithm 3 Move selection from state s in the MCTS player with p play-outs
1: function MakeMctsMove(s)
2: loop p times
3: Playout(s)
4: v ← {}
5: for each valid move i from s do
6: t ← the state after making the move i
7: v[i] ← 1 − W [t]/N [t]
8: return RandomElement({i : v[i] > (1 − ε) max(v)})
The variables N and W are maintained by the player throughout every game,
with N [s] storing the number of times the state s has been visited and W [s]
storing the total utility (from 0 to N [s]) to the current player at state s summed
over each time s was visited. Like with Alpha-beta Pruning, the move choices
are random, but with the addition that moves that are close but not equal to
the maximum may be chosen, controlled by the parameter ε, since the move
values produced by the Monte Carlo Tree Search are not discrete like in Alpha-
beta Pruning. For the implementation used in the experiment, ε = 0.05 was
chosen, since it gave sufficient randomness (see section 5.3.1) without a sig-
nificant decrease in playing strength.
3.3 Tournament
The tournament was 10 rounds, each round consisting of 2 games (giving each
player the first move advantage once) per possible match-up of different con-
testants. In total, 13∗14 = 182 games were played per round and the complete
tournament ran 1820 games.
After each game, the sequence of moves that were made was recorded, and
after each round, all the match-up results so far and the average thinking time
for each contestant were recorded in a separate file for analysis.
opponent’s last move, followed by a function call asking the contestant for its
move in the resulting position, after which the clock was immediately stopped.
Chapter 4
Results
w + 0.5 ∗ d
win rate = (4.1)
n
The result matrix if figure 4.1 can be divided into 4 sections, each corre-
sponding to either intra-method or inter-method match-up groups. The top left
13
14 CHAPTER 4. RESULTS
Figure 4.1: Average game result of y-axis player against x-axis player when
y-player starts. Y-axis player wins gives score 1, draw gives score 0.5, loss
gives score 0.
CHAPTER 4. RESULTS 15
section holds ABP vs ABP, top right holds ABP vs MCTS, bottom left holds
MCTS vs ABP and bottom right MCTS vs MCTS.
The win rates in the respective match-up groups show a few over-arching
trends. First, greater search depth and greater number of play-outs consis-
tently results in higher win rates for Alpha-beta Pruning and Monte Carlo Tree
Search, respectively, in the intra-method match-ups. For instance, the lower
triangles in the intra-method sections have win rates of 0.8 or greater in 24/38
match-ups for Alpha-beta Pruning and 9/10 match-ups for Monte Carlo Tree
Search. In the upper intra-method triangles the same relationship holds, with
stronger players generally beating weaker players, even though there are a few
outliers where a weaker algorithm happens to beat a stronger one in a majority
of the games.
The inter-method win rates display the same relationship. Again, an in-
creased number of play-outs or an increased search depth will generally re-
sult in higher win rates. However, the boundary where one method gains an
advantage over the other is not as clear, since there are many roughly equal
match-ups.
Figure 4.2: The overall win rates and average thinking times of the differ-
ent players. From left to right, Alpha-beta Pruning players (orange) are dis-
played in order of increasing search depth and Monte Carlo Tree Search players
(green) are displayed in order of increasing number of play-outs.
stronger method for thinking times over 1 second, and indeed MCTS-200k
performs much better than the strongest Alpha-beta Pruning player, despite
having similar thinking times.
Discussion
18
CHAPTER 5. DISCUSSION 19
would likely result in fewer random moves and a higher chance for adequate
moves regardless of whether a guaranteed win is within the search depth or not.
Such an implementation would counteract the ability to make generalizations
about the relationship between the methods as regards general game-playing,
and so was not within the scope of this report. However, there is value in
learning what role game-specific heuristics play in the performance of different
methods.
5.2 Generalizability
In contrast to Clune’s [7] findings on general game-play for Alpha-beta Prun-
ing and Monte Carlo Tree Search, which show that Alpha-beta Pruning per-
forms better with more thinking time in games of branching factor 16, this
study’s results show the opposite – more thinking time benefits Monte Carlo
Tree Search. Clune compared the methods with limits of 2 and 16 seconds
of thinking time, where Alpha-beta Pruning outperformed Monte Carlo Tree
Search in the latter case, while Monte Carlo Tree Search had the advantage
when given less time. Admittedly, different games were used for comparing
the methods, suggesting there could be other factors at play causing this dis-
crepancy and warranting further research into scalability in terms of thinking
time.
Clune’s comparison of the methods in many different games appoints Monte
Carlo Tree Search as the favoured method in games of high branching factor.
Adding an additional column to the standard 7x6 board of 2D Connect 4, i.e.
increasing the branching factor by 1, reduced Alpha-beta Pruning’s advantage
over Monte Carlo Tree Search. 3D Connect 4 has a branching factor of 16 and
this study’s results suggests Monte Carlo Tree Search as the preferred choice
of method, adding evidence in favour of one of Clune’s findings and further
supporting the idea that Monte Carlo Tree Search scales better in terms of
branching factor.
and yield no new information. Instead both methods have been made non-
deterministic by the same construction of selecting a random move among all
those that to the algorithm seem to be at least close to optimal. The hope was
that this would avoid repetition of games, and indeed the results in 4.5 showed
that no games were repeated.
5.5 Conclusion
To summarize, the experiment has provided insights into the scalability of
Alpha-beta Pruning and Monte Carlo Tree Search with regard to thinking time
in the game of 3D Connect 4. Alpha-beta Pruning is the preferred method
when thinking time is highly restricted – lower than 0.1 seconds. On the other
22 CHAPTER 5. DISCUSSION
hand, Monte Carlo Tree Search has proven to scale better with increased think-
ing times and could potentially be the favoured method in time-abundant con-
texts. It should be noted that the experiment studied players with at most 20
seconds of average thinking time and that the effect of thinking time on the
match-up between the two methods contradicts previous research. Therefore,
additional research into the relationship between thinking time and the meth-
ods’ playing strengths would contribute to a better understanding of how the
methods compare.
Bibliography
23
24 BIBLIOGRAPHY
[10] Xiyu Kang, Yiqi Wang, and Yanrui Hu. “Research on Different Heuris-
tics for Minimax Algorithm Insight from Connect-4 Game”. In: Journal
of Intelligent Learning Systems and Applications 11 (Jan. 2019), pp. 15–
31. doi: 10.4236/jilsa.2019.112002.
TRITA-EECS-EX-2021:479
www.kth.se