Final

Lesson Outline
CZ3005
• How can one describe the task/problem for the agent?
Artificial Intelligence
• What are the properties of the task environment for the agent?
• Problem formulation
Intelligent Agents and
• Uninformed search strategies
Search
• Informed search strategies: greedy search, A * search
Assoc Prof Bo AN
www.ntu.edu.sg/home/boan
Email: boan@ntu.edu.sg
Office: N4-02b-55
Agent Rational Agents

An agent is an entity that • A rational agent is one that does the right thing
• Perceives through sensors (e.g. eyes, ears, cameras, infrared range
sensors) • Rational action: action that maximises the expected value
• Acts through effectors (e.g. hands, legs, motors) of an objective performance measure given the percept
Sensors sequence to date
Percepts • Rationality depends on
Environment ? • performance measure
Agent • everything that the agent has perceived so far
Actions
• built-in knowledge about the environment
Effectors • actions that can be performed
Example: Google X2: Driverless Taxi More Examples
• Percepts: video, speed, acceleration, engine status, GPS, Agent Type Percepts Actions Goals Environment
radar, … Medical diagnosis Symptoms, findings, Questions, tests, Healthy patient, Patient, hospital
system patient’s answers treatments minimize costs
• Actions: steer, accelerate, brake, horn, display, …
Satellite image Pixels of varying Print a Correct Images from orbiting
• Goals: safety, reach destination, maximise profits, obey analysis system intensity, color categorization of categorization satellite
scene
laws, passenger comfort,… Part-picking robot Pixels of varying Pick up parts and Place parts in Conveyor belt with
• Environment: Singapore urban intensity sorts into bins correct bins parts
Refinery controller Temperature, Open, close valves; Maximize purity, Refinery

streets, highways, traffic, pressure readings adjust temperature yield, safety
pedestrians, weather, customers, …

Interactive English Typed words Print exercises, Maximize student’s Set of students
tutor suggestions, score on test
Image source: https://en.wikipedia.org/wiki/Waymo#/media/File:Waymo_Chrysler_Pacifica_in_Los_Altos,_2017.jpg corrections
Types of Environment Example: Driverless Taxi

Accessible (vs Agent's sensory apparatus gives it access to the
inaccessible) complete state of the environment Accessible? No. Some traffic information on road is missing
Deterministic (vs The next state of the environment is completely determined Deterministic? No. Some cars in front may turn right suddenly
nondeterministic) by the current state and the actions selected by the agent
Episodic? No. The current action is based on previous driving
Episodic (vs Each episode is not affected by the previous taken actions
Sequential) actions
Static? No. When the taxi moves, Other cars are moving as
Static (vs Environment does not change while an agent is well
dynamic) deliberating
Discrete? No. Speed, Distance, Fuel consumption are in real
Discrete (vs A limited number of distinct percepts and actions domains
continuous)
Example: Chess More Examples
Accessible? Yes. All positions in chessboard can be observed
Environment Accessible Deterministic Episodic Static Discrete
Deterministic? Yes. The outcome of each movement can be Chess with a clock Yes Yes No Semi Yes
determined Chess without a clock
Poker
Yes
No
Yes
No
No
No
Yes
Yes
Yes
Yes
Backgammon Yes No No Yes Yes
Episodic? No. The action depends on previous movements Taxi driving No No No No No
Medical diagnosis system No No No No No
Static? Yes. When there is no clock, when are you considering Image-analysis system
Part-picking robot
Yes
No
Yes
No
Yes
Yes
Semi
No
No
No
the next step, the opponent can't move; Semi. When Refinery controller No No No No No
Interactive English tutor No No No No Yes
there is a clock, and time is up, you will give up the
movement
Discrete? Yes. All positions and movements are in discrete
domains
Design of Problem-Solving Agent Design of Problem-Solving Agent

Steps
Idea
1. Goal formulation
• Systematically considers the expected outcomes of
different possible sequences of actions that lead to 2. Problem formulation
states of known value 3. Search process
• Choose the best one • No knowledge  uninformed search
• shortest journey from A to B? • Knowledge  informed search
• most cost effective journey from A to B? 4. Action execution (follow the recommended route)
Example: Romania Example: Vacuum Cleaner Agent
 Goal: be in Bucharest
 Formulate problem:
– states: various cities
– actions: drive between cities
 Solution:
– sequence of cities, e.g., Arad, Sibiu, Fagaras, Bucharest
Oradea Neamt
Zerind  Robotic vacuum cleaners move autonomously

lasi
Arad
Sibiu
Fagaras
 Some can come back to a docking station to charge
Vasliu
Timisoara Rimnicu Vilcea their batteries

Lugoj
Pitesti Hirsova
 A few are able to empty their dust containers into
Urziceni
Mehadia Bucharest
Eforie
the dock as well
Dobreta
Craiova
Giurgiu
Example: A Simple Vacuum World Single-State Problem

 Two locations, each location may or may not contain dirt, and the agent may
be in one location or the other – Accessible world state (sensory information is available)
1 2 – Known outcome of action (deterministic)
3 4 1 2
3 4
5 6
5 6
7 8 7 8
 8 possible world states

 Possible actions: left, right, and suck – e.g.: start in #5
 Goal: clean up all dirt Two goal states, i.e. {7, 8} – Solution: right, suck
Multiple-State Problem Well-Defined Formulation
– Inaccessible world state (with limited sensory Definition of a The information used by an agent to decide what to do
information): problem
 agent only knows which sets of states it is in
– Known outcome of action (deterministic) Specification • Initial state
• Action set, i.e. available actions (successor functions)
1 2 • State space, i.e. states reachable from the initial state
• Solution path: sequence of actions from one state to
4
3
another
5 6
• Goal test predicate
• Single state, enumerated list of states, abstract properties
7 8 • Cost function
• Path cost g(n), sum of all (action) step costs along the path
– e.g.: start in {1, 2, 3, 4, 5, 6, 7, 8}
 Action right goes to {2, 4, 6, 8} Solution A path (a sequence of operators leading) from the Initial-State to a
 Solution: right, suck, left, suck state that satisfies the Goal-Test
Measuring Problem-Solving Performance Single-State Problem Example

Search Cost • Initial state: e.g., “at Arad“
Oradea Neamt
• What does it cost to find the solution? • Set of possible actions and the
Zerind
lasi corresponding next states
• e.g. How long (time)? How many resources used (memory)? Arad Sibiu
Fagaras Vasliu • e.g., Arad  Zerind
Rimnicu
Vilcea Urziceni
• Goal test:
Total cost of problem-solving Timisoara
Lugoj Hirsova
• explicit (e.g., x = “at Bucharest")
Pitesti
• Search cost (“offline") + Execution cost (“online") Mehadia Bucharest
Eforie • Path cost function
• Trade-offs often required Dobreta • e.g., sum of distances, number
Giurgiu
Craiova of operators executed solution:
• Search a very long time for the optimal solution, or
a sequence of operators leading
• Search a shorter time for a “good enough" solution from the initial state to a goal
state
Example: Vacuum World (Single-state
Version) Multiple-State Problem Formulation
• Initial state: one of the eight states shown previously
• Actions: left, right, suck  Initial state set
• Goal test: no dirt in any square  Set of possible actions and the corresponding sets of next states
• Path cost: 1 per action  Goal test
 Path cost function
• Solution:
 a path (connecting sets of states) that leads to a set of states all of which are
goal states
Example: Vacuum World (Multiple-state Example: Vacuum World (Multiple-state

Version) Version)
 States: subset of the eight states
 Operators: left, right, suck
 Goal test: all states in state set have no dirt
 Path cost: 1 per operator
Example: 8-puzzle Real-World Problems
• States: integer locations of tiles Start state Route finding problems:
• number of states = 9!
• Routing in computer networks
• Actions: move blank left, right, up, down
• Goal test: = goal state (given) • Robot navigation
• Path cost: 1 per move • Automated travel advisory
• Airline travel planning
Goal state
Touring problems:
• Traveling Salesperson problem
• “Shortest tour": visit every city exactly once
Question to Think About Search Algorithms

• Exploration of state space by
generating successors of Arad
already-explored states Timisoara
Zerind
• Frontier: candidate nodes Sibiu
for expansion
Arad Rimnicu
• Explored set Vilcea
Fagaras
Oradea
Sibiu Bucharest
Search Strategies Uninformed vs Informed
• A strategy is defined by picking the order of node expansion. Uninformed search strategies Informed search strategies
• Strategies are evaluated along the following dimensions: • Use only the information • Use problem-specific
available in the problem knowledge to guide the
Completeness Does it always find a solution if one exists?
definition search
Time How long does it take to find a solution: the
Complexity number of nodes generated 1. Breadth-first search • Usually more efficient
2. Uniform-cost search
Space Maximum number of nodes in memory
Complexity 3. Depth-first search
Optimality Does it always find the best (least-cost) solution? 4. Depth-limited search
5. Iterative deepening search
Breadth-First Search Complexity of BFS

Expand shallowest unexpanded node which can be implemented by a • Hypothetical state-space, where every node can be expanded into b new
nodes, solution of path-length d
First-In-First-Out (FIFO) queue
• Time: 1 + 𝑏𝑏 + 𝑏𝑏2 +𝑏𝑏3 + ... + 𝑏𝑏𝑑𝑑 = 𝑂𝑂(𝑏𝑏𝑑𝑑 )
1 A 2 A 3 A 4 A • Space: (keeps every node in memory) 𝑂𝑂(𝑏𝑏𝑑𝑑 )are equal
Depth Nodes Time Memory
B C 0 1 1 millisecond 100 bytes

B C B C
2 111 0.1 seconds 1 kilobytes
4 11111 11 seconds 11 kilobytes
D E D E F G 6 106 18 minutes 111 megabyt
e
8
Denote • b: maximum branching factor of the search tree 8 10 31 hours 11 gigabytes
• d: depth of the least-cost solution 10 1010 128 days 1 terabyte
• Complete: Yes 12 1012 35 years 111 terabytes
• Optimal: Yes when all step costs equally 14 1014 3500 years 11111 terabytes
Uniform-Cost Search Uniform-Cost Search
To consider edge costs, expand 1
A S 2 S
unexpanded node with the least path
A 0
cost g 1 10
A B C
1 10 5 B 5
• Modification of breath-first search 1 5 15
B S G
5 5
• Instead of First-In-First-Out (FIFO) S G
4 S
3 S
15 5
queue, using a priority queue with 15 5 C A B C A B C
path cost g(n) to order the elements 15 5 15
C
• BFS = UCS with g(n) = Depth(n) G G G
11 10 11
Here we do not expand notes that have been expanded.
Uniform-Cost Search Depth-First Search

Expand deepest unexpanded node which can be implemented by a Last-In-
Complete Yes First-Out (LIFO) stack, Backtrack only when no more expansion
Time # of nodes with path cost g <= cost of optimal solution 1 A 2 A 3 A 4 A 5 A 6 A
(eqv. # of nodes pop out from the priority queue) B C B C B C B C C
Space # of nodes with path cost g <= cost of optimal solution D E D E D E E
Optimal Yes 7 A 8 H I I
A
9 10 11 12
C C
E C C C C
E
F G F G F G
J K K
L M M
Depth-First Search Depth-Limited Search
Denote To avoid infinite searching, Depth-first search with a cutoff on the
• m: maximum depth of the state space max depth / of a path
Complete • infinite-depth
No
spaces: No Complete Yes, if 𝐼𝐼 ≥ 𝑑𝑑
• finite-depth spaces
No
with loops: No
• with repeated-state
Ye checking: Yes Time 𝑂𝑂(𝑏𝑏𝐼𝐼 )
• finite-depth spaces
s Yewithout loops: Yes
Space 𝑂𝑂(𝑏𝑏𝐼𝐼)
𝑂𝑂 𝑏𝑏𝑚𝑚
s
Time
If solutions are dense, may be much faster than breadth-first Optimal No
Space 𝑂𝑂(𝑏𝑏𝑏𝑏)
No
Optimal No
Iterative Deepening Search Iterative Deepening Search

Iteratively estimate the max depth / of DLS one-by-one Iteratively estimate the max depth / of DLS one-by-one
Limit = 0 Limit = 2
Limit = 1
Iterative Deepening Search Iterative Deepening Search...
Iteratively estimate the max depth / of DLS one-by-one Function ITERATIVE-DEEPENING-SEARCH(problem) returns a solution sequence
inputs: problem, a problem
Limit = 3 for depth 0 to ∞ do
if DEPTH-LIMITED-SEARCH(problem, depth) succeeds then return its result
end
return failure
Complete Yes
Time 𝑂𝑂(𝑏𝑏𝑑𝑑 )
Space 𝑂𝑂(𝑏𝑏𝑑𝑑)
Optimal Yes
Summary (we make assumptions for optimality) General Search

Breadth- Uniform- Depth-First Depth- Iterative Bidirectional Uninformed search strategies Informed search strategies
Criterion first Cost Limited Deepening (if applicable) • Systematic generation of new • Use problem-specific knowledge
states (Goal Test) • To decide the order of node
expansion
Time 𝑏𝑏𝑑𝑑 𝑏𝑏𝑑𝑑 𝑏𝑏𝑚𝑚 𝑏𝑏𝑙𝑙 𝑏𝑏𝑑𝑑 𝑏𝑏𝑑𝑑/2
• Inefficient (exponential space
Space 𝑏𝑏𝑑𝑑 𝑏𝑏𝑑𝑑 𝑏𝑏𝑏𝑏 𝑏𝑏𝑏𝑏 𝑏𝑏𝑑𝑑 𝑏𝑏𝑑𝑑/2 and time complexity)
• Best First Search: expand the
most desirable unexpanded node
Optimal Yes Yes No No Yes Yes • Use an evaluation function to
Complete Yes Yes No Yes, if 𝑏𝑏 ≥ 𝑑𝑑 Yes Yes estimate the “desirability" of
each node
Evaluation function Greedy Search
• Path-cost function 𝑔𝑔(𝑛𝑛) Expands the node that appears to be closest to goal
• Cost from initial state to current state (search-node) 𝑛𝑛 • Evaluation function ℎ(𝑛𝑛):estimate of cost from 𝑛𝑛 to 𝑔𝑔𝑔𝑔𝑔𝑔𝑏𝑏
• No information on the cost toward the goal • Function Greedy-Search(problem) returns solution
• Need to estimate cost to the closest goal • Return Best-First-Search(problem, ℎ) // ℎ 𝑔𝑔𝑔𝑔𝑔𝑔𝑏𝑏 = 0
• “Heuristic” function ℎ(𝑛𝑛)
• Estimated cost of the cheapest path from 𝑛𝑛 to a goal state ℎ(𝑛𝑛)
Question: How to estimation the cost from 𝑛𝑛 to goal?
• Exact cost cannot be determined
• depends only on the state at that node Answer: Recall that we want to use problem-specific
• ℎ(𝑛𝑛) is not larger than the real cost (admissible) knowledge
Example: Route-finding from Arad to Bucharest Example

ℎ 𝑛𝑛 = straight-line distance from 𝑛𝑛 to Bucharest Straight-line distance to Bucharest a) The initial Straight-line distance to Bucharest
71 Oradea Neamt
Arad
Bucharest
366
0
state Arad
Arad
Bucharest
366
0
87 Craiova 160 Craiova 160
75 Zerind 151 Dobreta 242 Dobreta 242
Lasi
366
Arad Efoire 161 Efoire 161
140 Sibiu 92 Fagaras 176 Fagaras 176
99 Fagaras
Vaslui Giurgiu 77 Giurgiu 77
118
80 Rimnicu Hirsova 151 Hirsova 151
142
Vilcea Lasi 226 Lasi 226
Timisoara 211
Lugoj 244 Lugoj 244
Pitesti 98 Hirsova Mehadia 241 Mehadia 241
111 Lugoj 97 85
Neamt 234 Neamt 234
70 146 Urziceni 86 Oradea 380 Oradea 380
101
Mehadia 138 Bucharest Pitesti 98 Pitesti 98
75 90 Eforie Rimnicu Vilcea 193 Rimnicu Vilcea 193
Sibiu 253 Sibiu 253
Dobreta Craiova
120 Giurgiu Timisoara 329 Timisoara 329
Urziceni 80 Urziceni 80
• Useful but potentially fallible (heuristic) Vaslui 199 Vaslui 199
Zerind 374 Zerind 374
• Heuristic functions are problem-specific
Example Example
b) After Straight-line distance to Bucharest c) After Straight-line distance to Bucharest
expanding Arad
Arad
Bucharest
366
0
expanding Arad Arad
Bucharest
366
0
Arad Craiova
Dobreta
160
242
Sibiu Craiova
Dobreta
160
242
Efoire 161 Efoire 161
Sibiu Timisoara Zerind Fagaras 176
Giurgiu 77 Giurgiu 77
253 329 374 Hirsova 151
329 374 Hirsova 151
Lasi 226 Lasi 226
Lugoj 244 Arad Lugoj 244
Mehadia 241 Oradea Mehadia 241
Neamt 234 366 Fagaras Rimnicu Neamt 234
Oradea 380 380 Vilcea Oradea 380
Pitesti 98 176 193 Pitesti 98
Rimnicu Vilcea 193 Rimnicu Vilcea 193
Sibiu 253 Sibiu 253
Timisoara 329 Timisoara 329
Urziceni 80 Urziceni 80
Vaslui 199 Vaslui 199
Zerind 374 Zerind 374
Example Complete?
d) After Straight-line distance to Bucharest Question: Is this approach complete?
expanding Arad Arad
Bucharest
366
0
Fagaras Craiova 160 Example: Find a path from Lasi to Fagaras
Dobreta 242
Efoire 161 Oradea
Neamt
Giurgiu 77 Zerind
329 374 Hirsova 151
Lasi
Arad
Lasi 226 Sibiu Fagaras
Arad Lugoj 244 R Vaslui
Oradea Mehadia 241 Rimnicu
366 Fagaras Rimnicu Neamt 234 Timisoara Vilcea
380 Vilcea Oradea 380 Pitesti Hirsova
Lugoj
Pitesti 98 Urziceni
193
Rimnicu Vilcea 193
Mehadia Bucharest
Sibiu 253
Sibiu Bucharest Timisoara 329 Dobreta Eforie
Craiova
Urziceni 80 Giurgiu
253 0 Vaslui 199
Zerind 374 Answer: No
Greedy Search... A * Search
• m: maximum depth of the search space • Uniform-cost search
Complete No • g(n): cost to reach n (Past Experience)
Time 𝑂𝑂(𝑏𝑏𝑚𝑚 ) • optimal and complete, but can be very
inefficient
Space 𝑂𝑂(𝑏𝑏𝑚𝑚 )(keeps all nodes in memory)
• Greedy search
Optimal No
• h(n): cost from n to goal (Future Prediction)
• neither optimal nor complete, but cuts search
space considerably
A * Search Example: Route-finding from Arad to Bucharest

Idea: Combine Greedy search with Uniform-Cost search Best-first-search with evaluation function 𝑔𝑔 + ℎ
Evaluation function: 𝑓𝑓 𝑛𝑛 = 𝑔𝑔 𝑛𝑛 + ℎ(𝑛𝑛) (a) The initial state (b) After expanding Arad (c) After expanding Sibiu
Arad Arad Arad
366 = 0 + 366
• f (n): estimated total cost of path through n to goal (Whole Life) Sibiu Zerind Sibiu Zerind
Timisoara
• If g = 0  greedy search; If h = 0  uniform-cost search 393 = 140 + 253 449 = 75 + 374 449 = 75 + 374
447 = 118 + 329
• Function A* Search(problem) returns solution Timisoara Arad
Oradea
• Return Best-First-Search(problem, 𝑔𝑔 + ℎ) 447 = 118 + 329 646 = 280 + 366
671 = 291+ 380
Rimnicu
Vilcea
Fagaras
413 = 220 + 193
415 = 239 + 176
Example: Route-finding from Arad to Bucharest Example: Route-finding from Arad to Bucharest
Best-first-search with evaluation function 𝑔𝑔 + ℎ Best-first-search with evaluation function 𝑔𝑔 + ℎ
Arad
(d) After expanding (e) After expanding
Rimnicu Vilcea Arad Fagaras Zerind
Sibiu
Zerind Timisoara 449 = 75 + 374
Sibiu Arad
Timisoara 449 = 75 + 374 447 = 118 + 329
Arad 447 = 118 + 329 646 = 280 + 366 Oradea Rimnicu
646 = 280 + 366 Fagaras 671 = 291+ 380 Vilcea
Oradea Rimnicu
Fagaras 671 = 291+ 380 Vilcea Sibiu
Sibiu Craiova
415 = 239 + 176 Sibiu Pitesti 553 = 300 + 253
Bucharest
Craiova 591 = 338 + 253 526 = 360 + 166
Pitesti 553 = 300 + 253 415 = 317 + 98
450 = 450 + 0
526 = 360 + 166
415 = 317 + 98
Example: Route-finding from Arad to Bucharest Example: Route-finding in Manhattan

Best-first-search with evaluation function 𝑔𝑔 + ℎ
Arad
53𝑛𝑛𝑑𝑑 St 3 2
(f) After expanding Zerind
Sibiu
Pitesti Timisoara 449 = 75 + 374
Arad 447 = 118 + 329
646 = 280 + 366 Oradea Rimnicu 51𝑠𝑠𝑠𝑠 St 5 4 3 2 1
S G
Fagaras 671 = 291+ 380 Vilcea
Sibiu 50𝑠𝑠𝑡 St 7 6 5 3 2 1
4
Sibiu Craiova Pitesti 553 = 300 + 253
Bucharest
10𝑠𝑠𝑡 Ave
6𝑠𝑠𝑡 Ave
5𝑠𝑠𝑡 Ave
9𝑠𝑠𝑡 Ave
8𝑠𝑠𝑡 Ave
4𝑠𝑠𝑡 Ave
2𝑛𝑛𝑑𝑑 Ave
3𝑟𝑟𝑑𝑑 Ave
591 = 338 + 253 526 = 360 + 166
7𝑠𝑠𝑡 Ave
Manhattan Shortest
450 = 450 + 0 Path = 8
Rimnicu Distance
Bucharest Craiova Vilcea Heuristic
418 = 418 + 0 615 = 455 + 160 607 = 414 + 193
Example: Route-finding in Manhattan (Greedy) Example: Route-finding in Manhattan (UCS)
53𝑛𝑛𝑑𝑑 St 3 2 53𝑛𝑛𝑑𝑑 St 7 8
52𝑛𝑛𝑑𝑑 St 2 1 52𝑛𝑛𝑑𝑑 St 6 9
51𝑠𝑠𝑠𝑠 St 5 4 3 2 1 51𝑠𝑠𝑠𝑠 St 1 2 3 4 5
S G S G
50𝑠𝑠𝑡 St 7 6 5 4 3 2 1 50𝑠𝑠𝑡 St 1 2 3 4 5 6 7
10𝑠𝑠𝑡 Ave
10𝑠𝑠𝑡 Ave
6𝑠𝑠𝑡 Ave
6𝑠𝑠𝑡 Ave
5𝑠𝑠𝑡 Ave
5𝑠𝑠𝑡 Ave
9𝑠𝑠𝑡 Ave
9𝑠𝑠𝑡 Ave
8𝑠𝑠𝑡 Ave
4𝑠𝑠𝑡 Ave
8𝑠𝑠𝑡 Ave
4𝑠𝑠𝑡 Ave
7𝑠𝑠𝑡 Ave
7𝑠𝑠𝑡 Ave
Example: Route-finding in Manhattan (A*) Complexity of A*
Time Exponential in length of
solution
Space (all generated nodes
52𝑛𝑛𝑑𝑑 St 8 10 are kept in memory)
Exponential in length of
51𝑠𝑠𝑠𝑠 St 6 6 6 6 6 solution
S G
With a good heuristic, significant
50𝑠𝑠𝑡 St 8 8 8 8 8 8 8
savings are still possible
compared to uninformed search
10𝑠𝑠𝑡 Ave
6𝑠𝑠𝑡 Ave
5𝑠𝑠𝑡 Ave
9𝑠𝑠𝑡 Ave
8𝑠𝑠𝑡 Ave
4𝑠𝑠𝑡 Ave
7𝑠𝑠𝑡 Ave
methods
Lesson Outline
CZ3005
• Constraint Satisfaction
• Adversarial search (Game Playing)
Constraint satisfaction
and adversarial search
Assoc Prof Bo AN
Office: N4-02b-55
Constraint Satisfaction Problem (CSP) Examples: Real-world CSPs

Goal: discover some state that satisfies a given set of • Assignment problems
constraints • e.g. who teaches what class
Example: Sudoku Example: Minesweeper • Timetabling problems
1 3 • e.g. which class is offered when and where?
7 6
5
R
9
9 R
• Hardware configuration
1 6
2
• Transportation scheduling
7 4 5
8 4 • Factory scheduling
1
• Floor-planning
CSP Example: Cryptarithmetic Puzzle
State
S E N D
• defined by variables 𝑉𝑉𝑖𝑖 with values from domain 𝐷𝐷𝑖𝑖 + M O R E
Example: 8-queens M O N E Y
• Variables: locations of each of the eight queens
• Values: squares on the board
 Variables: D, E, M, N, O, R, S, Y
Goal test  Domains: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
• a set of constraints specifying allowable combinations of values for subsets of  Constraints
 Y = D + E or Y = D + E - 10, etc
variables  D ≠ E, D ≠ M, D ≠ N, etc.
 M ≠ 0, S ≠ 0 (unary constraints: concern the value of a
Example: 8-queens
R single variable)
• Goal test: No two queens in the same row, column or diagonal
Example: Map Colouring Some Definitions

Colour a map so that no adjacent parts have the same colour • A state of the problem is defined by an assignment of values
to some or all of the variables.
• Variables: Countries Ci • An assignment that does not violate any constraints is called
• Domains: {Red, Blue, Green} a consistent or legal assignment.
• Contraints: C1 ≠ C2, C1 ≠ C5, etc. • A solution to a CSP is an assignment with every variable
• binary constraints given a value (complete) and the assignment satisfies all the
constraints.
Applying Standard Search Applying Standard Search
Question: How to represent constraints?
• States: defined by the values assigned so far Answer: Explicitly (e.g., D ≠ E)
• Initial state: all variables unassigned
Example
• Actions: assign a value to an unassigned variable
• Row the 1st queen occupies: 𝑉𝑉1 ∈ {1, 2, 3, 4, 5, 6, 7, 8 }
• Goal test: all variables assigned, no constraints violated
(similarly, for 𝑉𝑉2 ) R
• No-attack constraint for 𝑉𝑉1 and 𝑉𝑉2 :

{ <1, 3> , <1, 4>, <1, 5>, … , <2, 4>, <2, 5>, …}
Implicitly: use a function to test for constraint satisfaction
Backtracking Search Example (4-Queens)

Backtracking search: Do not waste time searching when constraints have
already been violated
Initial State • Before generating

successors, check for
constraint violations
S1 Go back to the last decision
point • If yes, backtrack to try
something else
S2 X
Something goes wrong
Example (4-Queens) without Constraint
Heuristics for CSPs Propagation
Plain backtracking is an uninformed algorithm!!
More intelligent search that takes into consideration
• Which variable to assign next
• What order of the values to try for each variable
• Implications of current variable assignments for the other unassigned
variables
• forward checking and constraint propagation
Constraint propagation: propagating the implications of a constraint

on one variable onto other variables
Search Tree of 4-Queens with

Constraint Propagation Example (Map Colouring)
NT
Northern
Territory Q
Queensland WA
Western
Australia South
SA NSW
Australia
New South
Wales
V
Victoria
T
Tasmania
Example (Map Colouring)... Example (Map Colouring)...
NT
Q
WA
SA
NSW
V
WA NT Q NSW V SA T
T
WA NT Q NSW V SA T
Example (Map Colouring)... Example (Map Colouring)...
WA NT Q NSW V SA T
WA NT Q NSW V SA T
Most Constraining Variable Least Constraining Value
Allow 1 value
Example: map colouring for SA
Allow 0 values
for SA
To reduce the branching factor on future choices by selecting the variable that is
Choose the value that leaves maximum flexibility for subsequent variable
involved in the largest number of constraints on unassigned variables.
assignments
Min-Conflicts Heuristic (8-queens) Games as Search Problems

• A local heuristic search method for solving CSPs Abstraction
• Given an initial assignment, selects a variable in the scope of a violated • Ideal representation of real world problems
constraint and assigns it to the value that minimises the number of violated • e.g. board games, chess, go, etc. as an abstraction of war games
constraints • Perfect information, i.e. fully observable
• Accurate formulation: state space representation
2 3
2 3 Uncertainty
1
• Account for the existence of hostile agents (players)
2 2
3 3
• Other agents acting so as to diminish the agent's well-being
1 2 • Uncertainty (about other agents' actions):
2 3 • not due to the effect of non-deterministic actions
0 • not due to randomness
 Contingency problem
Games as Search Problems... Types of Games
Complexity
Deterministic Chance
• Games are abstract but not simple
• e.g. chess: average branching factor = 35, game length > 50 Perfect Chess, Backgammon,
 complexity = 3550 (only 1040 for legal moves) information Checkers, Monopoly
Go, Othello
Imperfect Bridge, Poker,
• Games are usually time limited Scrabble, Nuclear war
information
• Complete search (for the optimal solution) not possible
 uncertainty on actions desirability
• Search efficiency is crucial Perfect information
• each player has complete information about his opponent's position and
about the choices available to him
Game as a Search Problem Game Tree for Tic-Tac-Toe

• Initial state: initial board configuration and indication of who makes the first MAX(X)
move
X X X
• Operators: legal moves MIN(O) X X X
X X X
• Terminal test: determines when the game is over
X O X O X
• states where the game has ended: terminal states MAX(X) O …
• Utility function (payoff function): returns a numeric score to quantify the • The utility value of the terminal state
outcome of a game X O X X O X O
is from the point of view of MAX
MIN(O) X X …
… … … • MAX uses the search tree to

Example: Chess …
R determine the best move
X O X
Win (+1), loss(-1) or draw (0) TERMINAL
O X
X O X
O O X
X O X
X …
O X X O X O O
Utility -1 0 +1
What Search Strategy? Minimax Search Strategy
Search strategy
One-play Two-play • Find a sequence of moves that leads to a terminal state (goal)
MAX A
Minimax search strategy
MAX A • Maximise one's own utility and minimise the opponent's
• Assumption is that the opponent does the same
MIN B C D
B C D
(8) (3) (-2)
E F G H I J K
(9) (-6) (0) (0) (-2) (-4) (-3)
Minimax Search Strategy Perfect Decisions by Minimax Algorithm

3-step process Perfect decisions: no time limit is Example
1. Generate the entire game tree down to terminal states imposed
• generate the complete MAX 3
search tree 𝐴𝐴1 𝐴𝐴3
2. Calculate utility 𝐴𝐴2
a) Assess the utility of each terminal state Two players: MAX and MIN MIN 3
R
2 2
b) Determine the best utility of the parents of the terminal state • Choose move with best
c) Repeat the process for their parents until the root is reached achievable payoff against best 𝐴𝐴11 𝐴𝐴12 𝐴𝐴13 𝐴𝐴21 𝐴𝐴22 𝐴𝐴23 𝐴𝐴31 𝐴𝐴32 𝐴𝐴33
play
• MAX tries to max the utility, 3 12 8 2 4 6 14 5 2
3. Select the best move (i.e. the move with the highest utility value)
assuming that MIN will try to
min it
Othello 4 `X' plays first
• A player can place a new piece in a position if there
X considers the game now X O
exists at least one straight (horizontal, vertical, or
O X
diagonal) occupied line between the new piece and
another piece of the same kind, with one or more
contiguous pieces from the opponent player between
X O them O considers the game now X O
X X
O X X
• After placing the new piece, the pieces from the
opponent player will be captured and become the
pieces from the same Player X considers the game now O X O X O
X X X
• The player with the most pieces on the board wins
Game Tree Othello 4 Imperfect Decisions

X O
MAX(X) O X
For chess, branching factor ≈ 35, each player typically makes 50 moves  for the
complete game tree, need to examine 35100 positions
MIN(O)
X O
X X
Time/space requirements  complete game tree search is intractable 
X impractical to make perfect decisions
Modifications to minimax algorithm
MAX(X) O O O X O X O
X X O X X O 1. replace utility function by an estimated desirability of the position
X O X X O
• Evaluation function
X X X X X 2. partial tree search
O X O O X O O O X O O X X X X X X X O X O
X X
X
X X
X
X X
X
X X
X
X O
X O
X X
X O
X X X
X O
X X
X X X
• E.g., depth limit
X O
X
X X X X X
• Replace terminal test by a cut-off test
X X X O X O X
MIN(O) O X O X O X
Evaluation Functions Evaluation Functions...
Returns an estimate of the expected utility of the game from a given position Requirements
 Computation is efficient
 Agrees with utility function on terminal states
 Accurately reflects the chances of winning
Trade off between accuracy and efficiency
Example (Chess)
Define features
 e.g. , (number of white queens) – (number of black queens)
Use a weighted sum of these features R
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑠𝑠 = 𝑤𝑤1 𝑓𝑓1 𝑠𝑠 + 𝑤𝑤2 𝑓𝑓2 𝑠𝑠 + ⋯ 𝑤𝑤𝑛𝑛 𝑓𝑓𝑛𝑛 𝑠𝑠

Black: to move White: to move
White: slightly better Black: winning  Need to learn the weight
Evaluation Functions for Othello 4 Evaluation Functions for Othello 4 ...

 A corner of the board is one of the most important positions. A MAX(X)
X O
O X
piece at the corner can help capture other pieces from the
opponent player
 A square at the border is also more important than any position in MIN(O)
X O
X X 3( 𝑋𝑋𝐶𝐶 - 𝑂𝑂𝐶𝐶 ) + 2(𝑋𝑋𝑏𝑏 - 𝑂𝑂𝑏𝑏 ) + (𝑋𝑋𝑚𝑚 -𝑂𝑂𝑚𝑚 )
the middle of the board X
Heuristics for `X' is proposed as follows: MAX(X) O O O X O X O

X X O X X O
 For any non-terminal game state, the evaluation function is X O X X O
computed as
3( 𝑋𝑋𝐶𝐶 - 𝑂𝑂𝐶𝐶 ) + 2(𝑋𝑋𝑏𝑏 - 𝑂𝑂𝑏𝑏 ) + (𝑋𝑋𝑚𝑚 -𝑂𝑂𝑚𝑚 ) X X X X X
O X O O X O O O X O O X
where 𝑋𝑋𝐶𝐶 is the number of X's at corners, X X X X X X X X
X X
X O
X X X
X X
X O
X X X
X O
X X
𝑋𝑋𝑏𝑏 is the number of X's at the border (excluding corners) , X X X X X O X O X O X X X
𝑋𝑋𝑚𝑚 is the number of X's in the middle of the grid, 5 4 4 5 X 5 6 4 9

𝑂𝑂𝐶𝐶 , 𝑂𝑂𝑏𝑏 and 𝑂𝑂𝑚𝑚 are the number of O's at the corners, the border and the middle of the board X O
X X X
X X
O X
X X X
O X
MIN(O) O X O X O X
3 3 3
Minimax Search for Othello 4 Quiescent Search
MAX(X) 3
X O The evaluation function should only be applied to quiescent positions
O X
– positions that are not likely to have large variations in evaluation in the near future
Example (Chess)
MIN(O) 3 X O
X X Positions in which favorable captures can be made
X
MAX(X) O O O X O X O
5 X X 3 O X 9 X O
X O X X O
R
X X X X X Black: to move
O X O O X O O O X O O X
X X X X X X X X
X X X X X X O X O White: about to lose
X O X X X X X X X
X X X X X O X O X O X X X
5 4 4 5 X 5 6 4 9
X O X X X X X
X X X O X O X
MIN(O)
O X O X O X
Expansion of non-quiescent positions until quiescent positions are reached
3 3 3
AlphaGo: Key Ideas

• Objective: Reduce search space without sacrificing quality
• Key Idea 1: Take advantage of human top players’ data CZ3005
– Deep learning
• Key Idea 2: Self-play
– Reinforcement learning
• Key Idea 3: Looking ahead Markov Decision Process

– Monte Carlo tree search
– We learned Minimax search with evaluation functions
Assoc Prof Bo AN
Research area: artificial intelligence,

computational game theory, reinforcement
learning, optimization
Office: N4-02b-55
Lesson Outline Introduction
• Introduction • We consider a framework for decision making under uncertainty

• Markov Decision Process • Markov decision processes (MDPs) and their extensions provide an
extremely general way to think about how we can act optimally
• Two methods for solving MDP under uncertainty
• Value iteration • For many medium-sized problems, we can use the techniques from
this lecture to compute an optimal decision policy
• Policy iteration
• For large-scale problems, approximate techniques are often needed
(more on these in later lectures), but the paradigm often forms the
basis for these approximate methods
The Agent-Environment Interface Making Complex Decisions

• Make a sequence of decisions
• Agent’s utility depends on a sequence of decisions
• Sequential Decision Making
• Markov Property
Agent and environment interact at discrete time steps: 𝑡𝑡 = 0,1,2, … • Transition properties depend only on the current state, not on previous
Agent: history (how that state was reached)
1. observes state at step 𝑡𝑡: 𝑠𝑠𝑡𝑡 ∈ 𝑆𝑆
• Markov Decision Processes
2. Produces action at step 𝑡𝑡: 𝑎𝑎𝑡𝑡 ∈ 𝐴𝐴 𝑠𝑠𝑡𝑡
3. Gets resulting reward: 𝑟𝑟𝑡𝑡+1 and the next state: 𝑠𝑠𝑡𝑡+1 ∈ 𝑆𝑆
Markov Decision Processes Game Show
• Formulate the agent-environment interaction as an MDP • A series of questions with increasing level of difficulty and increasing payoff
• Components: • Decision: at each step, take your earnings and quit, or go for the next question
• Markov States 𝑠𝑠, beginning with initial state 𝑠𝑠0 – If you answer wrong, you lose everything
• Actions 𝑎𝑎
• Each state 𝑠𝑠 has actions 𝐴𝐴(𝑠𝑠) available from it $100 question $1,000 question $10,000 question $50,000 question
Correct:
• Transition model 𝑃𝑃(𝑠𝑠𝑠 | 𝑠𝑠, 𝑎𝑎) Q1
Correct
Q2
Correct
Q3
Correct
Q4
$61,100
• assumption: the probability of going to 𝑠𝑠𝑠 from 𝑠𝑠 depends only on 𝑠𝑠

and 𝑎𝑎 and not on any other past actions or states Incorrect: Incorrect: Incorrect: Incorrect:
• Reward function 𝑅𝑅(𝑠𝑠), or r(s) $0 $0 $0 $0
• Policy 𝜋𝜋(𝑠𝑠): the action that an agent takes in any given state Quit: Quit: Quit:
$100 $1,100 $11,100
• The “solution” to an MDP
Game Show Game Show

• Consider $50,000 question • What should we do in Q3?
• Probability of guessing correctly: 1/10 – Payoff for quitting: $1,100
• Quit or go for the question? – Payoff for continuing: 0.5 * $11,100 = $5,550
• What is the expected payoff for continuing? • What about Q2?
0.1 * 61,100 + 0.9 * 0 = 6,110 – $100 for quitting vs. $4,162 for continuing
• What is the optimal decision? • What about Q1?
U = $3,746 U = $4,162 U = $5,550 U = $11,100
$100 question $1,000 question $10,000 question $50,000 question
Correct: $100 question $1,000 question $10,000 question $50,000 question 1/10
Correct Correct Correct $61,100 9/10 3/4 1/2 Correct:
Q1 Q2 Q3 Q4 Correct Correct Correct $61,100
Q1 Q2 Q3 Q4
Incorrect: Incorrect: Incorrect: Incorrect:

$0 $0 $0 $0 Incorrect: Incorrect: Incorrect: Incorrect:
Quit: Quit: Quit: $0 $0 $0 $0
$100 $1,100 $11,100 Quit: Quit: Quit:
$100 $1,100 $11,100
Grid World Goal: Policy
Transition model:
0.1 0.8 0.1
R(s) = -0.04 for every

non-terminal state
Source: P. Abbeel and D. Klein Source: P. Abbeel and D. Klein
Grid World Grid World
Transition model:
Optimal policy when

non-terminal state

non-terminal state
Atari Video Games Solving MDPs
• MDP components:
• States 𝑠𝑠
• Actions 𝑎𝑎
• Transition model 𝑃𝑃(𝑠𝑠𝑠 | 𝑠𝑠, 𝑎𝑎)
Execute actions to get the maximum • Reward function 𝑅𝑅(𝑠𝑠)
accumulated rewards • The solution:
• Policy 𝜋𝜋 𝑠𝑠 mapping from states to actions
• How to find the optimal policy?
Maximizing Accumulated Rewards Accumulated Rewards

• The optimal policy should maximise the accumulated rewards over given a • Normally, we would define the accumulated rewards of trajectories as the
trajectories like 𝜏𝜏 =< 𝑆𝑆1 , 𝐴𝐴1 , 𝑅𝑅1 , … , 𝑆𝑆𝑇𝑇 , 𝐴𝐴 𝑇𝑇 , 𝑅𝑅𝑇𝑇 > under some policies: discounted sum of the rewards
𝐾𝐾
• Problem: infinite time horizon
𝐺𝐺𝑡𝑡 = 𝑅𝑅𝑡𝑡 + 𝛾𝛾𝑅𝑅𝑡𝑡+1 + 𝛾𝛾 2 𝑅𝑅𝑡𝑡+2 + ⋯ 𝛾𝛾 𝐾𝐾 𝑅𝑅𝑡𝑡+𝐾𝐾 = � 𝛾𝛾 𝑘𝑘 𝑅𝑅𝑡𝑡+𝑘𝑘 • Solution: discount the individual rewards by a factor γ between 0 and 1:
𝑘𝑘=0
𝐺𝐺𝑡𝑡 = 𝑅𝑅𝑡𝑡 + 𝛾𝛾𝑅𝑅𝑡𝑡+1 + 𝛾𝛾 2 𝑅𝑅𝑡𝑡+2 + ⋯
• How to define the accumulated rewards of a state sequence? 𝑅𝑅𝑚𝑚𝑚𝑚𝑚𝑚
= ∑∞ 𝑘𝑘
𝑘𝑘=0 𝛾𝛾 𝑅𝑅𝑡𝑡+𝑘𝑘 ≤ 1−𝛾𝛾 (0 < 𝛾𝛾 < 1)
• Discounted sum of rewards of individual states
• Problem: infinite state sequences • Sooner rewards count more than later rewards
• Makes sure the total accumulated rewards stays bounded
• If finite, LP can be applied
• Helps algorithms converge
Value Function Finding the Value Function of States
• The “true” value of a state, denoted 𝑉𝑉(𝑠𝑠), is the expected sum of discounted • What is the expected value of taking
rewards if the agent executes an optimal policy starting in state 𝑠𝑠 Max node action 𝒂𝒂 in state 𝒔𝒔?
𝑉𝑉 𝜋𝜋 𝑠𝑠 = 𝔼𝔼𝜋𝜋 𝐺𝐺𝑡𝑡 | 𝑆𝑆𝑡𝑡 = 𝑠𝑠 � 𝑃𝑃 𝑠𝑠 ′ 𝑠𝑠, 𝑎𝑎 [𝑟𝑟(𝑠𝑠,𝑎𝑎,𝑠𝑠ʹ) + 𝛾𝛾 ∗ 𝑉𝑉 𝑠𝑠 ′ ]
𝑠𝑠 ′
• Similarly, we define the action-value of a state-action pair as • How do we choose the optimal action?
Chance node
𝑄𝑄𝜋𝜋 𝑠𝑠, 𝑎𝑎 = 𝔼𝔼𝜋𝜋 𝐺𝐺𝑡𝑡 | 𝑆𝑆𝑡𝑡 = 𝑠𝑠, 𝐴𝐴𝑡𝑡 = 𝑎𝑎 𝜋𝜋 ∗ 𝑠𝑠 = argmax𝑎𝑎∈𝐴𝐴 � 𝑃𝑃 𝑠𝑠 ′ 𝑠𝑠, 𝑎𝑎 [𝑟𝑟(𝑠𝑠,𝑎𝑎,𝑠𝑠ʹ) + 𝛾𝛾𝑉𝑉 𝑠𝑠 ′ ]
𝑟𝑟(𝑠𝑠,𝑎𝑎,𝑠𝑠′) 𝑠𝑠 ′
• The relationship between Q and V P(s’ | s, a)
• What is the recursive expression for 𝑉𝑉(𝑠𝑠)
𝑉𝑉 𝜋𝜋 𝑠𝑠 = � 𝑄𝑄𝜋𝜋 𝑠𝑠, 𝑎𝑎 𝜋𝜋(𝑎𝑎|𝑠𝑠) in terms of the utilities of its successor
𝑎𝑎∈𝐴𝐴 V(s’) states?
V 𝑠𝑠 = max𝑎𝑎∈𝐴𝐴 � 𝑃𝑃 𝑠𝑠 ′ 𝑠𝑠, 𝑎𝑎 [𝑟𝑟(𝑠𝑠,𝑎𝑎,𝑠𝑠ʹ) + 𝛾𝛾𝑉𝑉 𝑠𝑠 ′ ]
𝑠𝑠 ′
The Bellman Equation Method 1: Value Iteration

• Recursive relationship between the • Start out with every 𝑉𝑉(𝑠𝑠) = 0
accumulated rewards of successive
states: • Iterate until convergence
V 𝑠𝑠 = max𝑎𝑎∈𝐴𝐴 � 𝑃𝑃 𝑠𝑠 ′ 𝑠𝑠, 𝑎𝑎 [𝑟𝑟(𝑠𝑠,𝑎𝑎,𝑠𝑠ʹ) + 𝛾𝛾𝑉𝑉 𝑠𝑠 ′ ] • During the ith iteration, update the value of each state according to this
𝑠𝑠 ′
rule:
• For N states, we get N equations in N Choose optimal action 𝑎𝑎
unknowns 𝑉𝑉𝑖𝑖+1 𝑠𝑠 = max � 𝑃𝑃(𝑠𝑠 ′ |𝑠𝑠, 𝑎𝑎) 𝑅𝑅 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉𝑖𝑖 (𝑠𝑠 ′ )
Receive reward 𝑟𝑟(𝑠𝑠,𝑎𝑎,𝑠𝑠′) 𝑎𝑎 𝑠𝑠 ′ ∈𝑆𝑆
• Solving them solves the MDP
• We could try to solve them through
expectimax search, but that would run into • In the limit of infinitely many iterations, guaranteed to find the
trouble with infinite sequences
End up here with 𝑃𝑃(𝑠𝑠𝑠 | 𝑠𝑠, 𝑎𝑎) correct values
• Instead, we solve them algebraically Get value V(𝑠𝑠𝑠)
• Two methods: value iteration and policy (discounted by γ) • In practice, don’t need an infinite number of iterations…
iteration
Value Iteration: Example Value Iteration: Example (cont’d)
• A simple example to show how VI works • Given the states and actions are finite, we can use matrices to
– State, action, reward (non-zero) and transition probability are shown in the figure represent the value function 𝑉𝑉(𝑠𝑠)
– We use this example as an MDP and solve it using VI and PI
• Pseudo-code of VI
1. Initialize V0 (𝑠𝑠), for all 𝑠𝑠
2. For 𝑖𝑖 = 0, 1, 2 …
3. 𝑉𝑉𝑖𝑖+1 𝑆𝑆 = max ∑𝑠𝑠′ 𝑃𝑃(𝑠𝑠 ′ |𝑠𝑠, 𝑎𝑎) 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉𝑖𝑖 (𝑠𝑠 ′ ) for all state 𝑠𝑠
𝑎𝑎
These structure can be easily implemented by dict of Python or HashMap of Java
Value Iteration: Example (cont’d) Value Iteration: Example (cont’d)

• How VI works in an iteration? Given iteration 𝑖𝑖 = 0 • We use the previous example and solve it using VI
• Construct a Q table, Q(s, a)
For all state find – 3 rows (3 states) and 2 columns (2 actions)
– 𝑉𝑉0 = [0, 0, 0] # 3 states
𝑉𝑉𝑖𝑖+1 𝑆𝑆 = max � 𝑃𝑃(𝑠𝑠 ′ |𝑠𝑠, 𝑎𝑎) 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉𝑖𝑖 (𝑠𝑠 ′ ) 𝑄𝑄𝑖𝑖+1 𝑠𝑠, 𝑎𝑎 = � 𝑃𝑃(𝑠𝑠 ′ |𝑠𝑠, 𝑎𝑎) 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉𝑖𝑖 (𝑠𝑠 ′ )
𝑎𝑎 𝑄𝑄0 V0 𝑄𝑄1 V1 𝑠𝑠 ′
𝑠𝑠 ′
We can instead calculate Q(s, a) values for each s and a and get the best V(s) 𝑎𝑎0 𝑎𝑎1 𝑎𝑎0 𝑎𝑎1
𝑄𝑄1 𝑠𝑠1 , 𝑎𝑎0 = 𝑃𝑃 𝑠𝑠0 𝑠𝑠1 , 𝑎𝑎0 𝑟𝑟 𝑠𝑠1 , 𝑎𝑎0 , 𝑠𝑠0 + 𝛾𝛾𝑉𝑉𝑜𝑜 𝑠𝑠0 +
𝑠𝑠0 0 0 0 𝑠𝑠0 0 0 0 𝑃𝑃 𝑠𝑠1 𝑠𝑠1 , 𝑎𝑎0 𝑟𝑟 𝑠𝑠1 , 𝑎𝑎0 , 𝑠𝑠1 + 𝛾𝛾𝑉𝑉𝑜𝑜 𝑠𝑠1 +
𝑄𝑄𝑖𝑖+1 𝑠𝑠, 𝑎𝑎 = � 𝑃𝑃(𝑠𝑠 ′ |𝑠𝑠, 𝑎𝑎) 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉𝑖𝑖 (𝑠𝑠 ′ ) 𝑃𝑃 𝑠𝑠2 𝑠𝑠1 , 𝑎𝑎0 𝑟𝑟 𝑠𝑠1 , 𝑎𝑎0 , 𝑠𝑠2 + 𝛾𝛾𝑉𝑉𝑜𝑜 𝑠𝑠2
𝑠𝑠1 0 0 0 𝑠𝑠1 3.5 0 3.5
𝑠𝑠 ′
Finally 𝑠𝑠2 0 0 0 𝑠𝑠2 0 -0.3 0 Q1 s1 , a0 = 0.7 ∗ 5 + 0 + 0.1 ∗ 0 + 0 + 0.2 ∗ 0 + 0 = 3.5
The initial Q and V table
𝑉𝑉𝑖𝑖+1 𝑆𝑆 = max � 𝑃𝑃(𝑠𝑠 ′ |𝑠𝑠, 𝑎𝑎) 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉𝑖𝑖 (𝑠𝑠 ′ ) = max 𝑄𝑄𝑖𝑖+1 (𝑠𝑠, 𝑎𝑎)
𝑎𝑎 𝑎𝑎 Repeat this process until it converges
𝑠𝑠 ′
Value Iteration: Example (cont’d) Value Iteration: Example (cont’d)
• Iterating the process (code available on later slides) • Put the optimal values on the graph
Very good, the

... ... values converge
now
Value Iteration: Example (cont’d) Method 2: Policy Iteration

• Start with some initial policy π0 and alternate between the following steps:
• Use 𝑉𝑉 ∗ to find optimal policy
• Policy evaluation: calculate 𝑉𝑉 𝜋𝜋𝑖𝑖 (𝑠𝑠) for every state s, like VI
– aka optimal actions in each state
• Policy improvement: calculate a new policy 𝜋𝜋𝑖𝑖+1 based on the
𝜋𝜋 ∗ 𝑠𝑠 = argmax � 𝑃𝑃 𝑠𝑠 ′ 𝑠𝑠, 𝑎𝑎 𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉𝑖𝑖 𝑠𝑠 ′ = argmax 𝑄𝑄𝑖𝑖 (𝑠𝑠, 𝑎𝑎) updated utilities ′ ′ 𝜋𝜋𝑖𝑖
𝑎𝑎
𝑠𝑠 ′
𝑎𝑎 𝜋𝜋𝑖𝑖+1 𝑠𝑠 = argmax𝑎𝑎∈𝐴𝐴 ∑𝑠𝑠′ 𝑃𝑃 𝑠𝑠 𝑠𝑠, 𝑎𝑎 [𝑟𝑟 𝑠𝑠, 𝑎𝑎, 𝑠𝑠 + 𝛾𝛾𝑉𝑉 (𝑠𝑠′)]
𝑎𝑎∗
𝑠𝑠0 𝑎𝑎1
𝜋𝜋 ∗ 𝑠𝑠1 𝑎𝑎0
𝑠𝑠2 𝑎𝑎0
Done! We get the optimal policy of the example

Python Code on Colab: https://colab.research.google.com/drive/1DnYIr3QJxpfs_rR_jAAUrqHZMjvsjGSx?usp=sharing
Policy Iteration: Main Steps Policy Iteration: Detailed Steps
1. Initialization
• Unlike VI, policy iteration has to maintain a policy - chosen actions from all a. Initialize 𝑉𝑉(𝑠𝑠) and 𝜋𝜋(𝑠𝑠) for all state 𝑠𝑠
states - and estimate 𝑉𝑉 𝜋𝜋𝑖𝑖 based on this policy. 2. Policy Evaluation (calculate the 𝑉𝑉)
– Iterate this process until convergence (like VI) a. Repeat
b. Δ←0
• Steps of PI c. For each state 𝑠𝑠:
1. Initialization d. 𝑣𝑣 ← 𝑉𝑉 𝑠𝑠
e. 𝑉𝑉 𝑠𝑠 ← ∑𝑠𝑠′ 𝑝𝑝 𝑠𝑠 ′ 𝑠𝑠, 𝜋𝜋(𝑠𝑠) 𝑅𝑅 𝑠𝑠, 𝜋𝜋(𝑠𝑠), 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉 𝑠𝑠 ′
2. Policy Evaluation (calculating the 𝑉𝑉) f. Δ ← max(Δ, 𝑣𝑣 − 𝑉𝑉 𝑠𝑠 )
3. Policy Improvement (calculate the policy 𝜋𝜋) g. until Δ < 𝜔𝜔 (a small positive number)
3. Policy Improvement (calculate the new policy 𝜋𝜋)

a. 𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖𝑝𝑝𝑝𝑝𝑠𝑠𝑡𝑡𝑎𝑎𝑝𝑝𝑝𝑝𝑝𝑝 ← 𝑇𝑇𝑟𝑟𝑇𝑇𝑝𝑝
b. For each state 𝑠𝑠:
c. 𝑝𝑝 ← 𝜋𝜋(𝑠𝑠)
d. 𝜋𝜋 𝑠𝑠 ← argmax𝜋𝜋(𝑠𝑠) ∑𝑠𝑠′ 𝑝𝑝 𝑠𝑠 ′ 𝑠𝑠, 𝜋𝜋(𝑠𝑠) 𝑅𝑅 𝑠𝑠, 𝜋𝜋(𝑠𝑠), 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉 𝑠𝑠 ′
e. If 𝑝𝑝 ≠ 𝜋𝜋(𝑠𝑠), then 𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖𝑝𝑝𝑝𝑝𝑠𝑠𝑡𝑡𝑎𝑎𝑝𝑝𝑝𝑝𝑝𝑝 ← 𝐹𝐹𝑎𝑎𝑝𝑝𝑠𝑠𝑝𝑝
f. If 𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖𝑝𝑝𝑝𝑝𝑠𝑠𝑡𝑡𝑎𝑎𝑝𝑝𝑝𝑝𝑝𝑝 ← 𝑇𝑇𝑟𝑟𝑇𝑇𝑝𝑝, then stop; else go to step 2;
Policy Iteration: Policy Evaluation Policy Iteration: Policy Evaluation (cont’d)

• Use the example used in VI • Use the example used in VI
• Start iteration 𝑖𝑖 = 0, initialize random 𝜋𝜋 , 𝛾𝛾 = 0.99, and 𝑉𝑉 𝑠𝑠 = 0 for all state 𝑠𝑠 • Start iteration 𝑖𝑖 = 0, 𝛾𝛾 = 0.99, initialize random 𝜋𝜋 = 𝑎𝑎1 , 𝑎𝑎0 , 𝑎𝑎1 and 𝑉𝑉 𝑠𝑠 = 0 for
states 𝑠𝑠0, 𝑠𝑠1 and 𝑠𝑠2
𝑉𝑉 𝜋𝜋𝑖𝑖 𝑠𝑠 = � 𝑃𝑃 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠 , 𝑠𝑠 ′ 𝑅𝑅 𝑠𝑠, 𝜋𝜋𝑖𝑖 𝑠𝑠 , 𝑠𝑠 ′ + 𝛾𝛾𝑉𝑉 𝜋𝜋𝑖𝑖 (𝑠𝑠 ′ ) • Values are calculated asynchronously
𝑠𝑠 ′ 𝑉𝑉 𝑠𝑠0 ← 0.0 0.0 + 𝛾𝛾𝑉𝑉(𝑠𝑠0 ) + 0.0 0.0 + 𝛾𝛾𝑉𝑉(𝑠𝑠1 ) + 1.0 0.0 + 𝛾𝛾𝑉𝑉(𝑠𝑠2 ) • The converged values are used for
𝑉𝑉 𝑠𝑠0 ← 𝑃𝑃 𝑠𝑠0 , 𝜋𝜋 𝑠𝑠0 , 𝑠𝑠0 𝑅𝑅 𝑠𝑠0 , 𝜋𝜋 𝑠𝑠0 , 𝑠𝑠0 + 𝛾𝛾V(𝑠𝑠0 ) + policy improvement
𝑉𝑉 𝑠𝑠1 ← 0.7 5.0 + 𝛾𝛾𝑉𝑉(𝑠𝑠0 ) + 0.1 0.0 + 𝛾𝛾𝑉𝑉(𝑠𝑠1 ) + 0.2 0.0 + 𝛾𝛾𝑉𝑉(𝑠𝑠2 )
𝑃𝑃 𝑠𝑠0 , 𝜋𝜋 𝑠𝑠0 , 𝑠𝑠1 𝑅𝑅 𝑠𝑠0 , 𝜋𝜋 𝑠𝑠0 , 𝑠𝑠1 + 𝛾𝛾𝑉𝑉(𝑠𝑠1 ) +
𝑃𝑃 𝑠𝑠0 , 𝜋𝜋 𝑠𝑠0 , 𝑠𝑠2 𝑅𝑅 𝑠𝑠0 , 𝜋𝜋 𝑠𝑠0 , 𝑠𝑠2 + 𝛾𝛾𝑉𝑉(𝑠𝑠2 ) 𝑉𝑉 𝑠𝑠2 ← 0.3 −1 + 𝛾𝛾𝑉𝑉(𝑠𝑠0 ) + 0.3 0.0 + 𝛾𝛾𝑉𝑉(𝑠𝑠1 ) + 0.4 0.0 + 𝛾𝛾𝑉𝑉(𝑠𝑠2 )
𝑉𝑉 𝑠𝑠1 ← 𝑃𝑃 𝑠𝑠1 , 𝜋𝜋 𝑠𝑠1 , 𝑠𝑠0 𝑅𝑅 𝑠𝑠1 , 𝜋𝜋 𝑠𝑠1 , 𝑠𝑠0 + 𝛾𝛾𝑉𝑉(𝑠𝑠0 ) +

𝑃𝑃 𝑠𝑠1 , 𝜋𝜋 𝑠𝑠1 , 𝑠𝑠1 𝑅𝑅 𝑠𝑠1 , 𝜋𝜋 𝑠𝑠1 , 𝑠𝑠1 + 𝛾𝛾𝑉𝑉(𝑠𝑠1 ) +
𝑉𝑉 𝑠𝑠0 = 0
𝑃𝑃 𝑠𝑠1 , 𝜋𝜋 𝑠𝑠1 , 𝑠𝑠2 𝑅𝑅 𝑠𝑠1 , 𝜋𝜋 𝑠𝑠1 , 𝑠𝑠2 + 𝛾𝛾𝑉𝑉(𝑠𝑠2 )
𝑉𝑉 𝑠𝑠1 = 0.7 ∗ 5.0 = 3.5
𝑉𝑉 𝑠𝑠2 ← 𝑃𝑃 𝑠𝑠2 , 𝜋𝜋 𝑠𝑠2 , 𝑠𝑠0 𝑅𝑅 𝑠𝑠2 , 𝜋𝜋 𝑠𝑠2 , 𝑠𝑠0 + 𝛾𝛾𝑉𝑉(𝑠𝑠0 ) +
𝑃𝑃 𝑠𝑠2 , 𝜋𝜋 𝑠𝑠2 , 𝑠𝑠1 𝑅𝑅 𝑠𝑠2 , 𝜋𝜋 𝑠𝑠2 , 𝑠𝑠1 + 𝛾𝛾𝑉𝑉(𝑠𝑠1 ) + 𝑉𝑉 𝑠𝑠2 = 0.3 ∗ −1 + 0.3 ∗ 0.99 ∗ 3.5 = 0.7395
𝑃𝑃 𝑠𝑠2 , 𝜋𝜋 𝑠𝑠2 , 𝑠𝑠2 𝑅𝑅 𝑠𝑠2 , 𝜋𝜋 𝑠𝑠2 , 𝑠𝑠2 + 𝛾𝛾𝑉𝑉(𝑠𝑠2 )
… loop this process as instructed in step 2 …
Policy Iteration: Example (cont’d) Further Reading [AAAI’18: http://www.ntu.edu.sg/home/boan/papers/AAAI18_Malmo.pdf]
• The optimal Q values are We Won 2017 Microsoft Collaborative AI Challenge

– Nearly the same as that of VI (as shown in the figure below)
𝑉𝑉 ∗
• Collaborative AI
• How can AI agents learn to recognise someone’s intent (that is, what they
𝑠𝑠0 8.03
are trying to achieve)?
𝑠𝑠1 11.2
• How can AI agents learn what behaviours are helpful when working toward a
𝑠𝑠2 8.9 common goal?
– We can easily calculate the optimal policy. Can you try it?
• How can they coordinate or communicate with another agent to agree on a
shared strategy for problem-solving?
𝑎𝑎∗
𝑠𝑠0 𝑎𝑎1
𝜋𝜋 ∗
𝑠𝑠1 𝑎𝑎0
𝑠𝑠2 𝑎𝑎0
Further Reading [AAAI’18: http://www.ntu.edu.sg/home/boan/papers/AAAI18_Malmo.pdf]
• Microsoft Malmo Collaborative AI

Challenge
• Collaborative mini-game, based on
CZ3005
an extension “stag hunt” Artificial Intelligence
• Uncertainty of pig movement
• Unknown type of the other agent
• Detection noise (frequency 25%)
Reinforcement Learning
• Our team HogRider won the challenge
Assoc Prof Bo AN
(out of more than 80 teams from 26
countries) Research area: artificial intelligence,
computational game theory, reinforcement
• learning + game theoretic reasoning learning, optimization
+ sequential decision making + Email: boan@ntu.edu.sg
Office: N4-02b-55
optimisation
Lesson Outline Reinforcement Learning
• Some RL algorithms: • Motivation

– In last lecture, we compute the value function and find the optimal policy
– Monte-Carlo
– But if without the transition function 𝑃𝑃(𝑠𝑠 ′ |𝑠𝑠, 𝑎𝑎)？
– Q-learning – We can learn the value function and find the optimal policy without transition
– Deep Q-Network • From experience
learning
Experience Policy/Value
RL algorithms What is Monte Carlo

Model-free
• Types Learning • Idea behind MC:
– Monte Carlo – Just use randomness to solve a problem
– Q-Learning Sampling Bootstrapping • Simple definition:
– DQN Monte Carlo Q-Learning – Solve a problem by generating suitable random numbers and observing the
fraction of numbers obeying some properties
– …
• An example for calculating 𝜋𝜋 (not policy in RL):
1
– 𝑆𝑆𝑟𝑟𝑟𝑟𝑟𝑟 = 𝜋𝜋𝑟𝑟 2 , 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑟𝑟𝑟𝑟 = 𝑟𝑟 2
DNN for Function Approx. 4
DQN – putting dots on the square randomly for 𝑛𝑛 = 3000 times
𝑁𝑁𝑟𝑟𝑟𝑟𝑟𝑟
– 𝜋𝜋 ≈ 4 × , 𝑁𝑁𝑟𝑟𝑟𝑟𝑟𝑟 is the number of dots in the circle
𝑛𝑛
Monte Carlo in RL: Prediction An Example
• Basic Idea: we run in the world randomly and gain experience to learn • One-dimensional grid world
• What experience? Many trajectories! – A robot is in a 1x4 world
– 𝑠𝑠1 , 𝑎𝑎1 , 𝑟𝑟2 , 𝑠𝑠2 , 𝑎𝑎2 , 𝑟𝑟3 , … , 𝑠𝑠𝑇𝑇 , … – State: current cell 𝑠𝑠 ∈ [𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 , 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐3 , 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐4 ]
• What we learn? Value function! – Action: left or right
– Recall that the return is the total discounted rewards:
– Reward:
𝐺𝐺𝑡𝑡 = 𝑟𝑟𝑡𝑡 + 𝛾𝛾𝑟𝑟𝑡𝑡+1 + 𝛾𝛾 2 𝑟𝑟𝑡𝑡+2 + ⋯ + 𝛾𝛾 𝑛𝑛 𝑟𝑟𝑡𝑡+𝑛𝑛 + ⋯ = Σ𝑖𝑖 𝛾𝛾 𝑖𝑖 𝑟𝑟𝑡𝑡+𝑖𝑖 • Move one step (-1)
– Recall that the value function is the expected return from 𝑠𝑠 • Reach the destination cell (+10) (ignoring the one-step reward)
𝑉𝑉𝜋𝜋 𝑠𝑠 = 𝔼𝔼𝜋𝜋 𝐺𝐺𝑡𝑡 𝑆𝑆𝑡𝑡 = 𝑠𝑠
• How we learn? 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐1 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐3 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐4
1 𝑁𝑁
– Use experience to learn an empirical state value function 𝑉𝑉�𝜋𝜋 𝑠𝑠 = Σ 𝐺𝐺
𝑁𝑁 𝑖𝑖=1 𝑖𝑖,𝑠𝑠
Start point Destination
One-dimensional Grid World Compute Value Function

𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐1 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐3 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐4 • Idea: Average return observed after visits to (𝑠𝑠, 𝑎𝑎)
• Trajectory or episode: • First-visit MC: average returns only for first time (𝑠𝑠, 𝑎𝑎) is visited in an
– The sequence of states from the staring state to the terminal state episode
– Robot starts in 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 , ends in 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐4 Start point Destination
• Return in one episode (trajectory):
• The representation of the three episodes 𝐺𝐺𝑡𝑡 = 𝑟𝑟𝑡𝑡 + 𝛾𝛾𝑟𝑟𝑡𝑡+1 + 𝛾𝛾 2 𝑟𝑟𝑡𝑡+2 + ⋯ + 𝛾𝛾 𝑛𝑛 𝑟𝑟𝑡𝑡+𝑛𝑛 + ⋯ = Σ𝑖𝑖 𝛾𝛾 𝑖𝑖 𝑟𝑟𝑡𝑡+𝑖𝑖
-1 10 -1 -1 -1 10
𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄3 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟑𝟑 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒
• We calculate the return for 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 of first episode with 𝛾𝛾 = 0.9
𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟑𝟑
1 2 -1 10
𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟑𝟑 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒
-1 -1 -1 10
𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟏𝟏 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟑𝟑 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒
3
𝐺𝐺𝑡𝑡 = −1 × 0.90 + 10 × 0.91 = 8
Compute Value Function (cont’d) Compute Value Function (cont’d)
• Similarly the return for 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 of second episode with 𝛾𝛾 = 0.9
-1 -1 -1 10 • Given these three episodes, we compute the value function for
𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄3 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟑𝟑 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒
all non-terminal state
6.2 5.72 8.73

𝐺𝐺𝑡𝑡 = −1 × 0.90 − 1 × 0.91 − 1 × 0.92 + 10 × 0.93 = 4.58
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐1 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐3
• Similarly the return for 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 of third episode with 𝛾𝛾 = 0.9
-1 -1 -1 10
𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟏𝟏 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟑𝟑 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒 • We can get more accurate value function with more episodes
𝐺𝐺𝑡𝑡 = −1 × 0.90 − 1 × 0.91 − 1 × 0.92 + 10 × 0.93 = 4.58

8 +4.58+4.58
• The empirical value function for 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 is = 5.72
3
First Visit Monte Carlo Policy Evaluation Monte Carlo in RL: Control
• Average returns only for the first time 𝑠𝑠 is visited in an episode • Now, we have the value function of all states given a policy
• Algorithm
– Initialize: • We need to improve policy to be better
• 𝜋𝜋 ← policy to be evaluated
• 𝑉𝑉 ← an arbitrary state-value function
• Policy Iteration
• 𝑅𝑅𝑐𝑐𝑅𝑅𝑅𝑅𝑟𝑟𝑛𝑛𝑠𝑠(𝑠𝑠) ← an empty list, for all state 𝑠𝑠 – Policy evaluation
– Repeat many times: – Policy improvement
• Generate an episode using 𝜋𝜋
• For each state 𝑠𝑠 appearing in the episode:
• However, we need to know how good an action is
– 𝑅𝑅 ← return following the first occurrence of 𝑠𝑠
– Append 𝑅𝑅 to 𝑅𝑅𝑐𝑐𝑅𝑅𝑅𝑅𝑟𝑟𝑛𝑛𝑠𝑠(𝑠𝑠)
– 𝑉𝑉 𝑠𝑠 ← 𝑎𝑎𝑎𝑎𝑐𝑐𝑟𝑟𝑎𝑎𝑎𝑎𝑐𝑐(𝑅𝑅𝑐𝑐𝑅𝑅𝑅𝑅𝑟𝑟𝑛𝑛𝑠𝑠 𝑠𝑠 )
Q-value Computing Q-value
• Estimate how good an action is when staying in a state • MC for estimating Q:

• Defined as the expected return starting from 𝑠𝑠, taking the action 𝑎𝑎 – A slight difference from estimating the value function
and thereafter following policy 𝜋𝜋 – Average returns for state-action pair (𝑠𝑠, 𝑎𝑎) is visited in an episode
𝑄𝑄 𝜋𝜋 𝑠𝑠, 𝑎𝑎 = 𝔼𝔼𝜋𝜋 𝐺𝐺𝑡𝑡 𝑆𝑆𝑡𝑡 = 𝑠𝑠, 𝐴𝐴𝑡𝑡 = 𝑎𝑎 • We calculate the return for (𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 , right) of first episode with 𝛾𝛾 = 0.9
• Representation: A table -1 10
– Filled with the Q-vale given a state and an action 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟑𝟑 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐1 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐3
𝐺𝐺𝑡𝑡 = −1 × 0.90 + 10 × 0.91 = 8
Compute Q-Value (cont’d) Q-Value for Control

• Similarly the return for (𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 , right) of second episode with 𝛾𝛾 = 0.9
-1 -1 -1 10 • Filling the Q-table
𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄3 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟑𝟑 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒 – By going through all state-action pairs, we get a complete Q-table with all the
entries filled
– A possible Q-table example 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐1 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐3
𝐺𝐺𝑡𝑡 = −1 × 0.90 − 1 × 0.91 − 1 × 0.92 + 10 × 0.93 = 4.58
8.1 9.3 9.9
• Similarly the return for (𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 , right) of third episode with 𝛾𝛾 = 0.9
4.5 5.6 7.5
-1 -1 -1 10
𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟏𝟏 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟐𝟐 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟑𝟑 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝟒𝟒
• Selecting action
𝜋𝜋 ′ 𝑠𝑠 = argmax𝑎𝑎∈𝐴𝐴 𝑄𝑄 𝜋𝜋 (𝑠𝑠, 𝑎𝑎)
At 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 and 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐3 , we choose right
𝐺𝐺𝑡𝑡 = −1 × 0.90 + 10 × 0.91 = 8
8 +4.58+8
• The empirical Q-value function for (𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐2 , right) is = 6.86
3
MC control algorithm Q-Learning
• Previously, we need the whole trajectory

• In Q-Learning, we only need one-step trajectory:(𝑠𝑠, 𝑎𝑎, 𝑟𝑟, 𝑠𝑠𝑠)
• The difference is the Q-value computing
– Previously:
1 𝑁𝑁
Policy evaluation 𝑄𝑄� 𝜋𝜋 𝑠𝑠, 𝑎𝑎 = Σ 𝐺𝐺
𝑁𝑁 𝑖𝑖=1 𝑖𝑖,𝑠𝑠
– Now, updating rule:
𝑄𝑄𝑛𝑛𝑟𝑟𝑛𝑛 𝑆𝑆𝑡𝑡 , 𝐴𝐴𝑡𝑡 ← 𝑄𝑄𝑜𝑜𝑜𝑜𝑟𝑟 𝑆𝑆𝑡𝑡 , 𝐴𝐴𝑡𝑡 + 𝛼𝛼(𝑅𝑅𝑡𝑡+1 + 𝛾𝛾max𝑄𝑄𝑜𝑜𝑜𝑜𝑟𝑟 𝑆𝑆𝑡𝑡+1 , 𝑎𝑎 − 𝑄𝑄𝑜𝑜𝑜𝑜𝑟𝑟 (𝑆𝑆𝑡𝑡 , 𝐴𝐴𝑡𝑡 ))
𝑎𝑎
Policy improvement old estimation

new estimation learning rate new sample
Q-Learning A Step-by-step Example

• 5-room environment as MDP
– We'll number each room 0 through 4
– The outside of the building can be thought of as one big room 5
– End at room 5
– Notice that doors at rooms 1 and 4 lead into the building from room 5 (outside)
A Step-by-step Example (cont’d) Q-Learning Step by Step
• Goal
– Put an agent in any room, and from that room, go outside (or room 5) • Initialize matrix Q as a zero matrix
• Reward • 𝛼𝛼 = 0.01, 𝛾𝛾 = 0.99
– The doors that lead immediately to the goal have an instant reward of 100
• Loop for each episode until converge
– Other doors not directly connected to the target room have zero reward
– Initial state: current we are in room 1 (1st outer loop)
0 1 2 3 4 5 action – Loop for each step of episode (until reach room 5)
0 0 0 0 0 0 0 • … (Next slide)
1 0 0 0 0 0 100
2 0 0 0 0 0 0
𝑅𝑅 =
3 0 0 0 0 0 0
4 0 0 0 0 0 100
5 0 0 0 0 0 100
state
Q-Learning Step by Step (cont’d) Q-Learning Step by Step (cont’d)

• ... (last slide) • When we loop many episodes, we can get
– Loop for each step of episode (until room 5)
• By random selection, we go to 5
• We get 100 reward
• Update Q: 𝑄𝑄𝑛𝑛𝑟𝑟𝑛𝑛 𝑆𝑆𝑡𝑡 , 𝐴𝐴𝑡𝑡 ← 𝑄𝑄𝑜𝑜𝑜𝑜𝑟𝑟 𝑆𝑆𝑡𝑡 , 𝐴𝐴𝑡𝑡 + 𝛼𝛼(𝑅𝑅𝑡𝑡+1 + 𝛾𝛾max𝑄𝑄𝑜𝑜𝑜𝑜𝑟𝑟 𝑆𝑆𝑡𝑡+1 , 𝑎𝑎 − 𝑄𝑄𝑜𝑜𝑜𝑜𝑟𝑟 (𝑆𝑆𝑡𝑡 , 𝐴𝐴𝑡𝑡 ))
𝑎𝑎
– At room 5, we have 3 possible actions: go to 1, 4 or 5; We select the one with max reward • According to this Q-table, we can select actions
– 𝑄𝑄𝑛𝑛𝑟𝑟𝑛𝑛 1,5 ← 𝑄𝑄𝑜𝑜𝑜𝑜𝑟𝑟 1,5 + 𝛼𝛼 100 + 𝛾𝛾max𝑄𝑄𝑜𝑜𝑜𝑜𝑟𝑟 5, 𝑎𝑎 − 𝑄𝑄𝑜𝑜𝑜𝑜𝑟𝑟 1,5 = 0 + 0.01 × 100 + 0.99 × 0 − 0 = 1 – E.g. We are at room 2
𝑎𝑎
– Greedily select based on maximun of Q value
1
An Example of Iteration Process Deep Q-Network
• A complex grid world example • Previously, we represent the Q-value as a table

• https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.ht • However, tabular representation is insufficient
ml – Many real world problems have enormous state and/or action spaces
– Backgammon: 10^20 states
– Computer Go: 10^170 states
– Robots: continuous state space
• We use a neural network as a black box to replace the table
– Input a state and an action, output the Q-value
𝑠𝑠
𝒘𝒘𝒒𝒒 𝑞𝑞(𝑠𝑠,
� 𝑎𝑎, 𝒘𝒘𝒒𝒒 )
𝑎𝑎
DQN in Atari DQN in Atari (cont’d)
• Input state s is stack of raw pixels from last 4 frames • Pong’s video
• Output is 𝑞𝑞(𝑠𝑠,𝑎𝑎) for 18 button • https://www.youtube.com/watch?v=PSQt5KGv7Vk
• Reward is change in score for that step • Beat human on many games
What is Game Theory?
• Game theory studies settings where multiple parties (agents)

each have
– different preferences (utility functions),
CZ3005 – different actions that they can take
Artificial Intelligence • Each agent’s utility (potentially) depends on all agents’ actions
– What is optimal for one agent depends on what other agents do
Game Theory – Very circular!
• Game theory studies how agents can rationally form beliefs over
Assoc Prof Bo AN what other agents will do, and (hence) how agents should act
– Useful for acting as well as predicting behavior of others
Email: boan@ntu.edu.sg • John von Neumann
Office: N4-02b-55
What is Game Theory? Normal Form Games – An Example
• List of players, strategies, payoffs

• Simultaneous
• Zero-sum here but not necessary Player B
• Analysis:
– What should they do? Rock
– Advice to player A
Economics Politics 0, 0 1,-1 -1,1
Player A Rock
-1, 1 0, 0 1, -1
1, -1 -1, 1 0, 0
Games Biology
3 4
Nash Equilibrium Nash Equilibrium
• Each agent is selfish

• Each agent makes decision based on what he thinks others
would do Player B
• No one can do better by changing strategy solely Paper 1/3 Rock 1/3 Scissors 1/3
Paper
0, 0 1,-1 -1,1
1/3
Player A Rock -1, 1 0, 0 1, -1
1/3
1, -1 -1, 1 0, 0
Scissors
1/3
5 6
Nash Equilibrium Nash Equilibrium
• In general, we will say that two strategies s1 (for i)

and s2 (for j) are in Nash equilibrium if:
1. under the assumption that agent i plays s1, agent j can do no better than
play s2; and
2. under the assumption that agent j plays s2, agent i can do no better than
Nash equilibrium play s1.
• Neither agent has any incentive to deviate from a
Nash equilibrium
confess dominates silent
• Unfortunately:
1. Not every interaction scenario has
a pure strategy Nash equilibrium
2. Some interaction scenarios have more
Cooperation would be better for both! than one pure strategy Nash equilibrium
But, rational for both to defect!
7 8
Matching Pennies Mixed Strategies for Matching Pennies
• Players i and j simultaneously choose the face of a • NO pair of strategies forms a pure strategy NE:
coin, either “heads” or “tails”. whatever pair of strategies is chosen, somebody will
• If they show the same face, then i wins, while if they wish they had done something else.
show different faces, then j wins. • The solution is to allow mixed strategies:
• The Payoff Matrix: – play “heads” with probability 0.5
– play “tails” with probability 0.5.
• This is a NE strategy.
9 10
Games: Complete Information Games: Normal Form v.s. Sequence

v.s. Incomplete Information Form
Games with
complete information
Chess Go
Games with
incomplete information
Negotiation Poker
11 12
Texas Hold’em Poker Architecture of Libratus
Original game Abstracted

game
• action abstraction Automated abstraction
Abstraction • card abstraction
(offline) • took the game size Compute
from 10161 to 1012 Nash
• Preflop: two private cards are dealt to each player, followed by a Reverse model
betting round; players can either check, bet or fold Nash equilibrium Nash equilibrium
• Flop: three public cards are dealt, followed by a betting round Equilibrium • CFR
Finding • CFR+
• Turn: a fourth public card is dealt, followed by a betting round • Monte Carlo CFR
(offline)
• River: a last public card is dealt, followed by a betting round
• The game ends when Decomposition
• endgame solving
– Only one player is left, all the other players fold and Subgame
• subgame re-solving
– A showdown; a hand with the best 5 cards using both the two private cards Refinement
• max-margin subgame refinement
and the five public cards wins (online)
13 14
Global Challenges for Security Stackelberg Games

Randomization: Increase Cost and Uncertainty to Attackers
• Security allocation
– Target weights
– Opponent reaction
• Stackelberg: Security forces commit first
• Optimal allocation: Weighted random
– Strong Stackelberg Equilibrium
Attacker
Target #1 Target #2
Defender Target #1 4, -3 -1, 1
Key challenges: Limited resources, surveillance Target #2 -5, 5 2, -1
15 16
Game Theory for Security: Applications IRIS: Federal Air Marshals Service [2009]
Game Theory + Optimization + Uncertainty + Learning + …
Scale Up Number of Defender Strategies
Strategy 1 Strategy 2 Strategy 3 Strategy 1 Strategy 2 Strategy 3
Strategy Str
1 ate
Green Security
gy
Opportunistic
1
Infrastructure Security Games Strategy

2 Str
ate
gy
2
Games Crime Games

Strategy
3
Str
ate
gy
Strategy 3
4
Str
ate
Strategy gy
5 4
Str
Strategy
ate
6
gy
5
Str
ate
gy
6
Coast Guard Coast Guard: Ferry
Coast Guard
LAX TSA
Cyber Security • 1000 Flights, 20 air marshals: 1041 combinations
Games – ARMOR out of memory
Panthera/WWF
LA Sheriff USC • Not enumerate all combinations combinations
– Branch and price
o Branch & bound + column generation
Argentina Airport Chile Border
India
17 18
PROTECT: Randomized Patrol Scheduling [2011] PAWS: Protection Assistant for Wildlife
Coordination (Scale-up) and Ferries (Continuous Space/time) Security Trials in Uganda and Malaysia [2014]
t2 t3 • Important lesson: Geography!
t7
t4
t1
t6 t5
Uganda Andrew Lemieux Malaysia Panthera
Malaysia
19 20
PAWS Deployed in 2015 in Southeast
Asia (with Panthera and WWF)
PAWS Version 2: Features
• Street map
– Ridgelines, rivers/streams
• Species Distribution Models (SDMs)

– From data points to distribution map
Indonesia Malaysia
21

Final

Uploaded by

Copyright:

Available Formats

Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final

Uploaded by

Copyright:

Available Formats

Lesson Outline

Agent Rational Agents

Refinery controller Temperature, Open, close valves; Maximize purity, Refinery

pedestrians, weather, customers, …

Types of Environment Example: Driverless Taxi

Design of Problem-Solving Agent Design of Problem-Solving Agent

Zerind  Robotic vacuum cleaners move autonomously

Timisoara Rimnicu Vilcea their batteries

Example: A Simple Vacuum World Single-State Problem

 8 possible world states

Measuring Problem-Solving Performance Single-State Problem Example

Example: Vacuum World (Multiple-state Example: Vacuum World (Multiple-state

Question to Think About Search Algorithms

Breadth-First Search Complexity of BFS

B C 0 1 1 millisecond 100 bytes

• d: depth of the least-cost solution 10 1010 128 days 1 terabyte

• Complete: Yes 12 1012 35 years 111 terabytes

Uniform-Cost Search Depth-First Search

Iterative Deepening Search Iterative Deepening Search

Summary (we make assumptions for optimality) General Search

Example: Route-finding from Arad to Bucharest Example

A * Search Example: Route-finding from Arad to Bucharest

Example: Route-finding from Arad to Bucharest Example: Route-finding in Manhattan

Constraint Satisfaction Problem (CSP) Examples: Real-world CSPs

Example: Map Colouring Some Definitions

• No-attack constraint for 𝑉𝑉1 and 𝑉𝑉2 :

Implicitly: use a function to test for constraint satisfaction

Backtracking Search Example (4-Queens)

Initial State • Before generating

Constraint propagation: propagating the implications of a constraint

Search Tree of 4-Queens with

Example (Map Colouring)... Example (Map Colouring)...

Min-Conflicts Heuristic (8-queens) Games as Search Problems

Game as a Search Problem Game Tree for Tic-Tac-Toe

… … … • MAX uses the search tree to

Minimax Search Strategy Perfect Decisions by Minimax Algorithm

Game Tree Othello 4 Imperfect Decisions

Use a weighted sum of these features R

𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑠𝑠 = 𝑤𝑤1 𝑓𝑓1 𝑠𝑠 + 𝑤𝑤2 𝑓𝑓2 𝑠𝑠 + ⋯ 𝑤𝑤𝑛𝑛 𝑓𝑓𝑛𝑛 𝑠𝑠

Evaluation Functions for Othello 4 Evaluation Functions for Othello 4 ...

Heuristics for `X' is proposed as follows: MAX(X) O O O X O X O

𝑋𝑋𝑚𝑚 is the number of X's in the middle of the grid, 5 4 4 5 X 5 6 4 9

AlphaGo: Key Ideas

• Key Idea 3: Looking ahead Markov Decision Process

Research area: artificial intelligence,

• Introduction • We consider a framework for decision making under uncertainty

The Agent-Environment Interface Making Complex Decisions

• assumption: the probability of going to 𝑠𝑠𝑠 from 𝑠𝑠 depends only on 𝑠𝑠

Game Show Game Show

Incorrect: Incorrect: Incorrect: Incorrect:

0.1 0.8 0.1

R(s) = -0.04 for every

Source: P. Abbeel and D. Klein Source: P. Abbeel and D. Klein

Grid World Grid World

Optimal policy when

R(s) = -0.04 for every

Maximizing Accumulated Rewards Accumulated Rewards

The Bellman Equation Method 1: Value Iteration

These structure can be easily implemented by dict of Python or HashMap of Java