Robertson Watson INISTA15
Robertson Watson INISTA15
Abstract—This paper presents a novel use of motif-finding to customise, and are sufficient to produce the desired behavior
techniques from computational biology to find recurring action [4], [5].
sequences across many observations of expert humans carrying
out a complex task. Information about recurring action sequences Since their introduction by the video game industry in 2005
is used to produce a behavior tree without any additional domain [6], Behavior Trees (BTs) have become increasingly common
information besides a simple similarity metric – no action models in the industry for encoding agent behavior [4], [7], [8]. They
or reward functions are provided. This technique is applied to have been used in major published games [6] and they are
produce a behavior tree for strategic-level actions in the real-time supported by major game engines such as Unity1 , Unreal
strategy game StarCraft. The behavior tree was able to represent Engine2 , and CryEngine3 . BTs are hierarchical goal-oriented
and summarise a large amount of information from the expert structures that appear somewhat similar to Hierarchical Task
behavior examples much more compactly. The method could still
Networks (HTNs), but instead of being used to dynamically
be improved by discovering reactive actions present in the expert
behavior and encoding these in the behavior tree. generate plans, BTs are static structures used to store and
execute plans [4], [9]. This is a vital advantage for game
designers because it allows them fine control over agent behav-
I. I NTRODUCTION ior by editing the BT, while still allowing complex behavior
An ongoing challenge in Artificial Intelligence (AI) is to and behavior reuse through the hierarchical structure [4], [9].
create problem-solving agents that are able to carry out some Although they have fixed structure, BTs produce reactive
task by selecting a series of appropriate actions to get from a behaviour by the interaction of conditional checks and success
starting state to achieve a goal – the field of planning. Ideally and failure propagation within the hierarchy. Various types of
these agents would be able to be applied to the many practical nodes (discussed further in section V) can be composed to
problems that require a sequence of actions in order to carry produce parallel or sequential behavior, or choose amongst
out a task, such as robotic automation, game playing, and different possible behaviors based on the situation [9].
autonomous vehicles. However, applying a classical planning We are creating a system able to automatically learn
agent to a new domain typically requires significant knowl- domain knowledge from examples of expert behavior, with
edge engineering effort [1]. It would be preferable if domain few assumptions about the domain, and be able to quickly
knowledge could be learned automatically from examples, react to changes in state during execution. This would combine
but current automated planning systems capable of learning some of the benefits of learning systems in automated planning
domain knowledge are generally designed to operate under and case-based planning. Instead of learning a set of planning
strong assumptions that do not hold in complex domains [2]. operators, we aim to automatically learn to carry out a single
Conversely, case-based planning systems capable of acquiring complex task within a domain, creating a less-flexible but still
domain knowledge can make few assumptions about the do- widely applicable planning system. The learned knowledge
main, but can have difficulty reacting to failures or exogenous will be represented and acted upon in the form of a BT, which
events [3]. is ideal for a single task within a domain. Furthermore, the
resulting BT is able to be hand-customised, so this approach
In many potential application areas, a planner capable of
could be used as an initial step, followed by human refinement,
transitioning from any starting state to any goal state is not
in the process of defining new behavior for an agent.
actually required, and instead it is sufficient or even desirable
to have an agent capable of robustly carrying out a specific task In the remainder of this paper we start by outlining related
or behavior. For example, in game playing, there is usually a work to automatically learning planning knowledge in the form
very similar starting state and goal for each match or activity of HTNs, case-based planners, and BTs. We concretely define
within a game – in board games this is the starting board the challenging problem of learning a task from observations
layout and object of the game, and in video games this could
1 Unity — Behavior Designer:
be the starting and win conditions of a match or the daily
https://www.assetstore.unity3d.com/en/#!/content/15277
activities of a non-player character. In the genre of real-time 2 Unreal Engine — Behavior Trees:
strategy games, video game industry developers tend to use https://docs.unrealengine.com/latest/INT/Engine/AI/BehaviorTrees/
scripting and finite state machines instead of more complex 3 CryEngine — Modular Behavior Tree:
approaches because those techniques are well-understood, easy http://docs.cryengine.com/display/SDKDOC4/Modular+Behavior+Tree
of expert behaviour, and outline the domain of the Real-Time phase, and a human operator with domain knowledge in the
Strategy (RTS) game StarCraft as our motivating example. interaction phase.
We then present our approach to the first part of the learning
system: using a motif-finding technique to find and collapse Instead of focusing on learning from examples, some work
repeated patterns of actions. We present some results from the has instead used genetic algorithms to evolve BTs in an
current system and discuss its limitations. Finally we discuss exploratory process [27], [28]. These approaches hold promise
potential future directions and conclude the paper. but can become prohibitively computationally expensive for
complex domains. They also require the addition of a fitness
function for evaluating evolved BTs. To the best of the authors’
II. R ELATED W ORK knowledge, no prior work has investigated automatically build-
Early automated planning systems such as STRIPS [10] ing BTs from examples of expert behavior.
made strong assumptions about the domain in order to operate,
Probably the most closely related work to ours involves
such as a fully observable, deterministic world that changes
automatically learning domain-specific planners from example
only due to agent actions, and actions that are sequential and
plans [23]. These domain-specific planners are static structures
instantaneous, with known preconditions and effects. More
for solving specific planning problems, and are made up of
recent work has aimed to make planning more practically
programming components such as loops and conditionals,
applicable by automatically learning action models [1], [11]–
combined with planning operators. The system is provided
[13], task models [14], [15], or hierarchical structure [16],
with accurate action models in order to build the plans, and
[17]. Some work also expands the applicability of planners
implicitly assumes fully observable, deterministic domains.
by relaxing assumptions from classical planning, addressing
learning with nondeterminism [18], [19], partial observability
[13], [20]–[22], or durative actions [2]. Almost all of this work III. P ROBLEM
on learning in automated planning learns by observing plan
executions, including observations of the world state, as carried We propose a problem definition that relaxes the assump-
out by an external expert, allowing the learner to get a good tions of the classical planning restricted model (as defined
coverage of the common cases in what could be a huge (or in [29]) in order to more closely reflect the real world. In
infinite) space of possible actions and observations. All of this this problem there are a potentially infinite number of states,
work still requires strong assumptions about the domain, or which may be partial observations of the complete system. The
domain knowledge provided, usually in the form of accurate system may be nondeterministic and may change without agent
action models. actions. Actions may occur in parallel, may have a duration,
and need not occur at fixed intervals. A policy is learned
An alternative approach to learning from examples of instead of action models or a plan library, in order to allow
expert behavior is case-based planning, which finds solu- robust reactive behaviour in a dynamic environment without
tion action sequences by retrieving and adapting previously- expensive replanning [15].
encountered solutions to similar problems. Unlike learning
in automated planning, which focuses on acquiring logical However, we do restrict the problem to learning to carry
building blocks for the planner, case-based planners learn out a single task or achieve a single goal that is being carried
associations between initial states and partial or complete plans out in the examples, instead of the more general automated
as solutions. Case-based planners can operate with very little planning requirement of being able to form a plan for any
domain knowledge and few assumptions about the domain, specified goal. This reduces the burden on the learner so that
but because they do most of the processing at runtime (for it is not forced to depend upon accurate action models for these
the retrieval and adaptation parts of the case-based planning complex domains. Thus, we define the problem of learning a
process), they can have efficiency issues when problems have single task by observation:
large case bases or time constraints [23]. Case-based plan-
ning also face difficulty in adapting solutions for particular Given a set of examples of experts carrying out a single
circumstances – long solutions may react slowly to unexpected high-level task, {E1 , E2 , . . . , En }
outcomes during execution, while short solutions may react Where an example is a sequence of cases ordered
excessively to small differences in state or have difficulty by time, Ei = (Ci1 , Ci2 , . . . , Cim )
reasoning about action ordering [3], [24]. A possible remedy to Where a case is an observation and action pair,
these issues is to introduce conditional checks and hierarchical Cij = (Oij , Aij )
structure into cases [3], [23], [24]. Where an observation and an action are arbitrary
Other work has examined building probabilistic behavior information available to the agent, (eg. a
models using Hidden Markov Models [25] and Bayesian key-value mapping)
Models [26] from examples. These approaches require very Given a similarity metric between pairs of observations
little domain knowledge and are capable of recognising or and pairs of actions, M (Oij , Okl ) ∈ [0, 1] and
predicting plans. However, they are not designed to be used M (Aij , Akl ) ∈ [0, 1]
for creating plans – their predictions could be extrapolated Find a policy that will decide the next action given
into a plan but this would likely lead to increasing error and previous cases and the current observations,
cyclic behavior. There are also task-learning methods based on π((Ci1 , Ci2 , . . . , Cij−1 ), Oij ) → Aij
explanation based learning [15], in which the agent explores This policy should be able to reproduce the input
the domain but also interacts with a human teacher in order action sequences and generalise well to unseen action
to learn. This requires an action model for the exploration sequences
This policy should have low run-time cost for select- 1
ing actions so that it is applicable for embedded or Selector
real-time applications
Note that no information is given about the preconditions or
effects of actions, or any conceptual reasoning or task structure 2 5
behind groups of cases. There is also limited information about Sequence Parallel Action
failure, as possible actions considered but unused by experts 3
will not be observed, and subsequences of actions which had a
negative outcome are observed just like other actions. Experts *
Action Decorator: Action Action Action
are assumed to have made appropriate actions, but there may Return Failure
not be one optimal action for a given situation.
4 * 6 6
IV. T HE DOMAIN
Fig. 1. An example BT, showing the order in which each node would
The domain motivating our problem is the Real-Time Strat- execute. Asterisks indicate nodes which are not executed. Execution begins at
egy (RTS) video game StarCraft. RTS games are essentially the root selector node. Next the sequence node begins execution – assuming
the leftmost child is selected first – and executes its children until a failure
a simplified military simulation, in which players indirectly is returned by the decorator node. The sequence node returns a failure and
control many units to gather resources, build infrastructure and the selector node executes its next child. The parallel node executes both
armies, and manage units in battle against an opponent. RTS children simultaneously and successfully returns, allowing the selector to
games present some of the toughest challenges for AI agents, return successfully.
making it a difficult area for developing competent AI [30]. It
is a particularly attractive area for AI research because of how
human players can quickly become adept at dealing with the node types that control the flow of execution in a BT: sequence,
complexity of the game, with experienced humans outplaying selector, parallel, and decorator nodes [7]. Each node type has
even the best agents from both industry and academia [31]. a different effect on the execution of its children, and respond
differently to failures reported by their children. Sequence
StarCraft is a very popular RTS game which has recently nodes run their children in sequence, and usually return with
been increasingly used as a platform for AI research [5]. Due to a failure status if any of their children fail. Selector nodes
the popularity of StarCraft, there are many expert players avail- run their children in a priority order, switching to the next
able to provide knowledge and examples of play, and it also child if one of their children fails, and usually return with a
has the advantage of the Brood War Application Programming success status if any of their children succeed. Selector nodes
Interface (BWAPI), which provides a way for external code to may alternatively be set to cancel the execution of a child
programmatically query the game state and execute actions if a higher-priority child becomes executable. Parallel nodes
as if they were a player in a match. In terms of complexity, run all their children in parallel, and usually return with a
StarCraft has real-time constraints, hidden information, minor success status if a certain number of their children succeed, or
nondeterminism, long-term goals, multiple levels of abstraction a failure status if a certain number of their children fail. Finally,
and reasoning, a vast space of actions and game states, durative decorator nodes add extra modifiers or logical conditions to
actions, and long-term action effects [3], [30]–[32]. In order to other nodes, for example always returning a success status,
make the domain slightly more manageable, we have chosen to or executing only when it has not run before. The specific
deal with only the strategic-level actions: build, train, morph, behavior and even types of nodes can vary depending on the
research, and upgrade actions. We also assume that only needs of the user.
successfully executed actions are shown, not all inputs from
the human, because in the game of StarCraft most professional
players very rapidly repeat action inputs until they are executed
VI. M ETHOD
in order to make actions execute as soon as possible.
The first stage in being able to build BTs is to be able
V. B EHAVIOR T REES to locate areas of commonality within action sequences, as
As mentioned earlier in the paper, Behavior Trees (BTs) are these likely represent common or repeated sub-behaviors. The
being used to represent and act upon the knowledge learned overall method for creating the behavior tree is an iterative
by our system, so this section provides a short overview of process as follows (Fig. 2). First, a maximally specific BT is
BTs. BTs have a hierarchical structure in which top levels created from the given example case sequences. The BT is then
generally represent abstract tasks, and subtrees represent dif- iteratively reduced in size by finding and combining common
ferent subtasks and behavior for achieving each task. Deeper patterns of actions. When no new satisfactory patterns are
subtrees represent increasingly-specific behaviors, and leaf found, the process stops. By merging similar action patterns,
nodes represent conditions and primitive actions that interact we are forced to generalise the BT and can find where common
with the agent’s environment (Fig. 1). Although conceptually patterns diverge so we can attempt to infer the reasons for
represented as trees, it is common for task behaviors to be different actions being chosen. Reducing the size of the BT
reused at different places in the tree, so the resulting structure will also help to make it more understandable if people wish
is really a directed acyclic graph [6]. to read and edit it.4
Execution of a BT is essentially a depth-first traversal of 4 The code implementation of this method is avilable online at
the directed graph structure, but there are four main non-leaf https://github.com/phoglenix/bt-builder.
Input Sequences
to GLAM2
Input Examples
Maximally-specific BT
Identified pattern
and alignments
Merged alignments
into new sequence
Merge into new sequence
Find common pattern
Attach to tree
Merged sequence
Fig. 2. Overview of the general BT construction process. Input examples replaces aligned
are converted into a maximally-specific BT. The BT is then iteratively reduced regions
by finding common patterns, merging them into new sequences, and attaching
them to the tree. When no more patterns are found, the process stops.
Sequence after
matched region
A. Creating the original BT joined by selector