Deep Sequential Models For Sampling-Based Planning
Deep Sequential Models For Sampling-Based Planning
the search for good paths. The resulting model, called DeRRT∗ ,
observes the state of the planner and the local environment to
bias the next move and next planner state. The neural-network-
based models avoid manual feature engineering by co-training
a convolutional network which processes map features and
observations from sensors. We incorporate this sequence model
in a manner that combines its likelihood with the existing bias
for searching large unexplored Voronoi regions. This leads to
more efficient trajectories with fewer rejected samples even in Fig. 1. A DeRRT∗ -based planner starts at the red square and tries to reach
difficult domains such as when escaping bug traps. This model the green square while escaping from a bug trap. The search tree, shown as
can also be used for dimensionality reduction in multi-agent blue circles, is mirrored by a sequence model, an HMM or LSTM. When
environments with dynamic obstacles. Instead of planning in a expanding the tree, a free-space sample is drawn, steered toward, and the
high-dimensional space that includes the configurations of the resulting node, shown as a red circle, is used to find the closest node in the
other agents, we plan in a low-dimensional subspace relying on tree; as in RRT. The sequence model, with state corresponding to that closest
the sequence model to bias samples using the observed behavior node, observes this free-space sample, the path leading to this node, along
of the other agents. The techniques presented here are general, with local visual or map features, shown in gray, and predicts a modified
include both graphical models and deep learning approaches, direction, shown in green, which is then connected to the search tree. A new
state for the sequence model is also predicted and connected. This process
and can be adapted to a range of planners. incorporates the bias to explore free space of RRT-based planners with a
co-evolving sequence model and observations of the environment.
I. INTRODUCTION
When you navigate an environment containing new agents,
obstacles, and goals, you can rely on previous experiences obstacles and the actions being performed by other agents.
to guide your actions. Having seen similar agents before To improve planning in these domains, we adopt a set of
allows you to predict the motions of the ones you encounter techniques from computer vision. We bias the growth of
in the future. Having seen obstacles, whether static or the RRT∗ search tree [6, 7] given prior experience and a
dynamic, allows you to efficiently navigate around them. sensed environment. Hidden Markov Models (HMMs) [8],
Having seen certain goal types allows you to determine what and stacked LSTMs [9] are powerful activity recognizers
preconditions must be satisfied before meeting those goals. In [10, 11, 12] but so far they have seen little use in improving
each case, your expectations about the future of the plan are robotic planning. We demonstrate how to adapt such sequence
conditioned on your previous experiences, current plan, and models to robotic planning using a general approach that can
local observations to help you navigate. This is the process employ either graphical models or neural networks.
we are modeling here. A sequence model co-evolves alongside a sampling-based
Existing sampling-based planners have difficulty taking planner as shown in Fig. 1. At each planning step, both the
advantage of such information. Most planners, like RRT∗ [1], planner and the sequence model are stepped forward while the
sample uniformly and take no heed of the environment. next sample from the planner is conditioned on the sequence
RRT∗ , Rapidly-exploring Random Tree, is part of a family model. That sequence model can observe local features of
of algorithms [2, 3, 4, 5] that explore a configuration space the environment as well as the current plan to provide good
by sampling moves while avoiding invalid states. Dynamic samples. Moreover, we can avoid feature engineering and
environments, in particular, pose many challenges. They learn the relevant features of the environment by co-training
combine uncertain sensing of the position of obstacles a convolutional network (CNN) with the LSTMs. We refer
and agents with uncertainty about the future path of those to this algorithm as DeRRT∗ , for deep RRT∗ , although the
techniques presented here can be adapted to other sampling-
This work was supported by the Center for Brains, Minds and Machines,
NSF STC award 1231216, the Toyota Research Institute, CBMM-Siemens based planners.
Graduate Fellowship, and the MIT-IBM Brain-Inspired Multimedia Compre- Several prior approaches have considered guiding planners
hension project.
Computer Science and AI Laboratory, MIT with local observations of static and dynamic obstacles.
{ylkuo,abarbu,boris}@mit.edu Fulgenzi et al. [13] demonstrate how a Gaussian process
can be combined with RRT to update plans conditioned Nominally the presence of other agents makes the reasoning
on the observations of the motions of dynamic agents. This problem exponentially harder. One must reason both about
model estimates the positions of static and dynamic agents and one’s configuration space and about the configuration space
assumes that the velocity and direction of motion are constant. of the other agents. In the case of a single robotic arm
Like the approach presented there, we plan incrementally and manipulating an object, Schmitt et al. [26] show how, without
sample from the stochastic model at every timestep. Unlike needing any training, one can reduce the configuration space.
this earlier work, we model the uncertainty in observing the Our method would be useful in this domain as it is more
position of obstacles by relying on the ability of the HMMs general at the expense of requiring training data. Work by Čáp
and LSTMs [14, 15] to capture this notion. Additionally, et al. [27] and Chen et al. [28] demonstrates how planners,
we also allow for obstacles and agents that change velocity including sampling-based planners, can be adapted to multi-
or direction while including observations of features of agent environments and cooperative planning without this
those obstacles and agents. Features are learned entirely exponential slowdown. Rather than cooperatively learning to
automatically in the case of LSTMs. Capturing the dynamics plan, we demonstrate an agent avoidance task which relies on
and appearance features of other entities can be helpful in the sequence model to learn to avoid the other agents. Kiesel
predicting future behavior and enhancing planner performance. et al. [29] consider a related problem, learning to perform
Aoude et al. [16] extend the work above to include a simulator kinodynamic planning. Here we do not consider learning the
which further constrains the possible trajectories of other dynamics of the robot being driven, instead only focusing
agents. Such simulation is a natural extension to the models on other agents and objects, but this would be an interesting
presented here using a range of probabilistic programming direction for future work.
approaches [17] developed for computer vision [18] and This work makes a number of contributions. We show how
robotics [19]. to combine sequence models with sampling-based planners
The closest work to our approach is that of Bowen in a manner that incorporates either graphical models or
and Alterovitz [20, 21]. Their approach learns an HMM neural networks. Learned features of the trajectory, the local
model for the trajectory of plans which achieves a task and map, and the obstacles are combined together to improve
performs exhaustive inference in the cross-product space of the generated plans in novel environments. By borrowing its
that HMM and the configuration space of the planner. That structure from computer vision algorithms for object tracking
work considers only domains where bidirectional planning and activity recognition, the model incorporates perceptual
is possible; we do not rely on such information here. uncertainty. The resulting model captures the dynamics of
Additionally, it only considers sequence models which observe other agents to plan in multi-agent scenarios. Additionally,
the state of the configuration space and one additional feature, the model we present is flexible and can easily be adapted to
the distance between landmarks and the end effector. This new sampling-based planners. We demonstrate DeRRT∗ with
prevents modeling the time-varying motion of other agents, a classical narrow passage, a bug trap scenario, that co-trains
although by detecting the arrival of new landmarks using a CNN to encode environmental cues, and a multi-agent
a manually-set threshold, their approach can replan when scenario where we take advantage of the learned patterns of
the environment changes. This process does not account for motion of the other agents. We expect that by employing
perceptual uncertainty as to the position or even presence general-purpose sequence models, which have seen great
of objects. The inference algorithm considered in that work success in natural language processing and computer vision,
relies on a discrete HMM state. The approach presented to planning, this approach not only improves performance
here can employ either continuous HMMs or arbitrary deep but opens the door to further cross-pollination between these
learning sequence models such as stacked LSTMs or GRUs. areas.
Similarly to the work above, Kim et al. [22] use a generative
adversarial network, GAN [23], to learn an action sampler. II. PLANNING WITH SEQUENCE MODELS
At each timepoint, they sample a new action conditioned on The algorithm presented here, DeRRT∗ , combines a se-
the learned model. Unlike our approach, the GAN does not quence model with a sampling-based planner, RRT∗ . RRT-
model the dynamics of other objects or capture uncertainty based algorithms create a tree which explores a configuration
when sensing. Janson et al. [24] show a Monte-Carlo planning space. The tree is used to efficiently connect an initial state
approach that, like the model described here, incorporates to a goal state in that configuration space. Given an initial
uncertainty in the observation of obstacles. It does not, state, RRT∗ samples locations uniformly and then attempts
however, consider the dynamics of obstacles or learn to extract to connect them to that original node. The tree reaches
relevant local features automatically. Arslan et al. [25] modify outward to cover the configuration space with a bias for
the steering function of a sampling-based planner, in their case large unexplored Voronoi cells, eventually finding paths to
PRM, to include sensory information. Unlike our approach, the goal state. In this way, RRT interleaves two steps: picking
their work does not include a sequence model to model the a point and steering toward it while avoiding obstacles and
dynamics of obstacles or other agents, or automatically learn infeasible areas. See algorithm 1 for a prototypical RRT.
to extract features from the environment. RRT∗ is an asymptotically optimal version of RRT [1]. There
Encoding the dynamics of other agents provides a major are a number of common extensions to RRT, for example,
advantage: efficiently reasoning in multi-agent scenarios. bidirectional RRT which considers the goal in addition to the
initial state. The work presented here can easily be adapted We keep the overall structure of RRT∗ [1] unchanged
to such enhanced RRTs. modifying two of its component functions, Steer and
Rewire. As described above, we modify the Steer to guide
Algorithm 1 A prototypical RRT algorithm. the planner toward a direction informed by the sequence
1: V ← {xinit }; E ← ∅ model instead of just minimizing the distance to the uniformly
2: for 1 . . . n do sampled node xrand . For performance reasons, this necessitates
3: xrand ← SampleFree() an update to Rewire to cache the state of the sequence model
4: xnearest ← Nearest(G = (V, E), xrand ) when a node changes its parent.
5: xnew ← Steer(xnearest , xrand ) When steering, one starts from node xnearest and heads in
6: if ObstacleFree(xnearest , xnew ) then the direction of the sampled point xrand . The end node of
7: V ← V ∪ {xnew } a single step of the steering function, xnew , lies within a
8: E ← E ∪ {(xnearest , xnew )} distance r of xnearest , within a sphere Bxnearest ,r . In the original
9: return G = (V, E) RRT∗ , xnew is chosen to minimize the distance to xrand . We
replace this function with SteerWithModel, as shown in
algorithm 2.
A. RRT∗ with sequence models SteerWithModel proceeds as follows. First, we find µ, the
optimal point according to the original RRT∗ algorithm. Next,
At each iteration of the RRT algorithm, we simultaneously
we would like to sample a point within steering distance r of
extend the tree and a sequence model. Just as RRT trees
xnearest conditioned on the sequence model, λ, along with any
have a branching structure, the sequence model will have that
observations from available sensors, Obs. When the sequence
same branching structure. This conditions future states of the
model allows for efficient conditioning of the samples based
sequence model on past states for that particular hypothesized
on this sphere and sensor data, we can directly sample from
plan. Fig. 1 shows an example of this process. As a node is
the posterior. Practically, most models do not allow for this
expanded, the precise position in the configuration space of a
and we instead sample a fixed number of points, k points,
new candidate node is sampled from the sequence model. To
compute the likelihood of each, and sample proportionally to
implement this, we modify the steering function to move in
those likelihoods. Other approaches to drawing samples for
a direction given by the sequence model while conditioning
the steering function such as Monte-Carlo methods would
it on the current state, the desired free space direction, and
also be appropriate but we intend to draw few samples in a
observations of the local environment around the current state.
small region meaning that the advantages of such approaches
While several extensions to RRT consider changing the
are outweighed by their additional runtime.
sampling function, here we instead change the steering
function. This distinction is important and we take this
approach for several reasons. Algorithm 2 SteerWithModel(xnearest , xrand )
1) It preserves the most desirable property of RRT, its bias 1: µ ← argmin kz − xrand k
for large Voronoi regions. Exploring novel regions helps z∈Bxnearest ,r
in difficult domains where simply attempting to directly 2: P ←∅
reach the goal is unlikely to succeed. At the same time, if 3: for 1 . . . k do
one wants to incorporate the goal position, this approach 4: xnext ← SampleUniform(xnearest , µ, r)
is easily extended to bidirectional RRT where two trees 5: pnext ← P (xnext , µ, Obs|λ, xinit , . . . , xnearest )
grow toward each other: one tree from the initial state 6: S ← S ∪ {(xnext , pnext )}
and one tree from the goal state. 7: (xnew , pnew ) ← Sample(S)
2) There is no need to change the sampling function. If 8: return xnew
the sequence model for the steering function has high
confidence, the random sampled direction in free space is
irrelevant. Intuitively, the sequence model controls how Intuitively, when the sequence doesn’t provide any infor-
much exploration vs. exploitation is occurring based on mation about the configuration space, it can learn to simply
its confidence in the next direction. provide high likelihood when the hypothesized direction xnext
3) We would like to take advantage of local observations is close to µ. This reverts the steering function to the one from
to help guide the algorithm. To do this, we allow the the original RRT∗ . At the other extreme, the sequence model
sequence model to observe those features, and in the may choose to disregard the free space samples if the future
case of the deep learning approach, to learn the nature path is clear. The structure of most sequence models makes
of those features. Steering moves are small, making computing this likelihood term very efficient by decoupling
local decisions about the direction of motion, while free the cost into the cost of the previous path, which is shared by
space-sampling controls the overall direction of motion. all future paths, and an additional term for the new position
Local features are far more informative for small moves and the new observations. Next, we describe how this generic
than for deciding what the overall direction across an model is instantiated in the case of HMMs and then neural
entire map or maze might be. networks.
B. Steering with HMMs the map or any other perceptual information. This eliminates
HMMs are easy to train, they do not generally require the need for feature engineering and provides robustness
much data, and usually provide efficient exact inference to perceptual uncertainty; we do not need to commit to
algorithms. They do on the other hand require feature the presence or absence of a feature in the environment.
engineering. Regardless of the domain we always include a The network learns one embedding layer to embed the local
feature modeled by a normal distribution which observes the observations at the current position. In addition, at each time
difference between the hypothesized direction, xnext , and the step, the previous state — an arbitrary vector — is propagated
optimal direction according to the original RRT∗ , µ. This and both a new state and a new direction are produced. The
allows the HMMs to elect to follow the original steering network is initialized with a zeroed state vector.
function. Nominally, it is possible to also recover this original The recurrent models can in principle directly score a
behavior when the HMMs assign equal likelihood to all future state. In practice, we found that having an explicit
outcomes but this can be difficult to learn. We extract other mixture model that combines the optimal direction per the
features from the environment and current trajectory and also original RRT∗ with the direction preferred by the LSTM
model them using normal distributions. results in models which are easier to train. It also provides
We only consider HMMs with a finite number of discrete a level of interpretability to the model. At each time step,
states although the inference algorithm only requires that we use the recurrent network, a step of which is evaluated
a likelihood of a path is computable. It is agnostic to the by the function η, to produce a mean and covariance matrix
number or type of states since we marginalize over all states. for a normal distribution from which new directions can be
To do this efficiently, we employ the forward algorithm. sampled. The likelihood of a steering move is computed as a
We take advantage of the Markov property to decompose mixture of
the likelihood function into one computed for the existing q(xt |η(xt−1 , st−1 , Obs, φ)) (1)
path, P (xinit . . . xnearest |λ), and an additional update term,
and the likelihood of following the RRT∗ direction, µ, is
P (xnearest , xnext , µ, Obs|λ). The Markov property also allows
computed as a normal distribution N (µ, σ) where q is normal
us to efficiently cache only the final entry in the lattice at
proposal distribution, η is the recurrent model (a function
each node of the RRT tree. Training the model employs an
returning a mean direction and a covariance matrix), s is
EM algorithm along with a collection of traces.
HMMs have more important limitations than just requiring the state vector of the model, Obs is an embedding of the
feature engineering. They have difficulties capturing complex observation vector, and φ are the parameters of the recurrent
temporal dynamics because of an implied exponential state model. In practice the value of σ is not important, the
duration model; efficient algorithms are scarce for other state network learns to compensate for the chosen value, but its
duration models. Additionally, the Markov property also limits addition makes for a coherent stochastic model. For efficiency,
the complexity of the dynamics that can be modeled without similarly to the HMM case, we store the current state of
an explosion in the number of states. the recurrent network at each node in the search tree and
incrementally compute the likelihood of a path.
C. Steering with neural networks At training time, the network is supplied with a series
These limitations of HMMs prompt us to employ neural of traces of successful plans. While we do not do so here,
networks instead. Using recurrent networks to approximate we could include unsuccessful plans as part of the training
complex probability distributions is not new. For example, Le set for the neural networks and even the HMMs by using
et al. [17] show that probabilistic programs can be compiled discriminative training. At each time step during training,
into neural networks that take observations as input and learn stochastic gradient descent is used to maximize the likelihood
to perform inference. One could accurately approximate the shown in equation 1. In essence, we have samples from an
HMMs with the neural networks using such techniques. n-dimensional normal distribution, where n is the size of the
We consider classes of recurrent models such as RNNs [30], configuration space, along with the network which produced
LSTMs [9], and GRUs [31]. While we often for shorthand the mean and covariance matrix from which these samples
reasons refer to LSTMs, the planning algorithm is intention- were drawn at each time step. We then train this network to
ally agnostic to the specifics of the chosen recurrent model. maximize the likelihood of the observed sequences.
In each case, these algorithms have a state that is propagated
at each time step, Unlike HMMs, states are not necessarily D. Multi-agent planning
discrete or interpretable. At each time step, models observe Sequence models can capture the dynamics of other agents.
the same quantities the HMMs do. And like with the HMMs, Doing so is useful for both avoiding moving obstacles, such
we use the outputs of the recurrent models to compute a as pedestrians, and for coordinating with those agents, such as
likelihood for each direction in the steering function. merging into traffic with other autonomous cars. Coordination
The recurrent networks take a local observation around the problems such as these are often computationally difficult
current point, the current position, and the current optimal for each individual agent, particularly when perception is
direction. Local observations and map features are embedded unreliable. On the other hand, simulating the correct behavior
into a fixed-dimensional input vector. Convolutional layers can of multiple agents when a shared oracle is available is easy.
be co-trained with the recurrent model to take input images of This is a good fit for the type of planning performed by
TABLE I
DeRRT∗ . Recurrent models have sufficiently large capacity
S UCCESS RATES FOR THE LONG NARROW PASSAGE PROBLEM .
to recognize the actions of other agents and learn a multi-agent RRT∗ DeRRT∗ /HMM DeRRT∗ /GRU
plan directly. Simulations provide large amounts of training success % 3.83% 24.02% 47.67%
data to tune these recurrent models. Intuitively, rather than standard deviation 1.68 3.74 4.38
attempting to plan in a large configuration space that is a cross- A test set with different statistics than the training set was used drawing
600 samples. Note the much higher success rate of DeRRT∗ with either
product of motions of both the robot and the other agents, HMMs or GRUs. While the standard deviation is also somewhat higher, it
we plan in a subspace, the robot’s own motions, and rely is miniscule compared to the difference in performance between the
on the sequence models to perform a type of dimensionality approaches.
reduction. The space of the robot’s motions is warped to
account for the motions of the other agents.
smaller than the volume of the free space making it hard
We augment the model to explicitly include observations of
to find good directions to move in. Prior work has shown
other agents and to explicitly reason about them. At each time
that the convergence of RRT-like algorithms depends on the
step, an embedding of the sensor information for each agent
thickness of the narrow passage.
is produced and also provided to the network. Nominally, one
could encode this information by taking as input a sequence of We created a 2D environment where a robot navigates
observations, one for each interacting agent, embedding each from a start location to a goal region. At each instance of
individually and using a separate sequence model embedding this problem, we randomized the start and end positions. We
all information for all agents into a single vector. This is uniformly sample the thickness and the length of the passage
similar to how sentences are embedded into vectors using and place its opening at a randomly sampled position. Fig. 2(a)
embeddings for each word and a recurrent model to combine shows an example map. The model was trained with 200
that per word embeddings [32]. example sequences.
However, such models can be difficult to train. Here, we For the case of the HMMs, we had 3 hidden states, perhaps
consider a complementary approach that takes advantage interpretable as corresponding to having not yet entered the
of the probabilistic interpretation of the sequence models the passage, being in the passage, and having traversed it.
described above. For each agent observed, we predict a mean We extracted the agent’s distance to the passage entrance,
and covariance matrix for the direction to steer in. Agents the presence of the agent in the passage, and the coordinates
that do not influence the current plan will have uninformative of the agent on the map as observed features. Each feature
directions while critical agents will have highly concentrated was assumed to be independent from the rest by the use of
distributions. One could also compute directions for pairs or a diagonal covariance matrix. The neural network sequence
triples of agents in this manner if the relationships between models used a GRU with an input layer to embed these local
agents, not just between the robot and a single agent, are features, two hidden layers with 32 dimensions, and a linear
important for coordination. This process is also data efficient: proposal layer.
a single example of multiple cooperating agents leads to 2 × While at training time the map was always 300 by 300,
number of agent pairs training examples as network weights at test time the map was always 600 by 300. We also
are shared between all agent pairs. At inference time, the significantly narrowed the passage size at test time. This
steering direction is a mixture of the optimal RRT∗ direction, both makes the problem more challenging and makes the test
the direction according to the local observations of the model, examples different from the training examples showing the
and the directions predicted for each observation of each generalization capabilities of the sequence models. Fig. 2(a)
agent. This process efficiently scales to a large number of shows a test case along with expanded search tree and the
agents. final path. There is no configuration similar to this one in
the training examples.
III. EXPERIMENT RESULTS
At test time, we ran each of RRT∗ , DeRRT∗ with HMMs,
We tested DeRRT∗ in a number of challenging environ- and DeRRT∗ with GRUs for 500 rounds with 600 sampled
ments. The sequence models described above were imple- nodes in each round. DeRRT∗ was able to succeed far more
mented in PyTorch and integrated with the Open Motion reliably with such small number of samples, 24% to 47%
Planning Library, OMPL [33], using the provided Python of the time, a 6 to 12-fold increase, depending on whether
bindings.1 The selected planning environments demonstrate the HMMs or the GRUs-based models were used. Table I
three key features of the approach presented here: planning summarizes the success rate of each planner. We emphasize
efficiently in difficult domains such as narrow channels, using that the test set is disjoint from the training set both in terms
local perceptual features to learn to escape a bug trap, and of individual examples but also in terms of the statistics of
multi-agent navigation that relies on learning the other agent’s the examples; the narrower passages required the planners to
motion patterns for coordination. generalize.
A. Long narrow passage We would like for the algorithms to not just have high
Narrow passages pose a problem for sampling-based performance but also to behave in a directed manner, while
planners [34]. The volume of a narrow passage is much still exploring free space. Fig. 2(b) and 2(c) summarize the
heat map of the search trees on the given test example.
1 Source code is available at https://github.com/ylkuo/derrt. Intensity corresponds to the number of times the tree visited
(a) Map with a DeRRT∗ tree and solution (a) Bug trap
Fig. 2. A test case for the long narrow passage. (a) The DeRRT∗ search
tree, black, and solution, red. (b) A heat map of the DeRRT∗ /HMM search (b) DeRRT∗ (c) RRT∗
tree. Note the distribution bias toward the channel entrance in the free space,
efficient traversal of the channel, and more samples after the channel exit. Fig. 3. (a) The default OMPL bug trap example with its start and end
(c) A heat map of the RRT∗ search tree. Note the high proportion of time position. Note the difficult central passage and dead-ends on either side of it.
spent looking for the entrance, less efficient traversal of the channel, and few (b) A heat map of the DeRRT∗ /GRU search tree. It efficiently learns to exit
samples in the second free space looking for the goal. Pure black represents the trap and quickly focuses on sweeping in large arcs to locate the goal.
allocating 1% of the samples inside the given cell with linear interpolation (c) A heat map of the RRT∗ search tree. It spends more time in the trap
to pure white. and less time on finding the goal in the free space. Pure black represents
allocating 0.2% of the samples inside the given cell with linear interpolation
to pure white.
that region. One can see that DeRRT∗ trades off exploration
vs exploitation. It still searches the free space but does so from a 110 × 110-sized map. The sequence model used was
toward the channel entrance, more easily traverses the channel, similar to that from section III-A, with two modifications.
and samples more densely in the free space after the channel First, manually-engineered features were replaced with a
where the goal is. two-layer convolutional network, each layer containing a
B. Bug trap convolution followed by max polling. The convolutions used
3 × 3 filters with 32 and 64 output channels respectively.
The bug trap is a standard benchmark in OMPL. It requires
Max pooling used a 2 × 2 window with step size 2. Second,
that a 2D robot escape from an inner chamber through a
we used GRUs instead of LSTMs as they proved easier to
narrow passage and then reach a goal in a large free space;
co-train with the convolutional layers.
see Fig. 3(a). This is made particularly hard by the shape
of the exit which includes two dead ends. Most samples At test time, we compared RRT∗ to DeRRT∗ . Fig. 3(a)
in the free space will lead to steering into these areas. For shows the default bug trap from OMPL. We ensure that the
planners to escape from the bug trap, particularly sampling- test and training sets are disjoint and that no motion sequence
based planners, a very deliberate sequence of samples must in the training set solves a map from the test set. Each planner
be drawn to guide the robot to the passage and then down is run with a maximum of 10000 planning steps.
the passage without getting trapped. Fig. 4 shows the solution length as a function of the number
Unlike the narrow passage problem, the bug trap has much of samples drawn. Already by 4000 samples the sequence-
more distinctive local visual features, i.e., the shape of the model guided planner is performing as well as RRT∗ with
trap. To quickly escape, one has to recognize not just the twice the number of samples. The colored regions show 95%
presence of a gap but the particular features of the central confidence intervals. Additionally, DeRRT∗ is more stable
channel. In this experiment, we cotrain the convolutional than RRT∗ , likely because the proportion of valid moves is
layers of the neural-network sequence models to learn to around 0.8, as shown in Fig. 5. This makes DeRRT∗ far more
recognize the presence of relevant map features in order to efficient in its proposals when compared to RRT∗ . When
reach a goal. This eliminates the need for feature engineering considering more complex scenarios, such as an articulated
or any other annotation aside from a series of prior plans. robot with a complex mesh, this can have an even more
To again ensure that the training and test data are disjoint significant impact as the more expensive collision checking
we manipulated the example provided in OMPL. We randomly can become a dominant concern in the runtime of sampling-
rotated and translated the trap and randomly sampled the based planners.
starting position inside the trap and the goal configuration Figs. 3(b) and 3(c) show heat maps computed the same
outside the trap. Training samples were provided by running way as in the previous section where intensity is proportional
RRT∗ for 10000 planning steps on each problem instance. In to the density of nodes. DeRRT∗ learns to sample less in
total, we collected 1000 training sequences. dead ends and focuses on leaving the channel and exploring
Instead of manually engineering features, we take as an the outside space which contains the goal while RRT∗ tends
observation 21 × 21 local patch centered at the current node to spend much more time in the trap.
Fig. 4. Solution length as a function of the number of samples. DeRRT∗ Fig. 6. Solution path length in multiagent navigation task
is both more efficient and more stable.
(a) (b)
Fig. 5. Proportion of valid graph moves as a function of the number of
samples. DeRRT∗ learns to avoid proposing invalid moves. Fig. 7. Trajectory comparison in 6 agents swap position case at 500 samples
per re-plan step. The grey lines are trajectories of other preloaded agents.
(a) Trajectory at the mid-time (b) Trajectory at the end.
C. Multi-agent coordination around a static obstacle
Finally, we test the ability of the sequence models to learn explicitly about the configuration space of other agents, it
to coordinate with other agents using a task where agents can in principle take into account the expected trajectories of
must swap their positions around a central obstacle while those agents. Without a model of the behavior of the agents,
avoiding each other; see Fig. 7. The obstacle makes the RRT∗ -joint has difficulties making meaningful inferences
problem significantly harder than a standard position swap aside from constraining the likely paths of other agents to
in free space since moving randomly in free space makes avoid imminent collisions, while at the same time incurring
it very unlikely that agents will collide with each other. To the cost of an exponentially increasing configuration space.
further increase the difficulty, we constrain all other agents to Table II shows detailed results comparing these four
move counter-clockwise around the obstacle. This scenario is approaches for a fixed number of samples per re-plan
closely related to driving; one might view it as a roundabout. step, 100 for all except for RRT∗ -join which due to its
To scale the difficulty of the problem, we change the number exponentially larger search space requires 5000. RRT∗ -
of agents that must avoid collisions. joint quickly degrades because even though it is nominally
Training data was generated by randomly placing an representing the problem with higher fidelity than RRT∗ ,
obstacle in a 100 × 100 map, and up to four agents at sampling in high dimensional spaces is known to be very
random orientation around the obstacle. Motion sequences difficult. DeRRT∗ with GRUs performs considerably better in
were supplied by running RRT∗ for 10,000 iterations. This terms of its success rate. Note that DeRRT∗ was never trained
resulted in 200 different maps and 600 sequences in total. on the same scenarios it was tested on and it was never trained
We trained a 3-state HMM including the orientation and with 6 or 8 agents on the map, yet it seems to generalize
distances to the goal and other agents. Similarly to the well to these new instances. DeRRT∗ with either sequence
previous sections, we trained a GRU-based planner including model finds much shorter paths. Fig. 6 shows the path length
those same features. As described in section II-D, the neural- as a function of the number of agents. RRT∗ -joint is omitted
network-based planner evaluates each of the different agents because its poor performance would make it difficult to see the
separately, proposes means and covariance matrices for each, difference between any of the other algorithms. As the number
and draws a sample from the resulting mixture model. of agents increases, DeRRT∗ does not produce poorer paths,
We compared DeRRT∗ to two other approaches, RRT∗ and although its likelihood of success does go down as finding
RRT∗ -join, with up to eight agents. RRT∗ only considers the collision-free paths becomes more difficult. Figs. 7(a) and 7(b)
configuration space of the robot while replanning at each step. show an instance of this problem along with full tracks, in
It treats the map as providing snapshots of fixed obstacles, grey, and partial tracks halfway through the execution, in
i.e., the center block and the other agents, some of which color, for RRT∗ and DeRRT∗ solutions. Even when both
happen to move between the snapshots. RRT∗ -joint plans approaches reach the goal, the collision-free paths produced
in the joint configuration space of all agents. By reasoning by DeRRT∗ are much smoother.
TABLE II
C OMPARING THE SUCCESS RATE AND PATH LENGTH AS A FUNCTION OF THE NUMBER OF AGENTS OF FOUR DIFFERENT PLANNING APPROACHES .