MonteCarloVehicleRouting
MonteCarloVehicleRouting
Abstract. Nested Rollout Policy Adaptation (NRPA) is a Monte problem per catchment area). Any of these VRPTW involves quite a
Carlo search algorithm that learns a playout policy in order to solve a big search space, since EDF technicians of a single catchment area
single player game. In this paper we apply NRPA to the vehicle rout- have to achieve every day hundreds of visits at customers places. The
ing problem. This problem is important for large companies that have objective of the work described in this paper was to adapt the NRPA
to manage a fleet of vehicles on a daily basis. Real problems are often algorithm to the specifics of EDF problem.
too large to be solved exactly. The algorithm is applied to standard Related approaches that apply Artificial Intelligence techniques
problem of the literature and to the specific problems of EDF (Elec- to transportation include agent based approaches [5], Monte Carlo
tricité De France, the main french electric utility company). These Search approaches [42, 30, 22, 1] and learning of simulations of ur-
specific problems have peculiar constraints. NRPA gives better re- ban traffic [16]. Our work is close to the Monte-Carlo Search ap-
sult than the algorithm previously used by EDF. proach applied to the Capacitated Vehicle Routing Problem with
Time Windows (CVRPTW).
We now give the outline of the paper. The next section briefly de-
1 Introduction scribes the state of the art of Vehicle Routing. The third section de-
Monte Carlo Tree Search (MCTS) has been successfully applied to tails the NRPA algorithm and its proposed improvements. The fourth
many games and problems [8]. section describe the problems of EDF and the current solver. The fifth
Nested Monte Carlo Search (NMCS) [9] is an algorithm that section gives experimental results for various instances of Vehicle
works well for puzzles. It biases its playouts using lower level play- Routing, both from literature (Solomon instances) and from real-life
outs. At level zero NMCS adopts a uniform random playout pol- problems. The last section concludes.
icy. Online learning of playout strategies combined with NMCS has
given good results on optimization problems [38]. Other applica-
tions of NMCS include Single Player General Game Playing [32], 2 The Vehicle Routing Problem
Cooperative Pathfinding [6], Software testing [36], heuristic Model-
Checking [37], the Pancake problem [7], Games [11], Cryptography The Vehicle Routing Problem (VRP) is one of the most studied op-
[19] and the RNA inverse folding problem [33]. timization problem. It was first stated in 1959 in [18]. Basically, it
Online learning of a playout policy in the context of nested consists of finding an optimal route for a number of vehicles that are
searches has been further developed for puzzles and optimization used to deliver goods or services to a set of customers, taking into
with Nested Rollout Policy Adaptation (NRPA) [40]. NRPA has account a set of constraints. In the simpler variant of the problem, all
found new world records in Morpion Solitaire and crosswords puz- vehicles start from a single depot and end their tour in this depot. The
zles. Stefan Edelkamp and co-workers have applied the NRPA al- objective function to be minimized may combine up to three criteria
gorithm to multiple problems. They have optimized the algorithm (namely the number of customers that are not serviced, the number
for the Traveling Salesman with Time Windows (TSPTW) problem of vehicles used, the total distance travelled by the vehicles), each
[12, 21]. Other applications deal with 3D Packing with Object Ori- criterion having an associated weight. The VRP can be formalized
entation [23], the physical traveling salesman problem [24], the Mul- as a graph problem : let G = (V, A) be a directed graph where V
tiple Sequence Alignment problem [25] or Logistics [22]. The prin- is the set of vertices, and A is the set of arcs. Each arc is labelled
ciple of NRPA is to adapt the playout policy so as to learn the best with a non-negative number. One of the vertices represent the depot,
sequence of moves found so far at each level. the others represent the customers locations. The arcs symbolize the
Electricité De France (EDF) is the main french electric utility roads between two customers, and the label of the arcs give the dis-
company. Each year, eleven millions of services have to be carried tance (or the travel time, or the travel cost) between two points. Lets
out by EDF technicians at customers places (problem fixing, meter us remind that the G is a directed graph, and thus its associated dis-
replacement, energetic diagnosis, business prospects). Thus, all to- tance matrix is not necessarily symmetric.Now the problem consists
gether, EDF techicians drive by car more than 220 000 kilometers of finding the minimum number of tours so that each vertex except
per year. Each service is accomplished by a technician having the the depot belongs to one and only one tour, and the depot belongs
required skills within a time window determined during the appoint- to all of them. The VRP is of the non deterministic polynomial time
ment booking. As a result, numerous Vehicle Routing Problems with hard type (NP-Hard), which implies that so far we don’t know of
Time Windows (VRPTW) have to be solved on a daily basis (one a general method which is able to optimally solve any instance of
1 the problem in polynomial time. As a result, the VRP is considered
LAMSADE, Université Paris-Dauphine, PSL, CNRS, France, email: Tris-
tan.Cazenave@dauphine.psl.eu very difficult. Nowadays exact methods can solve to optimality quite
2 EDF R&D, France, email: jean-yves.lucas@edf.fr hyoseok.kim@ensiie.fr challenging instances of the problem (for instance up to one hundred
thomas.triboulet@edf.fr visits or so with Branch and Price methods).
2.1 VRP variations 3.1 Description of NRPA
The VRP is a real challenge for delivery companies operating a An effective combination of nested levels of search [9] and of policy
fleet of vehicles. It has given rise to a number of variations. The learning has been proposed with the NRPA algorithm [40]. NRPA
CVRP (Capacitated VRP) is defined as a VRP with a demand as- holds world records for Morpion Solitaire and crosswords puzzles.
sociated to each customers (e.g. the number of parcels they have pur- NRPA is given in algorithm 3. The principle is to learn weights
chased) and each vehicle has a limited capacity. The VRP with time for the possible actions so as to bias the playouts. The playout algo-
windows (VRPTW) implies to serve each customer within a given rithm is given in algorithm 1. It performs Gibbs sampling, choosing
time window (possibly different for each customer). The capacitated the actions with a probability proportional to the exponential of their
VRP with time windows (CVRPTW) combines the characteristics weights.
of the two previous variants. Although the previous variations are The adaptive rollout policy is a policy parameterized by weights
the most widely studied because of their great practical importance, on each action. During the playout phase, action is sampled accord-
many other extensions of the basic VRP have also been proposed. ing to this weights. The playout algorithm is given in algorithm 1. It
Among them, let us just mention two of them. First, the dynamic uses Gibbs sampling, each move is associated to a weight. A move
VRP (DVRP) where vehicles may be dynamically re-routed during is coded as an integer that gives the index of its weight in the policy
their tour in order to fulfill new customers orders. Second, the VRP array of floats. The algorithm starts with initializing the sequence of
with Pickup and Delivery (VRPPD) where a fleet of vehicles must moves that it will play (line 2). Then it performs a loop until it reaches
satisfy transportation requests (e.g. picking parcels up at some places a terminal states (lines 3-6). At each step of the playout it calculates
and delivering them at given locations). the sum of all the exponentials of the weights of the possible moves
(lines 7-10) and chooses a move proportional to its probability given
2.2 Principal methods used for solving VRP by the softmax function (line 11). Then it plays the chosen move and
adds it to the sequence of moves (lines 12-13).
A wide range of methods have been used to solve the VRP. These Then, the policy is adapted on the best current sequence found,
methods may be broken down into three sub-groups, namely exact by increasing the weight of the best actions. The Adapt algorithm is
methods, heuristic methods and metaheuristic methods. These meth- given in algorithm 2. The weights of the actions are updated at each
ods are presented hereafter. step of the algorithm so as to favor moves of the best sequence found
so far at each level. The principle of the adaptation is to add α to
Exact methods These methods tend to find an optimal solution for the action of the best sequence for each state encountered in the best
the VRP. As mentioned above, they are thus used to solve instances sequence (lines 3-5) and to decrease the weight of the other possi-
of the VRP of consequent size. Among the most studied exact meth- ble actions by an amount proportional to their probabilities of being
ods for the VRP and variations we may mention the Branch-and-Cut played (lines 6-12). The adaptation algorithm is given in algorithm
and the Branch-and-Price algorithms [28], the column-generation al- 2.
gorithm [4] and the set-partitioning method [2]. In NRPA, each nested level takes as input a policy, and returns a
These methods can also provide a bound of optimum, especially sequence. Inside the level, the algorithm makes many recursive calls
in relaxing some of the most complicated constraints. For example, to lower levels, providing weights, getting sequences and adapting
in suppressing the classical subtour elimination constraint. the weights on those sequences. In the end, the algorithm returns the
best sequence found in that level. At the lowest level, the algorithm
Heuristic methods Not in a comprehensive manner, let us men- simply makes a rollout.
tion two interesting approaches. First the cluster-first, route-second The NRPA algorithm is given in algorithm 3. At level zero it sim-
heuristic [26]. Second, the savings heuristic [14] [3]. For the lo- ply performs a playout (lines 2-3). At greater levels it performs N it-
cal search approaches we see in [44] a heuristic projection pool erations and for each iteration it calls itself recursively to get a score
with powerful insertion and guided local search strategies. The lo- and a sequence (lines 4-7). If it finds a new best sequence for the
cal search described in [39] gives reference solutions on several level it keeps it as the best sequence (lines 8-11). Then it adapts the
instances of the benchmark VRP Solomon instances tested in this policy using the best sequence found so far at the current level (line
work. 12).
NRPA balances exploitation by adapting the probabilities of play-
ing moves toward the best sequence of the level, and exploration by
Metaheuristic methods The most commonly used metaheuristics
using Gibbs sampling at the lowest level. It is a general algorithm
for solving VRP and variations are the particle swarm optimization
that has proven to work well for many optimization problems.
[31], simulated annealing [13], genetic algorithm [34] [43] and tabu
Playout policy adaptation has also been used for games such as Go
search [15] [35]. Among the best algorithms for solving the VRP
[27] or various other games with success [10].
Solomon instances tested in this work, we can cite genetic algorithm
described in [29], evolutionnary algorithm detailed in [17] or tabu
Search proposed in [20]. 3.2 Modelling of the problem
There are several design choices when implementing NRPA. The first
3 Nested Rollout Policy Adaptation for Vehicle
choice is to decide how to code the possible moves. For the vehicle
Routing
routing problem we choose to code a move as a starting node, and
In this section we start with explaining the NRPA algorithm. Then arrival node and a vehicle.More precisely, a solution to the problem
we give our modelling of the CVRPTW problem for NRPA. We fin- is an ordered sequence of visits, including special visits (SV) that
ish the section describing how weight are heuristically initialized in represent the fact that the corresponding technician is at the depot.
order to speed-up convergence of the algorithm. So a sequence solution always starts and ends with a SV. If there
is a SV between two visits within the sequence, that means the end
of a tour for a technician, and the beginning of a tour for another
Algorithm 1 The playout algorithm technician (with no chronological ordering of these two tours that
1: playout (state, policy) will be carried out simultaneously).
2: sequence ← [] Another design choice is to define the score of a playout. Score
3: while true do includes number of non visited customers multiplied by a great pe-
4: if state is terminal then nalization number, number of used vehicles multiplied by an rather
5: return (score (state), sequence) great weight, and number of kilometers. We filter movements that do
6: end if not respect the time windows in the playout algorithm : all solutions
7: z ← 0.0 respect customers time windows and vehicle time windows.
8: for m in possible moves for state do If there were more objectives, a lexicographic ordering of the ob-
9: z ← z + exp (policy [code(m)]) jectives would be the best way to represent the scores of a playout.
10: end for Playouts could then be compared simply and easily.
11: choose a move with probability exp(policy[code(move)])
z
12: state ← play (state, move) 3.3 Initialization of the Weights
13: sequence ← sequence + move
14: end while In NRPA weights are uniformly initialized to 0.0. We propose to ini-
tialize the weight of NRPA heuristically in order to speedup conver-
gence. The greater the distance from the current city to another city
the less likely the other city is a good choice. Standard NRPA starts
with an uniform policy with all weights set to 0.0 and does not dis-
tinguishes between close and far cities. The heuristic initialization
Algorithm 2 The Adapt algorithm
of the weights initializes the weight with a value proportional to the
1: Adapt (policy, sequence) inverse of the distance. This initialization is only used for the first
2: polp ← policy iteration of NRPA.
3: state ← root
4: for move in sequence do Algorithm 4 The NRPA algorithm with quantiles.
5: polp [code(move)] ← polp [code(move)] + α 1: NRPA with quantile (level, policy)
6: z ← 0.0 2: allScores ← []
7: for m in possible moves for state do 3: quantile ← −∞
8: z ← z + exp (policy [code(m)]) 4: if level == 0 then
9: end for 5: (result,new) ← playout(root,policy)
10: for m in possible moves for state do 6: allScores ← allScores + result
11: polp [code(m)] ← polp [code(m)] - α ∗ 7: quantile ←updateQuantile(allScores)
exp(policy[code(m)])
z 8: return (result,new)
12: end for 9: else
13: state ← play (state, move) 10: bestScore ← −∞
14: end for 11: for N iterations do
15: policy ← polp 12: (result,new) ← NRPA(level − 1, policy)
13: if result ≥ bestScore then
14: bestScore ← result
15: seq ← new
16: else
Algorithm 3 The NRPA algorithm. 17: if result ≤ quantile then
1: NRPA (level, policy) 18: policy ← Adapt Bad(policy, new)
2: if level == 0 then 19: end if
3: return playout (root, policy) 20: end if
4: else 21: policy ← Adapt(policy, seq)
5: bestScore ← −∞ 22: end for
6: for N iterations do 23: return (bestScore, seq)
7: (result,new) ← NRPA(level − 1, policy) 24: end if
8: if result ≥ bestScore then
9: bestScore ← result
3.4 The Quantile Heuristic
10: seq ← new
11: end if The objective of the Quantile Heuristic is to penalize movements
12: policy ← Adapt (policy, seq) from bad solutions. Solution scores for all playouts need to be stored.
13: end for When a new playout is computed, if its score is worse than a defined
14: return (bestScore, seq) quantile, then its movements weights are decreased in the weight. Ef-
15: end if ficient implementation is needed to limit the increase of computation
time needed to sort the solution scores and compute the quantile. Se-
quence of algorithm is described in 4. The adapt function is the same
as for good solutions, with negative coefficient α, as described in 5. • Taking into account specific skills for visits. For each visit, only
The coefficient α used for bad solutions can be different from the one a subset of technicians is skilled to carry out the technical opera-
used to adapt policy to good solutions. tions.
• Traject distance and traject duration are not proportionnal, be-
Algorithm 5 The adapt algorithm to bad solutions cause vehicle mean speed depends on the traject. Consequently,
1: Adapt Bad (policy, sequence) 2 different traject matrixs are provided.
2: polp ← policy • Traject matrix and are not symetric : traject between two points
3: state ← root depends on the direction of the traject.
4: for move in sequence do
5: polp [code(move)] ← polp [code(move)] - α Another difference with academic data is that on real problems,
6: z ← 0.0 there is not always enough people to carry out the whole set of vis-
7: for m in possible moves for state do its. Consequently, we must first maximize the number of visits that
8: z ← z + exp (policy [code(m)]) will be done. Some of the appointments have a high priority, and be-
9: end for cause of customers satisfaction, it is not possible to cancel it. Other
10: for m in possible moves for state do appointments have less priority, because they consist for example in
11: polp [code(m)] ← polp [code(m)] + α ∗ a technical operation that can be postponed, with no corresponding
exp(policy[code(m)])
z customer appointment. Moreover, network preventive maintenance
12: end for visits have less priority than troubleshooting activities. In order to
13: state ← play (state, move) evaluate a solution, a lexicographic objective function is taken into
14: end for account :
15: policy ← polp
• Maximization of the number of achieved visits of high priority.
• Maximization of the number of achieved visits of low priority.
• Minimization of the economic function, taking into account num-
4 The EDF Capacitated Vehicle Routing Problem ber of technicians used (ponderated with a proportional daily
with Time Windows wage) and number of kms (ponderated with a proportional km
In this section we first explain the optimization problem encountered cost).
at EDF and then describe the current EDF approach to the problem.
Thus, when comparing two solutions, we first compare the number
of visits of high priority that are achieved in each solution.if these
4.1 Description of the EDF problems numbers are not equal then we know which solution is best. If they
The Vehicle Routing Problem modeled here includes time windows, are equal, then we compare the number of visits of low priority, if
which is a classical feature of VRP problems. That means that : again these numbers are equal, then we compare the third criterion
of the two solutions, that is the economic function. The problem is
• Each technician has an availability time window. Each tour starts solved each day for determining the technicians tours of the next day,
at the beginning of this time window and must end before the end with a computing time which must not last more than an few hours.
of this time window.
• Each appointment has a time window and a duration. A tour vis-
iting the appointment must start after the beginning of the time 4.2 Description of current EDF approach
window and end (including the appointment duration) before the
end of the time window Current approach for the problem is based on stochastic greedy algo-
• If a tour arrives at an appointment location before the beginning of rithm and variable neighborhood search. The stochastic greedy algo-
its time windows, it is possible to add waiting time before starting rithm is based on [41] algorithm. First phase consist in creating tours
the appointment. and insert in each tour the visits that increases the less the travelled
distance, until there is no room for new visits. For each insertion,
The problem also includes capacities, another classical feature of the place in the tour is chosen to minimize the increase of kms. If
VRP problems : no insertion is possible, a new round is created. Then the process is
• Each vehicle has several stock capacities (in EDF problem 6 per repeated, until no visits nor technicians are left.
vehicle). Vehicle stocks are initialized with initial capacity at the The local search consist of 2 kinds of movements :
begining of each tour.
• Small movements consist in choosing randomly an appointment,
• For any stock, each operation will consume a stock quantity.
and finding the best place to move the appointment, i.e. the best
• For each tour, for each capacity, the sum of the quantity used by
place in all the existing tours to reinsert the appointment.
the operation of the tour must not be greater than the capacity of
• Large movements consist in choosing randomly a tour, then en-
the stock.
tirely destroying it and trying to reinsert all its appointments in
Several peculiar features are taken into account in EDF modeliza- the other tours. If no improvement is possible, the movement is
tion. canceled.
• Taking into account an off-duty time window for lunch break. No Advantages of such a method is that it computes rapidly good solu-
place is imposed for it. It can interrupt a trip but not a service to a tions, very often acceptable by operators because greedy algorithm
customer. choices, based on distance, are humanly intuitive.
Table 1: The different algorithms tested on the 56 standard instances.
Table 3: Results for the different algorithms tested on the EDF instances.
Table 4: Summary for the different algorithms tested on the EDF instances.
Summary (mean) 1.88 4.75 363.30 1.88 4.75 326.54 1.88 4.75 324.68 2.13 4.88 344.51
5 Experimental Results • opturn : current EDF solver, based on Solomon and LNS heuristic.
The operational process of EDF consists in providing a solution to To compare the results of the 4 variants, the lexicographical approach
the CVRPTW in less than two hours. Each problem is solved during takes into account first the number of vehicles used and then the kilo-
the night so as to provide solutions for the next day. The parameters meters.
used in our experiments for NRPA are to use level 3, α = 1 and 100 • V : number of needed vehicles
iterations per level. This algorithm is called nrpa in our tables. The • Km : sum of the Kms of the rounds
results are given out of 11 independent runs. The distance heuris-
tic is tested and the initialization of the weights is only applied at Table 2 gives the summary for all the runs of the algorithms on the
the first iteration. The corresponding name of the algorithm is nrpaD standard problems. For the Km and the Vehicles (V), the figures are
in the tables. The Quantile heuristic is applied in 80% of the worst the average of each algorthm on the Solomon instances. The δ value
cases. It uses all the simulations performed to date and uses a specific is the number of problems that are solved better than with standard
α = 0.5. The algorithm that uses both the Distance and the Quantile NRPA minus the number of problems that are solved worse. Com-
heuristics is named nrpaDQ int the tables. parison is lexicographic, based on number of vehicles, and if equals,
The current solver of EDF, Opturn is launched only once with on the number of Kms. Logically, the δ value is 0 for the reference
3 000 iterations of greedy search and 3 000 iterations of local search. (nrpa variant).
These values were chosen during the tuning of Opturn after many ex- We observe that nrpaDQ scores 55 while Opturn scores -50. This
periments that showed that no improvements are obtained with big- is a great improvement over the current solver. However the current
ger values. As a result, the running time of Opturn is smaller than the solver was optimized on the EDF problems. The Distance heuristic
running time of the NRPA. improves a lot over standard NRPA. The Distance + Quantile heuris-
The running time for NRPA is approximately 7 000 seconds and tic only improves slightly over the Distance heuristic alone.
up to 8 000 seconds when using the Quantile heuristic. The running Table 3 and Table 4 give the results of the 4 algorithms on the
time of Opturn is approximately 5 000 seconds. Importantly, these EDF instances. For each algorithm, the 3 components of objective
run times are compatible with the operational process as they last function are displayed:
approximately two hours. • Missed : number of missed appointments
Table 1 gives the results of runs on the litterature instances. It com- • V : number of needed vehicles
pares 4 algorithms: • Km : sum of the Kms of the tours
• nrpa : Standard NRPA (3 levels) without heuristics Table 3 compares the results of runs of the algorithms on the EDF
• nrpaD : NRPA (3 levels) with Distance initialization heuristic problems. For each instance we compare the number of realized tech-
• nrpaDQ : NRPA (3 levels) with Distance initialization heuristic nical operations because on real EDF instances there are not neces-
and Quantile heuristic sarily enough vehicles to achieve all the operations. We also measure
the number of vehicles and the number of kilometers, as in classical [5] Ana LC Bazzan and Franziska Klügl, ‘A review on agent-based tech-
VRP problem instances. To compare the results of the 4 variants, the nology for traffic and transportation’, The Knowledge Engineering Re-
view, 29(3), 375–403, (2014).
lexicographical approach takes into account first the achieved opera- [6] Bruno Bouzy, ‘Monte-carlo fork search for cooperative path-finding’,
tions, then the number of vehicles used and finally the kilometers. in Computer Games - Workshop on Computer Games, CGW 2013,
Table 4 gives the average values on the whole set of EDF instances. Held in Conjunction with the 23rd International Conference on Arti-
The average number of missed visits is smaller with NRPA than ficial Intelligence, IJCAI 2013, Beijing, China, August 3, 2013, Revised
with Opturn. It is the same for all NRPA variants. The number of Selected Papers, pp. 1–15, (2013).
[7] Bruno Bouzy, ‘Burnt pancake problem: New lower bounds on the di-
vehicles used for the operations is also slightly smaller and this is ameter and new experimental optimality ratios’, in Proceedings of the
the same for all NRPA variants. The average number of kilometers is Ninth Annual Symposium on Combinatorial Search, SOCS 2016, Tar-
greater with standard NRPA than with Opturn, however it is smaller rytown, NY, USA, July 6-8, 2016, pp. 119–120, (2016).
with nrpaD and nrpaDQ than with Opturn. With respect to the lexico- [8] Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas,
Peter Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez,
graphic objective function described in subsection 4.1, we can con- Spyridon Samothrakis, and Simon Colton, ‘A survey of Monte Carlo
clude that nrpaD and nrpaDQ improves on the current solver for the tree search methods’, IEEE Transactions on Computational Intelli-
operational EDF instances. gence and AI in Games, 4(1), 1–43, (March 2012).
The Distance and Quantile heuristics provide solutions with the [9] Tristan Cazenave, ‘Nested Monte-Carlo Search’, in IJCAI, ed., Craig
same values for the 3 components of the objective function. These Boutilier, pp. 456–461, (2009).
[10] Tristan Cazenave, ‘Playout policy adaptation with move features’,
two heuristics perform better in average than the standard NRPA Theor. Comput. Sci., 644, 43–52, (2016).
variant as for the number of travelled kilometers : 324 vs 363, that [11] Tristan Cazenave, Abdallah Saffidine, Michael John Schofield, and
is a 10 percents improvement. If we compare with Opturn results, Michael Thielscher, ‘Nested monte carlo search for two-player games’,
the improvement account for around 5 percents. This is still signif- in Proceedings of the Thirtieth AAAI Conference on Artificial Intelli-
gence, February 12-17, 2016, Phoenix, Arizona, USA, pp. 687–693,
icant from a practical point of view because of the huge amount of (2016).
kilometers travelled by the technicians on a yearly basis : about 220 [12] Tristan Cazenave and Fabien Teytaud, ‘Application of the nested rollout
millions of kilometers. policy adaptation algorithm to the traveling salesman problem with time
windows’, in Learning and Intelligent Optimization - 6th International
Conference, LION 6, Paris, France, January 16-20, 2012, Revised Se-
6 Conclusion lected Papers, pp. 42–54, (2012).
[13] W.C. Chiang and R. Russel, ‘Simulated annealing metaheuristics for
We have presented the NRPA algorithm and its application to the the vehicle routing problem with time windows’, Annals of Operations
Research, 13(1), 3–27, (1996).
CVRPTW problems. This problem is important for EDF, a company [14] G. Clarke and J. Wright, ‘Scheduling of vehicles from a central depot
that plans numerous operations every day on the french electrical net- to a number of delivery points’, Operations Research, 12, 171–183,
work. We have given the modelization used to address the CVRPTW (1964).
problem and two heuristics to improve on standard NRPA: the Dis- [15] J-F. Cordeau, M. Gendreau, and G. Laporte, ‘A tabu search heuristic
for the periodic and multi-depot vehicle routing problems’, Networks,
tance heuristic and the Quantile heuristic. The Distance heuristic im- 30(2), 105–119, (1997).
proves a lot on standard NRPA and the Quantile heuristic is a slight [16] Luca Crociani, Gregor Lämmel, Giuseppe Vizzari, and Stefania Ban-
improvement. We also compared NRPA with the heuristics to the dini, ‘Learning obervables of a multi-scale simulation system of urban
current EDF solver Opturn. On standard instances NRPA with the traffic.’, in ATT@ IJCAI, pp. 40–48, (2018).
heuristic is much better. On the operational EDF instances it is still [17] O. Bräysy D. Mester and W. Dullaer, ‘A multi-parametric evolution
strategies algorithm for vehicle routing problems’, in Working Paper,
better even though Opturn was tuned for these instances while NRPA Institute of Evolution, University of Haifa, Israel, (2005).
is a general algorithm that uses a Distance heuristic that works for all [18] G.B. Dantzig and J.H. Ramser, ‘The truck dispatching problem’, Man-
kinds of VRP problems. agement Science, 6(1), 80–91, (1959).
As we have seen from the examples of the EDF instances, NRPA [19] Ashutosh Dhar Dwivedi, Paweł Morawiecki, and Sebastian Wójtowicz,
‘Finding differential paths in arx ciphers through nested monte-carlo
with heuristics performs better than the current solver. The percent- search’, International Journal of electronics and telecommunications,
age of kilometers saved by heuristic NRPA is greater than 5% of the 64(2), 147–150, (2018).
total number of kilometers. EDF agents drive hundreds of millions of [20] M. Gendreau F. Geurtin E. Taillard, P. Badeau and J.Y. Potvin, ‘A tabu
kilometers each year. The use of heuristic NRPA could save millions search heuristic for the vehicle routing problem with time windows’, in
of kilometers each year and reduce the carbon footprint of EDF by Transportation Science, 31, pp. 170–186, (1997).
[21] Stefan Edelkamp, Max Gath, Tristan Cazenave, and Fabien Teytaud,
hundreds of tons of CO2 . ‘Algorithm and knowledge engineering for the tsptw problem’, in Com-
putational Intelligence in Scheduling (SCIS), 2013 IEEE Symposium
on, pp. 44–51. IEEE, (2013).
REFERENCES [22] Stefan Edelkamp, Max Gath, Christoph Greulich, Malte Humann, Ot-
thein Herzog, and Michael Lawo, ‘Monte-carlo tree search for logis-
[1] Ashraf Abdo, Stefan Edelkamp, and Michael Lawo, ‘Nested rollout tics’, in Commercial Transport, 427–440, Springer, (2016).
policy adaptation for optimizing vehicle selection in complex vrps’, in [23] Stefan Edelkamp, Max Gath, and Moritz Rohde, ‘Monte-carlo tree
2016 IEEE 41st Conference on Local Computer Networks Workshops search for 3d packing with object orientation’, in KI 2014: Advances
(LCN Workshops), pp. 213–221. IEEE, (2016). in Artificial Intelligence, 285–296, Springer International Publishing,
[2] Y. Agrawal, K. Mathur, and H.M. Salkin, ‘A set-partitioning-based al- (2014).
gorithm for the vehicle routing problem’, Networks, 19(7), 731–749, [24] Stefan Edelkamp and Christoph Greulich, ‘Solving physical traveling
(1989). salesman problems with policy adaptation’, in Computational Intelli-
[3] S.P. Anbuudayasankar, K. Ganesh, S.C. Lenny Koh, and Y. Ducq, gence and Games (CIG), 2014 IEEE Conference on, pp. 1–8. IEEE,
‘Modified savings heuristics and genetic algorithm for bi-objective ve- (2014).
hicle routing problem with forced backhauls’, Expert Systems and Ap- [25] Stefan Edelkamp and Zhihao Tang, ‘Monte-carlo tree search for the
plications, 30, 2296–2305, (2012). multiple sequence alignment problem’, in Eighth Annual Symposium
[4] N. Azi, M. Gendreau, and J.-Y. Potvin, ‘An exact algorithm for a vehi- on Combinatorial Search, (2015).
cle routing problem with time windows and multiple use of vehicles’, [26] M. Fisher and R. Jaikumar, ‘A generalized assignment heuristic for ve-
European Journal of Operational Research, 202(3), 756–763, (2010).
hicle routing’, Networks, 11(2), 109–124, (1981).
[27] Tobias Graf and Marco Platzner, ‘Adaptive playouts in monte-carlo tree
search with policy-gradient reinforcement learning’, in Advances in
Computer Games - 14th International Conference, ACG 2015, Leiden,
The Netherlands, July 1-3, 2015, Revised Selected Papers, pp. 1–11,
(2015).
[28] G. Gutierrez-Jarpa, G. Desaulniers, G. Laporte, and Marianov V., ‘A
branch-and-price algorithm for the vehicle routing problem with deliv-
eries, selective pickups and time windows’, European Journal of Oper-
ational Research, 206(12), 341–349, (2010).
[29] M. Barkaoui J. Berger and O. Bräysy, ‘A parallel hybrid genetic algo-
rithm for the vehicle routing problem with time windows’, in Working
paper, Defense Research Establishment Valcartier, Canada, (2001).
[30] Jacek Mańdziuk and Cezary Nejman, ‘Uct-based approach to capaci-
tated vehicle routing problem’, in International Conference on Artifi-
cial Intelligence and Soft Computing, pp. 679–690. Springer, (2015).
[31] Y. Marinakis, G-R. Iordanidou, and M. Marinaki, ‘Particle swarm op-
timization for the vehicle routing problem with stochastic demands’,
Applied Soft Computing, 13, 1693–1704, (2013).
[32] Jean Méhat and Tristan Cazenave, ‘Combining UCT and Nested Monte
Carlo Search for single-player general game playing’, IEEE Transac-
tions on Computational Intelligence and AI in Games, 2(4), 271–277,
(2010).
[33] Fernando Portela, ‘An unexpectedly effective monte carlo technique for
the rna inverse folding problem’, bioRxiv, 345587, (2018).
[34] J.-Y. Potvin and S. Bengio, ‘The vehicule routing problem with time
windows part ii : Genetic search’, INFORMS Journal on Computing,
8(2), 165–172, (1996).
[35] J.-Y. Potvin, T. Kervahut, B.-L. Garcia, and J.-M. Rousseau, ‘The ve-
hicule routing problem with time windows part i : Tabu search’, IN-
FORMS Journal on Computing, 8(2), 158–164, (1996).
[36] Simon M. Poulding and Robert Feldt, ‘Generating structured test data
with specific properties using nested monte-carlo search’, in Genetic
and Evolutionary Computation Conference, GECCO ’14, Vancouver,
BC, Canada, July 12-16, 2014, pp. 1279–1286, (2014).
[37] Simon M. Poulding and Robert Feldt, ‘Heuristic model checking using
a monte-carlo tree search algorithm’, in Proceedings of the Genetic and
Evolutionary Computation Conference, GECCO 2015, Madrid, Spain,
July 11-15, 2015, pp. 1359–1366, (2015).
[38] Arpad Rimmel, Fabien Teytaud, and Tristan Cazenave, ‘Optimization
of the Nested Monte-Carlo algorithm on the traveling salesman prob-
lem with time windows’, in Applications of Evolutionary Computa-
tion - EvoApplications 2011: EvoCOMNET, EvoFIN, EvoHOT, Evo-
MUSART, EvoSTIM, and EvoTRANSLOG, Torino, Italy, April 27-29,
2011, Proceedings, Part II, volume 6625 of Lecture Notes in Computer
Science, pp. 501–510. Springer, (2011).
[39] Y. Rochat and E.D. Taillard, ‘Probabilistic diversification and intensi-
fication in local search for vehicle routing’, in Journal of Heuristics,1,
pp. 147–167, (1995).
[40] Christopher D. Rosin, ‘Nested rollout policy adaptation for Monte
Carlo Tree Search’, in IJCAI, pp. 649–654, (2011).
[41] Solomon, ‘Algorithms for the vehicle routing and scheduling problems
with time window constraints’, in Operations Research, (1985).
[42] Kenneth Sörensen and Marc Sevaux, ‘A practical approach for robust
and flexible vehicle routing using metaheuristics and monte carlo sam-
pling’, Journal of Mathematical Modelling and Algorithms, 8(4), 387,
(2009).
[43] S.R. Tangiah, K.E. Nygard, and Juell P.L., ‘Gideon : A genetic algo-
rithm system for vehicule routing with time window’, in IEEE CAIA
1991 - Proceedings of the 7th IEEE Conference on Artificial Intelli-
gence Applications, 24-28 February 1991, Miami beach, Florida, USA,
pp. 322–328, (1991).
[44] N. Yuichi and B. Olli, ‘A powerful route minimization heuristic for the
vehicle routing problem with time windows’, in Operations Research
Letters, Vol. 37, No. 5, pp. 333–338, (2009).