Path Plan
Path Plan
Path Plan
Series
Pa t h planning of autonomous U A V s
using reinforcement learning
C hr i stos C hroni s , Georgios Anagnostopoulos, E l ena Politi , Antonios
Garyfallou, Iraklis Varlamis, George Dimitrakopoulos
Department of Informatics and Telematics, Harokopio University of Athens, Greece
E-mail: chronis@hua.gr; geoaiagia@gmail.com; politie@hua.gr;
it21577@hua.gr; varlamis@hua.gr; gdimitra@hua.gr
A b s t ra c t .
Autonomous B V L O S Unmanned Aerial Vehicles (UAVs) are gradually gaining their share
in the drone market. Together with the demand for extended levels of autonomy comes the
necessity for high-performance obstacle avoidance and navigation algorithms that will allow
autonomous drones to operate with minimum or no human intervention. Traditi onal A I
algorithms have been extensively used in the literature for finding the shortest path in 2-D or 3-D
environments and navigating the drones successfully through a known and stable environment.
However, the situation can become much more complicated when the environment is changing
or not known in advance. In this work, we explore the use of advanced artificial intelligence
techniques, such as reinforcement learning, to successfully navigate a drone within unspecified
environments. We compare our approach against traditional A I algoriths in a set of validation
experiments on a simulation environment, and the results show that using only a couple of low-
cost distance sensors it is possible to successfully navigate the drone beyond the obstacles.
1. Introducti on
Unmanned Aerial Vehicles (UAVs) or drones are recently gaining popularity as sophisticated
monitoring instruments for a number of applications. Advances in communication technologies
together with the growing miniaturization of onboard sensors, and the development of new
algorithms and software, have opened the way for the expansion of the use of UAVs in many
fields and have thus opened new avenues in scientific research, environmental monitoring and
many more.
In particular beyond the Visual Line of Sight ( B V L O S ) operations are coming in the spotlight
of the drone industry since they endow flights with several degrees of autonomy and efficiency,
while they reduce costs and increase the granularity of surveillance or delivery. Due to those
aspects, such systems are gradually gaining their share in the UAV market. Together with
the demand for extended levels of autonomy, B V L O S drones satisfy the necessity for high-
performance obstacle avoidance and navigation algorithms that will allow autonomous UAVs to
operate with minimum or no human intervention.
Traditional A I algorithms have been extensively used in the literature for finding the shortest
path in 2-D or 3-D environments, avoiding obstacles, and navigating the UAVs successfully
through a known and stable environment. However, the situation can become much more
complicated when the environment is changing or not known in advance. Such algorithms
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further
distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
EASN-2022 IOP Publishing
Journal of Physics: Conference Series 2526 (2023) doi:10.1088/1742-6596/2526/1/012088
012088
take as input information from onboard sensors, such as cameras, depth-cameras, L i DA R s and
proximity sensors, process it and help the UAVs to safely navigate to their destination.
In this work, we examine i) how the input from simple distance sensors can assist in the
detection of potential obstacles and ii) how more advanced artificial intelligence techniques,
such as reinforcement learning, can be employed to safely navigate a drone within unspecified
environments.
In UAV operations reinforcement learning ( R L ) algorithms are widely used to help navigation
through unknown environments [2]. Generally, in R L methods the learning occurs when the
agent is rewarded for each desired action taken while being punished for the undesired ones. In
the scenario of this study, the multi-rotor agent was rewarded for every action that would bring
it closer to the landing target, while being penalised for any action towards a different direction
or for colliding with any intermediary obstacles.
In order to validate the performance of our proposed solution, we perform our experiments in a
simulated world environment (using the AirSim framework). For a more realistic representation
of the UAV airspace, the simulation environment is enhanced with multiple obstacles to enhanced
results. We also experiment with different departure and arrival points to generalise our models.
We compare the performance of the R L model, that uses only two low-cost distance sensors,
against a typical A I approach that employs a L i D A R to detect obstacles and the A* algorithm
to navigate the UAV to the final target.
The contributions of this work can be summarized in the following:
• it proposes an R L path-planning approach to safely navigate UAVs through obstacles using
the input from two distance sensors and the G P S,
• it provides a low cost and energy efficient solution for obstacle detection and avoidance,
• the proposed solution considers in tandem the position, height and speed of the vehicle,
providing a holistic solution for UAV navigation,
• it takes advantage of model training within a simulation environment, that allows unlimited
trial and error, and permits to transfer the model to the real world for verification.
The first results are promising for the R L algorithm and prove that autonomous UAVs
navigation can be performed successfully in dynamic and non-predefined environments. Of
course, similar experiments have to be performed in real-world environments, transferring the
navigation algorithms on the edge (i.e. UAV) and testing their performance and potential
difficulties.
In Section 2 that follows we perform an overview of related works on UAV navigation,
obstacle
avoidance and detection. Section 3 provides the details of the proposed methodology for UAV
path planning and obstacle avoidance using reinforcement learning. Section 4 explains the
experimental setup, the simulated environment, the task, and the evaluation metrics. Section 5
discusses the results achieved by the evaluated methods the advantages and implications of each
alternative. Finally, Section 6 concludes the paper and provides directions for the next steps in
this field.
2. Related Wo r k
As UAVs are increasingly considered for a wide range of applications, B V L O S operations
bring significant opportunities for greater efficiency, productivity, safety and economic value.
Nevertheless, autonomous features of B V L O S set a number of complex and often conflicting
requirements to the operation of drones in terms of performance, energy efficiency, or
security, particularly for computationally intensive operations, such as navigation in dynamic
environments where the re-allocation of resources is necessary [9]. Autonomous navigation
through unknown environments requires vigilant path planning strategies, a continuous spatio-
temporal perception of environmental conditions through data acquisition and the ability of the
2
EASN-2022 IOP Publishing
Journal of Physics: Conference Series 2526 (2023) doi:10.1088/1742-6596/2526/1/012088
012088
vehicle to make knowledge-based decisions based on the above with minimal human intervention
[6]. Onboard information can be captured by sensors, cameras or by global positioning system
(GP S). Path planning for UAVs typically involves the determination of path from an initial
point to a destination point, without the possibility of a collision with any existing obstacles.
Path planning methods can be classified in two categories, those that are based on a sampling of
the search space and those that employ artificial intelligence ( A I ) to find a solution with respect
to the representation of the environment [1].
Sampling-based methods search for paths in a predefined navigation space, using a discrete
representation that maps the area into nodes or solutions [3]. Some examples include rapid-
exploring random trees ( R R T ) , A-star (A*), probabilistic roadmaps (P RM ), particle swarm
optimization (PSO), etc. These methods have been implemented extensively for providing
collision-free trajectories for typical path planning problems for UAVs. A comparison of A* and
Dijkstra’s was performed by [8] in an environment with static obstacles, where the performance of
the A* algorithm was overall good in terms of obtaining optimal paths with respect to trajectory
acquisition, path length and total travelling time. The Rapidly exploring Random Trees ( R R T ) ,
known for producing optimal solutions in high-dimensional spaces, have been used with different
variations in [13, 16] for randomly building a space-filling tree in the navigation space and finding
the best path each time. Evolutionary algorithms, such as the Particle Swarm Optimization
( P SO ) are generally recognized for solving path planning problems [7]. The interesting work
in [15] proposed a 4D collision free path planning algorithm, namely Spatial Refined Voting
Mechanism (SRVM), based on an improved mechanism that is designed for standard PSO.
Deep learning-based approaches have been proposed lately to increase robustness in more
complex UAVs path planning scenarios [12]. In this respect, trial and error based methods, such
as the Reinforcement Learning ( R L ) have been adopted to deal with the lack of available training
datasets and the dynamic nature of the task [2]. Since R L often involves a lot of fail attempts for
learning, virtual environments that simulate the real world are preferred at a preparation stage
for training the navigation agent, which then can be transferred to the real drone for testing.
R L methods have been used to control a limited set of the flight parameters, such as the altitude
[4], the speed [5] or the path [14], while keeping the remaining attributes of the flight fixed. In
[11] authors investigated a typical path planning problem for a drone using deep R L algorithms
to find a 3D path through various 3D obstacles within the Airsim 1 simulation environment.
As mentioned above, the existing path planning and navigation methods that rely on
reinforcement learning use input from various sensors in order to provide a fusion of rich data
to the algorithms and help them decide the best action in each state. The most common sensor
to use for navigation and obstacle detection tasks is the Light Detection And Ranging sensor
( L i DA R ) , which however is expensive and sensitive in bad weather or light conditions. Cameras
are the most common sensors used in autonomous vehicle applications, due to their low cost,
lightweight and small energy footprint. They lack of depth information and sensitive to bad
lighting and weather conditions. Distance sensors are lightweight robust sensors, immune to
light or weather conditions and offer a unique performance at a reasonable price.
All the existing R L methods either they are applied on a simulated or a real
environment
have two main characteristics: i) they try to solve a single task at a time, such as navigation,
obstacle avoidance, or altitude and speed control, ii) they employ rich fusion of information as
input that comes from cameras, L i D A R sensors, distance sensors in tandem. In the proposed
approach, a limited set of distance sensors (one in the front and one at the bottom) is combined
with R L algorithms to solve multiple tasks at a time and safely navigate the drone from start
to destination, avoiding obstacles and keeping a safe altitude. We compare our method with a
sampling-based method on the same task, which employs rich information from a front-facing
1
https://microsoft.github.io/AirSim/
3
EASN-2022 IOP Publishing
Journal of Physics: Conference Series 2526 (2023) doi:10.1088/1742-6596/2526/1/012088
012088
L i D A R , and demonstrate its superiority in finding a fast path to destination, avoiding the
obstacles and any collision with them or the ground.
3. Methodology
The main purpose of our work is to create a reinforcement learning model that is capable to
navigate the UAV from a starting point A to a target point B , by deciding on a non-predefined
path P (A, B ) and without colliding to any obstacles during the flight. In this direction, we
may assume that the final path P is decomposed into a series of sub-steps p1, ..., p N , where each
p i , i ∈ 1..N refers to an obstacle avoidance maneuver, as shown in Figure 1a. The obstacles
in the route can be buildings or any other object that appears in the way of the drone. For
this purpose the drone must early detect obstacles, perceive their location, position and possible
movement and avoid collisions. A second objective is to use the minimum number of sensors in
an attempt to minimize the cost and load for the drone.
Figure 1: The UAV environment space (left) and the main flow of the A 2 C algorithm(right)
Σ
t=0
∇θ J (θ) ∼ ∇θ πθ log(αt |st )A (st , t (1)
α) T−1
update the weights of both networks in every step and not in the end of predefined number of
episodes, so α t is the action taken at step t and s t is the state at the same step. The A(s t ,
α t ) is the Advantage Function (Temporal Difference Error) and it is the subtraction between the
predicted feature reward (critic’s output) and the actual reward of the s t , Finally the πθ log(α t |s t )
is the probability that defines the action at step t. Also a big difference with other algorithms,
like DQN, is that the A 2 C uses only a small batch of states in every training step and that batch
is replaced with a new batch of states at each step. The specific characteristic of that algorithm
makes the A 2 C an non memory intensive algorithm.
The altitude is used in order to teach the agent an acceptable altitude for flights, the velocities
in order to teach the agent to approach the obstacles and the target with the proper speed and
avoid collisions or overshoot. The orientation is used in order to know where the drone is facing
at any moment. In combination with the front distance sensor it will gradually allow the agent
to learn where the obstacles are and how to carefully avoid them.
A cti on space: The Action space contains all the available actions for an agent to execute at
any moment. Given that the drone can move in all three dimensions, we define six different
actions, two per axis. More specifically, we choose to increase the UAV speed by a default
amount in the following axes: -Z and + Z for the altitude control, the - X and + X for forward or
backward movement and finally the -Y and + Y for movement in left and right.
Given the observation and action spaces, the objective of the agent is to choose the most
suitable action at every moment by evaluating the UAV state. The training process evaluates
each action after it is executed and computes the reward that must be given to the agent using
a reward function.
Rewa rd functi on: The Reward function is one of the hardest parts of any R L implementation
and its aim is to assist the agent to learn the policy of the task.
In the UAV case the reward function forces the drone to minimize the distance between
the current position and the target position without collide in any object or the ground.
Consequently the reward for an action that does not result in a collision is a function of the
distance to the target, gained with the action. A negative rewarded is given in the case of a
collision, which is relevant to the drone distance from target, when the collision occurred. When
the drone reaches the target area a maximum positive reward is given. Two more negative
5
EASN-2022 IOP Publishing
Journal of Physics: Conference Series 2526 (2023) doi:10.1088/1742-6596/2526/1/012088
012088
rewards are given when the drone is flying for a long time without reaching the target, and
when it overshoots out of the world limits in any direction.
Figure 2: A 2 C architecture.
Training parameters: For the training of A 2 C we have to setup several parameters concerning
the rewards and penalties, the exploration and exploitation ratio and the optimization strategy
for the learning rate. The reward function used a penalty of -100 for the collisions with obstacles,
collisions with the floor, time-outs and over-shoots.
6
EASN-2022 IOP Publishing
Journal of Physics: Conference Series 2526 (2023) doi:10.1088/1742-6596/2526/1/012088
012088
Evaluation metrics and baseline: For the evaluation of our method we use the percentage of
test routes (1,000 routes in total) that were successfully completed or failed (i.e. crashed on
the floor, crashed on an obstacle, stacked on a loop movement for a long time, or went out of
world limits). We also measure the time needed to complete the route (flight time), and we
report the average time for all the successful routes and the standard deviation. Our baseline
for comparison is our previous implementation of A* [8], which constitutes a hard baseline,
since the drone in the case of A* is equiped with a very dense L i D A R that provides rich depth
information about obstacles. In addition we locked the movement in the z-axis in the case of
A*, which further simplified the task.
5. Res ul ts
The total training time for the A 2 C models was approximately 24 hours for 10.000 routes, on
an A M D Ryzen 7 5800X3D (8 cores) C P U with 32 R A M , using a R T X 3090 T i 24GB graphics
card. All the experiments have been performed with 10x acceleration in Airsim engine clock
speed, with a frame limiter at 60 F P S for consistency between the experiments. The respective
evaluation time for the 1,000 test routes (i.e. random start-end points within the designated
areas) was approximately 5 hours for A 2 C and 25 hours for A*.
From the results depicted in Table 1 it is obvious that the reinforcement learning approach,
using only two distance sensors, cannot yet compete the A* that uses the L i D A R , in terms
of detecting and avoiding obstacles. However, it is very promising that the agent has learned
to keep a proper altitude and avoid crashes on the floor (only 1.5% fails). In addition, the
percentage ratio of cases that the R L agent made the drone reach its target is very promising
(i.e. 62.5%) especially if we take into account that it used only two distance sensors to detect
obstacles on its route. Given that we can continue the training in the simulated environment,
this success ratio is expected to increase.
What is even more promising is the ability of the A 2 C algorithm to quickly navigate the UAV
to the destination (in 17 seconds in average), which is 4 to 5 times faster than the A*. Faster
routes, also means shorter routes and better energy efficiency for the UAVs. The stability in
times (smaller standard deviation from the mean) is also positive.
Increasing the training time of the agent, and further adjusting the exploration/exploitation
setup and other parameters is expected to further improve the agent performance, before
transferring the model to the real world. Given the current results and the limitations of
this study, we can say that A 2 C and R L algorithms in general can be a promising solution for
training path planning and obstacle avoidance agents in UAV simulation environments.
A* A2C
Flight duration (avg. in completed routes) 88.51 sec 17.00 sec
Flight duration (stdev) 71.84 sec 5.22 sec
Reach target percentage 98.2% 62.5%
Crash on obstacle percentage 1.8% 17.0%
Crash on floor percentage 0.0% 1.5%
Failed to reach target percentage 0.0% 19.0%
6. Conclusions
In this work, we examined the power of reinforcement learning in teaching an agent to navigate
a UAV within a simulated town environment, using a limited set of distance sensors to detect
obstacles and avoid crashes, and a G P S to properly locate the UAV at every moment until
7
EASN-2022 IOP Publishing
Journal of Physics: Conference Series 2526 (2023) doi:10.1088/1742-6596/2526/1/012088
012088
reaching its destination. It is among our next steps to further optimize the R L algorithm
parameters in order to improve the successful navigation performance while keeping the flight
duration short. In this direction we will examine more complex reward functions that will learn
the agent to slow down the drone when it is moving next to an obstacle or near the target area,
to avoid crossing the world limits and to keep a minimum altitude. We will also examine the
use of more sensor types, especially cameras or additional distance sensors in order to improve
the world perception of the agent. The capabilities of UAV swarm systems is finally a research
area that authors wish to further explore.
Acknowledgments
This project received funding from the E C S E L Joint Undertaking ( J U ) under grant agreement
No 876019. The J U receives support from the European Union’s Horizon 2020 research and
innovation programme and national authorities.
7. References
1 A G G A R WA L , S., A N D K U M A R , N . Path planning techniques for unmanned aerial vehicles: A review,
solutions, and challenges. Computer Communications 149 (2020), 270–299.
2 A Z A R , A . T. , K O U B A A , A . , A L I M O H A M E D , N., I B R A H I M , H . A . , I B R A H I M , Z . F. , K A Z I M , M.,
AMMAR, A.,
B E N J D I R A , B . , K H A M I S , A . M., H A M E E D , I . A . , E T A L . Drone deep reinforcement learning: A
review.
Electronics 10, 9 (2021), 999.
3 G A R R E T T , C . R . , L OZ A N O - P E ´ R EZ , T. , A N D K A E L B L I N G , L . P. Sampling-based methods for factored
task and motion planning. The International Journal of Robotics Research 37, 13-14 (2018), 1796–1825.
4 K O C H , W. , M A N C U S O , R . , W E S T , R . , A N D B E S TAV R O S , A . Reinforcement learning for uav atti tude
control.
A C M Transactions on Cyber-Physical Systems 3, 2 (2019), 1–21.
5 L U , H . , L I , Y. , MU, S., W A N G , D. , K I M , H . , A N D S E R I K AWA , S . Motor anomaly detection for
unmanned aerial vehicles using reinforcement learning. I E E E internet of things journal 5, 4 (2017), 2315–
2322.
6 N E X , F. , A R M E N A K I S , C . , C R A M E R , M., C U C C I , D . A . , G E R K E , M., H O N K AVA A R A , E . , K U K K O ,
A . , P E R S E L L O , C . , A N D S K A L O U D , J . Uav in the advent of the twenties: Where we stand and what is
next. I S P R S journal of photogrammetry and remote sensing 184 (2022), 215–242.
7 P O L I T I , E . , A N D D I M I T R A KO P O U L O S , G . Comparison of evolutionary algorithms for auvs for path
planning in variable conditions. In 2019 International Conference on Computational Science and
Computational Intelligence ( C S C I ) (2019), I E E E , pp. 1230–1235.
8 P O L I T I , E . , G A R Y FA L L O U , A . , P A N A G I O TO P O U L O S , I., V A R L A M I S , I., A N D D I M I T R A KO P O U L O S ,
G . Path planning and landing for unmanned aerial vehicles using A I . In Proceedings of the Future Technologies
Conference (2023), Springer, pp. 343–357.
9 P O L I T I , E . , V A R L A M I S , I., T S E R P E S , K . , L A R S E N , M., A N D D I M I T R A KO P O U L O S , G . The future of
safe bvlos drone operations with respect to system and service engineering. In 2022 I E E E International
Conference on Service-Oriented System Engineering ( S O S E ) (2022), I E E E , pp. 133–140.
10 S C H U L M A N , J . , W O L S K I , F. , D H A R I WA L , P. , R A D F O R D , A . , A N D K L I M O V , O . Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
11 S H I N , S .-Y. , K A N G , Y.- W. , A N D K I M , Y.- G . Obstacle avoidance drone by deep reinforcement learning and
its racing with human pilot. Applied sciences 9, 24 (2019), 5571.
12 S H I N , S .-Y. , K A N G , Y.- W. , A N D K I M , Y.- G . Reward-driven u-net training for obstacle avoidance
drone.
Expert Systems with Applications 143 (2020), 113064.
13 S U N , Q., L I , M., W A N G , T. , A N D Z H A O , C . UAV path planning based on improved rapidly-exploring
random tree. In 2018 Chinese control and decision conference ( C C D C ) (Shenyang, China, 2018), I E E E ,
I E E E , pp. 6420–6424.
14 W A N G , C . , W A N G , J . , Z H A N G , X . , A N D Z H A N G , X . Autonomous navigation of UAV in large-scale
unknown complex environment with deep reinforcement learning. In 2017 I E E E Global Conference on Signal
and Information Processing (GlobalSIP) (Montreal, Quebec, Canada, 2017), I E E E , I E E E , pp. 858–862.
15 Y A N G , L . , Z H A N G , X . , Z H A N G , Y. , A N D X I A N G M I N , G . Collision free 4d path planning for multiple
uavs based on spatial refined voting mechanism and pso approach. Chinese Journal of Aeronautics 32, 6
(2019), 1504–1519.
16 Z U , W. , F A N , G . , G A O , Y. , M A , Y. , Z H A N G , H . , A N D Z E N G , H . Multi-uavs cooperative path planning
8
method based on improved rrt algorithm. In 2018 I E E E international conference on mechatronics and
automation ( I C M A ) (Changchun, China, 2018), I E E E , I E E E , pp. 1563–1567.