A Survey of FPGA-Based Robotic Computing
A Survey of FPGA-Based Robotic Computing
A Survey of FPGA-Based
Robotic Computing
Zishen Wan,* Bo Yu,* Thomas Yuang Li, Jie Tang, Yuhao Zhu,
Yu Wang, Arijit Raychowdhury, and Shaoshan Liu
O
easy-to-use development frameworks, so they have been widely ver the last decade, we have seen significant
adopted in several applications. On the other hand, FPGA-based progress in the development of robotics, span-
robotic accelerators are becoming increasingly competitive al-
ning from algorithms, mechanics to hardware
ternatives, especially in latency-critical and power-limited sce-
narios. With specialized designed hardware logic and algorithm platforms. Various robotic systems, like manipulators,
kernels, FPGA-based accelerators can surpass CPU and GPU legged robots, unmanned aerial vehicles, self-driving cars
Digital Object Identifier 10.1109/MCAS.2021.3071609 * These authors contributed equally to this work.
Date of current version: 24 May 2021 Corresponding author: Shaoshan Liu (email: shaoshan.liu@perceptin.io).
48 IEEE CIRCUITS AND SYSTEMS MAGAZINE 1531-636X/21©2021IEEE SECOND QUARTER 2021
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
have been designed for search and rescue [1], [2], explora- for high-performance scenarios. Recently, benefiting in
tion [3], [4], package delivery [5], entertainment [6], [7] part from the better accessibility provided by CUDA/
and more applications and scenarios. These robots are OpenCL, GPU has been predominantly used in many
on the rise of demonstrating their full potential. Take robotic applications. However, conventional CPU and
drones, a type of aerial robots, as an example, the num- GPUs usually consume 10 W to 100 W of power, which
ber of drones has grown by 2.83x between 2015 and 2019 are orders of magnitude higher than what is available on
based on the U.S. Federal Aviation Administration (FAA) the resource-limited robotic system.
report [8]. The registered number has reached 1.32 mil- Besides CPU and GPU, FPGAs are attracting attention
lion in 2019, and the FFA expects this number will come to and becoming a platform candidate to achieve energy-effi-
1.59 billion by 2024. cient robotics tasks processing. FPGAs require little pow-
However, robotic systems are pretty complicated er and are often built into small systems with less memory.
[9]–[11]. They tightly integrate many technologies and They have the ability to parallel computations massively
algorithms, including sensing, percep- and makes use of the properties of perception (e.g., ste-
tion, mapping, localization, decision reo matching), localization (e.g., SLAM), and planning
making, control, etc. This complexity (e.g., graph search) kernels to remove additional logic
poses many challenges for the design and simplify the implementation. Taking into account
of robotic edge computing systems [12], hardware characteristics, several algorithms are pro-
[13]. On the one hand, the robotic system posed which can be run in a hardware-friendly way and
needs to process an enormous amount achieve similar software performance. Therefore, FP-
of data in real-time. The incoming data GAs are possible to meet real-time requirements while
often comes from multiple sensors and achieving high energy efficiency compared to CPUs
is highly heterogeneous. However, the and GPUs.
robotic system usually has limited on- Unlike the ASIC counterparts, FPGA technology pro-
board resources, such as memory stor- vides the flexibility of on-site programming and re-pro-
age, bandwidth, and compute capabili- gramming without going through re-fabrication with a
ties, making it hard to meet the real-time modified design. Partial Reconfiguration (PR) takes this
requirements. On the other hand, the flexibility one step further, allowing the modification of
current state-of-the-art robotic system an operating FPGA design by loading a partial configu-
usually has strict power constraints on ration file. Using PR, part of the FPGA can be reconfig-
the edge that cannot support the amount ured at runtime without compromising the integrity of
of computation required for performing the applications running on those parts of the device
©SHUTTERSTOCK.COM/POPTIKA
tasks, such as 3D sensing, localization, that are not being reconfigured. As a result, PR can al-
navigation, and path planning. Therefore, low different robotic applications to time-share part of
the computation and storage complex- an FPGA, leading to energy and performance efficiency,
ity, as well as real-time and power con- and making FPGA a suitable computing platform for dy-
straints of the robotic system, hinder its namic and complex robotic workloads.
wide application in latency-critical or FPGAs have been successfully utilized in commercial
power-limited scenarios [14]. autonomous vehicles. Particularly, over the past three
Therefore, it is essential to choose a proper compute years, PerceptIn has built and commercialized autono-
platform for the robotic system. CPU and GPU are two mous vehicles for micromobility, and PerceptIn’s prod-
widely used commercial compute platforms. CPU is de- ucts have been deployed in China, US, Japan and Switzer-
signed to handle a wide range of tasks quickly and is of- land. In this paper, we review how PerceptIn developed
ten used to develop novel algorithms. A typical CPU can its computing system by relying heavily on FPGAs, which
achieve 10-100 GFLOPS with below 1GOP/J power effi- perform not only heterogeneous sensor synchroniza-
ciency [15]. In contrast, GPU is designed with thousands tions, but also the acceleration of software components
of processor cores running simultaneously, which en- on the critical path. In addition, FPGAs are used heavily
able massive parallelism. A typical GPU can perform up in space robotic applications, for FPGAs offered unprec-
to 10 TOPS performance and become a good candidate edented flexibility and significantly reduced the design
Zishen Wan,* School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA and John A. Paulson School of
Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138 USA. Bo Yu,* Thomas Yuang Li and Shaoshan Liu, PerceptIn Inc, Fremont,
CA 94539 USA. Jie Tang, School of Computer Science and Engineering, South China University of Technology, Guangzhou, Guangdong, China. Yuhao
Zhu, Department of Computer Science, University of Rochester, Rochester, NY 14627 USA. Yu Wang, Department of Electronic Engineering, Tsinghua
University, Beijing, China. Arijit Raychowdhury, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA.
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
cycle and development cost. In this paper, we also delve Cameras. Cameras are usually used for object rec-
into space-grade FPGAs for robotic applications. ognition and object tracking, such as lane detection in
The rest of paper is organized as follows: Section II autonomous vehicles and obstacle detection in drones,
introduces the basic workloads of the robotic system. etc. RGB-D camera can also be utilized to determine
Section III, IV and V reviews the various perception, local- object distances and positions. Take autonomous ve-
ization and motion planning algorithms and their imple- hicle as an example, the current system usually mounts
mentations on FPGA platforms. In section VI, we discuss eight or more 1080p cameras around the vehicle to de-
about FPGA partial reconfiguration techniques. Section tect, recognize and track objects in different directions,
VII and VIII present robotics FPGA applications in com- which can greatly improve safety. Usually, these cam-
mercial and space areas. Section IX concludes the paper. eras run at 60 Hz, which will process multiple gigabytes
of raw data per second when combined.
II. Overview of Robotics workloads GNSS/IMU. The global navigation satellite system
(GNSS) and inertial measurement unit (IMU) system
A. Overview help the robot localize itself by reporting both inertial
Robotics is not one technology but rather an integration updates and an estimate of the global location at a high
of many technologies. As shown in Fig 1, the stack of rate. Different robots have different requirements for lo-
the robotic system consists of three major components: calization sensing. For instance, 10 Hz may be enough
application workloads, including sensing, perception, for low-speed mobile robots, but high-speed autono-
localization, motion planning, and control; a software edge mous vehicles usually demand 30 Hz or higher for local-
subsystem, including operating system and runtime lay- ization, and high-speed drones may need 100 Hz or more
er; and computing hardware, including both microcon- for localization, thus we are facing a broad spectrum of
trollers and companion computers. sensing speeds. Fortunately, different sensors have their
We focus on the robotic application workloads in this own advantages and drawbacks. GNSS can enable fairly
section. The application subsystem contains multiple algo- accurate localization, while it runs at only 10 Hz, thus un-
rithms that are used by the robot to extract meaningful in- able to provide real-time updates. By contrast, both ac-
formation from raw sensor data to understand the environ- celerometer and gyroscope in IMU can run at 100–200 Hz,
ment and dynamically make decisions about its actions. which can satisfy the real-time requirement. However,
IMU suffers bias wandering over time or perturbation by
B. Sensing some thermo-mechanical noise, which may lead to an
The sensing stage is responsible for extracting meaning- accuracy degradation in the position estimates. By com-
ful information from the sensor raw data. To enable intel- bining GNSS and IMU, we can get accurate and real-
ligent actions and improve reliability, the robot platform time updates for robots.
usually supports a wide range of sensors. The number LiDAR. Light detection and ranging (LiDAR) is used
and type of sensors are heavily dependent on the specifi- for evaluating distance by illuminating the obstacles
cations of the workload and the capability of the onboard with laser light and measuring the reflection time. These
compute platform. The sensors can include the following: pulses, along with other recorded data, can generate
precise and three-dimensional information about the
surrounding characteristics. LiDAR plays an important
role in localization, obstacle detection and avoidance.
As indicated in [16], the choice of sensors dictates the
algorithm and hardware design. Take autonomous driv-
ing as an instance, almost all the autonomous vehicle
Sensing Perception Decision companies use LiDAR at the core of their technologies.
GPS/IMU Mapping Path Planning Examples include Uber, Waymo, and Baidu. PerceptIn
and Tesla are among the very few that do not use Li-
LiDAR Localization Action Prediction
DAR and, instead, rely on cameras and vision-based
Camera Object Detection Obstacle Avoidance
systems, and in particular PerceptIn’s data demon-
Radar/Sonar Object Tracking Feedback Control strated that for the low-speed autonomous driving sce-
Operating System nario, LiDAR processing is slower than camera-based
vision processing, but increases the power consump-
Hardware Platform
tion and cost.
Radar and Sonar. The Radio Detection and Rang-
Figure 1. The stack of the robotic system.
ing (Radar) and Sound Navigation and Ranging (Sonar)
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
system is used to determine the distance and speed to work [27] and pyramid scene parsing network (PSPNet)
a certain object, which usually serves as the last line [28] to combine global image-level information with the
of defense to avoid obstacles. Take autonomous vehicle locally extracted feature. By using auxiliary natural im-
as an example, a danger of collision may occur when ages, a stacked autoencoder model can be trained of-
near obstacles are detected, then the vehicle will apply fline to learn generic image features and then applied for
brakes or turn to avoid obstacles. Compared to LiDAR, online object tracking [29].
the Radar and Sonar system is cheaper and smaller, and
their raw data is usually fed to the control processor D. Localization
directly without going through the main compute pipe- The localization layer is responsible for aggregating
line, which can be used to implement some urgent func- data from various sensors to locate the robot in the en-
tions as swerving or applying the brakes. vironment model.
GNSS/IMU system is used for localization. The GNSS
C. Perception consist of several satellite systems, such as GPS, Galileo
The sensor data is then fed into the perception layer and BeiDou, which can provide accurate localization re-
to sense the static and dynamic objects, and build a sults but with a slow update rate. In comparison, IMU
reliable and detailed representation of the robot’s envi- can provide a fast update with less accurate rotation
ronment using computer vision techniques (including and acceleration results. A mathematical filter, such as
deep learning). Kalman Filter, can be used to combine the advantages of
The perception layer is responsible for object detec- the two and minimize the localization error and latency.
tion, segmentation and tracking. There are obstacles, However, this sole system has some problems, such as
lane dividers and other objects to detect. Traditionally, the signal may bounce off obstacles, introduce more
a detection pipeline starts with image pre-processing, noise, and fail to work in closed environments.
followed by a region of interest detector and then a LiDAR and High-Definition (HD) maps are used for
classifier that outputs detected objects. In 2005, Dalal localization. LiDAR can generate point clouds and pro-
and Triggs [17] proposed an algorithm based on histo- vide a shape description of the environment, while it
gram of orientation (HOG) and support vector machine is hard to differentiate individual points. HD map has a
(SVM) to model both the appearance and shape of the higher resolution compared to digital maps and makes
object under various condition. The goal of segmenta- the route familiar to the robot, where the key is to fuse
tion is to give the robot a structured understanding different sensor information to minimize the errors in
of its environment. Semantic segmentation is usually each grid cell. Once the HD map is built, a particle fil-
formulated as a graph labeling problem with vertices ter method can be applied to localize the robot in real-
of the graph being pixels or super-pixels. Inference al- time correlated with LiDAR measurement. However,
gorithms on graphical models such as conditional ran- the LiDAR performance may be severely affected by
dom field (CRF) [18], [19] are used. The goal of tracking weather conditions (e.g., rain, snow) and bring local-
is to estimate the trajectory of moving obstacles. Track- ization error.
ing can be formulated as a sequential Bayesian filter- Cameras are used for localization as well. The pipe-
ing problem by recursively running the prediction step line of vision-based localization is simplified as follows:
and correction step. Tracking can also be formulated 1) by triangulating stereo image pairs, a disparity map is
by tracking-by-detection handling with Markovian deci- obtained and used to derive depth information for each
sion process (MDP) [20], where an object detector is point; 2) by matching salient features between successive
applied to consecutive frames and detected objects are stereo image frames in order to establish correlations
linked across frames. between feature points in different frames, the motion
In recent years, deep neural networks (DNN), also between the past two frames is estimated; and 3) by com-
known as deep learning, have greatly affected computer paring the salient features against those in the known
vision and made significant progress in solving robot map, the current position of the robot is derived [30].
perception problems. Most state-of-the-art algorithms Apart from these techniques, sensor fusion strategy
now apply one type of neural network based on con- is also often utilized to combine multiple sensors for lo-
volution operation. Fast R-CNN [21], Faster R-CNN [22], calization, which can improve the reliability and robust-
SSD [23], YOLO [24], and YOLO9000 [25] were used to ness of robot [31], [32].
get much better speed and accuracy in object detection.
Most CNN-based semantic segmentation work is based E. Planning and Control
on Fully Convolutional Networks (FCN) [26], and there The planning and control layer is responsible for generat-
are some recent work in spatial pyramid pooling net- ing trajectory plans and passing the control commands
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
based on the original and destination of the robot. III. Perception on FPGA
Broadly, prediction and routing modules are also in-
cluded here, where their outputs are fed into down- A. Overview
stream planning and control layers as input. The pre- Perception is related to many robotic applications where
diction module is responsible for predicting the future sensory data and artificial intelligence techniques are
behavior of surrounding objects identified by the per- involved. Examples of such applications include stereo
ception layer. The routing module can be a lane-level matching, object detection, scene understanding, seman-
routing based on lane segmentation of the HD maps for tic classification, etc. The recent developments in ma-
autonomous vehicles. chine learning, especially deep learning, have exposed
Planning and Control layers usually include behav- robotic perception systems to more tasks. In this section,
ioral decision, motion planning and feedback control. we will focus on the recent algorithms and FPGA imple-
The mission of the behavioral decision module is to mentations in the stereo vision system, which is one of
make effective and safe decisions by leveraging all the key components in the robotic perception stage.
various input data sources. Bayesian models are be- Real-time and robust stereo vision systems are in-
coming more and more popular and have been applied creasingly popular and widely used in many percep-
in recent works [33], [34]. Among the Bayesian mod- tion applications, e.g., robotics navigation, obstacle
els, Markov Decision Process (MDP) and Partially Ob- avoidance [46] and scene reconstruction [47]–[49]. The
servable Markov Decision Process (POMDP) are the purpose of stereo vision systems is to obtain 3 D struc-
widely applied methods in modeling robot behavior. ture information of the scene using stereoscopic rang-
The task of motion planning is to generate a trajectory ing techniques. The system usually has two cameras to
and send it to the feedback control for execution. The capture images from two points of view within the same
planned trajectory is usually specified and represent- scenario. The disparities between the corresponding
ed as a sequence of planned trajectory points, and pixels in two stereo images are searched using stereo
each of these points contains attributes like location, matching algorithms. Then the depth information can
time, speed, etc. Low-dimensional motion planning be calculated from the inverse of this disparity.
problems can be solved with grid-based algorithms Throughout the whole pipeline, stereo matching is
(such as Dijkstra [35] or A* [36]) or geometric algo- the bottleneck and time-consuming stage. The stereo
rithms. High-dimensional motion planning problems matching algorithms can be mainly classified into two
can be dealt with sampling-based algorithms, such as categories: local algorithms [50]–[56] and global algo-
Rapidly-exploring Random Tree (RRT) [37] and Prob- rithms [57]–[61]. Local methods compute the dispari-
abilistic Roadmap (PRM) [38], which can avoid the ties by only processing and matching the pixels around
problem of local minima. Reward-based algorithms, the points of interest within windows. They are fast
such as the Markov decision process (MDP), can also and computationally-cheap, and the lack of pixel de-
generate the optimal path by maximizing cumula- pendencies makes them suitable for parallel accelera-
tive future rewards. The goal of feedback control is tion. However, they may suffer in textureless areas and
to track the difference between the actual pose and occluded regions, which will result in incorrect dispari-
the pose on the predefined trajectory by continuous ties estimation.
feedback. The most typical and widely used algorithm In contrast, global methods compute the disparities
in robot feedback control is PID. by matching all other pixels and minimizing a global
While optimization-based approaches enjoy main- cost function. They can achieve much higher accuracy
stream appeal in solving motion planning and control than local methods. However, they tend to come at high
problems, learning-based approaches [39]–[43] are be- computation cost and require much more resources due
coming increasingly popular with recent developments to their large and irregular memory access as well as
in artificial intelligence. Learning-based methods, such the sequential nature of algorithms, thus not suitable for
as reinforcement learning, can naturally make full use of real-time and low-power applications. Many research
historical data and iteratively interact with the environ- works in stereo systems focus on the speed and accu-
ment through actions to deal with complex scenarios. racy improvement of stereo matching algorithms, and
Some model the behavioral level decisions via reinforce- some of the implementations are summarized in Tab. I
ment learning [41], [43], while other approaches directly
work on motion planning trajectory output or even B. Local Stereo Matching on FPGA
direct feedback control signals [40]. Q-learning [44], Local algorithms are usually based on correlation, where
Actor-Critic learning [45], policy gradient [38] are some the process involves finding matching pixels in the left
popular algorithms in reinforcement learning. and right image patches by aggregating costs within a
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
specific region. There are many ways for cost aggrega- 640 × 480 resolution images by applying fast local consis-
tion, such as the sum of absolute differences (SAD) [62], tent dense stereo functions and cost aggregation.
the sum of squared differences (SSD) [63], normalized
cross-correlation (NCC) [64], and census transform C. Global Stereo Matching on FPGA
(CT) [65]. Many FPGA implementations are based on Global algorithms can provide state-of-the-art accuracy
these methods. Jin et al. [66] develop a real-time ste- and disparity map quality, however, they are usually
reo vision system based on census rank transformation processed through high computational-intensive optimi-
matching cost for 640 × 480 resolution images. Zhang et zation techniques or massive convolutional neural net-
al. [67] propose a real-time high definition stereo match- works, making them difficult to be deployed on resource-
ing design on FPGA based on mini-census transform limited embedded systems for real-time applications.
and cross-based cost aggregation, which achieves 60 However, some works have attempted to implement glob-
fps at 1024 × 768 pixel stereo images. The implementa- al algorithms on FPGA for better performance. Park et al.
tion of Honegger et al. [68] achieves 127 fps at 376 × 240 [70] present a trellis-based stereo matching system on
pixel resolution with 32 disparity levels based on block FPGA with a low error rate and achieved 30 fps at 320 × 240
matching. Jin et al. [69] further achieve 507.9 fps for resolution with 128 disparity levels. Sabihuddin et al. [71]
Table I.
Comparison of Stereo Vision Systems on FPGA platforms, across local stereo matching, global stereo matching,
semi-global stereo matching (SGM) and efficient large-scale stereo matching (ELAS) algorithms. The results reported
in each design are evaluated by frame rate (fps), image resolution (width # height), disparity levels, million disparity
estimations per second (MDE/s), power (W), resource utilization (logic% and BRAM%) and hardware platforms, where
MDE/s = width # height # fps # disparity.
Image
Frame Resolution
Rate (Width # Disparity Power Resource(%)
Algorithm Reference (FPS) Height) Level MDE/s (W) Logic/BRAM FPGA Platform
Local Jin et al. [66] 230 640 # 480 64 4522 – 34.0/95.0 Xilinx Virtex-4
Stereo Zhang et al. 60 1024 # 768 64 3020 1.56 61.8/67.0 XC4VLX200-10
Matching [67] 127 376 # 240 32 367 2.8 49.0/68.0 Altera EP3SL150
Honegger 507.9 640 # 480 60 9362 3.35 81.0/39.7 AItera Cyclone III
et al. [68] EP3C80
Jin et al. [69] Xilinx Vertex-6
Global Park et al. [70] 30 320 # 240 128 295 – –/– Xilinx Virtex II
Stereo Sabihuddin et 63.54 640 # 480 128 2498 – 23.0/58.0 pro-100
Matching al. [71] 32 640 # 480 60 590 1.40 72.0/46.0 Xilinx XC2VP100
Jin et al. [72] 30 1920 # 1680 60 5806 – 84.8/91.9 Xilinx XC4VLX160
Zha et al. [59] 30 1024 # 768 64 1510 0.17 57.0/53.0 Xilinx Kintex 7
Puglia et al. Xilinx Virtex-7
[60] XC7Z020CLG484-1
Semi- Banz et al. [74] 37 640 # 480 128 1455 2.31 51.2/43.2 Xilinx Virtex-5
Global Wang et al. 42.61 1600 # 1200 128 10472 2.79 93.9/97.3 Altera 5SGSMD5K2
Stereo [75] 127 1024 # 768 128 12784 – –/– AItera Cyclone IV
Matching Cambuim 72 1242 # 375 128 4292 3.94 75.7/30.7 Xilinx ZC706
et al. [76] 25 1024 # 768 256 5033 6.5 50.0/38.0 AItera Cyclone IV
Rahnama 147 1242 # 375 64 4382 9.8 68.7/38.7 GX, Stratix IV GX
et al. [77] Xilinx Ultrascale +
Cambuim ZCU102
et al. [78]
Zhao et al.
[79]
Efficient Rahnama 47 1242 # 375 – – 2.91 11.9/15.7 Xilinx ZC706
Large- et al. [80] 50 1242 # 375 – – 5 70.7/8.7 Xilinx ZCU104
Scale Rahnama
Stereo et al. [81]
Matching
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
implement a dynamic programming maximum likeli- design is evaluated in a full stereo vision system using
hood (DPML) based hardware architecture for dense two heterogeneous platforms, DE2i-150 and DE4, and
binocular disparity estimation and achieved 63.54 fps at achieves a 25 fps processing rate in 1024 × 768 HD maps
640 × 480 pixel resolution with 128 disparity levels. The with 256 disparity levels.
design in Jin et al. [72] uses a tree-structured dynamic While most existing SGM designs on FPGA are imple-
programming method, and achieves 58.7 fps at 640 × 480 mented using the register-transfer level (RTL), some
resolution as well as a low error rate. Recently, some works leveraged the high-level synthesis (HLS) ap-
other adaptations of global approaches for FPGA-imple- proach. Rahnama et al. [77] implement an SGM varia-
mentation have been proposed, such as cross-trees [59], tion on FPGA using HLS, which achieves 72 fps speed at
dynamic programming for DNA sequence alignment [60], 1242 × 375 pixel size with 128 disparity levels. To reduce
and graph cuts [73], where all of these implementations the design effort and achieve an appropriate balance
achieve real-time processing. among speed, accuracy and hardware cost, Zhao et al.
[79] recently propose FP-Stereo for building high-perfor-
D. Semi-Global Matching on FPGA mance SGM pipelines on FPGAs automatically. A series
Semi-global matching (SGM) [82] bridges the gap be- of optimization techniques are applied in this system to
tween local and global methods, and achieves a notable exploit parallelism and reduce resource consumption.
improvement in accuracy. SGM calculates the initial Compared to GPU designs [86], it achieves the same ac-
matching disparities by comparing local pixels, and curacy at a competitive speed while consuming much
then approximates an image-wide smoothness con- less energy.
straint with global optimization, which can obtain more To compare these implementations, the depth qual-
robust disparity maps through this combination. There ity of are evaluated on Middlebury Benchmark [87],
are several critical challenges for implementing SGM on with four image pairs Tsukuba, Venus, Teddy, Cones. As
hardware, e.g., data dependence, high complexity, and shown in Tab. II, there is a general trade-off between
large storage, so this is an active research field with accuracy and processing speed. The stereo vision sys-
recent works proposing FPGA-friendly variants of SGM tem designs in Tab. I are drawn as points in Fig. 2 (if
[74], [75], [83]–[85]. both power and speed number are reported), using
Banz et al. [74] propose a systolic-array based hard- log 10 (power) as x-coordinate and log 10 (speed ) as y-
ware architecture for SGM disparity estimation along coordinate ( y – x = log 10 (energy_efficiency)). Besides
with a two-dimensional parallelization concept for SGM. FPGA-based implementations, we also plot GPU and
This design achieves 30 fps performance at 640 × 480 CPU experimental results as a comparison to FPGA de-
pixel images with a 128-disparity range on the Xilinx signs’ performance. In general, local and semi-global
Virtex-5 FPGA platform. Wang et al. [75] implement a stereo matching designs have achieved higher perfor-
complete real-time FPGA-based hardware system that mance and energy efficiency than global stereo match-
supports both absolute difference-census cost initial- ing designs. As introduced in section III-C, global stereo
ization, cross-based cost aggregation and semi-global matching algorithms usually involve massive computa-
optimization. The system achieves 67 fps at 1024 × 768 tional-intensive optimization techniques. Even for the
resolution with 96 disparity levels on the Altera Stratix- same design, varying design parameters (e.g., window
IV FPGA platform, and 42 fps at 1600 × 1200 resolution size) may result in a 10x difference in energy efficiency.
with 128 disparity levels on the Altera Stratix-V FPGA Compared to GPU and CPU-based designs, FPGA-based
platform. The design in Cambuim et al. [76] uses a scal- designs have achieved higher energy efficiency, and the
able systolic-array based architecture for SGM based on speed of many FPGA implementations have surpassed
the Cyclone IV FPGA platform, and it achieves a 127 fps general-purpose processors.
image delivering rate in 1024 × 768 pixel HD resolution
with 128 disparity levels. The key point of this design E. Efficient Large-Scale Stereo Matching on FPGA
is the combination of disparity and multi-level paral- Another popular stereo matching algorithm that offers a
lelisms such as image line processing to deal with data good trade-off between speed and accuracy is Efficient
dependency and irregular data access pattern problems Large-Scale Stereo Matching (ELAS) [90], which is cur-
in SGM. Later, to improve the robustness of SGM and rently one of the fastest and accurate CPU algorithms
achieve a more accurate stereo matching, Cambuim concerning the resolution on Middlebury dataset. ELAS
et al. [78] combine the sampling-insensitive absolute implements a slanted plane prior very effectively while
difference in the pre-processing phase, and propose its dense estimation of depth is completely decompos-
a novel streaming architecture to detect noisy and able over all pixels, which make it attractive for eas-
occluded regions in the post-processing phase. The ily parallelized.
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
Rahnama et al. [80] first implement and evaluate
nonocc = 6.7
Average Bad
an FPGA accelerated adaptation of the ELAS algo-
Pixel Rate
all = 17.3
rithm, which achieved a frame rate of 47 fps (up to
30× compared high-end CPU) while consuming un-
6.05
5.61
8.71
7.65
17.2
8.2
der 4 W of power. By taking advantage of different
components on the SoC, several elaboration blocks
such as feature extraction and dense matching are
9.64
9.62
16.4
13.9
21.0
6.19
disc
—
—
executed on FPGA, while I/O and other conditional/
sequential blocks are executed on ARM-core CPU.
Cones
8.97
13.8
11.0
14.1
17.6
11.1
7.74
The authors also reveal the strategy to accelerate
—
all
complex and computationally diverse algorithms
for low power and real-time systems by collabora-
nonocc
3.34
3.51
5.41
7.34
tively utilizing different compute components. Lat-
8.12
2.12
8.4
—
er, by leveraging and combining the best features
of SGM and ELAS-based methods, Rahnama et al.
30.6
15.5
19.4
15.4
[81] propose a sophisticated stereo approach and
disc
17.4
17.1
—
—
achieve an 8.7% error rate on the challenging KITTI
2015 dataset at over 50 fps, with a power consump-
Teddy
A comparison between different designs on performance (MDE/s) and accuracy results on Middlebury Benchmark.
21.5
12.4
13.6
12.6
14.7
12.1
15.1
tion of only 4.5 W.
all
—
F. CNN-Based Stereo Vision System on FPGA
nonocc
6.08
6.79
12.5
7.54
11.4
Convolutional neural networks (CNNs) have been
8.11
7.17
—
demonstrated to perform very well on many vision
tasks such as image classification, object detec-
tion, and semantic segmentation. Recently, CNN has
36.8
5.62
2.79
1.92
1.95
13.1
disc
—
—
also been utilized in stereo estimation [91], [92] and
stereo matching [93]. CNN is applied to determine
(The lower of average bad pixel rate means the better stereo matching performance.)
Venus
5.27
0.89
0.87
2.97
1.68
15.7
0.6
—
all
3.59
2.37
0.4
2.7
works [97]–[100], with an example of lightweight YO- 1.2
—
20.3
8.87
14.2
14.0
7.64
6.6
YOLOv2 with a binarized CNN on Xilinx ZCU102 FPGA
—
2.51
11.6
2.17
4.15
all2
9.79
1.43
1.66
A. Overview
MDE/s
15437
10472
13076
9362
3020
4522
1455
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
requires accurate landmarks or pose estimates from sensory measurements in consecutive frames to physi-
known positions. cal landmarks. It incrementally deduces the robot mo-
Many SLAM algorithms have been developed in the tion by applying geometry constraints on the associated
last decades to improve the accuracy and robustness, sensory observations. The back-end tries to minimize
and its implementation comes in a diverse set of sizes errors introduced from sensory measurement noises by
and shapes. One end of the spectrum is dense SLAM performing optimizations on a batch of observed land-
algorithms [102]–[105], which can generate high-qual- marks and tracked poses. Filter based (e.g., Extended
ity maps of the environment with complex computa- Kalman Filter) and numerical optimization based (e.g.,
tions. Dense SLAM algorithms usually are executed on bundle adjustment) algorithms are two prevalent meth-
powerful and high-performance machines to ensure ods for SLAM back-end.
real-time performance. At the same time, the intensive A critical challenge to mobile robot localization is
computation characteristic makes dense SLAM hard to accuracy and efficiency under stringent power and re-
deploy on edge devices. The other end of the spectrum source constraints. To avoid losing tracked features due
is sparse SLAM [106]–[109], which is computationally to large motions between consecutive frames, SLAM
light by only selecting limited numbers of landmarks systems need to process sensory data at a high frame
or features. rate. For example, open data sets for evaluating local-
To form a compromise in terms of compute intensi- ization algorithms [112], [113] for drones and vehicles
ty and accuracy quality between these two extremes, provide images at 10 to 20 fps. Low power computing
a family of works described as semi-dense SLAM has systems are always required to extend the battery life
emerged [110], [111]. They aim to achieve better compu- of mobile robots. Most SLAM algorithms are developed
tational efficiency compared to dense methods by only on CPU or GPU platforms, of which power consump-
processing a subset of high-quality sensory information tion is hundreds of Watts. To execute SLAM efficiently
while providing a more dense and informative map com- on mobile robots and meet real-time and power con-
pared to sparse methods. straints, specialized chips and accelerators have been
A typical SLAM system includes two components: developed. FPGA SoCs provide rich sensor interfaces,
the front-end and the back-end, which are with different dedicated hardware logic and programmability, hence
computational characteristics. The front-end associates they have been explored in diverse ways in recent years.
3
[83] NVIDIA Jetson TX2 [79]
[72]
[68]
[84]
2.5
Local Stereo Matching Global Stereo Matching Semi-Global Stereo Matching GPU CPU
Figure 2. A comparison between different designs for perception tasks on a logarithm coordinate of power (W) and performance
(MDE/s).
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
We summarize and discuss FPGA-based accelerators for 1) EKF-SLAM
SLAMs in the following sections. EKF-SLAM [106] is a class of algorithms that utilizes
the extended Kalman Filter (EKF) for SLAM. EKF-SLAM
B. Dense SLAM on FPGA algorithms are typically feature-based and use the
Dense SLAM can construct high quality and complete maximum likelihood algorithm for data association.
models of the environment, and most of them are run- Several heterogeneous architectures using multi-core
ning in high-end hardware platforms (especially GPU). CPUs, GPUs, DSPs, and FPGAs are proposed to acceler-
One of the representative real-time dense SLAM algo- ate the complex computation in EKF-SLAM algorithms.
rithms is KinectFusion [114], which was released by Mi- Bonato et al. [115] presents the first FPGA-based ar-
crosoft in 2011. As a scene reconstruction algorithm, it chitecture for the EKF-SLAM based algorithm that is
continuously updates the global 3D map and tracks the capable of processing 2 D maps at up to 1800 features
location of depth cameras within the surrounding envi- at real-time with a frequency of 14 Hz, compared to
ronment. KinectFusion is generally composed of three 572 features with Pentium CPU and 131 features with
algorithms: ray-casting algorithm for generating graph- ARM. They analyze the computational complexity and
ics from surface information, iterative closest point memor y bandwidth requirements for FPG A-based
(ICP) algorithm for camera-tracking and volumetric in- EKF-SLAM, and then propose an architecture with a
tegration (VI) algorithm for integrating depth streams parallel memory access pattern to accelerate the ma-
into the 3D surface. Several works have attempted to trix multiplication. This design achieves two orders of
implement real-time dense SLAM algorithms on a het- magnitude more power-efficient than a general-pur-
erogeneous system with FPGA embedded. pose processor.
Several works implement computationally inten- Similarly, Tertei et al. [116] propose an efficient FP-
sive components of dense SLAMs, such as ICP and VI, GA-SoC hardware architecture for matrix multiplication
on FPGA to accelerate the critical path. Belshaw [102] with systolic arrays to accelerate EKF-SLAM algorithms.
presents an FPGA implementation of the ICP algorithm, The setup of this design is a PLB peripheral to PPC440
which achieves over 200 fps tracking speed with low hardcore embedded processor on a Virtex5 FPGA, and
tracking errors. This design divides the ICP algorithm it achieves a 7.3× speedup with a processing frequency
into filtering, nearest neighbor, transform recovery and of 44 Hz compared to the pure software implementation.
transform application stages. It leverages fixed-point Later, taking into account the symmetry in cross-cova-
arithmetic and the power of two data points to utilize riance matrix-related computations, Tertei et al. [117]
FPFA logic efficiently. Williams [103] notices that the improve the previous implementation to further reduce
nearest neighbor search takes up the majority of ICP the computational time and on-chip memory storage on
runtime, and then proposes two hybrid CPU-FPGA Zynq-7020 FPGA.
architectures to accelerate the bottleneck of the ICP- DSP is also leveraged in some works to accelerate
SLAM algorithm. The implementation is performed with EKF-SLAM algorithms. Vincke et al. [118] implement
Vivado HLS, a high-level synthesis tool from Xilinx, and an efficient implementation of EKF-SLAM on a low-cost
achieves a maximum 17.22× speedup over the ARM soft- heterogeneous architecture system consisting of a sin-
ware implementation. Hoorick [104] presents an FPGA- gle-core ARM processor with a SIMD coprocessor and
based heterogeneous framework using a similar HLS a DSP core. The EKF-SLAM program is partitioned into
method to accelerate the KinectFusion algorithm and different functional blocks based on the profiling char-
explored various ways of dataflow and data manage- acteristics results. Compared to a non-optimized ARM
ment patterns. Gautier et al. [105] embed both ICP and implementation, this design achieved 4.7× speed up
VI algorithms on an Altera Stratix V FPGA by using the from 12 fps to 57 fps. In a later work, Vincke et al. [119]
OpenCL language and the Altera OpenCL SDK. This de- replace the single-core ARM with a double-core ARM to
sign was a heterogeneous system with NVIDIA GTX 760 optimize the non-optimized blocks using the OpenMP
GPU and Altera Stratix V FPGA. By distributing different library. This design achieves a 2.75× speedup compared
workloads on different parts of SoC, the entire system to non-optimized implementation.
achieves up to 28 fps real-time speed.
2) ORB-SLAM
C. Sparse SLAM on FPGA ORB-SLAM [107] is an accurate and widely-used sparse
Sparse SLAM algorithms usually use a small set of fea- SLAM algorithm for monocular, stereo, and RGB-D cam-
tures to track and maintain a sparse map of surround- eras. Its framework usually consists of five main proce-
ing environments. These algorithms exhibit lower power dures: feature extraction, feature matching, pose esti-
consumption but are limited to the localization accuracy. mation, pose optimization and map updating. Based on
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
the profiling results on a quad-core ARM v8 mobile SoC, Abouzahir et al. [125] implement Fast-SLAM 2.0 on
feature extraction is the most computation-intensive a CPU-GPGPU-based SoC architecture. The algorithm
stage in the ORB-SLAM system, which consumes more is partitioned into function blocks, and each of them is
than half of CPU resources and energy budget [120]. implemented on the CPU or GPU accordingly. This op-
ORB based feature extraction algorithm usually timized and efficient CPU-GPGPU partitioning enables
consists of two parts, namely Oriented Feature from accurate localization and a 37× execution speedup com-
Accelerated Segment Test (oFAST) [121] based feature pared to non-optimized implementation on a single-core
detection and Binary Robust Independent Elementary CPU. Further, Abouzahir et al. [126] perform a complete
(BRIEF) [122] based feature descriptors computation. study of the processing time of different SLAM algo-
To accelerate this bottleneck, Fang et al. [120] design rithms under popular embedded devices, and demon-
and implement a hardware ORB feature extractor and strate that Fast-SLAM2.0 allowed a compromise between
achieved a great balance between performance and en- the consistency of localization results and computation
ergy consumption, which outperforms ARM Krait by 51% time. This algorithm is then optimized and implemented
and Intel Core i5 by 41% in computation latency as well on GPU and FPGA using HLS and parallel computing
as outperforms ARM Krait by 10% and Intel Core i5 by frameworks OpenCL and OpenGL. It is observed that
83% in energy consumption. Liu et al. [123] propose an the global processing time of FastSLAM2.0 on FPGA
energy-efficient FPGA implementation eSLAM to acceler- implementations achieves 7.5× acceleration compared
ate both feature extraction and feature matching stages. to high-end GPU. The processing frequency achieves
This design achieves up to 3× and 31× speedup in fram- 102 fps and meets the real-time performance constraints
erate, as well as up to 71× and 25× in energy efficiency of an operated robot.
improvement compared to Intel i7 and ARM Cortex-A9
CPUs, respectively. This eSLAM design utilizes a rota- 4) VO-SLAM
tionally symmetric ORB descriptor pattern to make the The visual odometry based SLAM algorithm (VO-SLAM)
algorithm more hardware-friendly, resulting in a 39% also belongs to the Sparse SLAM class with low com-
less latency compared to [120]. Rescheduling and par- putational complexity. Gu et al. [109] implement the
allelizing optimization techniques are also exploited to VO-SLAM algorithm on a DE3 board (Altera Stratix III)
improve the computation throughput in eSLAM design. to perform drift-free pose estimation, resulting in lo-
Scale-invariant feature transform (SIFT) and Harris calization results accurate to 1-2cm. A Nios II soft-core
corner detector are also commonly-used feature extrac- is used as a master processor. The authors design a
tion methods. SIFT is invariant to rotation and transla- dedicated matrix accelerator and propose a hierarchi-
tion. Gu et al. [109] implement SIFT-feature based SLAM cal matrix computing mechanism to support applica-
algorithm on FPGA and accelerate the matrix computa- tion requirements. This design achieves a processing
tion part to achieve speedup. Harris corner detector is speed of 31 fps with 30000 global map features, and 10×
used to extract corners and features of an image, and energy saving for each frame processing compared to
Schulz et al. [124] propose an implementation of Harris Intel i7 CPU.
and Stephen corner detector optimized for an embed-
ded SoC platform that integrates a multicore ARM pro- D Semi-Dense SLAM on FPGA
cessor with Zynq-7000 FPGA. Taking into account I/O Semi-dense SLAM algorithms have emerged to provide a
requirements and the advantage of parallelization and compromise between sparse SLAM and dense SLAM al-
pipeline, this design achieves a speedup of 1.77 com- gorithms, which attempt to achieve improved efficiency
pared to dual-core ARM processors. and dense point clouds. However, they are still usually
computationally intensive and require multicore CPUs
3) Fast-SLAM for real-time processing.
One of the key limitations of EKF-SLAM is its computation- Large-Scale Direct Monocular SLAM (LSD-SLAM)
al complexity since EKF-SLAM requires time quadratic in is one of the state-of-the-art and widely-used semi-
the number of landmarks to incorporate each sensor dense SLAM algorithms, and it directly operates on
update. In 2002, Montemerlo et al. [108] propose an ef- image intensities for both tracking and mapping prob-
ficient SLAM algorithm called Fast-SLAM. Fast-SLAM lems. The camera is tracked by direct image align-
decomposes the SLAM problem into a robot localiza- ment, while geometry is estimated from semi-dense
tion problem and a landmark estimation problem. It depth maps acquired by filtering over multiple stereo
recursively estimates the full posterior distribution pixel-wise comparisons.
over landmark positions and robot path with a loga- Several works have explored LSD-SLAM FPGA-SoC
rithmic scale. implementation. Boikos et al. [127] investigate the
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
performance and acceleration opportunities for LSD- tra post-processing operations within CNN-based fea-
SLAM in the SoC system. This design achieves an aver- ture extraction networks. 8-bit fixed-point numerics are
age framerate of more than 4 fps for a resolution of 320 × leveraged in the post-processing operations and CNN
240 with an estimated power of less than 1 W, which is a backbone. Similar hardware-oriented model compres-
2× acceleration and more than 4.3× energy efficiency com- sion techniques (e.g., data quantization and weight re-
pared to a software version running on embedded CPUs. duction) have been widely adopted in robotics and CNN
The author also notes that the communication between related designs [135]–[142].
two accelerators is via DDR memory since the produced Yu et al. [143] build a CNN-based monocular decen-
intermediate data is too large to be fully cached on the tralized-SLAM (DSLAM) on the Xilinx ZCU102 MPSoC
FPGA. Hence, it is important to optimize the memory ar- platform with DPU. DSLAM is usually used in multi-ro-
chitecture (e.g., data movement and caching techniques) bot applications that can share environment informa-
to ensure the scalability and compatibility of the design. tion and locations between agents. To accelerate the
To further improve the performance of [127], Boikos main components in DSLAM, namely visual odometry
et al. [128] re-implement the design using a dataflow ar- (VO) and decentralized place recognition (DPR), the
chitecture and distributed asynchronous blocks to al- authors adopt CNN-based Depth-VO-Feat [144] and Net-
low the memory system and the custom hardware pipe- VLAD [145] to replace handcrafted approaches and pro-
lines to function at peak efficiency. This implementation pose a cross-component pipeline scheduling algorithm
can process and track more than 22 fps with an embed- to improve the performance.
ded power budget and achieves a 5× speedup over [127]. To enable multi-tasking processing in embedded ro-
Furthermore, Boikos et al. [129] combine a scalable bots on CNN accelerators, Yu et al. [146] further propose
depth estimation with direct semi-dense SLAM architec- an INterruptible CNN accelerator (INCA) with a novel
ture and propose a complete accelerator for semi-dense virtual-instruction-based interrupt method. Feature ex-
SLAM on FPGA. This architecture achieved more than traction and place recognition of DSLAM are deployed
60 fps at the resolution of 640 × 480 and an order of mag- and accelerated on the same CNN accelerator of the
nitude power consumption improvement compared to embedded FPGA system, and the interrupt response la-
Intel i7-4770 CPU. This implementation leverages multi- tency is reduced by 1%.
rate and multi-modal units to deal with LSD-SLAM’s
complex control flow. A new dataflow paradigm is also F. Bundle Adjustment on FPGA
proposed where the kernel is linked with a single con- Besides the hardware implementation of the frontend of
sumer and a single producer to achieve high efficiency. the SLAM system, several works investigate to acceler-
ate the backend of the SLAM system, mainly Bundle Ad-
E. CNN-Based SLAM on FPGA justment (BA). BA is heavily used in robot localization
Recently, CNNs have made significant progress in the [107], [147], autonomous driving [148], space exploration
perception and localization ability of the robots com- missions [149] and some commercial products [150],
pared to handcrafted methods. Take one of the main where it is usually employed in the last stage of the pro-
SLAM components, feature extraction, as an exam- cessing pipeline to refine camera trajectories and 3D
ple, the CNN-based approach SuperPoint [130] can structures further.
achieve 10%-30% higher matching accuracy compared Essentially, BA is a massive joint non-linear optimiza-
to handcrafted ORB. Other CNN-based methods, such tion problem that usually consumes a significant amount
as DeepDesc [131] and GeM [132], also present sig- of power and processing time in both offline visual re-
nificant improvements in feature extraction and de- construction and real-time localization applications.
scriptor generation stage. However, CNN has a much Several works aim to accelerate BA on multi-core
higher computational complexity and requires more CPUs or GPUs using parallel or distributed computing
memory footprint. techniques. Jeong et al. [151] exploit efficient memory
Several works have explored to deploy CNN on FP- handling and fast block-based linear solving, and pro-
GAs. Xilinx DPU [133] is one of the state-of-the-art pro- pose a novel embedded point iterations method, which
grammable dedicated to CNN, which has a specialized substantially improves the BA performance on CPU.
instruction set and works efficiently across various CNN Wu et al. [152] present a multi-core parallel process-
topologies. Xu et al. [134] propose a hardware architec- ing solution running on CPUs and GPUs. The matrix-
ture to accelerate CNN-based feature extraction Super- vector product is carefully restructured in this design
Point on the Xilinx ZCU102 platform and achieve 20 fps to reduce memory requirements and compute latency
in a real-time SLAM system. The key point of this design substantially. Eriksson et al. [153] propose a distributed
is an optimized software dataflow to deal with the ex- approach for very large scale global bundle adjustment
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
computation to achieve BA performance improve- V. Planning and Control on FPGA
ment. The authors present a consensus framework
using the proximal splitting method to reduce the A. Overview
computational cost. Similarly, Zhang et al. [154] pro- Planning and control are the modules that compute how
pose a distributed formulation to accelerate the global the robot should maneuver itself. They usually include
BA computation without much distributed computing behavioral decision, motion planning and feedback con-
communication overhead. trol kernels. Without loss of generality, we focus on the
To better deploy BA in embedded systems with strict motion planning algorithms and their FPGA implemen-
power and real-time constraints, recent works explore tations in this section.
BA algorithm acceleration using specialized hardware. As a fundamental problem in the robotic system, mo-
The design in [155] implements both the image frontend tion planning aims to find the optimal collision-free path
and BA backend of a VIO algorithm on a single-chip for from the current position to a goal position for a robot
nano-drone scale applications. Liu et al. [156] propose a in complex surroundings. Generally, motion planning
hardware-software co-designed BA hardware accelera- contains three steps, namely roadmap construction,
tor and its implementation on an embedded FPGA-SoC collision detection and graph search [38], [158]. Motion
to achieve higher performance and power efficiency planning will become a relatively complicated problem
simultaneously. Especially, a co-observation optimiza- when robots work with a high degree of freedom (DOF)
tion technique and a hardware-friendly differentiation configurations since the search space will be exponen-
method are proposed to accelerate BA operations with tially increased. Typically, state-of-the-art CPU-based
optimized usage of memory and computation resourc- approaches take a few seconds to find a collision-free
es. Sun et al. [157] present a hardware architecture run- trajectory [159]–[161], making the existing motion plan-
ning local BA on FPGAs, which works without external ning algorithms too slow to meet the real-time require-
memory access and refines both cameras poses and 3D ment for complex robot tasks and environments. Sev-
map points simultaneously. eral works have investigated approaches to speed up
motion planning, either for each stage or whole pipeline.
G. Discussion
We summarize FPGA based SLAM systems in Tab. III. It B. Roadmap Construction
only includes works that implement the whole SLAM on In the roadmap construction step, the planner generates
an FPGA and provide overall performance and power a set of states in the robot’s configuration space and
evaluation. The works in the table adopt a similar FPGA- then connects them with edges to construct a general-
SoC architecture that accelerates computationally inten- purpose roadmap in the obstacle-free space. Each state
sive components by FPGA fabrics and offloads others represents a robot’s configuration, and each edge rep-
works to embedded processors on FPGAs. Compared resents a possible robot movement. Conventional algo-
with sparse method, the semi-dense implementation rithms build the roadmap by randomly sampling poses
has lower frame rate, which is mainly due to the high from configuration space at runtime to navigate around
resolution data processed in the pipeline. Due to the the obstacles present at that time.
high frame rates and low power consumption, sparse Several works explore roadmap construction acceler-
SLAM FPGA have been used in drones and autonomous ation. Yershova et al. [162] improve the nearest neighbor
vehicles [16]. The two sparse SLAM implementations search to accelerate roadmap construction by orders
achieve similar performance in terms of frame rate. of magnitude compared to the naive nearest-neighbor
Compared with the ORB design, the VO SLAM design searching. Wang et al. [163] reduce the computation
includes pre-processing and outliers removal hardware, workload by trimming roadmap edges and keeping the
such as image rectification and RANSAC, which lead to roadmap to a reasonable size to achieve speedup. Differ-
a more accurate but power inefficient implementation. ent from online runtime approaches, Murray et al. [164]
completely remove the
runtime latency by con-
Table III. ducting the roadmap
Comparison of FPGA SLAM Systems. construction only once
Method Platform Frame Rate Power Indoor Error at the design time. A
Boikos et al. [127] Semi Dense Xilinx Zynq 7020 SoC 4.5 fps 2.5 W na more general and much
larger roadmap is pre-
Liu et al. [123] ORB Xilinx Zynq 7000 SoC 31 fps 1.9 W 4.5 cm
computed and allows
Gu et al. [109] VO Altera Stratix III 31 fps 5.9 W 2 cm for fast and successive
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
queries in complex environments without reprogram- collision detection results for the discretized roadmap
ming the accelerator during runtime. are precomputed in the first stage before runtime, and
then the collision detection accelerator streams in the
C. Collision Detection voxels of obstacles and the edges of flags which are in
In the collision detection step, the planner determines collision at runtime.
whether there are potential collisions with the environ-
ment or the robot itself during movement. Specifically, D. Graph Search
collision detection is the primary challenge in motion After collision detection, the planner will try to find the
planning, which often comprises 90% of the process- shortest and safe path from the start position to the
ing time [165]. target position based on the obtained collision-free
Several works leverage data parallelization comput- roadmap through graph search. Several works explore
ing on GPUs to achieve speedup [165]–[167]. For exam- graph search accelerations. Bondhugula et al. [171]
ple, Bialkowski et al. [165] divide the RRT* algorithm employ a parallel FPGA-based design using a blocked
of collision detection tasks into three parallel dimen- algorithm to solve large instances of All-Pairs Shortest-
sions and construct thread block grids to execute col- Paths (APSP) problem, which achieves a 15× speedup
lision computations simultaneously. However, GPU over an optimized CPU-based implementation. Srid-
can only provide a constant speedup factor due to the haran et al. [172] present an architecture-efficient so-
core limitations, which is still hard to achieve the real- lution based on Dijkstra’s algorithm to accelerate the
time requirement. shortest path search, and Takei et al. [173] extend this
Recently, [168]–[170] develop high-efficiency custom for a high degree of parallelism and large-scale graph
hardware implementations based on the FPGA system. search. Recently, Murray et al. [164] accelerate graph
Atay and Bayazit [168] focus on directly accelerating the search with the Bellman-Ford algorithm. By leveraging
PRM algorithm on FPGA by creating functional units to a precomputed roadmap and bounding specific robot
perform the random sampling, nearest neighbor search quantities, this design enables a more compact and ef-
and parallelizing triangle-triangle testing. However, ficient storage structure, dataflows and a low-cost in-
this design cannot be reconfigured at runtime, and the terconnection network.
huge resource demands make it fail to support a large
roadmap. Murray et al. [169] present a novel microar- VI. Partial Reconfiguration
chitecture for an FPGA-based accelerator to speed up FPGA technology provides the flexibility of on-site pro-
collision detection by creating a specialized circuit for gramming and re-programming without going through
each motion in the roadmap. This solution achieves re -fabrication with a modified design. Partial Re -
sub-millisecond speed for motion planning query and configuration (PR) takes this flexibility one step fur-
improves the power consumption by more than one or- ther, allowing the modification of an operating FPGA
der of magnitude, which is sufficient to enable real-time design by loading a partial configuration file, usu-
robotics applications. ally a partial BIT file [174]. Using PR, after a full BIT
Besides real-time constraint, motion planning algo- file configures the FPGA, partial BIT files can be down-
rithms also have flexibility requirements to make the loaded to modify reconfigurable regions in the FPGA
robots adapt to dynamic environments. Dadu-P [170] without compromising the integrity of the applica-
build a scalable motion planning accelerator to attain tions running on those parts of the device that are
both high efficiency and flexibility, where a motion not being reconfigured.
plan can be solved in around 300 microseconds in a A major performance bottleneck for PR is the con-
dynamic environment. A hardware-friendly data struc- figuration overhead, which seriously limits the useful-
ture representing roadmap edges is adopted to achieve ness of PR. To address this problem, in [175], the authors
flexibility, and a batched processing as well as a priori- propose a combination of two techniques to minimize
ty-rating method are proposed to achieve high efficien- the overhead. First, the authors design and implement
cy. But this design comprises a 25× latency increase to fully streaming DMA engines to saturate the configura-
make it retargetable to different robots and scenarios tion throughput. Second, the authors exploit a simple
due to the external memory access. Murray et al. [164] form of data redundancy to compress the configura-
develop a fully retargetable microarchitecture of colli- tion bitstreams, and implement an intelligent internal
sion detection and graph search accelerator that can configuration access port (ICAP) controller to perform
perform motion planning in less than 3 ms with a mod- decompression at runtime. This design achieves an ef-
est power consumption of 35 W. This design divides fective configuration data transfer throughput of up to
the collision detection workflow into two stages. The 1.2 Gbytes/s, which actually well surpasses the theoretical
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
pper bound of the data transfer throughput, 400 Mbytes/s.
u VII. Commercial Applications of FPGAs in
Specifically, the proposed fully streaming DMA engines Autonomous Vehicles
reduce the configuration time from the range of sec- Over the past three years, PerceptIn has built and com-
onds to the range of milliseconds, a more than 1000-fold mercialized autonomous vehicles for micromobility. Our
improvement. In addition, the proposed compression products have been deployed in China, US, Japan and
scheme achieves up to a 75% reduction in bitstream size Switzerland. We summarize system design constraints,
and results in a decompression circuit with negligible workloads and their performance characteristics from
hardware overhead. the real products. A custom computing system is de-
Another problem of PR is that it may incur additional veloped by taking into account the inherent task-level
energy consumption. In [176], the authors investigate parallisim, cost, safety and programmability [16], [179].
whether PR can be used to reduce FPGA energy con- FPGA plays a critical role in our system, which synchro-
sumption. The core idea is that there are a number of nizes various sensors and accelerates the component
independent circuits within a hardware design, and on the critical path.
some can be idle for long periods of time. Idle circuits
still consume power though, especially through clock A. Computing system
oscillation and static leakage. Using PR, one can replace Software pipeline. Fig. 3 shows the block diagram of
these circuits during their idle time with others that the processing pipeline in our vehicle, which consists
consume much less power. Since the reconfiguration of three parts: sensing, perception and planning. The
process itself introduces energy overhead, it is unclear sensing module bridges sensors and computing sys-
whether this approach actually leads to an overall en- tem. It synchronizes various sensor samples for the
ergy saving or to a loss. This study identifies the precise downstream perception module, which performs two
conditions under which partial reconfiguration reduces fundamental tasks: 1) locating the vehicle itself in a
the total energy consumption, and proposes solutions global map and 2) understanding the surroundings
to minimize the configuration energy overhead. In this through depth estimation and object detection. The
study, PR is compared against clock gating to evaluate planning module uses the perception results to devise
its effectiveness. The authors apply these techniques a driveable route, and then converts the planed path
to an existing embedded microprocessor design, and into a sequence of control commands, which will drive
successfully demonstrate that FPGAs can be used to the vehicle along the path. The control commands are
accelerate application performance while also reducing sent to the vehicle’s Engine Control Unit (ECU) via the
overall energy consumption. CAN bus interface.
Further, PerceptIn demonstrate in their commercial Sensing, perception and planning are serialized. They
product that Runtime partial reconfiguration (RPR) is are all on the critical path of the end-to-end latency. We
useful for robotic computing, especially computing for pipeline the three modules to improve the throughput.
autonomous vehicles, because many on-vehicle tasks Within perception, localization and scene understand-
usually have multiple versions where each is used in ing are independent and could execute in parallel. While
a particular scenario [16]. For instance, in PerceptIn’s there are multiple tasks within scene understanding,
design, the localization algorithm relies on salient they are mostly independent with the only exception
features; features in key frames are extracted by a that object tracking must be serialized with object de-
feature extraction algorithm (based on ORB features tection. The task-level parallelisms influence how the
[177]), whereas features in non-key frames are tracked tasks are mapped to the hardware platform.
from previous frames (using optical flow [178]); the Algorithm. Our localization module is based on Vi-
latter executes in 10 ms, 50% faster than the former. sual Inertial Odometry algorithms [180], [181], which
Spatially sharing the FPGA is not only area-inefficient, fuses camera images, IMU and GPS samples to esti-
but also power-inefficient as the unused portion of the mate the vehicle pose in the global map. The depth es-
FPGA consumes non-trivial static power. In order to timation employs traditional stereo vision algorithms,
temporally share the FPGA and “hot-swap” different which calculates depths according to the principal of
algorithms, PerceptIn develop a partial reconfigura- triangulation [182]. In particular, our method is based
tion engine (PRE) that dynamically reconfigures part on the classic EL A S algorithm, which uses hand-
of the FPGA at runtime. The PRE achieves a 400 MB/sec crafted features [183]. While DNN models for depth
reconfiguration throughput (i.e., bitstream program- estimation exist, they are orders of magnitude more
ming rate). Both the feature extraction and tracking compute-intensive than non-DNN algorithms [184]
bitstreams are less than 4 MB. Thus, the reconfigura- while providing only marginal accuracy improvements
tion delay is less than 1 ms. to our use-cases. We detect objects using DNN models,
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
such as YOLO [24]. We use the Kernelized Correla-
tion Filter (KCF) [185] to track detected objects.
The planning algorithm is formulated as Model
Control Command
Path Generation
Predictive Control (MPC) [186].
Hardware architecture. Fig. 4 is the hardware
system designed for our autonomous vehicles. The
sensing hardware consists of stereo cameras, IMU
and GPS. In particular, our system uses stereo cam-
Planning
eras for depth estimation. One of the cameras is
also used for semantic tasks such as object detec-
Traffic Prediction
tion. The cameras along with the IMU and the GPS
Detection
Collision
drive the VIO-based localization task.
Considering the cost, compute requirements
and power budget, our computing platform is com-
posed of a Xilinx Zynq Ultrascale+ FPGA and an on-
vehicle PC equipped with an Intel Coffe Lake CPU
and an Nvidia GTX 1060 GPU. The PC is the main
Position in Global
Object Velocity,
Position, Class
computing platform, while the FPGA plays a critical
Coordinates
role, which bridges sensors and the PC, and provides
an acceleration platform. To optimize the end-to-end
latency, explore the task level parallelism and ease
practical development and deployment, planning
and scene understanding are mapped onto the CPU
and the GPU respectively, and sensing and localiza- GPS Fusion
tion are implemented on the FPGA platform.
Tracking
Perception
B. Sensing on FPGA
Ego-Motion Estimation
Scene Understanding
Object Detection
Odometry (VIO)
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
Sensors FPGA Platform On-Vehicle PC
Sensor Image Signal
Stereo Camera Synchronizer Processor Multicore CPUs
Localization Mem.
CPU
IMU Accelerator Controller GPU
now wildly adopted to unify various measurements in a tency of each perception tasks on the FPGA platform
global timing domain. Software-based synchronization with the GPU. Due to the available resources, the
associates samples with timestamps at the application FPGA platform is faster than the GPU only for localiza-
or the driver layer. This approach is inaccurate due to tion, which is more lightweight than other tasks. We
the software processing before the timestamp stage. offload localization to the FPGA while leaving other
The software processing introduces variable latency perception task on the GPU. This partitioning frees
that is non-deterministic. more GPU resources for depth estimation and object
To obtain more precise synchronization, we uses a detection, which is benefit for reducing the percep-
hardware synchronizer implemented by FPGA fabrics. tion pipeline’s latency.
The hardware synchronizer triggers the camera sen- As with classic SLAM algorithms, our localization
sors and the IMU using a common timer initialized by algorithm consists of a front-end and a back-end. The
the satellite atomic time provided by the GPS device. It front-end uses the ORB features and descriptors for de-
records the triggering time of each sensor sample, and tecting and tracking key points [120], [187]. The back-
then pack the timestamp with the corresponding sensor end uses Levenberg-Marquardt’s (LM) algorithm, a non-
data. In terms of costs, the synchronizer is extremely linear optimization algorithm, to optimize the position
lightweight in design with only 1,443 LUTs and 1,587 reg- of 3 D key points and the pose of the camera [156], [188].
isters and consumes 5mW of power. The ORB feature extraction/matching and the LM
optimizer are the most time-consuming parts of our
C. Perception on FPGA SLAM algorithm, which take up nearly all the execution
For our autonomous vehicles, the perception tasks time. We accelerate ORB feature extraction/matching
includes scene understanding (depth estimation and and the non-linear optimizer on FPGA fabrics. The rest
objection detection) and localization, which are inde- lightweight parts are implemented on the ARM core of
pendent. The slower one dictates the overall percep- the Zyqn platform. We use independent hardware for
tion latency. each camera to extract features and compute descrip-
We evaluate our perception algorithms on the CPU, tors. Hamming distance and Sum of Absoluated Differ-
GPU and Zynq FPGA platform. Fig. 5 compares the la- ence (SAD) matching are implemented to obtain stable
matching results. Compared with the CPU implementa-
tion, our FPGA implementation achieves a 2.2× speedup
103 12,892 CPU and 44 fps, and saves 83% energy.
Latency (ms)
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
and Schur elimination are the most time-consuming sider. Radiation has various effects on electronics, but
parts. By profiling our algorithm on datasets [189], the commonly focused two are total ionizing dose effect
Schur and Jacobian computations account for 29.8% (TID) and single event effects (SEE). TID results from
and 48.27% of total time. We implemented Schur elimi- the accumulation of ionizing radiation over time, which
nation and Jacobian updates on FGPA fabrics [156]. causes permanent damage by creating electron-hole
Compared with the CPU implementation, the FPGA pairs in the silicon dioxide layers of MOS devices.
achieves 4× and 27× speedup for Schur and Jacobbian, The effect of TID is that electronics gradually de-
and saves 76% energy. grade in their performance parameters and eventually
fail to function. Electronics intended for application
VIII. Application of FPGAs in Space Robotics in space are tested for the total amount of radiation,
In the 1980s, field-programmable gate arrays (FPGA) measured in kRads, they can endure before failure.
emerged as a result of increasing integration in elec- Usually, electronics that can withstand 100 kRads are
tronics. Before the use of FPGA, glue-logic designs sufficient for low earth orbit missions to use for sev-
were based on individual boards with fixed compo- eral years [191].
nents interconnected v ia a shared standard bus, SEE occurs when high-energy particles from space
which has various drawbacks, such as hindrance of radiation strike electronics and leave behind an ionized
high volume data processing and higher susceptibil- trail. The results are various types of SEEs [193], which
ity to radiation-induced errors, in addition to inflex- can be categorized as either soft errors, which usually
ibility. The utilization of FPGAs in space applications do not cause permanent damage, or hard errors, which
began in 1992, for FPGAs offered unprecedented flex- often cause permanent damage. Examples of soft er-
ibility and significantly reduced the design cycle and ror include single event upset (SEU), and single event
development cost [190]. transient (SET). In SEU, a radiation particle struck a
FPGAs can be categorized by the type of their pro- memory element, causing a bit flip. Noteworthy is that
grammable interconnection switches: antifuse, SRAM, as the cell density and clock rate of modern devices in-
and Flash. Each of the three technologies comes with creases, multiple cell upset (MCU), corruption of two
trade-offs. Antifuse FPGAs are non-volatile and have or more memory cells in a single particle strike, is in-
minimal delay due to routing, resulting in a faster speed creasingly becoming a concern. A special type of SEU is
and lower power consumption. The drawback is evident single event functional interrupt (SEFI), where the upset
as they have a relatively more complicated fabrication leads to loss of normal function of the device by affect-
process and are only one time programmable. SRAM- ing control registers or the clock. In SET, a radiation par-
based FPGAs are the most common type employed in ticle passes through a sensitive node, which generates a
space missions. They are field reprogrammable and use transient voltage pulse, causing wrong logic state at the
the standard fabrication process that foundries put in combinatorial logic output. Depending on whether the
significant effort in optimizing, resulting in a faster rate impact occurs during an active clock edge or not, the
of performance increase. However, based on SRAM, error may or may not propagate. Some examples of hard
these FPGAs are volatile and may not hold configuration error include single event latch-up (SEL), in which en-
if a power glitch occurs. Also, they have more substan- ergized particle activates parasitic transistor and then
tial routing delay, require more power, and have a higher cause a short across the device, and single event burn-
susceptibility to bit errors. Flash-based FPGAs are non- out (SEB), in which radiation induces high local power
volatile and reprogrammable, and also have low power dissipation, leading to device failure. In these hard error
consumption and route delay. The major drawback is cases, radiation effects may cause the failure of an en-
that in-flight reconfiguration is not recommended for tire space mission.
flash-based FPGAs due to the potentially destructive re- Space-grade FPGAs can withstand considerable lev-
sults if radiation effects occur during the reconfigura- els of TID and have been designed against most de-
tion process [191]. Also, the stability of stored charge on structive SEEs [194]. However, SEU susceptibility is per-
the floating gate is of concern: it is a function including vasive. For the most part, radiation effects on FPGA are
factors such as operating temperature, the electric fields not different from those of other CMOS based ICs. The
that might disturb the charge. As a result, flash-based primary anomaly stems from FPGAs’ unique structure,
FPGAs are not as frequently used in space missions [192]. involving programmable interconnections. Depending
on their type, FPGAs have different susceptibility to-
A. Radiation Tolerance for Space Computing ward SEU in their configuration. SRAM FPGAs are des-
For electronics intended to operate in space, the harsh ignated by NASA as the most susceptible ones due to
space radiation present is an essential factor to con- their volatile nature. Even after the radiation hardening
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
process, the configuration of SRAM FPGAs is only des- processors or configuration controllers, carry out
ignated as “hardened” or simply having embedded SEE scrubbing. Some advanced SRAM FPGAs, includ-
mitigation techniques rather than “hard,”which means ing ones made by Xilinx, support partial reconfig-
close to immune [191]. Configuration SRAM is not used uration, which allows memory repairs to be made
in the same way as the traditional SRAM. A bit flip in without interrupting the operation of the whole
configuration causes an instantaneous effect without device. Scrubbing can be done in frame-level (par-
the need for a read-write cycle. Moreover, instead of tial) or device-level (full), which will inevitably
producing one single error in the output, the bit flip lead to some downtime; some devices may not be
shifts the user logic directly, changing the device’s be- able to tolerate such an interruption. Blind scrub-
havior. Scrubbing is needed to rectify SRAM configu- bing is the most straightforward way of implemen-
ration. Antifuse and flash FPGAs are less susceptible tation: individual frames are scrubbed periodical-
to effects in configuration and are designated “hard” ly without error detection. Blind scrubbing avoids
against SEEs in their configuration without applying the complexity required in error detection, but ex-
radiation hardening techniques [191]. tra scrubbing may increase vulnerability to SEUs
Design based SEU/fault mitigation techniques are as errors may be written into frames during the
commonly used, for, in contrast to fabrication level ra- scrubbing process. An alternative to blind scrub-
diation hardening techniques, they can be readily ap- bing is readback scrubbing, where scrubbers
plied to commercial off the shelf (COTS) FPGAs. These actively detect errors in configuration through
techniques can be classified into static and dynamic. error-correcting code or cyclic redundancy check
Static techniques rely on fault-masking, toleration of [195]. If an error is found, scrubber initiates frame-
error without requiring active fixing. One such exam- level scrubbing.
ple is passive redundancy with voting mechanisms. Currently, the majority of space-grade FPGA comes
Dynamic techniques, in contrast, detect faults and act from Xilinx and Microsemi. Xilinx offers the Virtex fam-
to correct them. The common SEU Mitigation Methods ily and Kintex. Both are SRAM based, which have high
include [195], [196]: flexibility. Microsemi offers antifuse based RTAX and
1) Hardware Redundancy: functional blocks are rep- Flash-based RTG4, RT PolarFire, which have lower sus-
licated to detect/tolerate faults. Triple modular re- ceptibility against SEE and power consumption. 20 nm
dundancy (TMR) is perhaps the most widely used Kintex and 28 nm RT PolarFire are the latest generations.
mitigation technique. It can be applied to entire The European market is offered with Atmel devices and
processors or parts of circuits. At a circuit level, NanoXplore space-grade FPGAs [198]. Table IV shows
registers are implemented using three or more flip the specifications of the above devices.
flops or latches. Then, voters compare the values
and output the majority, reducing the likelihood B. FPGAs in Space Missions
of error due to SEU. As internal voters are also For space robotics, processing power is of particular
susceptible to SEU, they are sometimes triplicated importance, given the range of information required to
also. For mission-critical applications, global sig- accurately and efficiently process. Many of the current
nals may be triplicated to mitigate SEUs further. and previous space missions are packed with sophisti-
TMR can be implemented at ease with the help cated algorithms that are mostly static. They serve to
supporting HDLs [197]. It is important to note that increase the efficiency of data transmission; neverthe-
a limitation of TMR is that one fault, at most, can less, data processing is done mainly on the ground. As
be tolerated per voter stage. As a result, TMR is the travel distance of missions increases, transmitting
often used with other techniques, such as scrub- all data to, and processing it on the ground is no longer
bing, to prevent error accumulation. an efficient or even viable option due to transmission
2) Scrubbing: The vast majority of memory cells in delay. As a result, space robots need to become more
reprogrammable FPGAs contain configuration adaptable and autonomous. They will also need to pre-
information. As discussed earlier, configuration process on-board a large amount of data collected and
memory upset may lead to alteration routing net- compress it before sending it back to Earth [199].
work, loss of function, and other critical effects. The rapid development of new generation FPGAs
Scrubbing, refreshing and restoration of configu- may fill the need in space robotics. FPGAs enable robot-
ration memory to a known-good state, is therefore ic systems to be reconfigurable in real-time, making the
needed [196]. The reference configuration memo- systems more adaptable by allowing them to respond
ry is usually stored in radiation-hardened mem- more efficiently to changes in environment and data. As
ory cells either off or on the device. Scrubbers, a result, autonomous reconfiguration and performance
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
optimization can be achieved. Also, the FPGAs have a parameters, such as temperature considerations and
high capability for parallel processing, which is useful semiconductor characteristics are verified in these
in boosting processing performance. The use of FPGA tests. What follows is flight-specific approval. In this
is present in various space robots. Some of the most step, NASA engineers examine the device compatibility
prominent examples of the application are the NASA with the mission. For instance, considerations of the op-
Mars rovers. Since the first pair of rovers were launched erating environment including factors like temperature
in 2003, the presence of FPGAs have steadily increased and radiation. Also included are a variety of mission-
in the later rovers. specific situations that the robot may encounter and the
associated risk assessment. Depending on the specific
1) Mars Exploration Rover Missions application of the device, whether mission critical or
Beginning in the early 2000s, NASA have been us- not, and the expected mission lifetime, the risk stan-
ing FPGAs in exploration rover control and lander con- dards varies. Finally, parts go through specific design
trol. In Opportunity and Spirit, the two Mars rovers consideration to ensure all the design requirements
launched in 2003, two Xilinx Virtex XQVR1000s were have been met. Parts are examined for their designs ad-
in the motor control board [200], which operates mo- dressing issues such as SEL, SEU, SEFI. The Xilinx FP-
tors on instruments as well as rover wheels. In addi- GAs used addressed some of the SEE through the follow-
tion, an Actel RT 1280 FPGA was used in each of the ing methods [201]:
20 cameras on the rovers to receive and dispatch 1) Fabrication processes largely prevents SEL
hardware commands. The camera electronics consist 2) TMR reduces SEU frequency
of clock driver that provides timing pulses through 3) Scrubbing allows device recovery from single event
the charge-coupled device (CCD), an IC containing an functional interrupts
array of linked or coupled capacitors. Also, there are MER went successful and despite being designed for
signal chains that amplify the CCD output and convert only 90 Martian days (1 Martian day = 24.6 hours), con-
it from analog to digital. The Actel FPGA provides the tinued until 2019. The implementation of mitigation tech-
timing, logic, and control functions in the CCD signal niques was also proven to be effective as the observed
chain and inserts a camera ID into camera telemetry error rate was very similar to that predicted [200].
to simplify processing [201].
Selected electronic parts have to undergo a multi- 2) Mars Science Laboratory Mission
step flight consideration process before utilized in any Launched in 2011, Mars Science Lab (MSL) was the new
space exploration mission [200], [202]. The first step is Rover sent on to Mars. FPGAs were heavily used in its
the general flight approval, during which the manufac- key components, mainly responsible for scientific instru-
turers perform additional space-grade verification tests ment control, image processing, and communications.
beyond the normal commercial evaluation, and NASA Curiosity has 17 cameras on board: four navigation
meticulously examines the results. Additional device cameras, eight hazard cameras, the Mars Hand Lens
Table IV.
Specifications of Space-Grade FPGAs.
Device Logic Memory DSPs Technology Rad. Tolerance
Xilinx Virtex-5QV 81.9 K LUT6 12.3 Mb 320 65 nm SRAM SEE immune up to LET > 100 MeV/(mg · cm2) and
1 Mrad TID
Xilinx RT Kintex 331 K LUT6 38 Mb 2760 20 nm SRAM SEE immune up to LET > 80 MeV/(mg · cm2) and
UltraScale 100-120 Krads TID
Microsemi RTG4 150 K LE 5 Mb 462 65 nm Flash SEE immune up to LET > 37 MeV(mg · cm2) and TID
> 100 Krads
Microsemi RT 481 K LE 33 Mb 1480 28 nm Flash SEE immune up to LET > 63 MeV(mg · cm2) and
PolarFire 300 Krads
Microsemi RTAX 4 M gates 0.5 Mb 120 150 nm SEE immune up to LET > 37 MeV(mg · cm2) and
antifuse 300 Krads TID
Atmel ATFEE560 560 K gates 0.23 Mb – 180 nm SEL immune up to 95 MeV(mg · cm2) and 60 Krads
SRAM TID
NanoXplore NG- 137 K LUT4 9.2 Mb 384 65 nm SRAM SEL immune up to 60 MeV(mg · cm2) and 100
LARGE Krads TID
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
Imager (MAHLI), two Mast Cameras, the Mars De- Conditioning Unit (CEPCU), and a Computer Vision Ac-
scent Imager (MARDI), and the ChemCam Remote Mi- celeration Card (CVAC). While the former two parts were
croscopic Imager [203]. MAHLI, the mast cameras, and inherited from the MLS mission, the CVAC is new. It has
MARDI share the same electronics design. Similar to two FPGAs. One is called the Vision Processor–a Xilinx
the system used on MER, an Actel FPGA provides the Virtex 5QV that contains image processing modules for
timing, logic, and control functions in the CCD signal matching landmarks to estimate position. The other is
chain and transmits pixels to the digital electronics called the Housekeeping FPGA–a Microsemi RTAX 2000
assembly (DEA), which interfaces the camera heads antifuse FPGA that handles tasks such as synchroniza-
with the rover electronics, transmitting command to tion with the spacecraft, power management, Vision
the camera heads and data back to the rover. There is Processor configuration.
one DEA dedicated to each of the imagers above. Each Through more than two decades of use in space, FPGAs
is has a Virtex-II FPGA that contains a Microblaze soft- have shown their reliability and applicability for space ro-
processor core. All of the core functionalities of the botic missions. The properties of FPGAs make them good
DEA, including timing, interface, and compression, onboard processors, ones that have high reliability,
are implemented in the FPGA as logic peripherals of adaptability, processing power, and power efficiency:
the Microblaze. Specifically, the DEA provides an im- FPGAs have been used for space robotic missions for
age processing pipeline that includes 12 to 8-bit com- decades and are proven in reliability; they have unri-
manding of input pixels, horizontal subframing, and valed adaptability and can even be reconfigured in run
lossless or JPEG image compression [203]. What runs time; their capability for high degree parallel process-
on the Microblaze is the DEA flight software, which ing allow significant acceleration in executing many
coordinates DEA hardware functions such as camera complex algorithms; hardware/software co - design
movements. It receives and executes commands, and method makes them potentially more power-efficient.
transmits command from the Earth. The flight software They may finally help us close the two-decade perfor-
also implements image acquisition algorithms, includ- mance gap between commercial processors and space-
ing autofocus and autoexposure, performs error cor- grade ASICs. As a direct result, the achievements that
rection of flash memory, and mechanism control fault the world has made in fields such as deep learning and
protection [203]. In total, the flight software consists computer vision, which were often too computationally
of 10,000 lines of ANSI C code, all implemented on the intense for space-grade processors to be used, may be-
FPGA. Additionally, FPGAs power communication box- come applicable for robots in space in the near future.
es (Electra-Lite) to provide critical communication to The implementation of those new technologies will be
Earth from the rovers through a Mars relay network of great benefit for space robots, boosting their auton-
[204]. They are responsible for a variety of high speed omy and capabilities and allowing us to explore farther
bulk signal processing. and faster.
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
and thus any ASIC-based accelerators will be months or China, in 2013. He is currently the CTO of PerceptIn, Fre-
even years behind the state-of-the-art algorithms; on mont, CA, U.S.A., a company focusing on providing visu-
the other hand, FPGAs can be dynamically updated as al perception solutions for robotics and autonomous
needed. Second, robotic workloads are highly diverse, driving. His current research interests include algorithm
thus it is difficult for any ASIC-based robotic comput- and systems for robotics and autonomous vehicles. Dr.
ing accelerator to reach economies of scale in the near Yu is also a Founding Member of the IEEE Special Techni-
future; on the other hand, FPGAs are a cost-effective cal Community on Autonomous Driving.
and energy-effective alternative before one type of ac-
celerator reaches economies of scale. Third, compared Thomas Yuang Li (Student Member, IEEE)
to SoCs that have reached economies of scale, e.g., is a research intern at PerceptIn, U.S.A.
mobile SoCs, FPGAs deliver a significant performance and a student member of the IEEE. His re-
advantage. Fourth, partial reconfiguration allows mul- search interests include building autono-
tiple robotic workloads to time-share an FPGA, thus al- mous space explorers for future commer-
lowing one chip to serve multiple applications, leading cial robotic space-exploration missions
to overall cost and energy reduction. as well as space robotics and computing related topics.
However, FPGAs are still not the mainstream com-
puting substrate for robotic workloads, for several Jie Tang (Senior Member, IEEE) is cur-
reasons: first, FPGA programming is still much more rently an associate professor in School
challenging than regular software programming, and of Computer Science and Engineering
the supply of FPGA engineers is still limited. Second, of South China University of Technol-
although there is significant progress in the past few ogy, Guangzhou, China. She received
years in the FPGA High-Level Synthesis (HLS) auto- her B.E. degree From University of De-
mation, such as [206], HLS is still not able to produce fense Technology and Ph.D. degree from the Beijing In-
optimized code, and IP supports for robotic work- stitute of Technology, both in Computer Science. She
loads are still extremely limited. Third, commercial was previously a visiting researcher at the Embedded
software support for robotic workloads on FPGAs Systems Center at University of California, Irvine, USA,
is still missing. For instance, there is no official ROS and a research scientist at Intel China Runtime Tech-
support on any commercial FPGA platform today. For nology Lab. Dr. Tang is mainly doing research on Com-
robotic companies to fully exploit the power of FP- puting Systems for Autonomous Machines. She is a
GAs, these problems need to be first addressed, and founding member and secretary of the IEEE Computer
the authors use these problems to motivate our fu- Society Special Technical Community on Autonomous
ture research work. Driving Technologies.
Zishen Wan (Student Member, IEEE) is Yuhao Zhu (Member, IEEE) is an Assis-
currently a Ph.D. student in Electrical tant Professor of Computer Science at
and Computer Engineering at Georgia University of Rochester, U.S.A. His re-
Institute of Technology, Atlanta, GA, search group focuses on applications
U.S.A. He received the M.S. degree from and computer systems for visual com-
Harvard University, Cambridge, MA, in puting. His work is recognized by the
2020, and the B.S. degree from Harbin Institute of Tech- Honorable Mention of the 2018 ACM SIGARCH/IEEE-CS
nology, Harbin, China, in 2018, both in Electrical Engi- TCCA Outstanding Dissertation Award and multiple
neering. He has a broad research interest in VLSI design, IEEE Micro Top Picks designations. He is a recipient
computer architecture, and edge intelligence, with a fo- of the NSF CAREER Award in 2020.
cus on energy-efficient and robust hardware and system
design for autonomous machines. He has received the Yu Wang (Senior Member, IEEE) recei
Best Paper Award in DAC 2020 and CAL 2020. ved his B.S. degree in 2002 and Ph.D.
degree (with honor) in 2007 from Tsing-
Bo Yu (Senior Member, IEEE) received hua University, Beijing, China. He is cur-
the B.S. degree in electronic technology rently a Tenured Professor and Chair
and science from Tianjin University, with the Department of Electronic Engi-
Tianjin, China, in 2006, and the Ph.D. de- neering, Tsinghua University. His research interests in-
gree from the Institute of Microelec- clude application specific hardware computing, parallel
tronics, Tsinghua University, Beijing, circuit analysis, and power/reliability aware system
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
design methodology. Dr. Wang has authored and coau- dents have won several fellowships and eleven best
thored over 250 papers in refereed journals and con- paper awards over the years. Dr. Raychowdhury is a
ferences. He has received Best Paper Award in ASPDAC Senior Member of the IEEE.
2019, FPGA 2017, NVMSA17, ISVLSI 2012, and Best Post-
er Award in HEART 2012 with 9 Best Paper Nomina- Shaoshan Liu (Senior Member, IEEE) is
tions. He is a recipient of DAC Under- 40 Innovator Founder and CEO of PerceptIn (www
Award in 2018. He served as TPC chair for ICFPT 2019, .perceptin.io) U.S.A., a company focus-
ISVLSI 2018, ICFPT 2011 and Finance Chair of ISLPED ing on providing visual perception so-
2012-2016, and served as the program committee mem- lutions for autonomous robots and ve-
ber for leading conferences in EDA/FPGA area. Cur- hicles. Dr. Shaoshan Liu received his
rently he serves as Associate Editor for IEEE Trans on Ph.D. in Computer Engineering from University of Cali-
CAS for Video Technology, IEEE Transactions on CAD, fornia, Irvine and M.P.A. from Harvard University. His
and ACM TECS. He is an IEEE/ACM senior member. He research focuses on Computing Systems for Autono-
is the co-founder of Deephi Tech (acquired by Xilinx in mous Machines. Dr. Shaoshan Liu has published over 80
2018), which is a leading deep learning computing plat- research papers and holds over 150 U.S. international
form provider patents on autonomous machines. Dr. Shaoshan Liu is
an ACM Distinguished Speaker, and an IEEE Computer
Arijit Raychowdhury (Senior Member, Society Distinguished Speaker.
IEEE) is currently a Professor in the
School of Electrical and Computer En- References
[1] A. Qiantori, A. B. Sutiono, H. Hariyanto, H. Suwa, and T. Ohta, “An
gineering at the Georgia Institute of emergency medical communications system by low altitude platform at
Technology, U.S.A., where he joined in the early stages of a natural disaster in Indonesia,” J. Med. Syst., vol. 36,
January 2013. From 2013 to July 2019 no. 1, pp. 41–52, 2012. doi: 10.1007/s10916-010-9444-9.
[2] A. Ryan and J. K. Hedrick, “A mode-switching path planner for uav-
he was an Associate Professor and held the ON Semi- assisted search and rescue,” in Proc. 44th IEEE Conf. Decision and Con-
conductor Junior Professorship in the department. He trol, 2005, pp. 1471–1476.
[3] N. Smolyanskiy, A. Kamenev, J. Smith, and S. Birchfield, “Toward
received his Ph.D. degree in Electrical and Computer low-flying autonomous MAV trail navigation using deep neural net-
Engineering from Purdue University (2007) and his works for environmental awareness,” in Proc. IEEE/RSJ Int. Conf. Intell.
B.E. in Electrical and Telecommunication Engineering Robots and Syst. (IROS), 2017, pp. 4241–4247.
[4] A. Giusti et al., “A machine learning approach to visual perception of
from Jadavpur University, India (2001). His industry forest trails for mobile robots,” IEEE Robot. Automat. Lett., vol. 1, no. 2,
experience includes five years as a Staff Scientist in pp. 661–667, 2015. doi: 10.1109/LRA.2015.2509024.
[5] J. K. Stolaroff, C. Samaras, E. R. O’Neill, A. Lubers, A. S. Mitchell, and
the Circuits Research Lab, Intel Corporation, and a
D. Ceperley, “Energy use and life cycle greenhouse gas emissions of
year as an Analog Circuit Researcher with Texas In- drones for commercial package delivery,” Nature Commun., vol. 9, no. 1,
struments Inc. His research interests include low pow- pp. 1–13, 2018. doi: 10.1038/s41467-017-02411-5.
[6] S. J. Kim, Y. Jeong, S. Park, K. Ryu, and G. Oh, “A survey of drone
er digital and mixed-signal circuit design, design of use for entertainment and AVR (augmented and virtual reality),” in Aug-
power converters, sensors and exploring interactions mented Reality and Virtual Reality. Springer-Verlag, 2018, pp. 339–352.
[7] S. Jung, S. Cho, D. Lee, H. Lee, and D. H. Shim, “A direct visual
of circuits with device technologies. Dr. Raychowd-
serving-based framework for the 2016 IROS autonomous drone racing
hury holds more than 25 U.S. and international patents challenge,” J. Field Robot., vol. 35, no. 1, pp. 146–166, 2018. doi: 10.1002/
and has published over 200 articles in journals and ref- rob.21743.
[8] “Fact sheet—The Federal Aviation Administration (FAA) aerospace
ereed conferences. He currently serves on the Techni- forecast fiscal years (FY) 2020–2040,” 2020. https://www.faa.gov/news/
cal Program Committees of ISSCC, VLSI Circuit Sympo- fact_sheets/news_story.cfm?newsId=24756
sium, CICC, and DAC. He was the Associate Editor of [9] S. Liu, L. Li, J. Tang, S. Wu, and J.-L. Gaudiot, “Creating autonomous
vehicle systems,” Synthesis Lectures Comput. Sci., vol. 6, no. 1, pp. 1–186,
the IEEE Transactions on Computer Aided Design from 2017. doi: 10.2200/S00787ED1V01Y201707CSL009.
2013-2018 and the Editor of the Microelectronics Jour- [10] S. Krishnan et al., “The sky is not the limit: A visual performance
model for cyber-physical co-design in autonomous machines,” IEEE
nal, Elsevier Press from 2013 to 2017. He is the winner Comput. Arch. Lett., vol. 19, no. 1, pp. 38–42, 2020. doi: 10.1109/LCA.2020.
of Qualcomm Faculty Award, 2020; IEEE/ACM Innova- 2981022.
tor under 40 award; the NSF CISE Research Initiation [11] S. Krishnan et al., “Machine learning-based automated design space
exploration for autonomous aerial robots,” 2021, arXiv:2102.02988.
Initiative Award (CRII), 2015; Intel Labs Technical Con- [12] S. Liu and J.-L. Gaudiot, “Autonomous vehicles lite self-driving
tribution Award, 2011; Dimitris N. Chorafas Award technologies should start small, go slow,” IEEE Spectr., vol. 57, no. 3, pp.
36–49, 2020. doi: 10.1109/MSPEC.2020.9014458.
for outstanding doctoral research, 2007; the Best The-
[13] S. Liu, L. Liu, J. Tang, B. Yu, Y. Wang, and W. Shi, “Edge computing
sis Award, College of Engineering, Purdue University, for autonomous driving: Opportunities and challenges,” Proc. IEEE, vol.
2007; SRC Technical Excellence Award, 2005; Intel 107, no. 8, pp. 1697–1716, 2019. doi: 10.1109/JPROC.2019.2915983.
[14] S. Liu, J. Tang, Z. Zhang, and J.-L. Gaudiot, “Computer architectures
Foundation Fellowship, 2006; NASA INAC Fellowship, for autonomous driving,” Computer, vol. 50, no. 8, pp. 18–25, 2017. doi:
2004; the Meissner Fellowship 2002. He and his stu- 10.1109/MC.2017.3001256.
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
[15] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “[DL] a survey of FPGA- [39] S. Shalev-Shwartz, N. Ben-Zrihem, A. Cohen, and A. Shashua, “Long-
based neural network inference accelerators,” ACM Trans. Reconfigu- term planning by short-term prediction,” 2016, arXiv:1602.01580.
rable Technol. Syst. (TRETS), vol. 12, no. 1, pp. 1–26, 2019. doi: 10.1145/ [40] M. Gómez, R. González, T. Martínez-Marín, D. Meziat, and S. Sán-
3289185. chez, “Optimal motion planning by reinforcement learning in autono-
[16] B. Yu, W. Hu, L. Xu, J. Tang, S. Liu, and Y. Zhu, “Building the comput- mous mobile vehicles,” Robotica, vol. 30, no. 2, pp. 159, 2012. doi: 10.1017/
ing system for autonomous micromobility vehicles: Design constraints S0263574711000452.
and architectural optimizations,” in Proc. 53rd Annu. IEEE/ACM Int. [41] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent,
Symp. Microarch. (MICRO), 2020. doi: 10.1109/MICRO50266.2020.00089. reinforcement learning for autonomous driving,” 2016, arXiv:1610.03295.
[17] N. Dalal and B. Triggs, “Histograms of oriented gradients for human [42] M. Bojarski et al., “End to end learning for self-driving cars,” 2016,
detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern arXiv:1604.07316.
Recogn. (CVPR’05), 2005, vol. 1, pp. 886–893. [43] X. Geng, H. Liang, B. Yu, P. Zhao, L. He, and R. Huang, “A scenario-
[18] X He, R. S. Zemel, and M. A. Carreira-Perpinan, “Multiscale con- adaptive driving behavior prediction approach to urban autonomous
ditional random fields for image labeling,” in Proc. IEEE Comput. Soc. driving,” Appl. Sci., vol. 7, no. 4, p. 426, 2017. doi: 10.3390/app7040426.
Conf. Comput. Vision and Pattern Recogn. (CVPR 2004), 2004, vol. 2, p. II. [44] C. J. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8, no.
[19] X. He, R. S. Zemel, and D. Ray, “Learning and incorporating top- 3-4, pp. 279–292, 1992. doi: 10.1023/A:1022676722315.
down cues in image segmentation,” in Proc. Comput. Vision – ECCV 2006, [45] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advanc-
A. Leonardis, H. Bischof, and A. Pinz, Eds. Berlin, Heidelberg: Springer es Neural Inf. Process. Syst., 2000, pp. 1008–1014.
Berlin Heidelberg, 2006, pp. 338–351. [46] S. L. Hicks, I. Wilson, L. Muhammed, J. Worsfold, S. M. Downes, and
[20] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi- C. Kennard, “A depth-based head-mounted visual display to aid naviga-
object tracking by decision making,” in Proc. IEEE Int. Conf. Comput. Vi- tion in partially sighted individuals,” PloS One, vol. 8, no. 7, p. e67695,
sion (ICCV), 2015, pp. 4705–4713. doi: 10.1109/ICCV.2015.534. 2013. doi: 10.1371/journal.pone.0067695.
[21] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vision [47] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leu-
(ICCV), Dec. 2015. doi: 10.1109/ICCV.2015.169. tenegger, “Elasticfusion: Real-time dense slam and light source estima-
[22] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: Towards tion,” Int. J. Robot. Res., vol. 35, no. 14, pp. 1697–1716, 2016. doi: 10.1177/
real-time object detection with region proposal networks,” CoRR, vol. 0278364916669237.
abs/1506.01497, 2015. [48] V. A. Prisacariu et al., “Infinitam v3: A framework for large-scale 3d
[23] W. Liu et al., “SSD: Single shot multibox detector,” CoRR, vol. abs/ reconstruction with loop closure,” 2017, arXiv:1708.00783.
1512.02325, 2015. [49] S. Golodetz, T. Cavallari, N. A. Lord, V. A. Prisacariu, D. W. Murray,
[24] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only and P. H. Torr, “Collaborative large-scale dense 3d reconstruction with
look once: Unified, real-time object detection,” CoRR, vol. abs/1506. online inter-agent pose optimisation,” IEEE Trans. Vis. Comput. Graph-
02640, 2015. ics, vol. 24, no. 11, pp. 2895–2905, 2018. doi: 10.1109/TVCG.2018.2868533.
[25] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” [50] M. Pérez-Patricio and A. Aguilar-González, “FPGA implementation
CoRR, vol. abs/1612.08242, 2016. of an efficient similarity-based adaptive window algorithm for real-time
[26] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks stereo matching,” J. Real-Time Image Process., vol. 16, no. 2, pp. 271–287,
for semantic segmentation,” CoRR, vol. abs/1411.4038, 2014. 2019. doi: 10.1007/s11554-015-0530-6.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep [51] D.-W. Yang, L.-C. Chu, C.-W. Chen, J. Wang, and M.-D. Shieh, “Depth-
convolutional networks for visual recognition,” CoRR, vol. abs/1406. reliability-based stereo-matching algorithm and its VLSI architecture
4729, 2014. design,” IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 6, pp. 1038–
[28] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing 1050, 2014. doi: 10.1109/TCSVT.2014.2361419.
network,” CoRR, vol. abs/1612.01105, 2016. [52] A. Aguilar-González and M. Arias-Estrada, “An FPGA stereo match-
[29] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. ing processor based on the sum of hamming distances,” in Proc. Int.
Torr, “Fully-convolutional Siamese networks for object tracking,” CoRR, Symp. Appl. Reconfigurable Comput., 2016, pp. 66–77.
vol. abs/1606.09549, 2016. [53] M. Pérez-Patricio, A. Aguilar-González, M. Arias-Estrada, H.-R. Her-
[30] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and nandez-de Leon, J.-L. Camas-Anzueto, and J. de Jesús Osuna-Coutiño,
mapping: Part I,” IEEE Robot. Automat. Mag., vol. 13, no. 2, pp. 99–110, “An FPGA stereo matching unit based on fuzzy logic,” Microprocessors
2006. doi: 10.1109/MRA.2006.1638022. Microsyst., vol. 42, pp. 87–99, 2016. doi: 10.1016/j.micpro.2015.10.011.
[31] M. Montemerlo et al., “Junior: The Stanford entry in the urban [54] G. Cocorullo, P. Corsonello, F. Frustaci, and S. Perri, “An efficient
challenge,” J. Field Robot., vol. 25, no. 9, pp. 569–597, 2008. doi: 10.1002/ hardware-oriented stereo matching algorithm,” Microprocessors Micro-
rob.20258. syst., vol. 46, pp. 21–33, 2016. doi: 10.1016/j.micpro.2016.09.010.
[32] J. Ziegler et al., “Making bertha drive—an autonomous journey on [55] P. M. Santos, J. C. Ferreira, and J. S. Matos, “Scalable hardware ar-
a historic route,” IEEE Intell. Transp. Syst. Mag., vol. 6, no. 2, pp. 8–20, chitecture for disparity map computation and object location in real-
2014. doi: 10.1109/MITS.2014.2306552. time,” J. Real-Time Image Process., vol. 11, no. 3, pp. 473–485, 2016. doi:
[33] C. Katrakazas, M. Quddus, W.-H. Chen, and L. Deka, “Real-time mo- 10.1007/s11554-013-0338-1.
tion planning methods for autonomous on-road driving: State-of-the- [56] K. M. Ali, R. B. Atitallah, N. Fakhfakh, and J.-L. Dekeyser, “Exploring
art and future research directions,” Transp. Res. C, Emerg. Technol., vol. HLS optimizations for efficient stereo matching hardware implementa-
60, pp. 416–442, 2015. doi: 10.1016/j.trc.2015.09.011. tion,” in Proc. Int. Symp. Appl. Reconfigurable Comput., 2017, pp. 168–176.
[34] B. Paden, M. Čáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey [57] B. McCullagh, “Real-time disparity map computation using the cell
of motion planning and control techniques for self-driving urban ve- broadband engine,” J. Real-Time Image Process., vol. 7, no. 2, pp. 87–93,
hicles,” IEEE Trans. Intell. Veh., vol. 1, no. 1, pp. 33–55, 2016. doi: 10.1109/ 2012. doi: 10.1007/s11554-010-0155-8.
TIV.2016.2578706. [58] L. Li, X. Yu, S. Zhang, X. Zhao, and L. Zhang, “3d cost aggregation
[35] Y. Deng, Y. Chen, Y. Zhang, and S. Mahadevan, “Fuzzy dijkstra with multiple minimum spanning trees for stereo matching,” Appl. Opt.,
algorithm for shortest path problem under uncertain environment,” vol. 56, no. 12, pp. 3411–3420, 2017. doi: 10.1364/AO.56.003411.
Appl. Soft Comput., vol. 12, no. 3, pp. 1231–1237, 2012. doi: 10.1016/j.asoc. [59] D. Zha, X. Jin, and T. Xiang, “A real-time global stereo-matching
2011.11.011. on FPGA,” Microprocessors Microsyst., vol. 47, pp. 419–428, 2016. doi:
[36] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the 10.1016/j.micpro.2016.08.005.
heuristic determination of minimum cost paths,” IEEE Trans. Syst. Sci. [60] L. Puglia, M. Vigliar, and G. Raiconi, “Real-time low-power FPGA ar-
Cybern., vol. 4, no. 2, pp. 100–107, 1968. doi: 10.1109/TSSC.1968.300136. chitecture for stereo vision,” IEEE Trans. Circuits Syst. II, Express Briefs,
[37] S. M. LaValle and J. J. Kuffner Jr., “Randomized kinodynamic plan- vol. 64, no. 11, pp. 1307–1311, 2017. doi: 10.1109/TCSII.2017.2691675.
ning,” Int. J. Robot. Res., vol. 20, no. 5, pp. 378–400, 2001. doi: 10.1177/ [61] A. Kjær-Nielsen et al., “A two-level real-time vision machine com-
02783640122067453. bining coarse-and fine-grained parallelism,” J. Real-Time Image Pro-
[38] L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars, “Prob- cess., vol. 5, no. 4, pp. 291–304, 2010. doi: 10.1007/s11554-010-0159-4.
abilistic roadmaps for path planning in high-dimensional configuration [62] S. Wong, S. Vassiliadis, and S. Cotofana, “A sum of absolute differ-
spaces,” IEEE Trans. Robot. Autom. (1989–June 2004), vol. 12, no. 4, pp. ences implementation in FPGA hardware,” in Proc. 28th Euromicro Conf.,
566–580, 1996. doi: 10.1109/70.508439. 2002, pp. 183–188.
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
[63] M. Hisham, S. N. Yaakob, R. A. Raof, A. A. Nazren, and N. W. Em- [83] D. Honegger, H. Oleynikova, and M. Pollefeys, “Real-time and low
bedded, “Template matching using sum of squared difference and latency embedded computer vision hardware based on a combination
normalized cross correlation,” in Proc. IEEE Student Conf. Res. De- of FPGA and mobile CPU,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots and
velopment (SCOReD), 2015, pp. 100–104. doi: 10.1109/SCORED.2015 Syst., 2014, pp. 4930–4935.
.7449303. [84] S. Mattoccia and M. Poggi, “A passive RGBD sensor for accurate
[64] J.-C. Yoo and T. H. Han, “Fast normalized cross-correlation,” Cir- and real-time depth sensing self-contained into an FPGA,” in Proc.
cuits Syst. Signal Process., vol. 28, no. 6, p. 819, 2009. doi: 10.1007/s00034- 9th Int. Conf. Distrib. Smart Cameras, 2015, pp. 146–151. doi: 10.1145/
009-9130-7. 2789116.2789148.
[65] B. Froba and A. Ernst, “Face detection with the modified census [85] S. K. Gehrig, F. Eberli, and T. Meyer, “A real-time low-power stereo
transform,” in Proc. 6th IEEE Int. Conf. Automatic Face and Gesture Recogn., vision engine using semi-global matching,” in Proc. Int. Conf. Comput.
2004, pp. 91–96. Vision Syst., 2009, pp. 134–143.
[66] S. Jin et al., “FPGA design and implementation of a real-time stereo [86] D. Hernandez-Juarez, A. Chacón, A. Espinosa, D. Vázquez, J. C.
vision system,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 1, Moure, and A. M. López, “Embedded real-time stereo estimation via
pp. 15–26, 2009. semi-global matching on the GPU,” Procedia Comput. Sci., vol. 80, pp.
[67] L. Zhang, K. Zhang, T. S. Chang, G. Lafruit, G. K. Kuzmanov, and D. 143–153, 2016. doi: 10.1016/j.procs.2016.05.305.
Verkest, “Real-time high-definition stereo matching on FPGA,” in Proc. [87] H. Hirschmuller and D. Scharstein, “Evaluation of cost functions
19th ACM/SIGDA Int. Symp. Field Programmable Gate Arrays, 2011, pp. for stereo matching,” in Proc. IEEE Conf. Comput. Vision and Pattern
55–64. doi: 10.1145/1950413.1950428. Recogn., 2007, pp. 1–8. doi: 10.1109/CVPR.2007.383248.
[68] D. Honegger, P. Greisen, L. Meier, P. Tanskanen, and M. Pollefeys, [88] Y. Shan et al., “FPGA based memory efficient high resolution stereo
“Real-time velocity estimation based on optical flow and disparity vision system for video tolling,” in Proc. Int. Conf. Field-Programmable
matching,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots and Syst., 2012, pp. Technol., 2012, pp. 29–32.
5177–5182. [89] Y. Shan et al., “Hardware acceleration for an accurate stereo vi-
[69] M. Jin and T. Maruyama, “Fast and accurate stereo vision system sion system using mini-census adaptive support region,” ACM Trans.
on FPGA,” ACM Trans. Reconfigurable Technol. Syst. (TRETS), vol. 7, no. Embedded Comput. Syst. (TECS), vol. 13, no. 4s, pp. 1–24, 2014. doi:
1, pp. 1–24, 2014. doi: 10.1145/2567659. 10.1145/2584659.
[70] S. Park and H. Jeong, “Real-time stereo vision FPGA chip with low [90] A. Geiger, M. Roser, and R. Urtasun, “Efficient large-scale stereo
error rate,” in Proc. Int. Conf. Multimedia and Ubiquitous Eng. (MUE’07), matching,” in Proc. Asian Conf. Comput. Vision, 2010, pp. 25–38.
2007, pp. 751–756. [91] S. Zagoruyko and N. Komodakis, “Learning to compare image
[71] S. Sabihuddin, J. Islam, and W. J. MacLean, “Dynamic programming patches via convolutional neural networks,” in Proc. IEEE Conf. Comput.
approach to high frame-rate stereo correspondence: A pipelined archi- Vision and Pattern Recogn., 2015, pp. 4353–4361.
tecture implemented on a field programmable gate array,” in Proc. Ca- [92] J. Žbontar and Y. LeCun, “Stereo matching by training a convolu-
nadian Conf. Electr. Comput. Eng., pp. 1461–1466, 2008. tional neural network to compare image patches,” J. Mach. Learn. Res.,
[72] M. Jin and T. Maruyama, “A real-time stereo vision system us- vol. 17, no. 1, pp. 2287–2318, 2016.
ing a tree-structured dynamic programming on FPGA,” in Proc.ACM/ [93] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning
SIGDA Int. Symp. Field Programmable Gate Arrays, 2012, pp. 21–24. doi: for stereo matching,” in Proc. IEEE Conf. Comput. Vision and Pattern
10.1145/2145694.2145698. Recogn., 2016, pp. 5695–5703.
[73] R. Kamasaka, Y. Shibata, and K. Oguri, “An FPGA-oriented graph [94] A. Seki and M. Pollefeys, “SGM-Nets: Semi-global matching with
cut algorithm for accelerating stereo vision,” in Proc. Int. Conf. ReCon- neural networks,” in Proc. IEEE Conf. Comput. Vision and Pattern Recogn.,
Figurable Comput. FPGAs (ReConFig), 2018, pp. 1–6. doi: 10.1109/RE- 2017, pp. 231–240.
CONFIG.2018.8641737. [95] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy,
[74] C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and P. Pirsch, “Real-time and T. Brox, “A large dataset to train convolutional networks for dispar-
stereo vision system using semi-global matching disparity estimation: ity, optical flow, and scene flow estimation,” in Proc. IEEE Conf. Comput.
Architecture and FPGA-implementation,” in Proc. Int. Conf. Embedded Vision and Pattern Recogn., 2016, pp. 4040–4048.
Comput. Syst., Arch., Model. Simulation, 2010, pp. 93–101. [96] A. Kuzmin, D. Mikushin, and V. Lempitsky, “End-to-end learning of
[75] W. Wang, J. Yan, N. Xu, Y. Wang, and F.-H. Hsu, “Real-time high- cost-volume aggregation for real-time dense stereo,” in Proc. IEEE 27th
quality stereo vision system in FPGA,” IEEE Trans. Circuits Syst. Video Int. Workshop on Mach. Learn. Signal Process. (MLSP), 2017, pp. 1–6. doi:
Technol., vol. 25, no. 10, pp. 1696–1708, 2015. doi: 10.1109/TCSVT.2015. 10.1109/MLSP.2017.8168183.
2397196. [97] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per-
[76] L. F. Cambuim, J. P. Barbosa, and E. N. Barros, “Hardware module formance FPGA-based accelerator for large-scale convolutional neural
for low-resource and real-time stereo vision engine using semi-global networks,” in Proc. 26th Int. Conf. Field Programmable Logic and Appl.
matching approach,” in Proc.30th Symp. Integrated Circuits Syst. Des., (FPL), 2016, pp. 1–9.
Chip Sands. 2017, pp. 53–58. doi: 10.1145/3109984.3109992. [98] J. Qiu et al., “Going deeper with embedded FPGA platform for
[77] O. Rahnama, T. Cavalleri, S. Golodetz, S. Walker, and P. Torr, “R3sgm: convolutional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-
Real-time raster-respecting semi-global matching for power-constrained Programmable Gate Ar rays, 2016, pp. 26 –35. doi: 10.1145/2847263
systems,” in Proc. Int. Conf. Field-Programmable Technol. (FPT), 2018, .2847265.
pp. 102–109. doi: 10.1109/FPT.2018.00025. [99] K. Guo et al., “Angel-eye: A complete design flow for mapping CNN
[78] L. F. Cambuim, L. A. Oliveira, E. N. Barros, and A. P. Ferreira, “An FPGA- onto embedded FPGA,” IEEE Trans. Comput.-Aided Design Integr. Circuits
based real-time occlusion robust stereo vision system using semi-global Syst., vol. 37, no. 1, pp. 35–47, 2017. doi: 10.1109/TCAD.2017.2705069.
matching,” J. Real-Time Image Process., vol. 17, no. 5, pp. 1–22, 2019. doi: 10.1007/ [100] J. Yu et al., “Instruction driven cross-layer CNN accelerator for fast
s11554-019-00902-w. detection on FPGA,” ACM Trans. Reconfigurable Technol. Syst. (TRETS, ),
[79] J. Zhao et al., “FP-stereo: Hardware-efficient stereo vision for em- vol. 11, no. 3, pp. 1–23, 2018. doi: 10.1145/3283452.
bedded applications,” 2020, arXiv:2006.03250. [101] H. Nakahara, H. Yonekawa, T. Fujii, and S. Sato, “A lightweight YO-
[80] O. Rahnama, D. Frost, O. Miksik, and P. H. Torr, “Real-time dense LOv2: A binarized CNN with a parallel support vector regression for an
stereo matching with ELAS on FPGA-accelerated embedded devices,” FPGA,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays,
IEEE Robot. Automat. Lett., vol. 3, no. 3, pp. 2008–2015, 2018. doi: 10.1109/ 2018, pp. 31–40.
LRA.2018.2800786. [102] M. S. Belshaw, “A high-speed iterative closest point tracker on an
[81] O. Rahnama et al., “Real-time highly accurate dense depth on a FPGA platform,” PhD thesis, 2008.
power budget using an FPGA-CPU hybrid soc,” IEEE Trans. Circuits and [103] B. Williams, “Evaluation of a soc for real-time 3d slam,” 2017.
Syst. II: Express Briefs, vol. 66, no. 5, pp. 773–777, 2019. doi: 10.1109/TC- [104] B. Van Hoorick, “FPGA-based simultaneous localization and map-
SII.2019.2909169. ping (slam) using high-level synthesis,” 2019.
[82] H. Hirschmuller, “Accurate and efficient stereo processing by semi- [105] Q. Gautier et al., “Real-time 3d reconstruction for FPGAs: A case
global matching and mutual information,” in Proc. IEEE Comput. Soc. study for evaluating the performance, area, and programmability
Conf. Comput. Vision and Pattern Recogn. (CVPR’05), 2005, vol. 2, pp. trade-offs of the Altera OpenCL SDK,” in Proc. Int. Conf. Field-Program-
807–814. mable Technol. (FPT), 2014, pp. 326–329. doi: 10.1109/FPT.2014.7082810.
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
[106] T. Bailey, J. Nieto, J. Guivant, M. Stevens, and E. Nebot, “Consis- [131] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-
tency of the EKF-SLAM algorithm,” in Proc. IEEE/RSJ Int. Conf. Intell. Noguer, “Discriminative learning of deep convolutional feature point de-
Robots and Syst., 2006, pp. 3562–3568. scriptors,” in Proc. IEEE Int. Conf. Comput. Vision, 2015, pp. 118–126.
[107] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: A [132] F. Radenović, G. Tolias, and O. Chum, “Fine-tuning CNN image re-
versatile and accurate monocular slam system,” IEEE Trans. Robot., vol. trieval with no human annotation,” IEEE Trans. Pattern Anal. Mach. In-
31, no. 5, pp. 1147–1163, 2015. doi: 10.1109/TRO.2015.2463671. tell., vol. 41, no. 7, pp. 1655–1668, 2018. doi: 10.1109/TPAMI.2018.2846566.
[108] M. Montemerlo et al., “FastSLAM: A factored solution to the simul- [133] Xilinx. “DPU for convolutional neural network.”
taneous localization and mapping problem,” AAAI/IAAI, vol. 593598, 2002. [134] Z. Xu, J. Yu, C. Yu, H. Shen, Y. Wang, and H. Yang, “CNN-based
[109] M. Gu, K. Guo, W. Wang, Y. Wang, and H. Yang, “An FPGA-based feature-point extraction for real-time visual slam on embedded FPGA,”
real-time simultaneous localization and mapping system,” in Proc. in Proc. IEEE 28th Annu. Int. Symp. Field-Programmable Custom Comput.
Int. Conf. Field Programmable Technol. (FPT), 2015, pp. 200–203. doi: Mach. (FCCM), 2020, pp. 33–37. doi: 10.1109/FCCM48280.2020.00014.
10.1109/FPT.2015.7393150. [135] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
[110] C. Cadena et al., “Past, present, and future of simultaneous local- deep neural networks with pruning, trained quantization and Huffman
ization and mapping: Toward the robust-perception age,” IEEE Trans. coding,” 2015, arXiv:1510.00149.
Robot., vol. 32, no. 6, pp. 1309–1332, 2016. doi: 10.1109/TRO.2016.2624754. [136] S. Krishnan, S. Chitlangia, M. Lam, Z. Wan, A. Faust, and V. J. Reddi,
[111] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a “Quantized reinforcement learning (QUARL),” 2019, arXiv:1910.01055.
monocular camera,” in Proc. IEEE Int. Conf. Comput. Vision, 2013, pp. 1449–1456. [137] H. F. Langroudi, V. Karia, J. L. Gustafson, and D. Kudithipudi,
[112] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robot- “Adaptive posit: Parameter aware numerical format for deep learning
ics: The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, inference on the edge,” in Proc. IEEE/CVF Conf. Comput. Vision and Pat-
2013. doi: 10.1177/0278364913491297. tern Recogn. Workshops, 2020, pp. 726–727.
[113] M. Burri et al., “The EuRoC micro aerial vehicle datasets,” Int. J. Robot. [138] T. Tambe et al., “Algorithm-hardware co-design of adaptive
Res., vol. 35, no. 10, pp. 1157–1163, 2016. doi: 10.1177/0278364915620033. floating-point encodings for resilient deep learning inference,” in Proc.
[114] R. A. Newcombe et al., “KinectFusion: Real-time dense surface 57th ACM/IEEE Des. Automat. Conf. (DAC), 2020, pp. 1–6. doi: 10.1109/
mapping and tracking,” in Proc. 10th IEEE Int. Symp. Mixed and Augment- DAC18072.2020.9218516.
ed Reality, 2011, pp. 127–136. [139] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” 2016, arX-
[115] V. Bonato, E. Marques, and G. A. Constantinides, “A floating-point ex- iv:1605.04711.
tended Kalman filter implementation for autonomous mobile robots,” J. Signal [140] J. Choi, S. Venkataramani, V. Srinivasan, K. Gopalakrishnan, Z.
Process. Syst., vol. 56, no. 1, pp. 41–50, 2009. doi: 10.1007/s11265-008-0257-8. Wang, and P. Chuang, “Accurate and efficient 2-bit quantized neural net-
[116] D. T. Tertei, J. Piat, and M. Devy, “FPGA design and implementation works,” in Proc. 2nd SysML Conf., 2019, vol. 2019.
of a matrix multiplier based accelerator for 3d EKF SLAM,” in Proc. Int. [141] J. Kim, K. Yoo, and N. Kwak, “Position-based scaled gradient for model
Conf. ReConFigurable Comput. and FPGAs (ReConFig14), 2014, pp. 1–6. quantization and pruning,” Adv. Neural Inform. Process. Syst., vol. 33, 2020.
[117] D. T. Tertei, J. Piat, and M. Devy, “FPGA design of EKF block accel- [142] T. Tambe et al., “Adaptivfloat: A floating-point based data type for
erator for 3d visual slam,” Comput. Electr. Eng., vol. 55, pp. 123–137, 2016. resilient deep learning inference,” 2019, arXiv:1909.13271.
doi: 10.1016/j.compeleceng.2016.05.003. [143] J. Yu et al., “CNN-based monocular decentralized SLAM on em-
[118] B. Vincke, A. Elouardi, and A. Lambert, “Real time simultaneous bedded FPGA,” 2020.
localization and mapping: towards low-cost multiprocessor embedded [144] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I.
systems,” EURASIP J. Embedded Syst., vol. 2012, no. 1, p. 5, 2012. doi: Reid, “Unsupervised learning of monocular depth estimation and vi-
10.1186/1687-3963-2012-5. sual odometry with deep feature reconstruction,” in Proc. IEEE Conf.
[119] B. Vincke, A. Elouardi, A. Lambert, and A. Dine, “SIMD and Comput. Vision and Pattern Recogn., 2018, pp. 340–349.
OpenMP optimization of EKF-SLAM,” in Proc. Int. Conf. Multimedia Com- [145] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Net-
puti. Syst. (ICMCS), 2014, pp. 712–716. doi: 10.1109/ICMCS.2014.6911157. VLAD: CNN architecture for weakly supervised place recognition,” in
[120] W. Fang, Y. Zhang, B. Yu, and S. Liu, “FPGA-based ORB feature ex- Proc. IEEE Conf. Comput. Vision and Pattern Recogn., 2016, pp. 5297–5307.
traction for real-time visual slam,” in Proc. Int. Conf. Field Programmable [146] J. Yu et al., “INCA: Interruptible CNN accelerator for multi-tasking
Technol. (ICFPT), 2017, pp. 275–278. doi: 10.1109/FPT.2017.8280159. in embedded robots,” in Proc. 57th ACM/ESDA/IEEE Des. Automat. Conf.
[121] Y. Biadgie and K.-A. Sohn, “Feature detector using adaptive acceler- (DAC), 2020.
ated segment test,” in Proc. Int. Conf. Inf. Sci. Appl. (ICISA), 2014, pp. 1–4. [147] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam
[122] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary ro- system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot.,
bust independent elementary features,” in Proc. European Conf. Com- vol. 33, no. 5, pp. 1255–1262, 2017. doi: 10.1109/TRO.2017.2705103.
put. Vision, 2010, pp. 778–792. [148] S. Liu, Engineering Autonomous Vehicles and Robots: The Dragon-
[123] R. Liu, J. Yang, Y. Chen, and W. Zhao, “eSLAM: An energy-efficient Fly Modular-based Approach, 1st ed. Wiley-IEEE Press, Mar. 2020.
accelerator for real-time ORB-slam on FPGA platform,” in Proc. 56th [149] M. Maimone, Y. Cheng, and L. Matthies, “Two years of visual
Annu. Des. Automat. Conf., 2019, pp. 1–6. odometry on the mars exploration rovers,” J. Field Robot., vol. 24, no. 3,
[124] V. H. Schulz, F. G. Bombardelli, and E. Todt, “A Harris corner detec- pp. 169–186, 2007. doi: 10.1002/rob.20184.
tor implementation in SoC-FPGA for visual slam,” in Robotics. Springer- [150] B. Klingner, D. Martin, and J. Roseborough, “Street view motion-
Verlag, 2016, pp. 57–71. from-structure-from-motion,” in Proc. IEEE Int. Conf. Comput. Vision,
[125] M. Abouzahir, A. Elouardi, S. Bouaziz, R. Latif, and A. Tajer, 2013, pp. 953–960.
“Large-scale monocular FastSLAM2. 0 acceleration on an embedded [151] Y. Jeong, D. Nister, D. Steedly, R. Szeliski, and I.-S. Kweon, “Pushing the
heterogeneous architecture,” EURASIP J. Adv. Signal Process., vol. 2016, envelope of modern methods for bundle adjustment,” IEEE Trans. Pattern Anal.
no. 1, p. 88, 2016. doi: 10.1186/s13634-016-0386-3. Mach. Intell., vol. 34, no. 8, pp. 1605–1617, 2011. doi: 10.1109/TPAMI.2011.256.
[126] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and A. Tajer, “Em- [152] C. Wu, S. Agarwal, B. Curless, and S. M. Seitz, “Multicore bundle
bedding SLAM algorithms: Has it come of age?” Robot. Autonom. Syst., adjustment,” in Proc. CVPR 2011, 2011, pp. 3057–3064.
vol. 100, pp. 14–26, 2018. doi: 10.1016/j.robot.2017.10.019. [153] A. Eriksson, J. Bastian, T.-J. Chin, and M. Isaksson, “A consensus-
[127] K. Boikos and C.-S. Bouganis, “Semi-dense SLAM on an FPGA based framework for distributed bundle adjustment,” in Proc. IEEE
SoC,” in Proc. 26th Int. Conf. Field Programmable Logic Appl. (FPL), 2016, Conf. Comput. Vision and Pattern Recogn., 2016, pp. 1754–1762.
pp. 1–4. doi: 10.1109/FPL.2016.7577365. [154] R. Zhang, S. Zhu, T. Fang, and L. Quan, “Distributed very large
[128] K. Boikos and C.-S. Bouganis, “A high-performance system-on-chip scale bundle adjustment by global camera consensus,” in Proc. IEEE Int.
architecture for direct tracking for slam,” in Proc. 27th Int. Conf. Field Pro- Conf. Comput. Vision, 2017, pp. 29–38.
grammable Logic Appl. (FPL), 2017, pp. 1–7. doi: 10.23919/FPL.2017.8056831. [155] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and V. Sze, “Navion:
[129] K. Boikos and C.-S. Bouganis, “A scalable FPGA-based architec- A 2-mw fully integrated real-time visual-inertial odometry accelerator
ture for depth estimation in SLAM,” in Proc. Int. Symp. Appl. Reconfigu- for autonomous navigation of nano drones,” IEEE J. Solid-State Circuits,
rable Comput., 2019, pp. 181–196. vol. 54, no. 4, pp. 1106–1119, 2019. doi: 10.1109/JSSC.2018.2886342.
[130] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-su- [156] Q. Liu, S. Qin, B. Yu, J. Tang, and S. Liu, “π-ba: Bundle adjustment
pervised interest point detection and description,” in Proc. IEEE Conf. hardware accelerator based on distribution of 3d-point observations,”
Comput. Vision and Pattern Recogn. Workshops, 2018, pp. 224–236. IEEE Trans. Comput., 2020.
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.
[157] R. Sun, P. Liu, J. Xue, S. Yang, J. Qian, and R. Ying, “BAX: A bundle [179] W. Fang, Y. Zhang, B. Yu, and S. Liu, “Dragonfly+: FPGA-based
adjustment accelerator with decoupled access/execute architecture quad-camera visual slam system for autonomous vehicles,” Proc. IEEE
for visual odometry,” IEEE Access, vol. 8, pp. 75,530–75,542, 2020. doi: HotChips, p. 1, 2018.
10.1109/ACCESS.2020.2988527. [180] T. Qin, P. Li, and, and S. Shen, “VINS-MONO: A robust and versatile
[158] P. Leven and S. Hutchinson, “A framework for real-time path plan- monocular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34,
ning in changing environments,” Int. J. Robot. Res., vol. 21, no. 12, pp. no. 4, pp. 1004–1020, 2018. doi: 10.1109/TRO.2018.2853729.
999–1030, 2002. doi: 10.1177/0278364902021012001. [181] K. Sun et al., “Robust stereo visual inertial odometry for fast au-
[159] S. Karaman and E. Frazzoli, “Sampling-based algorithms for opti- tonomous flight,” IEEE Robot. Automat. Lett., vol. 3, no. 2, pp. 965–972,
mal motion planning,” Int. J. Robot. Res., vol. 30, no. 7, pp. 846–894, 2011. 2018. doi: 10.1109/LRA.2018.2793349.
doi: 10.1177/0278364911406761. [182] R. Szeliski, “Computer vision: Algorithms and applications,” in
[160] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Batch informed Texts in Computer Science. London: Springer-Verlag, 2010.
trees (bit*): Sampling-based optimal planning via the heuristically [183] A. Geiger, M. Roser, and R. Urtasun, “Efficient large-scale stereo
guided search of implicit random geometric graphs,” in Proc. IEEE Int. matching,” in Proc. 10th Asian Conf. Comput. Vision, 2010.
Conf. Robot. Automat. (ICRA), 2015, pp. 3067–3074. [184] Y. Feng, P. Whatmough, and Y. Zhu, “ASV: Accelerated stereo vi-
[161] K. Hauser, “Lazy collision checking in asymptotically-optimal sion system,” in Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarch. (MI-
motion planning,” in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), 2015, pp. CRO ‘52), 2019, p. 643–656.
2951–2957. [185] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-
[162] A. Yershova and S. M. LaValle, “Improving motion-planning algo- speed tracking with kernelized correlation filters,” IEEE Trans. Pattern
rithms by efficient nearest-neighbor searching,” IEEE Trans. Robot., vol. Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015. doi: 10.1109/
23, no. 1, pp. 151–157, 2007. doi: 10.1109/TRO.2006.886840. TPAMI.2014.2345390.
[163] W. Wang, D. Balkcom, and A. Chakrabarti, “A fast online spanner [186] A. Kelly, Mobile Robotics: Mathematics, Models, and Methods. Cam-
for roadmap construction,” Int. J. Robot. Res., vol. 34, no. 11, pp. 1418– bridge Univ. Press, 2013.
1432, 2015. doi: 10.1177/0278364915576491. [187] J. Tang, B. Yu, S. Liu, Z. Zhang, W. Fang, and Y. Zhang, “π-soc: Het-
[164] S. Murray, W. Floyd-Jones, G. Konidaris, and D. J. Sorin, “A pro- erogeneous soc architecture for visual inertial slam applications,” in
grammable architecture for robot motion planning acceleration,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots and Syst. (IROS), 2018, pp. 8302–
Proc. IEEE 30th Int. Conf. Appl.-Specific Syst., Arch. Process. (ASAP), 2019, 8307. doi: 10.1109/IROS.2018.8594181.
vol. 2160, pp. 185–188. [188] S. Qin, Q. Liu, B. Yu, and S. Liu, “π-ba: Bundle adjustment accelera-
[165] J. Bialkowski, S. Karaman, and E. Frazzoli, “Massively parallelizing tion on embedded FPGAs with co-observation optimization,” in Proc.
the RRT and the RRT,” in Proc. IEEE/RSJ Int, Conf. Intell. Robots and Syst., 27th IEEE Annu. Int. Symp. Field-Programmable Custom Comput. Mach.
2011, pp. 3513–3518. (FCCM 2019), San Diego, CA, Apr. 28–May 1, 2019, pp. 100–108.
[166] J. Pan and D. Manocha, “GPU-based parallel collision detection for [189] https://grail.cs.washington.edu/projects/bal/
fast motion planning,” Int. J. Robot. Res., vol. 31, no. 2, pp. 187–200, 2012. [190] P. L. Mckerracher, R. P. Cain, J. C. Barnett, W. S. Green, and J. D.
doi: 10.1177/0278364911429335. Kinnison, “Design and test of field programmable gate arrays in space
[167] J. Pan, C. Lauterbach, and D. Manocha, “G-planner: Real-time mo- applications,” 1992.
tion planning and global navigation using GPUs,” in AAAI, 2010. [191] M. Berg, “FPGA mitigation strategies for critical applications,”
**[168] N. Atay and B. Bayazit, “A motion planning processor on re- 2019.
configurable hardware,” in Proc. IEEE Int. Conf. Robot. Automat. (ICRA [192] D. Sheldon, “Flash-based FPGA NEPP FY12 summary report,”
2006), 2006, pp. 125–132. [193] R. Gaillard, “Single event effects: Mechanisms and classification,”
[169] S. Murray, W. Floyd-Jones, Y. Qi, G. Konidaris, and D. J. Sorin, “The in Soft Errors in Modern Electronic Systems. Springer-Verlag, 2011, pp.
microarchitecture of a real-time robot motion planning accelerator,” 27–54.
in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarch. (MICRO), 2016, pp. [194] M. Wirthlin, “FPGAs operating in a radiation environment: lessons
1–12. doi: 10.1109/MICRO.2016.7783748. learned from FPGAs in space,” J. Instrumentation, vol. 8, no. 2, p. C02020,
[170] S. Lian, Y. Han, X. Chen, Y. Wang, and H. Xiao, “DADU-P: A scalable 2013. doi: 10.1088/1748-0221/8/02/C02020.
accelerator for robot motion planning in a dynamic environment,” in [195] F. Brosser and E. Milh, “SEU mitigation techniques for advanced
Proc. 55th ACM/ESDA/IEEE Design Automat. Conf. (DAC), 2018, pp. 1–6. reprogrammable FPGA in space,” Master’s thesis, 2014.
doi: 10.1109/DAC.2018.8465785. [196] B. Ahmed and C. Basha, “Fault mitigation strategies for reliable
[171] U. Bondhugula et al., “Hardware/software integration for FPGA- FPGA architectures,” Ph.D. thesis, Rennes 1, 2016.
based all-pairs shortest-paths,” in Proc. 14th Annu. IEEE Symp. Field- [197] S. Habinc, “Suitability of reprogrammable FPGAs in space applica-
Programmable Custom Comput. Mach., 2006, pp. 152–164. doi: 10.1109/ tions,” Gaisler Research, Feasibility Rep., 2002.
FCCM.2006.48. [198] G. Lentaris et al., “High-performance embedded computing in space:
[172] K. Sridharan, T. Priya, and P. R. Kumar, “Hardware architecture Evaluation of platforms for vision-based navigation,” J. Aerospace Inform.
for finding shortest paths,” in Proc. TENCON 2009 IEEE Region 10 Conf., Syst., vol. 15, no. 4, pp. 178–192, 2018. doi: 10.2514/1.I010555.
pp. 1–5. [199] T. Y. Li and S. Liu, “Enabling commercial autonomous robotic
[173] Y. Takei, M. Hariyama, and M. Kameyama, “Evaluation of an FP- space explorers,” IEEE Potentials., vol. 39, no. 1, pp. 29–36, 2019. doi:
GA-based shortest-path-search accelerator,” in Proc. Int. Conf. Parallel 10.1109/MPOT.2019.2935338.
Distrib. Process. Techn. Appl. (PDPTA), The Steering Committee of The [200] D. Ratter, “FPGAs on Mars,” Xcell J., vol. 50, pp. 8–11, 2004.
World Congress in Computer Science, Computer Engineering and Ap- [201] J. F. Bell III et al., “Mars exploration rover athena panoramic cam-
plied Computing (WorldComp), 2015 p. 613. era (pancam) investigation,” J. Geophys. Res. Planets, vol. 108, 2003. doi:
[174] K. Vipin and S. A. Fahmy, “FPGA dynamic and partial reconfigura- 10.1029/2003JE002070.
tion: A survey of architectures, methods, and applications,” ACM Com- [202] “Space flight system design and environmental test.” https://
put. Surveys (CSUR), vol. 51, no. 4, pp. 1–39, 2018. doi: 10.1145/3193827. www.nasa.gov/sites/default/files/atoms/files/std8070.1.pdf (accessed
[175] S. Liu, R. N. Pittman, and A. Forin, “Minimizing partial reconfigura- Sept. 1, 2020)
tion overhead with fully streaming DMA engines and intelligent ICAP [203] M. C. Malin et al., “The Mars Science Laboratory (MSL) mast cam-
controller,” in FPGA, 2010, p. 292. eras and descent imager: Investigation and instrument descriptions,” Earth
[176] S. Liu, R. N. Pittman, A. Forin, and J.-L. Gaudiot, “Achieving energy Space Sci., vol. 4, no. 8, pp. 506–539, 2017. doi: 10.1002/2016EA000252.
efficiency through runtime partial reconfiguration on reconfigurable [204] C. D. Edwards, T. C. Jedrey, A. Devereaux, R. DePaula, and M. Dapore,
systems,” ACM Trans. Embedded Comput. Syst. (TECS), vol. 12, no. 3, p. “The electra proximity link payload for Mars relay telecommunications
72, 2013. doi: 10.1145/2442116.2442122. and navigation,” 2003. doi: 10.2514/6.IAC-03-Q.3.a.06.
[177] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski, “Orb: An ef- [205] A. Johnson et al., “The lander vision system for Mars 2020 entry
ficient alternative to sift or surf,” in ICCV, vol. 11, p. 2, 2011. descent and landing,” 2017.
[178] B. D. Lucas and T. Kanade, “An iterative image registration tech- [206] “Vivado high-level synthesis.” https://www.xilinx.com/products/
nique with an application to stereo vision,” in Proc. 7th Int. Joint Conf. design-tools/vivado/integration/esl-design.html (accessed Sept. 10,
Artif. Intell., 1981. 2020)
Authorized licensed use limited to: Josif Kosev. Downloaded on July 19,2021 at 08:21:08 UTC from IEEE Xplore. Restrictions apply.