Integration of Open Source Platform Duckietown and Gesture Recognition As An Interactive Interface For The Museum Robotic Guide
Integration of Open Source Platform Duckietown and Gesture Recognition As An Interactive Interface For The Museum Robotic Guide
Integration of Open Source Platform Duckietown and Gesture Recognition As An Interactive Interface For The Museum Robotic Guide
Abstract—In recent years, population aging becomes a serious the indoor application, not only the common outdoor car
problem. To decrease the demand for labor when navigating application.
visitors in museums, exhibitions, or libraries, this research
designs an automatic museum robotic guide which integrates In [3], a map-based navigation method with indoor vision is
image and gesture recognition technologies to enhance the guided proposed. To improve the adaptability of the robot, a small
tour quality of visitors. The robot is a self-propelled vehicle vehicle is equipped with a camera to capture the vision
developed by ROS (Robot Operating System), in which we information of the environment which helps the robot to sense
achieve the automatic driving based on the function of lane- the environmental changes so that the vehicle could smoothly
following via image recognition. This enables the robot to lead navigate along the planed route. In [4], Kwapisz et al. utilized
guests to visit artworks following the preplanned route. In the three-axis inertial sensor built in smart phone to collect the
conjunction with the vocal service about each artwork, the robot sensing data over time and then select 43 different features
can convey the detailed description of the artwork to the guest. from the data to analyze and predict the ongoing human
We also design a simple wearable device to perform gesture activities with. Several classification algorithms are compared
recognition. As a human machine interface, the guest is allowed and used to analyze the relationship between the features and
to interact with the robot by his or her hand gestures. To improve human activities. Prediction accuracy of each algorithm is also
the accuracy of gesture recognition, we design a two phase hybrid drawn in the study. The use of wearable devices to identify
machine learning-based framework. In the first phase (or human activities could effectively improve the recognition
training phase), k-means algorithm is used to train historical data
performance. In [5], Weiss et al. use machine learning methods
and filter outlier samples to prevent future interference in the
recognition phase. Then, in the second phase (or recognition
to recognize human activities and compare the recognition
phase), we apply KNN (k-nearest neighboring) algorithm to accuracy between smart-watches and smart-phones. In
recognize the hand gesture of users in real time. Experiments addition, the studied human activities include not only a wide
show that our method can work in real time and get better range of limb movements but also a small range of hand
accuracy than other methods. movements. Their results conclusively confirm that using
smart-watches has a better accuracy rate compared to smart-
Keywords—ROS (Robot Operating System), wireless networks, phones. In [6], Schlömer et al. use Wii controller to collect a
navigation, gesture recognition, autonomous driving, guide robot series of simple movements of hands and proposed a
framework to realize gesture recognition. The proposed
I. INTRODUCTION framework first preprocesses the collected raw data by k-means
algorithm and then converts the output into the input vector for
The development of artificial intelligence and automation the Hidden Markov Model (HMM) and finally cascade a Bayes
technologies has been advancing by leaps and bounds over the Decision Theorem as a classifier. For the proposed framework,
recent years. This also helps several emerging applications the gesture recognition accuracy is 85% -95%. On the other
make big progress, including the autonomous driving vehicle. hand, autonomous driving is an emerging and hot research area
To reduce the setup cost for researchers and in school students [8]. It is a complex system, so the functional modules are
and help them to test their designed or learned autonomous highly dependent. Then, the stability of the system is very
driving algorithms in a quick manner, Massachusetts Institute important. There are many researchers using ROS for the
of Technology Computer Science and Artificial Intelligence implementation of automatic driving and prove that this is
Laboratory (CSAIL) designs an open source platform, called feasible [9]. The authors in [9] demonstrate that ROS is suitable
Duckietown [7]. This helps more researchers and students can for developing autonomous driving applications in various
involve the field of autonomous vehicle technology [1] and has configurations from the perspectives of ROS characteristics,
been confirmed from the actual case of education [2]. In this communication overhead, and advantages and disadvantages
paper, we aim to improve the quality of museum navigation for comparison.
individuals with autonomous vehicle-based robots. This
expands the possibilities of autonomous vehicle technology to To implement an auxiliary robot helping the museum guide
service, we propose a novel framework which integrates the
This research is co-sponsored by MOST 105-2221-E-024-010, 106-2221-
E-024-004, and ITRI.
autonomous driving system for navigation and a gesture robot, and the remote monitoring subsystem will be described
recognition function embedded wearable device as the human in more detail.
machine interactive interface. For gesture recognition, we
propose a two phase hybrid machine learning-based algorithm. Control node: wearable device:
As preprocessor, k-means algorithm is used to train and filter For the human-computer interaction, the user sends control
our dataset to decrease interference. It is shown that the instructions to the robot by making predefined hand gestures.
proposed method can help our classifier to conduct a better As shown in Fig. 1(1), the wearable device is installed with a
performance in prediction accuracy compared to other trained gesture database, which is trained and selected by the
methods. In this work, our contribution is threefold. (1) For the unsupervised-based learning algorithm. Detail of the algorithm
trend of fewer children and more elderlies, our system provides and procedure will be illustrated later. On receiving the gesture
a way to reduce the labor demand of guide service in museums, raw data, the hand recognition process will compute the feature
exhibitions, and libraries. Precious working populations can be vector of the row data and then use KNN learning algorithm to
used to deal with more complex and important jobs, but not classify and determine the user’s hand gesture in real time.
highly repetitive work. (2) The hardware setup of our system is After that, the classification result (or the control command) is
based on the low cost Raspberry Pi and Arduino Yun, while the sent to the robot through Wi-Fi. To improve the accuracy of
software is totally open source. The setup cost and the hand gesture recognition, our wearable device is designed to
maintenance fee are acceptable for most organizations. (3) For measure the calibration parameters of its inertial sensor
the gesture recognition, we adopt the hybrid machine learning- whenever the device turns on.
based algorithm and implement the algorithm in the end device
– a wearable apparatus for the visitor. Through the intelligence Museum guide robot:
embedded device, visitors can easily interact with the robotic The robot is designed and implemented based on ROS. The
guide. Moreover, new gestures (or new commands) can be system architecture of the robot as shown in Fig. 1(2) focuses
added into the wearable device at any time without difficulty, on the features of how ROS nodes cooperate with each other
which guarantees the scalability of our system. Experiments and how each function works. The functions provided in the
show that the gesture recognition accuracy of our hybrid robot include (1) autonomous driving, (2) AprilTag
machine learning-based algorithm is better than other methods. identification, and (3) voice service. The robot is equipped with
a fisheye camera, which provides image data of the
II. SYSTEM ARCHITECTURE AND PROPOSED METHODS environment to the guide robot. On receiving each image from
the camera, the robot analyzes and conducts the information
In this Section, we will first introduce the system
included in the image to determine the next action accordingly.
architecture of the whole system. Then, the operation procedure
For example, the Lane Control Module in Fig. 1(2) convert the
of the ROS-based museum robotic guide will be described.
2D image to 3D and determines its current coordination in the
Finally, the proposed two phase hybrid machine learning-based
image in order to decide the driving direction according to the
gesture recognition algorithm will be illustrated.
lane information derived from the image. Moreover, if there is
an AprilTag found in the image, the robot will analyzes the
A. System Architecture content in the AprilTag and then acts accordingly. Furthermore,
upon the receipt of the analyzed result message published by
the Lane Control Module, the Finite State Machine (FSM)
Control Module will set its state and then publish wheel motor
control message accordingly for the Wheel Motor Control
Module. Socket node is responsible for communicating to the
visitor’s wearable device and the remote monitoring subsystem
via Wi-Fi.
Remote monitoring subsystem:
The remote monitoring subsystem is to display both the raw
Fig. 1. System Architecture video and analyzed results that the museum guide robot sees
with its equipped camera. With these information, the
In our proposed system, to guide visitors, the museum administrator can monitor the status of the robot through the
guide robot will follow the preplanned lane on the floor. Each remote monitoring subsystem. Once any problem happens, the
artwork in the museum is tagged with an AprilTag. Once the administrator can deal with it immediately. The remote
museum guide robot identifies an AprilTag, it will notify the monitoring subsystem is also implemented based on ROS. The
user with both voice and vibration and then check whether the robot and the subsystem are connected via Wi-Fi.
user requests for further information of the artwork or chooses
to skip it. On the other hand, each visitor is equipped with a
wearable device on the hand. The visitor is able to interact with B. Operation Procedure
the robot or issue a command to the robot with hand gestures. The robot is equipped with a fisheye camera, which keeps
Moreover, the system administrator can see what the robot sees recording the road image and then passes the images to the
via the remote monitoring subsystem and check the status of Lane Control Module and Voice Control Module. As shown in
the robot. In the following, the wearable device, museum guide Fig. 1(2), the Lane Control Module, FSM Control Module, and
Wheel Moter Control Module are to realize the lane-following
The 27th Wireless and Optical Communications Conference (WOCC2018)
function of the robot. On getting the compressed image from classification result is translated to the corresponding command
the camera node, the Line detector node uses the OpenCV and then send to the guide robot through Wi-Fi.
library to identify the desired color on the road and then passes
the image to the Ground projection node. The Ground C. Hand Gesture Recognition
projection node marks and converts the image to conduct exact Our goal is to improve the accuracy rate of gesture
coordinates of the image in the space. Afterwards, the recognition and then identify different hand gestures
information enters the Lane filter node and then calculate the successfully. We formulate the hand gesture recognition
driving direction according to the identified lane and problem as clustering problem. In machine learning, it is one of
coordinates. The driving direction is then sending to the Lane the most popular problems. To solve the problem, we propose a
control node and the node will convert the direction into the two phase hybrid machine learning-based gesture recognition
dedicated car_cmd. The operation procedure then enters the algorithm. The hybrid algorithm combines k-means and KNN
FSM Control Module. Upon the receipt of the car_cmd from algorithms to recognize hand gestures. To classify an unknown
the Lane Control Module, the FSM node switches or remains input, KNN is liable to be affected by outliers. However, it is
the state accordingly. Inside the module, the FSM and Car cmd hard to totally remove noise from the input sensing data in the
switch nodes communicate with each other to keep the FSM preprocess stage. Therefore, we exploit k-means algorithm to
status in the Car cmd switch node being up to date. detect outliers and then remove these samples to construct a
Specifically, the Car cmd switch node is responsible to drive reliable database, where k-means is sensitive to noise and this
the Wheel driver node in the Wheel Motor Control Module to property can help us to detect the samples with noise. To be
manipulate the motors according to the state of FSM. On the brief, in the first phase (or called training phase) of our method,
other hand, upon the receipt of the recording image from the we exploit k-means to train the feature vectors of a series of
camera node, the AprilTags node in the Voice Control Module known hand gestures and filter out outlier feature vectors.
processes the image and recognizes whether there is any When a feature vector is removed, a new feature vector is
AprilTag in the image. If yes, the ID of the tag will be added to keep the amount of feature vectors of each class is
conducted and then be forwarded to the Artwork_query node. equal. The above operation is repeated until no outlier is
On receiving the tag information from the AprilTag node, the detected. In the second phase (or called execution phase), we
Artwork_query node will notify the guided visitor and wait for use KNN algorithm with the trained database derived in the
the command of the visitor. If the visitor is interested in the training phase as a reference information to predict the hand
artwork in front of him or her, the Artwork_query node will gesture of the user in real time. In the following, the detail of
play the corresponding audio file to introduce the artwork for the two phase method will described.
the visitor.
Training phase:
In our proposed framework, the visitor interact with the
guide robot with the wearable device as shown in Fig. 1(1). The In this phase, we aim to setup a database with reference
operation procedure of the device is described as follows. feature vectors for the later phase. The setup of the database is
Whenever the wearable device turns on, the first step is to assisted by k-means algorithm. Suppose we have n samples,
calibrate the output of the accelerometer of the inertial sensor. each is labeled with its hand gesture, and try to cluster them
This step ensures the correctness of the measurement later in into k classes, i.e., there are k kinds of hand gestures. Each
both training phase and recognition phase (or execution phase). sample x(i) is an m-element vectors as below:
To recognize the hand gesture of the visitor, as shown in the
right hand side of Fig. 1(1), the trained database must be set up
first. The database setting up phase is also called the training x (i ) = { x1( i ) ,..., xm(i ) } , x (ji ) ∈ R, j = 1..m . (1)
phase. Our designed wearable device is equipped with an
inertial sensor. For each kind of hand gestures, the user repeats First, each sample x(i), i = 1..N, is translated into a feature
for several times in the training phase. The wearable device vector y(i). Then, k points are randomly selected as k initial
will catch the raw data of the sensor for each hand gesture cluster centers for n feature vectors {y(i)| i = 1..N}. The set of k
sample. After collecting training sample data, each sample is cluster centers is as below:
translated into the feature vector, which is treated as the
attribute value in the next step. In the training phase, we exploit
k-means algorithm to complete data clustering and filter outlier u = {u1 ,..., uk } , k ∈ N . (2)
samples. Detail of the operation will be explained in next
subsection. When an ideal trained database is created, it is
saved and will be referred in the recognition phase (or Then, each y(i) calculate the Euclidean distance between it and
execution phase). To analyze and recognize the hand gesture of each uj. Then, let y(i) select the nearest one uj* as its cluster
the visitor, the wearable device follows the operation procedure center and update Sj* = Sj*+{y(i)}. After calculation, evaluate the
shown in the left hand side part of Fig. 1(1). In the recognition following object function as shown in Eq. (3), which sums up
phase, each time the visitor makes a hand gesture, the wearable all the square of Euclidean distance between each sample point
device records the sensing value of the inertial sensor and then and its cluster center.
translates the raw data into the feature vector. The feature
vector is used as the input data of the supervised learning
(y u )
k 2
algorithm, KNN. KNN classifies the visitor’s hand gesture with Z1 = (i )
j . (3)
the trained database obtained in the training phase. Finally, the j =1 y ( i ) ∈S j
The 27th Wireless and Optical Communications Conference (WOCC2018)
After the calculation and getting Z1, we update cluster centers clone suite of the Duckietown. After setting the IP address of
for each cluster Sj, j = 1..k as follows. the local, you can use the laptop to SSH (remote connection
server) to connect and control Raspberry Pi. Next, you will
need to install the programs that you will use, for example:
ROS-indigo, OpenCV-bridge, ROS-related programs, etc. Then
uj =
(i )
y (i ) Sj . (4)
you have to set the ROS environment, such as: Who is the
y ∈S j
Master, ROS Workspace path, Hostname and so on.
In addition, reset Sj = {}, j = 1..k. Then, we iteratively let y(i), i (ii). Configuration of laptop:
= 1..N, select the nearest uj* as its cluster center, calculate the
We use the laptop to enter the Ubuntu system and connect
object function in Eq. (3), and compare Zh with Zh-1. If the
the internet to the same LAN. Then, we open the terminal in
difference between Zh and Zh-1 is less than or equal to a
the interface and enter the SSH command to connect to the
predefined threshold, then stops; otherwise, update uj, j = 1..k,
robot. As a result, the RPi on the robot can be controlled
following Eq. (4) and set Sj = {}, j = 1..k. In this moment, we
remotely. For example, the setting of the ROS environment in
check whether the classification results of samples are the same
step A can call the topic message and the picture on the car, etc.
as each of their labeled hand gestures. Ideally, the results shall
be the same. If not, mean that some sample points are polluted (iii). Camera and motor calibration:
by too much noise. We will remove these samples and add new
samples with the same label to balance the sample quantities of We use the checkerboard image and OpenCV function to
each kind of hand gestures. This operation is repeated until the calibrate the image of the camera. Let Duckiebot's camera
classification results of samples are the same as each of their collect many samples against the checkerboard image. Then,
labeled hand gestures. within the RPI we use commands to modify two parameters
that are gain and trim, where trim is the adjustment of the
Execution phase: unilateral wheel speed and gain is the adjustment of the overall
wheel speed.
In the execution phase, it is to recognize the ongoing hand
gesture of the user. To identify the hand gesture, we adopt
KNN (k-nearest neighbor) algorithm. For an input sample point
p, calculate its feature vector Feature(p) and then measure its
Euclidean distance to each point y(i), i = 1..N, in the database.
Next, check the first q shortest distance samples and count
the labeled hand gesture of each of the q samples. If the most
chosen samples are labeled hand gesture g, classify the user is
doing hand gesture g. The classified result will then be sent to
the museum guide robot and the robot will act accordingly. Fig. 2. Camera calibration
III. SYSTEM DESIGN AND PROTOTYPE IMPLEMENTATION (iv). Implementation of new added ROS nodes and
Modification of existing nodes:
In this section, we illustrate the system design and
implementation of the museum guide robot and the gesture Common writing languages are c++ and python and we
recognition. currently use python to code. First of all, we create a folder we
want to develop in Workspace. Then, we must use the
A. Museum Guide Robot command $catkin_create_pkg to create the node package that
We use Duckiebot, a device that serves as the primary client we want to develop, as well as the required libraries for this
for autonomous driving in Duckietown [7], as the basic node. After that, you can edit your own .py file as the main
architecture for navigation robots. Our main hardware and code for this node. It is important to establish a ROS node that
software is based on the Raspberry Pi 2 B+ Ubuntu operating is the Publisher and Subscriber. The communication of
system. Then, other hardware has a Adafruit DC & Stepper messages between Subscriber and Publisher depends on Topic.
Motor HAT, Adafruit 16-Channel PWM / Servo HAT, fisheye Publisher: In the coding, we'll hit a line to represent what
camera, car module, mobile power and the internal system of the publisher wants to publish, the message format for the
the robot is open-source ROS. topic, and the size of the message.
Implementation: Subscriber: In the coding, the subscriber receives the
First, the various components are welded and assembled message in the topic to complete the communication between
into the RPi. There are four steps for the installation of the nodes. So as long as the message is processed in each code of
Duckiebot, network settings, laptop environment settings, robot nodes and given out, you can make the whole system active.
calibration and node coding.
B. Gesture Recognition
(i). Setup Duckiebot and network configuration:
We designed a device to implement gesture recognition
We can enter the basic Rpi commands to open the camera based on Arduino Yun as our wearable device. The MPU6050
and set the network, and then you can use Git commands to acceleration sensor was embedded to detect the raw
The 27th Wireless and Optical Communications Conference (WOCC2018)
acceleration value of the user's hand when establishing the Up Down Left (%) number
database in training phase and obtained the source of the instant Actual Up 70 1 0 98.59 71
gesture data when performing the gesture recognition. To judge action
when is to obtain the gesture information, we directly used the Down 0 70 0 100 70
quaternion value from the hardware solution built-in in Left 0 1 69 98.57 70
MPU6050 as output. For database establishment, we uses k-
means function in MATLAB to train the database, and designs
TABLE III. COMPARISON OF OUR PROPOSED METHOD AND OTHERS
a iteration and verification method, instead of manually
observing the way of screening bad sample points, and Classifier
automatically optimize the database. In the part of feature Method Ours Random J48 KNN(1B1) Logistic SVM
selection, we adopted the general statistics including average, tree Regression
variation, SVM, maximum, minimum and root mean square.
Accur. 99.08 98.97 98.71 97.43 94.86 36.25
In order to enhance the reliability of the database, we added (%)
X, Y, Z axis acceleration and angular acceleration as our
training features for totally 36 features. When the data base was V. CONCLUSION
completed, it would be loaded into the program of wearable
In this paper, to decrease the demand for labor when
device. Then, we applied KNN algorithm with the loop in
navigating visitors in museums and enhance the guided tour
Arduino Yun to read the acceleration of the gesture and the
quality of visitors, we design an automatic museum robotic
quaternion after starting. If it is detected the signal of starting
guide which integrates image and gesture recognition
recognition based on quaternion value (i.e, horizontal
technologies. The guide robot is a self-propelled vehicle
placement), it will recognize the specific gesture of the user
developed by open source software ROS, in which the
which device is designed to blink red light to inform user to do
automatic driving is achieved via image recognition. The robot
within two seconds. The principle of recognition is based on
can lead guests to visit artworks following the preplanned
the database generated by k-means and the gesture
route. In conjunction with the vocal service about each artwork,
classification is performed by using the KNN algorithm. The
the robot is able to convey the detailed description of the
raw data values are read out in this time interval and calculated
artwork to the guest. As a human machine interface, the guest
the corresponding features as the input parameters to the KNN
is allowed to interact with the robot by his or her hand gestures
algorithm, and finally send the classified values via the Socket
through our designed wearable device. The proposed two phase
terminal in the Arduino Yun built-in chip to the robot in the
hybrid machine learning-based method is shown to gain better
same AP. This process would complete the command
accuracy of gesture recognition compared to other methods.
transmission of gestures.
REFERENCES
IV. EXPERIMENTAL RESULTS
[1] L. Paull, J. Tani, H. Ahn, et al. “Duckietown: an Open, Inexpensive and
In the experiments, the baud rate of the experimental device Flexible and Capable Platform for Autonomy Education and Research,”
is 115200. In Table I, we are to figure out the effect of the 2017 IEEE International Conference on Robotics and Automation
sampling window on the accuracy of gesture recognition. The (ICRA), pp. 1497-1504, 2017.
accuracy of different gestures for our method are also [2] Jacopo Tani, Liam Paull, Maria T. Zuber, et al. “Duckietown: An
compared in Table II. In Table III, the recognition performance Innovative Way to Teach Autonomy,” International Conference
EduRobotics 2016, 2016.
of our proposed framework is compared with other classifiers
[3] Chi-Shian Lin, “Study on Map-Based Indoor Mobile Robot Vision
provided in the open source software Weka. In our designed Navigation,” Matster’s Thesis of Department of Electrical Engineering,
museum guide service, DuckieBots follow the pre-set route to National Cheng Kung University, Tainan, Taiwan, 2009.
guide users in the museum. The user can switch the moving [4] Kwapisz, Jennifer R., Gary M. Weiss, and Samuel A. Moore, “Activity
speed of his/her DuckieBot by making gestures. Once the recognition using cell phone accelerometers,” ACM SigKDD
DuckieBot finding an AprilTag, it stops and starts the voice Explorations Newsletter, 12(2): 74-82, Mar. 2011.
navigation function. The user can make gestures to control the [5] Gary M. Weiss, Jessica L. Timko, Catherine M. Gallagher, et al.
voice function, such as replay/pause/play/move forward, etc. “Smartwatch-based Activity Recognition: A Machine Learning
Approach,” 2016 IEEE-EMBS International Conference on Biomedical
*Video link: https://ez2o.com/3DXwQ and Health Informatics (BHI), pp. 426 – 429, 2016.
[6] Schlömer, Thomas, et al. “Gesture recognition with a Wii con-troller.” in
Proceedings of the ACM 2nd international conference on Tangible and
TABLE I. ACCURACY OF DIFFERENT SAMPLING WINDOW embedded interaction, pp. 11-14, 2008.
Sampling window of gesture (sec) [7] What is Duckietown? [Online] Available: http://duckietown.mit.edu/
[8] Ömer Şahin Taş, Florian Kuhnt, J. Marius Zöllner, and Christoph Stiller,
0.5 1 2 3 4 “Functional system architectures towards fully automated driving,” 2016
Accr. (%) 43.01 82.24 99.05 87.22 88.53 IEEE Intelligent Vehicles Symposium (IV), 2016.
[9] André-Marcel Hellmund, Sascha Wirges, Ömer Şahin Taş, et al. “Robot
operating system: A modular software framework for automated
TABLE II. CONFUSION MATRIX OF OUR CLASSIFIER WITH SAMPLE RATE driving,” 2016 IEEE 19th International Conference on Intelligent
187/2MS Transportation Systems (ITSC), 2016.
Predicted action Accur. Total