Control of Robot Arm Based On Speech Recognition Using Mel-Frequency Cepstrum Coefficients (MFCC) and K-Nearest Neighbors (KNN) Method
Control of Robot Arm Based On Speech Recognition Using Mel-Frequency Cepstrum Coefficients (MFCC) and K-Nearest Neighbors (KNN) Method
Control of Robot Arm Based On Speech Recognition Using Mel-Frequency Cepstrum Coefficients (MFCC) and K-Nearest Neighbors (KNN) Method
Abstract—In this study describe the implementation of speech to control 5 Degree of Freedom (DoF) Robot Arm for perform-
recognition to pick and place an object using 5 DoF Robot Arm ing the assignment to pick and place an object based on Arduino
based on Arduino Microcontroller. To identify the speech used microcontroller.
Mel-Frequency Cepstrum Coefficients (MFCC) method to get
feature extraction and K-Nearest Neighbors (KNN) method to The paper is organized as follows. In section 2, described
learn and identify the speech recognition based on Python 2.7. The the theoretical background of MFCC and KNN on details. The
database of speech use 12 feature for KNN process, then tested experimental design of method and system design described
using trained (85%) and not trained (80%) respondent show the in section 3. In section 4, described implementation of speech
best agreement result to identifying the speech recognition. Finally, recognition in detail. Finally, the concluding remarks are given
the speech recognition system implemented to control Robot Arm
for perform assignment pick and place the object. in section 6.
Keywords—Speech Recognition, Arduino, Robot Arm, MFCC,
KNN, Python. II. T HEORETICAL BACKGROUND
A. Feature Extraction using Mel Frequency Cepstrum Coeffi-
I. I NTRODUCTION cient (MFCC) Method
Mel Frequency Cepstrum Coefficient (MFCC) is one of
Speech Recognition is a process to identify a speech with a method for feature extraction of the signal, especially for
an acoustic signal data conversion from the audio device. the audio signal. The feature extraction is used as individual
The conversion process technically needs an audio signal, identity with the process of determining a value or vector that
then identified by the audio feature extraction and machine can be. Because of considered quite good in representing the
learning. The advantage of implementing speech recognition signal, MFCC becomes the most used method in various of
is a convenience to controlling something with the speech voice processing field, [13].
especially to help disability or another aim. The feature is the coefficient of cepstral which used to
The feature extraction method which can use to know the consider the perception of the human hearing system. The
identity of the audio signal, they are; Mel-Frequency Cepstrum principle of MFCC is based on the different frequencies which
Coefficient (MFCC) [1] [2] [3] and Linear Predictive Coding captured by human ear so can be representation the sound
(LPC) [4] [5]. While, the machine learning method which can signals. MFCC diagram process shown on Fig. 1.
use to grouping and classify the speech are Support Vector Ma- 1) Preemphasis: Pre-emphasis filter process is required after
chine (SVM) [1], Artificial Neural Networks (ANN) [1] [5] [6], the sampling process, with the purpose to obtain a smoother
Hidden Markov Model (HMM) [4], Fuzzy Logic [7], Adaptive spectral form of speech signal frequency or to reduce a noise
Neuro-Fuzzy Inference System (ANFIS) [8], K-Nearest Neigh- during sound capture. The pre-emphasis filter is based on the
bors (KNN) [9] and other soft computing. For implementation input/output relationship in the time domain expressed by (1).
of speech recognition of Robotic field, such as; biomatrix [10],
control arm robot [11], control mobile robot [12], control y(n) = x(n) − ax(n − 1), (1)
smarthome [13], control wheel chair [14] [15], control Social
Robot [16] [17], and other. from (1) define; a is a pre-emphasis filter constant, it is usually
This study describes a signal of voice processing by using 0.9 < a < 1.0.
Mel-Frequency Cepstrum Coefficients (MFCC) and K-Nearest 2) Frame Blocking: The audio signal is segmented into
Neighbors (KNN) method to recognize the human speech on multiple overlapped frames in this process, so that not found
Python 2.7. Finally, the speech recognition system implemented a single deletion of signals. This process continues until all
−1
Xi = log10 (ΣN
k=0 |X(k)|Hi (k)), (4)
from ( 4) define that i = 1, 2, 3, ..., M (M is the number of
triangle filters) and Hi (k) is the value of the i− triangle filter
for the acoustic frequency of k.
signals have to get into one or more frames. The voice analysis
was performed by short-time analysis.
3) Windowing: Windowing is an analysis process for long
sound signals by taking a sufficiently representative section.
With a Finite Impulse Response (FIR) digital filter approach,
this process removes the aliasing signal because of the discon-
tinuity of the signal pieces. The discontinuities occur because
of the frame blocking the process.
4) Fast Fourier Transform (FFT): Fourier transform (FFT) Fig. 2. The original amplitude spectrum and the Mel Bank filter.
used to convert a time series of bounded time domain signals
6) Cepstrum: Humans listen to voice based on time domain
into a frequency spectrum. FFT is a fast algorithm of Discrete
signals. In this stage, the mel-spectrum be converted into
Fourier Transform (DFT) which can convert every frame to
the time domain by using Discrete Cosine Transform (DCT),
N samples from time domain to frequency domain. FFT can
then the result can call as Mel-frequency cepstrum coefficient
reduce the repeatable multiplication contained in the DFT.
(MFCC). (5) is the equations used in cosine transformations.
−1 −2πjkn/N
Xn = ΣN
k=0 xk e , (2) π
Cj = ΣK
j=1 Xj cos(j(i − 1)/2 ), (5)
K
(2) define that n = 0.1, 2, ..., N − 1 and j = sqrt−1. X[n] is
the n-frequency pattern generated from the Fourier transform, from (5) show Cj is the MFCC coefficient, Xj is the power
Xk is the signal of a frame. The result of this stage called spectrum of mel frequency, j = 1, 2, 3, ..., K (K is the number
Spectrum or periodogram. of desired coefficients) and M is the number of filters.
5) Mel-Frequency Wrapping: The perception of human ear B. Machine Learning using K-Nearest Neighbors (KNN)
against the audio frequency does not follow the linear scale. Method
The actual frequency scale used Hz units. The scale which
works on the human ear is the mel scale frequency which is K-Nearest Neighbors (KNN) is proposed by Cover and Hart
a low frequency that linear under 1000 Hz and a logarithmic (1968) [19] [20] is a learning method based on statistical
high frequency above 1000 Hz [18]. The relation of the Mel theory. KNN method is the one of the simplest and fundamental
scale to the frequency in Hz shown in (3). classification [21] [22] [23]. In KNN algorithm each sample
should be classified similarly to its surrounding samples in
FHZ
2595∗[log]
Fmel = {FHZ ,FHZ 10
(1+
700 ),FHZ >1000
, (3) pattern recognition. Thus, if found unknown/new sample when
<1000)
classification, it could be predicted by considering the classi-
(3) show that Fmel is the mel scale and f is the frequency in fication of its nearest neighbor samples. k-NN is based on the
Hz. The one of approach the frequency spectrum in the mel idea that any new sample can be classified by the majority vote
scale with the working function of the human ear as a filter is of the k neighbors, where k is a positive integer and usually
the Filter Bank. found by the small number. The KNN method illustrated on
In Mel-frequency wrapping, the FFT signal result is grouped Fig. 3.
into triangular filter file. The purpose is that each FFT value The advantages of KNN method they are; simplicity, effec-
is multiplied by the corresponding filter gain and the result is tiveness, and robustness to noisy of training data. But, KNN
218
Fig. 3. K-Nearest Neighbors illustration.
219
(a)Design
(a)
(b)Realization
Fig. 5. Robot arm design.
(b)
Fig. 7. The interface of speech recognition system, and the waveform of speech
command; (a) ”Ambil” (b) ”Simpan”.
220
Fig. 8. The database of speech recognition.
TABLE I
T HE S PEECH R ECOGNITION DATA TEST OF T RAINED R ESPONDENT AND
U NTRAINED R ESPONDENT. (b)
Examination Command Value Result of Respondents
Trained Untrained
1 ”Ambil” 1 1 1
2 ”Simpan” 0 0 0
3 ”Ambil” 1 1 1
4 ”Simpan” 0 0 1
5 ”Ambil” 1 1 0
6 ”Simpan” 0 1 1
7 ”Ambil” 1 1 1
8 ”Simpan” 0 0 0
9 ”Ambil” 1 1 1
10 ”Simpan” 0 1 0
11 ”Ambil” 1 1 1
12 ”Simpan” 0 0 0
13 ”Ambil” 1 1 1 (c)
14 ”Simpan” 0 0 0
Fig. 9. Robot arm performed the assignment; (a) initial pose, (b) ”Ambil”
15 ”Ambil” 1 1 1 (pick) an Object, (c)”Simpan” (place) an Object.
16 ”Simpan” 0 0 0
17 ”Ambil” 1 1 1
18 ”Simpan” 0 0 0
19 ”Ambil” 1 1 1 V. C ONCLUSIONS
20 ”Simpan” 0 1 1
This study has been investigated a development of Robot
Arm which controlled by speech recognition to perform as-
signment pick and place an object. The speech recognition
C. Speech Recognition Implementation to Robot Arm using MFCC and KNN based on Python 2.7 method work
After the database of speech recognition have been success- successfully suitable the speech command. The speech recog-
fully identified, the speech recognition database implemented nition system result obtained a high average accuracy rate of
to Robot Arm system to perform assignment pick (Ambil) and speech recognition; for the trained respondent is 85% and for
place (Simpan) an object. Fig. 9 show the examination of Robot the untrained respondent is 80%. The speech recognition system
Arm which controlled by speech are work successfully to pick has been implemented to 5 DoF Robot Arm based on Arduino
and place an object. microcontroller works effectively to pick and place an object.
The future works will be a focus on the combination of speech
recognition to Social Robot for Human-Robot Interaction.
221
R EFERENCES ground,” Int. Journal of Engineering Research and Applications, vol. 3,
no. 5, pp. 605–610, 2013.
[1] P. A. Sawakare, R. R. Deshmukh, and P. P. Shrishrimal, “Speech [22] G. M. S. Najah, “Emotion estimation from facial images,” Ph.D. disser-
Recognition Techniques: A Review,” International Journal of Scientific tation, 2017.
& Engineering Research, vol. 6, no. 8, pp. 1693–1698, 2015. [23] H. Parvin, H. Alizadeh, and B. Minati, “A Modification on K-Nearest
[2] A. Setiawan, A. Hidayatno, and R. R. Isnanto, “Aplikasi Pengenalan Neighbor Classifier,” Global Journal of Computer Science and Technol-
Ucapan dengan Ekstraksi Mel-Frequency Cepstrum Coefficients ( MFCC ogy, vol. 10, no. 14, pp. 37–41, 2010.
) Melalui Jaringan Syaraf Tiruan ( JST ) Learning Vector Quantization ( [24] D. Anggraeni, W. S. M. Sanjaya, M. Y. Solih, and M. Munawwaroh, “The
LVQ ) untuk Mengoperasikan Kursor Komputer,” Tech. Rep. 3, 2011. Implementation of Speech Recognition using Mel-Frequency Cepstrum
[3] I. B. Fredj and K. Ouni, “Optimization of Features Parameters for HMM Coefficients ( MFCC ) and Support Vector Machine ( SVM ) method
Phoneme Recognition of TIMIT Corpus,” in International Conference on based Python to Control Robot Arm,” in Annual Applied Science and
Control, Engineering & Information Technology, vol. 2. IPCO, 2013, Engineering Conference, vol. 2. IOP Conference, 2018, pp. 1–9.
pp. 90–94.
[4] Thiang and Wanto, “Speech Recognition Using LPC and HMM Applied
for Controlling Movement of Mobile Robot,” Seminar Nasional Teknologi
Informasi, 2010.
[5] Thiang and S. Wijoyo, “Speech Recognition Using Linear Predictive
Coding and Artificial Neural Network for Controlling Movement of Mo-
bile Robot,” in International Conference on Information and Electronics
Engineering, 2011.
[6] B. P. Das and R. Parekh, “Recognition of Isolated Words using Features
based on LPC , MFCC , ZCR and STE , with Neural Network Classifiers,”
International Journal of Modern Engineering Research, vol. 2, no. 3, pp.
854–858, 2012.
[7] I. B. Fredj and K. Ouni, “A novel phonemes classification method using
fuzzy logic,” Science Journal of Circuits, Systems and Signal Processing,
vol. 2, no. 1, pp. 1–5, 2013.
[8] W. S. M. Sanjaya and D. Anggraeni, “Sistem Kontrol Robot Arm 5 DOF
Berbasis Pengenalan Pola Suara Menggunakan Mel-Frequency Cepstrum
Coefficients ( MFCC ) dan Adaptive Neuro-Fuzzy Inference System (
ANFIS ),” Wahana Fisika, vol. 1, no. 2, pp. 152–165, 2016.
[9] R. P. Gadhe, R. R. Deshmukh, and V. B. Waghmare, “KNN based emotion
recognition system for isolated Marathi speech,” International Journal of
Computer Science Engineering (IJCSE), vol. 4, no. 04, pp. 173–177,
2015.
[10] I. N. K. Wardana and I. G. Harsemadi, “Identifikasi Biometrik Intonasi
Suara untuk Sistem Keamanan Berbasis Mikrokomputer,” Jurnal Sistem
dan Informatika, vol. 9, no. 1, pp. 29–39, 2014.
[11] W. S. M. Sanjaya, D. Anggraeni, and I. P. Santika, “Speech Recogni-
tion using Linear Predictive Coding (LPC) and Adaptive Neuro-Fuzzy
(ANFIS) to Control 5 DoF Arm Robot,” in ICCSE. Bandung: IOP
Conference, 2018.
[12] Z. H. Abdullahi, N. A. Muhammad, J. S. Kazaure, and F. A. Amuda,
“Mobile Robot Voice Recognition in Control Movements,” International
Journal of Computer Science and Electronics Engineering, vol. 3, no. 1,
pp. 11–16, 2015.
[13] W. S. M. Sanjaya and Z. Salleh, “Implementasi Pengenalan Pola Suara
Menggunakan Mel-Frequency Cepstrum Coefficients (MFCC) Dan Adap-
tive Neuro-Fuzzy Inferense System (ANFIS) Sebagai Kontrol Lampu
Otomatis,” Al-HAZEN Jurnal of Physics, vol. 1, no. 1, 2014.
[14] A. Kumar, P. Singh, A. Kumar, and S. K. Pawar, “Speech Recognition
Based Wheelchair Using Device Switching,” International Journal of
Emerging Technology and Advanced Engineering, vol. 4, no. 2, pp. 391–
393, 2014.
[15] K. P. Tiwari and K. K. Dewangan, “Voice Controlled Autonomous
Wheelchair,” International Journal of Science and Research, no. April,
pp. 10–11, 2015.
[16] C. Breazeal, “Breazeal-AR03.pdf,” Advanced Robotics, vol. 17, no. 2, pp.
97–113, 2003.
[17] O. Mubin, J. Henderson, and C. Bartneck, “You Just Do Not Understand
Me ! Speech Recognition in Human Robot Interaction,” in International
Symposium on Robot and Human Interactive Communication. IEEE,
2014.
[18] A. Mustofa, “Sistem Pengenalan Penutur dengan Metode Mel-frequency
Wrapping,” J. Tek. Elektro, vol. 7, no. 2, pp. 88–96, 2007.
[19] T. M. Cover, “Estimation by the Nearest Neighbor Rule,” IEEE Transac-
tions on Information Theory, vol. 14, no. 1, pp. 50–55, 1968.
[20] T. M. Cover and P. E. Hart, “Nearest Neighbor Pattern Classification,”
IEEE Transactions on Information Theory, vol. 1, no. 1, pp. 21–27, 1967.
[21] S. B. Imandoust and M. Bolandraftar, “Application of K-Nearest Neighbor
( KNN ) Approach for Predicting Economic Events : Theoretical Back-
222