0% found this document useful (0 votes)
80 views

Real-Time Indian Sign Language (ISL) Recognition: Kartik Shenoy, Tejas Dastane, Varun Rao, Devendra Vyavaharkar

This document presents a system for real-time recognition of Indian Sign Language (ISL) hand poses and gestures using a smartphone camera and remote server. The system achieves 99.7% accuracy for recognizing 33 static hand poses and 97.23% accuracy for recognizing 12 selected gestures. Video frames captured by the smartphone camera are transmitted to the server for processing using techniques like face detection, skin color segmentation, and feature extraction to recognize poses and a Hidden Markov Model to recognize gestures based on sequences of intermediate poses. The system aims to bridge communication gaps without requiring external hardware like gloves.

Uploaded by

Dudeyes Perfect
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Real-Time Indian Sign Language (ISL) Recognition: Kartik Shenoy, Tejas Dastane, Varun Rao, Devendra Vyavaharkar

This document presents a system for real-time recognition of Indian Sign Language (ISL) hand poses and gestures using a smartphone camera and remote server. The system achieves 99.7% accuracy for recognizing 33 static hand poses and 97.23% accuracy for recognizing 12 selected gestures. Video frames captured by the smartphone camera are transmitted to the server for processing using techniques like face detection, skin color segmentation, and feature extraction to recognize poses and a Hidden Markov Model to recognize gestures based on sequences of intermediate poses. The system aims to bridge communication gaps without requiring external hardware like gloves.

Uploaded by

Dudeyes Perfect
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE - 43488

Real-time Indian Sign Language (ISL) Recognition

Kartik Shenoy, Tejas Dastane, Varun Rao, Devendra Vyavaharkar


Department of Computer Engineering,
K. J. Somaiya College of Engineering, University of Mumbai
Mumbai, India
kartik.s@somaiya.edu, tejas.dastane@somaiya.edu, varun.rao@somaiya.edu, devendra.v@somaiya.edu

Abstract—This paper presents a system which can recognise


hand poses & gestures from the Indian Sign Language (ISL) in
real-time using grid-based features. This system attempts to
bridge the communication gap between the hearing and speech
impaired and the rest of the society. The existing solutions either
provide relatively low accuracy or do not work in real-time. This
system provides good results on both the parameters. It can
identify 33 hand poses and some gestures from the ISL. Sign
Language is captured from a smartphone camera and its frames
are transmitted to a remote server for processing. The use of any
external hardware (such as gloves or the Microsoft Kinect
sensor) is avoided, making it user-friendly. Techniques such as
Face detection, Object stabilisation and Skin Colour
Segmentation are used for hand detection and tracking. The
image is further subjected to a Grid-based Feature Extraction Fig. 1. Hand poses in ISL
technique which represents the hand's pose in the form of a
Feature Vector. Hand poses are then classified using the k- The system described in this paper successfully classifies
Nearest Neighbours algorithm. On the other hand, for gesture all the 33 hand poses in ISL. For the initial research, gestures
classification, the motion and intermediate hand poses containing only one hand was considered. The solution
observation sequences are fed to Hidden Markov Model chains described can be easily extended to two-handed gestures. In
corresponding to the 12 pre-selected gestures defined in ISL. the next section of this paper, the related work pertaining to
Using this methodology, the system is able to achieve an sign language translation is discussed. Section III explains the
accuracy of 99.7% for static hand poses, and an accuracy of techniques used to process each frame and translate a hand
97.23% for gesture recognition. pose/gesture. Section IV discusses the experimental results
after implementing the techniques discussed in section III.
Keywords—Indian Sign Language Recognition; Gesture Section V describes the Android application developed for the
Recognition; Sign Language Recognition; Grid-based feature
system that enables real-time Sign Language Translation.
extraction; k-Nearest Neighbours (k-NN); Hidden Markov Model
Section VI discusses the future work that can be carried out in
(HMM); Kernelized Correlation Filter (KCF) Tracker; Histogram
of Oriented Gradients (HOG).
ISL translation.

I. INTRODUCTION II. RELATED WORK


Indian Sign Language (ISL) is a sign language used by There has been considerable work in the field of Sign
hearing and speech impaired people to communicate with Language recognition with novel approaches towards gesture
other people. The research presented in this paper pertains to recognition. Different methods such as use of gloves or
ISL as defined in the Talking Hands website [1]. ISL uses Microsoft Kinect sensor for tracking hand, etc. have been
gestures for representing complex words and sentences. It employed earlier. A study of many different existing systems
contains 33 hand poses including 10 digits, and 23 letters. has been done to design a system that is efficient and robust
Amongst the letters in ISL, the letters ‘h’, ‘j’ are represented than the rest.
by gestures and the letter ‘v’ is similar to digit 2. The system
is trained with the hand poses in ISL as shown in Fig. 1. A Microsoft Kinect sensor is used in [2] for recognising
sign languages. The sensor creates depth frames; a gesture is
Most people find it difficult to comprehend ISL gestures. This viewed as a sequence of these depth frames. T. Pryor et al [3]
has created a communication gap between people who designed a pair of gloves, called SignAloud which uses
understand ISL and those who do not. One cannot always find embedded sensors in gloves to track the position and
an interpreter to translate these gestures when needed. To movement of hands, thus converting gestures to speech. R.
facilitate this communication, a potential solution was Hait-Campbell et al [4] developed MotionSavvy, a
implemented which would translate hand poses and gestures technology that uses Windows tablet and Leap Motion
from ISL in real-time. It comprises of an Android smartphone accelerator AXLR8R to recognise the hand, arm skeleton.
camera to capture hand poses and gestures, and a server to
Sceptre [5] uses Myo gesture-control armbands that provide
process the frames received from the smartphone camera. The
accelerometer, gyroscope and electromyography (EMG) data
purpose of the system is to implement a fast and accurate
recognition technique. for signs & gestures classification. These hardware solutions

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

provide good accuracy but are usually expensive and are not HMM Gesture recognition that we have used in our system is
portable. Our system eliminates the need of external sensors mainly inspired from [16]. They were using 5 hand poses and
by relying on an Android phone camera. the same 4 directive gestures. This 9-element vector was used
Now for software-based solutions, there are coloured as input to the HMM classifier. Training of HMM was done
glove based [6, 7] and skin colour-based solutions. R. Y. using Baum-Welch re-estimation formulas.
Wang et al [6] have used multi-coloured glove for accurate
III. IMPLEMENTATION
hand pose reconstruction but the sign demonstrator, while
demonstrating the sign language, has to wear this each time. Using an Android smartphone, gestures and signs
Skin colour-based solutions may use RGB colour space with performed by the person using ISL are captured and their
some motion cues [8] or HSV [9, 10, 11], YCrCb [12] colour frames are transmitted to the server for processing. To make
the frames ready for recognition of gestures and hand poses,
space for luminosity invariance. G. Awad et al [13] have used
they need to be pre-processed. The pre-processing first
the initial frames of the video sequence to train the SVM for involves face removal, stabilisation and skin colour
skin colour variations for the further frames. But to speed up segmentation to remove background details and later
the skin segmentation, they have used Kalman filter for morphology operations to reduce noise. The hand of the
further prediction of position of skin coloured objects thus person is extracted and tracked in each frame. For recognition
reducing the search space. Z. H. Al-Tairi et al [14] have used of hand poses, features are extracted from the hand and fed
YUV and RGB colour space for skin segmentation and the into a classifier. The recognised hand pose class is sent back
colour ranges that they have used handles good variation of to the Android device. For classification of hand gestures, the
people’s races. intermediate hand poses are recognised and using these
recognised poses and their intermediate motion, a pattern is
After obtaining segmented hand image, A. B. Jmaa et al defined which is represented in tuples. This is encoded for
[12] have used the rules defined in the hand anthropometry HMM and fed to it. The gesture whose HMM chain gives the
study of comparative measurements of human body for highest score with forward-backward algorithm is
localizing and eliminating the palm. They have then used the determined to be the recognized gesture for this pattern. An
rest of the segmented image containing only fingers to create overview of this process is described in Fig. 2.
skin-pixel histogram with respect to palm centroid. This
histogram is fed to decision tree classifier. In [15], from the
segmented hand image, hand contour was obtained, which
was then used for fitting a convex hull and convexity defects
were found out. Using this, the fingers were identified and the
angles between the adjacent ones were determined. This
feature set of angles was fed to SVM for classification. [10]
have used distance transform to identify hand centroid
followed by elimination of palm and using angles between
fingers for classification. Fourier Descriptors have been used
to describe hand contours by [8, 16]. [16] has used RBF on
these Fourier Descriptors for hand pose classification. S. C.
Agarwal et al [17] have used a combination of geometric
features (eccentricity, aspect ratio, orientation, solidity),
Histogram of Oriented Gradients (HOG) and Scale Invariant
Fourier Transform (SIFT) key points as feature vectors. The
accuracy obtained using geometric features goes really low
when number of hand poses increases. [18] has used Local
Binary Patterns (LBP) as features. Our paper is mainly
inspired from [9]. They have trained the k-NN model using
the binary segmented hand images directly. This technique
provides great speed when combined with fast indexing
methods, thus making it suitable for real-time applications.
But to handle the variations in hand poses, more data needs
to be captured. With the use of grid-based features in our
system, the model will become more user-invariant.
Fig. 2. Flow diagram for Gesture Recognition.
For gesture recognition, hand centroid tracking is done
which provides motion information [16]. Gesture recognition A. Dataset used
can be done using the Finite State Machine [19] which has to For the digits 0 to 9 in ISL, an average of 1450 images per
be defined for each gesture. C. Y. Kao et al [20] have used 2 digit were captured. For letters in ISL excluding ‘h’, ‘j’ and
hand gestures for training HMM that will be used for gesture ‘v’, about 300 images per letter were captured. For the 9
recognition. They defined directive gestures such as up, left, gesture-related intermediate hand poses such as Thumbs_Up,
right, down for their 2 hands and a time series of these pairs Sun_Up, about 500 images per pose were captured. The
was input to the HMM for gesture recognition. C. W. Ng et dataset contains a total of 24,624 images. All the images
al [16] used a combination of HMM and RNN classifiers. The consist of the sign demonstrator wearing a full sleeve shirt.

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

Most of these images were captured from an ordinary 80 < < 130

webcam and a few of them were captured from a smartphone ⎪ 136 < < 200
camera. The images are of varying resolutions. For training > (2)
HMMs, 15 gesture videos were captured for each of the 12 ⎨ > 80 & > 30 & > 15

one-handed pre-selected gestures defined in [1] (After, All ⎩ | − | > 15
The Best, Apple, Good Afternoon, Good Morning, Good
Night, I Am Sorry, Leader, Please Give Me Your Pen, Strike, The segmentation mask thus obtained contains less noise
That is Good, Towards). These videos have slight variations and less false positive results. An illustration of the
in sequences of hand poses and hand motion so as to make segmentation mask is shown in Fig. 4.
the HMMs robust. These videos were captured from a
smartphone camera and also involve the sign demonstrator
wearing a full sleeve shirt.
B. Pre-processing
1) Face detection and elimination
The hand poses and gestures in ISL can be represented by
particular movement of hands, and facial features are not Fig. 4. Skin segmentation mask and effect of morphology operations. (Left)
necessary. Also, face of the person creates an issue during Segmentation mask; (Right) Mask after application of Morphology
hand extraction process. To resolve this issue, face detection operations.
was carried out using Histogram of Oriented Gradients
(HOG) descriptors followed by a linear SVM classifier. It 3) Morphology operations
uses an image pyramid and sliding window to detect faces in Morphology operations were performed to remove any
an image, as described in [21]. HOG feature extraction noise generated after skin colour segmentation. There are 2
combined with a linear classifier reduces false positive rates types of errors in skin colour segmentation:
by more than an order of magnitude than the best Haar 1. Non-skin pixels classified as skin
wavelet-based detector [21]. After detection of face, the face 2. Skin pixels classified as non-skin
contour region is identified, and the entire face-neck region is
blackened out, as demonstrated in Fig. 3. Morphology involves 2 basic sub-operations:
1. Erosion: Here, the active areas in the mask (which
are white) are reduced in size
2. Dilation: Here, the active areas in the mask (which
are white) are increased in size
Morphology Open operation handles the 1st error. It
involves erosion followed by dilation. The 2 nd error is
handled by Morphology Close operation. This involves
dilation followed by erosion. The result of applying
morphology operations can be seen in Fig. 4.
4) Object stabilisation using Facial reference.
For tracking hand motion accurately, a steady camera
Fig. 3. Face detection and elimination operation
position is desired. Movement of the camera caused by shaky
2) Skin colour segmentation hands is common. If the sign demonstrator, that is, the person
To identify skin-like regions in the image, the YUV and using ISL does not move his hand but the person capturing
RGB based skin colour segmentation is used, which provides the video shakes his hand, false movements will get detected.
great results. This model has been chosen since it gives the This problem is tackled using object stabilisation.
best results among the options considered: HSV, YCbCr, Under the assumption that a person’s face is always
RGB, YIQ, YUV and few pairs of these colour spaces [14]. included in the gesture video, the face of the sign
The frame is converted from RGB to YUV colour space using demonstrator is tracked to stabilise hand movements. The
the equation mentioned in [22]. This is specified in equation tracker is initialised with the co-ordinates which were
(1). extracted from Face Detection before removing the face. The
tracker detects the location of the facial blob and shifts the
. . .
= . . . . + (1) entire frame in the opposite direction of the detected motion
. . . of facial coordinates. The system uses the Kernelized
Correlation Filter (KCF) tracker implemented in the OpenCV
[14] mentions the criteria for determining if pixel is to be library to track the face in each frame. The operation of
classified as skin using the following conditions: tracking is performed on the image before the face is
blackened. Fig. 5 demonstrates object stabilisation.

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

For tracking hand motion, centroid of hand is calculated


in each frame. If there is movement of hand, the co-ordinates
of centroid of hand will change. Slope of the line formed by
the centroid of hand in the current frame and centroid of hand
in previous frame is then determined. Depending upon the
value of slope, the motion of hand was determined as follows:
x If -1 < slope < 1 and difference between x co-
ordinates of both centroids is positive, the hand
moved leftwards.
x If -1 < slope < 1 and difference between x co-
(a) ordinates of both centroids is negative, the hand
moved rightwards.
x If |slope| > 1 and difference between y co-ordinates
of both centroids is positive, hand moved upwards.
x If |slope| > 1 and difference between y co-ordinates
of both centroids is negative, hand moved
downwards.
The above slope-based motion determination is illustrated
in Fig. 7. It is to be noted that the hand motion observed by
the camera will be opposite to the actual motion performed
by the sign demonstrator. The system uses the OpenCV
(b) library to calculate area of contour and centroid of hand
contour.
Fig. 5. (a) Actual motion of camera. The subject’s position with respect to
the frame changes. (b) Effect after stabilisation – The position of subject with
To reduce noise during hand tracking, an imaginary circle
respect to the frame remains constant. with a 20-pixel radius around the previous hand centroid
pixel is considered. If the new centroid lies within this radius,
this shift is termed as noise and motion is neglected. The
C. Hand extraction and tracking
previous co-ordinate considered during comparison is not
As all ISL hand poses and gestures can be represented updated in this case. If the current centroid lies outside the
using hand movements, hand extraction and tracking is 20-pixel threshold, this shift is considered as movement of
important part of the system. After pre-processing each hand. In the subsequent frames, the radius is set to 7 pixels
frame, a black and white image is obtained, where white areas instead of 20 pixels until there is no movement after which
represent skin. These skin areas do not contain facial regions. the radius is restored to 20 pixels. This use of an imaginary
They contain parts of the hand and other skin-like parts in the circle reduces noise to a greater extent and gives highly
original image. Since each frame contains only 1 hand (the accurate tracking of hand movements.
other hand is not visible) or both hands are touching each
other, the only prominent contour present in the frame will be
the person’s hand. Thus, areas of all contours in the frame are
calculated and the contour with the largest area is extracted.
Since the only prominent contour left is hand contour, the
extracted contour is the hand region. Fig. 6 illustrates the
importance of face elimination operation. If face was not
eliminated, the face region would have been the largest
contour area in the image, thus, being classified as hand as
shown in Fig. 6.

Fig. 7. Determining hand motion using slope.

Fig. 6. Importance of eliminating face before hand extraction.

Fig. 8. The hand pose ‘a’ in ISL fragmented by a 3x3 grid.

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

Fig. 9. ISL hand poses' data visualised using PCA and TSNE dimensionality reduction.

D. Feature extraction using Grid-based fragmentation E. Classification


technique 1) Recognition of ISL Hand poses using k-NN
Using an M x N grid, extracted hand sample is After observing the graph plotted in Fig. 9, it can be
fragmented into M*N blocks. Using this grid, a feature vector observed that data is organised into clusters. There is more
is obtained containing M*N feature values where each block than one cluster for the same hand pose. For classification, an
provides a feature value. In each block, the feature value is algorithm was needed which can distinguish clustered data
calculated as the percentage of hand contour present. This is efficiently. K-Nearest Neighbours (k-NN) was found suitable
specified in equation (3). for such distribution of data. Hand extracted from each frame
of the live feed is subject to feature extraction using the
= (3) previously discussed Grid-based fragmentation. This sample
is then represented in an M*N dimensional feature map.
If no contour is found, the feature value is 0. In Fig. 8, a
3 x 3 grid is constructed on a sample for the purpose of Using Euclidean distance as a distance metric, k nearest
illustration. The advantage of using this approach is the samples, which are fitted previously in the classifier, are
features generated vary with the orientation of each hand computed. The distance computation can be performed using
pose. Different hand poses occupy different number of grids a brute force approach, wherein Euclidean distance between
and different fragment areas. The feature vector thus the sample and each fitted sample in the classifier is
accurately represents the shape and position of the hand. calculated and the k lowest distances are selected. Other
Using these M*N features, each hand pose is represented by optimal approaches include the KD-tree and Ball Tree. The
different clusters. most suitable approach for distance computation is dependent
on size of the data. Brute force works well on small data size
Fig. 9 supports these arguments. The data visualised in [24]. For low dimensional data, KD Tree works well whereas
Fig. 9 was subjected to a 10 x 10 grid and 100 features were for high dimensional data, Ball Tree works best [24]. From
extracted per sample. Using Principle Component Analysis these k nearest neighbours, the classifier selects the class
(PCA) and t-Distributed Stochastic Neighbour Embedding (t- occurring most frequently.
SNE) [23], the dimensionality of this data was reduced from
100 to 2. This was done by first applying PCA to reduce the 2) Gesture Classification using HMM
dimensionality to 40 features and then t-SNE to reduce it There are always some variations present while
further to 2. As seen in Fig. 9, separate clusters representing demonstrating gestures even if performed by the same sign
different orientations of each hand pose can be seen after demonstrator. For handling such variations, some kind of
visualising the extracted features. statistical model is needed. Hidden Markov Models (HMMs)

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

are one type of statistical model that can handle such detected, the motion is mapped to the corresponding
variations well [25]. There are two types of HMMs: observation symbol ('Upwards':0, 'Rightwards':1,
continuous and discrete HMM. In continuous HMM, the 'Leftwards':2, 'Downwards':3). If no motion is detected, the
number of observation symbols possible in each element of hand pose is accordingly mapped. This mapping is pre-
the observation sequences is infinite, but in discrete HMM, it decided. Therefore, for each frame there will be an
is finite. The HMM can either be ergodic or left-to-right. In observation symbol and the video sequence will generate an
left-to-right HMM, the transition can occur only in 1 direction observation sequence which encodes motion and hand pose
i.e. if the HMM moves to the next state it cannot go back to information. For example, the time series – [ <Thumbs_Up,
the previous state as shown in Fig. 10. But in ergodic HMM, None>, <Thumbs_Up, None>, <Thumbs_Up, None>,
transition is possible from any one state to any other state. <None, Up>, <None, Up>, <None, Up>, <None, Up>,
The initial state probabilities (π) and the transition <Sun_Up, None>, <Sun_Up, None>, <Sun_Up, None> ]
probabilities for left-to-right HMM have been shown in Fig. represents the gesture “Good Afternoon”. After encoding this
10. into an observation sequence, we get [ 4, 4, 4, 0, 0, 0, 5, 5, 5
]. A few frames of this gesture are illustrated in Fig. 12.

Fig. 10. HMM chain for gesture with 3 hidden states (E.g. Good Afternoon)
Fig. 12. 4 frames of Gesture “Good Afternoon”
A human brain perceives gestures as a combination of a
The gesture recognition used in this system involves 12
few intermediate hand poses and hand movements executed
HMM chains (one chain per gesture). The number of hidden
in a particular order. Using this idea, ISL Gestures are
states in each of these chains is determined by breaking the
comprised of intermediate stationery hand poses and the hand
gesture into a combination of hand poses and the motion
movements between these. Thus, in this system, discrete left-
between them. For example, Good Afternoon gesture as
to-right HMM has been used. This uses the segmented hand
shown in Fig. 12, can be said to have 3 states: ‘Thumbs_Up’
centroid motion and pose classification results for gesture
hand pose, ‘Upwards’ motion, ‘Sun_Up’ hand pose. All these
classification of provided observation sequence as belonging
HMM chains, being left-to-right HMMs, the initial state
to one of the 12 predefined gestures.
probabilities and the initial state transition probabilities are
The input to this HMM is an observation sequence similar to the one shown in Fig. 10. For n hidden states in
extracted from the video feed. The number of observation HMM, the state transition probability will be of order n x n
symbols possible is the sum of the number of directions and the emission probability matrix will be of order n x 13.
tracked and the number of intermediate stationery hand poses The emission probability matrices were initialized with
with which the system is trained. In this system, there are 4 probabilities determined empirically by subjectively looking
directions being tracked (as described in section III C) and at the similarity between hand poses and the closest motions
the system is trained with 9 intermediate stationery hand possible. This is to increase the chances of the HMM model
poses such as ‘Thumbs_Up’, ‘Sun_Up’, ‘Fist’ as shown in converging into the global maxima after training.
Fig. 11.
After all the parameters as described here [26] are
initialized as explained above, the estimation probabilities
and state transition probabilities of HMM chains are trained
using Baum-Welch algorithm [26, 27]. The HMM chains are
trained by using a gesture database that was created, the
details of which have been specified in section III A. After
training the HMM chains, the new observation sequence is
Fig. 11. Thumbs_Up and Sun_Up stationery hand poses fed to these HMM chains, and the HMM chain giving the
highest score with forward-backward algorithm [26] is
The intermediate hand pose recognition also uses Grid- determined to be the recognized gesture.
based feature extraction. The intermediate hand pose
recognition is similar to recognition of hand poses for 3) Temporal Segmentation
recognizing letters and digits but is carried out only when The gesture recognition module needs to be given video
there is no movement. Thus, the total number of observation segments that correspond to the gesture only. Without
symbols are 13 i.e. the observation sequence can have temporal segmentation, continuous gesture recognition is not
elements with values 0-12. At each frame, a tuple is generated possible. We have used a rule where in if the hand goes out
of the form: <S, M> where M represents the Motion of the of frame, it would mark the end of the current gesture and
hand relative to the previous frame and S represents the pose gesture recognition will be performed over the currently
classified if there was no motion detected. If motion is obtained frame sequence from gesture. When the hand again

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

comes within frame, it would mark the beginning of the new TABLE I. COMPUTATIONAL TIME FOR EACH PHASE
gesture. Temporal segmentation is achieved using this rule. Sr. Time taken (in
Phase
No. ms)
1 Data Transfer over WLAN 46.2
IV. EXPERIMENTAL RESULTS Skin Colour Segmentation and
2 12.3
The results discussed in this section were obtained from a Morphological Operations
personal computer with 8 GB of RAM, an Intel i7 processor 3 Face Detection and Elimination 99.9
4 Object Stabilization 14.8
with 4 virtual CPUs and a 2 GB NVIDIA GPU. The operating
5 Feature Extraction 11.2
system used was Ubuntu 16.04.
6 Hand Pose Classification 1.7
For better classification of hand poses, an appropriate grid Average Time per Frame 186.1
size had to be determined for fragmentation of hand and
extraction of features. We applied 6 different grid sizes – 5x5,
10x10, 10x15, 15x15, 15x20 and 20x20 to extract features
from the same training data. These features were then fitted
into a k-NN classifier. Features were then extracted from the
testing data using the grid sizes in consideration. The features
extracted from the testing data were then classified using the
k-NN classifier trained previously. An ideal grid size would
form distinct clusters of hand poses so that the k-NN classifier
can recognise them with high precision. To determine this
ideal grid size, the accuracies of the trained k-NN classifiers
were calculated using the testing data and were compared.
The accuracy was calculated as specified in equation (4):

= (4)

The comparison was plotted on a graph as shown in Fig. Fig. 14. Heatmap of Confusion matrix of ISL gestures.
13. From the graph, it can be inferred that the grid size of
10x10 gives the highest accuracy of 99.714%. In our 12 gestures and their abbreviations that were considered
implementation, the 10x10 grid was thus selected for for testing are as follows:
extracting features from hand image. The average time
required to extract features from an image of size 300x300 x After (AFT)
using a 10x10 grid was found to be about 1 milli-second. The x All the Best (ATB)
testing data was 30% of the original dataset of hand poses. x Apple (APL)
x Good Afternoon (GA)
Fig. 15 shows the confusion matrix of the k-NN classifier
x Good Morning (GM)
when tested with hand poses’ data. It can be inferred from the
x Good Night (GN)
confusion matrix that the model is able to distinguish between
each hand pose accurately. The time taken for each frame has x I Am Sorry (IAS)
been tabulated in TABLE I. Thus, our application achieves a x Leader (LDR)
frame rate of 5.3 fps. After the frames have been processed x Please Give Me Your Pen (PGP)
and the time series has been extracted from the gesture frame x Strike (STR)
sequence, the average time taken by HMM model (consisting x That is good (TIG)
of 12 HMM chains) for gesture classification is 3.7 ms. x Towards (TWD)
Also, testing was done for Wrong Gestures (WR), that is,
gestures that are not amongst the above listed. A 10x10 grid
and a k-NN classifier was used to recognise the intermediate
hand poses, similar to hand pose recognition. 50 real-time
trials were performed for each gesture class in good lighting
conditions with the sign demonstrator wearing a full sleeve
shirt. Fig. 14 depicts a confusion matrix generated after
obtaining the results of these trials. Each cell in the confusion
matrix contains the percentage of trials that gave the
corresponding output. It clearly shows that correct
classification was obtained more than 94% of the time, and
97.2% on an average. It can thus be inferred from the results
that the hand motion tracking and hand pose classification
was accurate enough to generate appropriate time series for
Fig. 13. Comparison of accuracy of k-NN classification on features extracted the Hidden Markov Models.
using various grid sizes on hand poses’ data.

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

Fig. 15. Heatmap of Confusion matrix for k-NN classifier tested on ISL hand poses.

Hand extraction is currently dependent on skin colour


V. THE APPLICATION segmentation. This means that the hand extraction requires
The system is implemented as an Android application. that the subject must wear full sleeve shirt for accurate
The application uses the smartphone’s camera to capture the recognition. Although the system could help the hearing and
sign language used by the person. The frames were captured speech impaired community where full-sleeve shirts are
at a rate of 5 frames per second. Each frame is continuously frequently used, the system may not work in general
sent to a remote server. The processing is performed at the conditions. This approach could be further extended using
Object Detection techniques to extract hand region from the
server-side. After each pose or gesture is classified, the result
image. The only limitation in implementation of Object
is sent back to the application which is displayed in the top-
Detection techniques is requirement of a very wide variety of
portion. Currently, sockets are used to simulate a client-server annotated hand samples so that it could detect hands in almost
connection. Fig. 16 shows screenshots of the application. any position, orientation and background. The current
approach also requires that the lighting conditions should be
optimal – neither too dark nor too bright. The use of even
better skin colour segmentation techniques which can perform
well under a wider variety of lighting conditions can lead to
better segmentation results and in turn aid in feature
extraction.

CONCLUSION
From the results, it can be inferred that the system
presented in this paper is accurately able to track hand
movements of the sign demonstrator using techniques such as
Object Stabilisation, Face elimination, Skin colour Extraction
Fig. 16. (Left) Hand Pose ‘a’ recognized by the application; (Right) Gesture and then Hand extraction. It can classify all 33 hand poses in
‘Good Afternoon’ recognised by the application. ISL with an accuracy of 99.7%. The system was also able to
classify 12 gestures with an average accuracy of 97.23%. The
VI. FUTURE WORK approach uses an HMM chain for each gesture and a k-NN
model to classify each hand pose. The time required for
Currently, only single-handed gestures in ISL were recognition of hand pose is about 0.2s and that for gesture is
considered for research. With the use of advanced hand 0.0037s. From the results, it can be concluded that the system
extraction algorithms, this approach can be extended to two- can recognise hand poses and gestures in ISL with precision
handed gestures as well. Also, with the use of Natural and in real-time. The system provides higher accuracy and
Language Processing algorithms, this system can be extended
faster recognition in sign language recognition than other
to recognise sentences in ISL, by recognition of multiple
gestures in the same video capture. approaches discussed in the literature. The approach

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

discussed is inspired from various systems described in the Intelligent Data Engineering and Automated Learning, pp. 151-159,
2009.
Related Works section and utilises the pros discussed in their
[13] G. Awad, J. Han and A. Sutherland, “A Unified System for
system to make the system more precise while classification. Segmentation and Tracking of Face and Hands in Sign Language
This approach is generic and can be extended to other single- Recognition,” 18th International Conference on Pattern Recognition,
handed and two-handed gestures. The system presented in 2006.
this paper can also be extended to other Sign Languages, if [14] Z. H. Al-Tairi, R. W. Rahmat, M.I. Saripan and P.S. Sulaiman, “Skin
dataset satisfying the current requirements of the system is Segmentation Using YUV and RGB Color Spaces,” J Inf Process Syst,
vol. 10, no. 2, pp. 283-299, June 2014.
available.
[15] H. Lahiani , M. Elleuch and M. Kherallah, “Real Time Hand Gesture
Recognition System for Android Devices,” 15th International
REFERENCES Conference on Intelligent Systems Design and Applications (ISDA),
2015.
[1] TalkingHands.co.in, “Talking Hands,” 2014. [Online]. Available:
http://www.talkinghands.co.in/. [Accessed: 21- Jul- 2017]. [16] C. W. Ng and S. Ranganath, “Real-time gesture recognition system and
application,” Image and Vision computing, 2002.
[2] A. Agarwal and M. K. Thakur, “Sign Language Recognition using
Microsoft Kinect,” Sixth International Conference on Contemporary [17] S. C. Agrawal, A. S. Jalal and C. Bhatnagar, “Recognition of Indian
Computing (IC3), September 2013. Sign Language using Feature Fusion,” 4th International Conference on
Intelligent Human Computer Interaction (IHCI), 2012.
[3] MailOnline, ‘'SignAloud gloves translate sign language gestures into
spoken English,” 2016. [Online]. Available: [18] M. Hrúz, J. Trojanová and M. Železný, “Local Binary Pattern based
http://www.dailymail.co.uk/sciencetech/article-3557362/SignAloud- features for Sign Language Recognition,” Pattern Recognition and
gloves-translate-sign-language-movements-spoken-English.html. . Image Analysis, 2012.
[Accessed: 10- Feb- 2018]. [19] J. Davis and M. Shah, “Visual gesture recognition,” IEE Proceedings-
[4] Alexia. Tsotsis, “MotionSavvy Is A Tablet App That Understands Sign Vision, Image and Signal Processing, 1994.
Language,” 2014. [Online]. Available: [20] C. Y. Kao and C. S. Fahn, “A Human-Machine Interaction Technique:
https://techcrunch.com/2014/06/06/motionsavvy-is-a-tablet-app-that- Hand Gesture Recognition based on Hidden Markov Models with
understands-sign-language/. [Accessed: 10 – Feb- 2018]. Trajectory of Hand Motion,” Procedia Engineering, vol. 15, pp. 3739-
[5] P. Paudyal, A. Banerjee and S. K. S. Gupta, “SCEPTRE: a Pervasive, 3743, 2011.
Non-Invasive, and ProgrammableGesture Recognition Technology,” [21] N. Dalal and B. Triggs, “Histogram of Oriented Gradients for Human
Proceedings of the 21st International Conference on Intelligent User Detection,” Computer Vision and Pattern Recognition (CVPR), vol. 2,
Interfaces, pp. 282-293, 2016. pp. 886-893, 2005. [doi = 10.1109/CVPR.2005.177]
[6] R. Y. Wang and J. Popovic, “Real-Time Hand-Tracking with a Color [22] B. C. Ennehar, O. Brahim, and T. Hicham, “An appropriate color space
Glove,” ACM transactions on graphics (TOG), vol. 28, no. 3, 2009. to improve human skin detection,” INFOCOMP Journal of Computer
[7] R. Akmeliawati , M. P. L. Ooi and Y. C. Kuang, “Real-Time Malaysian Science, vol. 9, no. 4, pp. 1-10, 2010.
Sign Language Translation using Colour Segmentation and Neural [23] L. Maaten and G. Hinton, “Visualizing Data using t-SNE,” Journal of
Network,” Instrumentation and Measurement Technology Conference Machine Learning Research, vol. 9, pp. 2579-2605, November 2008.
Proceedings, 2007. [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O.
[8] F. S. Chen, C. M. Fu and C. L. Huang, "Hand gesture recognition using Grisel et al, “1.6. Nearest Neighbours – scikit-learn 0.19.1
a real-time tracking method and hidden Markov models", Image and documentation,” 2011. [Online]. Available: http://scikit-
vision computing, vol. 21, pp. 745-758, 2003. learn.org/stable/modules/neighbors.html#nearest-neighbor-
[9] M. A. Rahaman , M. Jasim, M. H. Ali and M. Hasanuzzaman, “Real- algorithms. [Accessed: 12- Sep- 2017].
Time Computer Vision-Based Bengali Sign Language Recognition,” [25] C. Vogler, D. Metaxas, “Handshapes and movements: Multiple
17th International Conference on Computer and Information channel ASL Recognition,” Gesture-Based Communication in Human-
Technology (ICCIT), 2014. Computer Interaction, pp. 247-258, 2004.
[10] S. Padmavathi, M. S. Saipreethy and V. Valliammai, “Indian Sign [26] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected
Language Character Recognition using Neural Networks,” IJCA Applications in Speech Recognition,” Proceedings of the IEEE, vol.
Special Issue on Recent Trends in Pattern Recognition and Image 77, no. 2, February 1989.
Analysis RTPRIA, 2013. [27] L. Baum, “An Inequality and Associated Maximization Technique in
[11] A. Chaudhary, J. L. Raheja and S. Raheja, “A Vision based Statistical Estimation for Probabilistic Functions of Markov Process,”
Geometrical Method to find Fingers Positions in Real Time Hand Inequalities III: Proceedings of the Third Symposium on Inequalities,
Gesture Recognition,” JSW, pp. 861-869, 2012. ssvol. 3, pp. 1-8, 1972.
[12] A. B. Jmaa, W. Mahdi, Y.B. Jemaa and A.B. Hamadou, “Hand
Localization and Fingers Features Extraction: Application to Digit
Recognition in Sign Language,” International Conference on

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy