0% found this document useful (0 votes)
59 views

Indian Sign Language Interpretation and Sentence Formation: Disha Gangadia Varsha Chamaria Vidhi Doshi

The document discusses developing a system to recognize Indian Sign Language gestures through video input and convert them to text and speech for improved communication. The system would perform image preprocessing, classification, and sentence formation on the gestures to generate meaningful sentences and voice output. The goal is to help the hearing and speech impaired communicate with others more easily.

Uploaded by

Dhanush Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Indian Sign Language Interpretation and Sentence Formation: Disha Gangadia Varsha Chamaria Vidhi Doshi

The document discusses developing a system to recognize Indian Sign Language gestures through video input and convert them to text and speech for improved communication. The system would perform image preprocessing, classification, and sentence formation on the gestures to generate meaningful sentences and voice output. The goal is to help the hearing and speech impaired communicate with others more easily.

Uploaded by

Dhanush Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2020 IEEE Pune Section International Conference (PuneCon)

Vishwakarma Institute of Technology, Pune India. Dec 16-18, 2020

Indian Sign Language Interpretation and Sentence


Formation
Disha Gangadia Varsha Chamaria Vidhi Doshi
Computer Engineering Computer Engineering Computer Engineering
DJ Sanghvi College of Engineering DJ Sanghvi College of Engineering DJ Sanghvi College of Engineering
2020 IEEE Pune Section International Conference (PuneCon) | 978-1-7281-9600-8/20/$31.00 ©2020 IEEE | DOI: 10.1109/PuneCon50868.2020.9362383

Mumbai, India Mumbai, India Mumbai, India


dishagangadia@gmail.com varshachamaria1999@gmail.com doshividhi021@gmail.com

Jigyasa Gandhi
Computer Engineering
DJ Sanghvi College of Engineering
Mumbai, India
jigsgandhi97@gmail.com

Abstract—People with speech and hearing disabilities device. After a series of phases involved, for example, Image
approximately constitute 1 percentage of the total Indian Preprocessing, Classification and Sentence Formation, the
population. A person who is hearing and speech impaired is output will be generated which would consist of meaningful
not able to compete or work with a normal person in a normal
environment because of the lack of a proper communication sentences, along with the voice.
medium. Thus, to develop a real-time interactive system that can
Sign Language is used for communication amongst them. help the hearing and speech impaired to communicate with
Sign Language is the most natural and expressive way for the
hearing and speech impaired. This paper proposes a method
normal people using Indian Sign Language and to develop a
that recognizes Sign Language and converts it to normal text scalable project which can be extended to capture the whole
and speech for fast and improved communication amongst vocabulary of Indian Sign Language through manual and
them and also with others. The focus is on the Indian Sign non-manual signs is the main aim of this work.
Language (ISL) specifically as there is no substantial work on
ISL rendering the above requirements for these people. The social benefits of such a system are enormous.
The paper focuses on developing a real-time hands-on system
that takes video inputs of gestures in the specified ROI and • A Helping hand for the hearing-impaired and
performs gesture recognition using various feature extraction speech-impaired students in their early stages of
techniques and Hybrid-CNN model trained using the ISL development and a clearer, compact and precise way of
database created. The correctly identified gesture tokens are communication.
sent to a Rule-Based Grammar and for Web Search query to
• Improved teaching-learning process.
generate various sentences and a Multi-Headed BERT
grammar corrector provides grammatically precise and correct • Language flexibility that ensures that the impaired do not
sentences as the final output. have to learn a new language. A uniform system that can
Index Terms—Indian Sign language (ISL), Hand gesture, be used unanimously.
Feature extraction, Scale-Invariant Feature Transform (SIFT), • Revolutionize the communication process with the help
Gesture recognition, Dumb and Deaf, Convolutional Neural
of non-governmental organizations (NGO).
Network (CNN), Natural Language Processing (NLP).
• Improved lifestyle and a great aid to the elderly.
I. I NTRODUCTION The system has two main components, namely Gesture
Sign Language is used as a medium for communication by Recognition and Sentence Generation, which in turn have
visually impaired, deaf and dumb people which constitute a many sub-components, which includes image preprocessing,
significant portion of our population. This work aims to removing light variations, reducing noise, motion detection,
bridge this gap of communication by developing a system edge detection, segmentation into frames, extracting gesture
that takes Indian sign language as input and generates labels through Convolutional Neural Network (CNN) based
meaningful sentences as output using various machine learning, converting the labels to tokens, fitting the tokens
learning and data mining techniques. inside a grammar, using web results and corpus results that
Communication plays an important role in the day-to-day include similarity checks and grammar correction to create
lives of human beings. Thus, we have developed this project top 10 most meaningful and relevant sentences, converting
for the speech-impaired population. The system will receive text format to speech and showing the text and speech output
a video from the user through the webcam available on the results to the user.

978-1-7281-9600-8/20/$31.00 ©2020 IEEE 71

Authorized licensed use limited to: Miami University Libraries. Downloaded on June 15,2021 at 14:58:11 UTC from IEEE Xplore. Restrictions apply.
II. L ITERATURE S URVEY
A vision-based hand gesture recognition model is
proposed by Kanchan Dabre and Surekha Dholay. The
proposed architecture in [1] has a Preprocessing phase which
undergoes background subtraction, blob analysis, brightness
normalization and scaling; and a Classification phase that
uses a Haar Cascade Classifier to classify the word and give
a textual output.
Reference [2] uses a similar preprocessing phase following
which a Condensation Algorithm is used for hand tracking and
localization and a Hidden Markov Model (HMM) forward-
backward algorithm in conjunction with a Viterbi path is used
for recognition gaining 96.25 % accuracy over a batch of 8
gestures.
Xiujuan Chai, Hanjie Wang, Fang Yin, Xilin Chen
propose a hierarchical Grassmann Co-variance Matrix
(GCM) model in [4] that encodes static sequences as well as
continuous sequences (frame by frame) followed by a
discriminator kernel Support Vector Machine (SVM) which
is used for sign classification. In case of continuous
sequences, probability inference technique is used for
pointing the labels.
Reference [5] proposed a system that comprises of three
stages: Preprocessing stage, Feature Extraction and
Classification. Hand gestures are converted to meaningful
sentences using some grammar rules, part-of-speech(POS)
tagging and a Look-Ahead LR (LALR) parser generates tags
on the framed rules. Fig. 1: System architecture.
Reference [3] proposed a grammar error correction (GEC)
model which consists of a sequence labelling phase wherein
the tokens are given {remain, insert, delete, substitute} labels Grayscaling, illumination normalization, noise removal, edge
and a grammar correction phase which uses a pre-trained detection, corner detection, thresholding, etc.
BERT model to provide candidate outputs for the masked The system contains a data set of preprocessed hand
token inputs, thus performing simple grammar correction of gestures. These are used to evaluate the performance of the
unit span. proposed method. Data set is a collection of around 10000
Eigenvalues and Eigenvectors are considered by Singha J, hand gesture images. For each gesture class, 100 instances
Das K for the feature extraction stage and finally, Eigen value- are captured. Since Indian Sign language data set (ISL) is
weighted Euclidean distance is employed to acknowledge the not available, we have created a data set of 100 Indian signs.
sign in paper [6]. It deals with bare hands. Skin Filtering is While creating the data set, image for a particular gesture is
performed to the input video frames for detection of hand stored in 4 different formats: 1) NoFilter 2) Features from
gestures, thus allowing the user to interact with the system in accelerated segment test (FAST) 3) Canny Edge 4)
a natural way. Scale-Invariant Feature Transform (SIFT)
Wazalwar, S., Shrawankar, Shrawankar Urmila suggested a The classification uses a CNN-hybrid model which is
method in [8] which the input video of sign language is trained and validated using the database. Also, Data
framed and segmented and the CamShift algorithm and Augmentation is done on the training database images to
Pseudo two-dimension hidden Markov model (P2DHMM) include various transformations(geometric variance and
are used for tracking. Identification of signs is done by Haar illumination variance).
Cascade classifier. POS tagging is done with WordNet and The input videos from the user are converted into frames at
using the WordNet dictionary. Finally, an LALR parser is real-time and passed to the trained model and the gesture with
used to generate the sentence. the highest probability is selected and passed on to the Natural
Language Processing Stage. The Natural Language Processing
III. P ROPOSED A RCHITECTURE
phase forms sentences by adding relevant words and correcting
The System Architecture is shown in Figure. 1 and consists grammar and returns a text output which is converted to audio
of many stages. The preprocessing phase consists of methods as well for the end-user. NLP phase is also called the speech
that extract important discernible features from the image like synthesis phase.

72

Authorized licensed use limited to: Miami University Libraries. Downloaded on June 15,2021 at 14:58:11 UTC from IEEE Xplore. Restrictions apply.
(a) No-filter (b) Adaptive (c) SIFT
Thresholding

Fig. 2: Gesture for ”house” in different modes

IV. I MPLEMENTATION
A. Data set Generation Fig. 3: FAST Feature Extraction
The data set consists of 10000 hand gesture instances of 100
unique gestures for the Indian Sign Language. These include
Here,
the alphabets A-Z (for Proper Nouns) and digits 0-9. The rest
Gij is a Gaussian weight window of size bxb
64 gestures are selected after doing a survey about the most
c is a constant
frequently used words that need to be addressed in day-to-day
s is source pixel
life. The data set includes gestures from four different people.
d is destination pixel
Variations for light, orientation, motion, image placement, etc.
3) Feature Extraction: Here, we use FAST key-point
have been considered and included so as to avoid high bias.
extraction, Canny Edge Detection and SIFT feature
While training the model, data augmentation is done so as to
extraction and use a hybrid in the final model.
create transformations of the original data for the CNN model
to learn better. The image resolution is 200 * 200 pixels. This a) FAST Feature Extraction: Features from accelerated
particular size of the region of interest is selected as most of segment test uses a Bresenham circle of radius 3 which takes
the gestures (single-hand and dual-hand) can be well fit inside 16 pixels into consideration. A candidate pixel p is a corner
the square. if it satisfies one of the following conditions.
B. Data Preprocessing and Feature Extraction
x ∈ S I x > Ip + t (3)
The images of the data set need to be processed before
storing and training. The following thresholding and
transformation techniques are applied that help make the x ∈ S I x < Ip − t (4)
classification more accurate.
1) No-filter: The images are kept the way they are. This Here S is the set of continuous N pixels.
preserves the RGB values for the images in the order that A high-speed test method is applied to improve
they are extracted. The dimensions thus remain (200, 200, 3). performance. Situations where N<12, this method cannot be
It can be claimed that the Z dimension is not required for this generalized. Also, the sequence in which the adjacent pixels
particular gesture recognition problem because the shape is a are queried determines the speed. To improve this, a machine
primary concern rather than colour and also, the image size is learning optimization is used. Detecting multiple interest
small. The output is shown in Figure. 2(a). points in nearby locations is solved by using Non-maximum
2) Adaptive Thresholding: The No-Filter will fail to give Suppression. The output is shown in Figure. 3.
accurate results and hence the image needs to be converted
to Grayscale and an Adaptive Threshold is applied for the b) Canny Edge Detector: Canny Edge Detector is a
same. Simple thresholding uses a global value as a threshold process used to extract structural information from different
parameter and this can create an imbalanced output if the vision objects and reduces the amount of data to be
lighting conditions are different in different areas of the processed. It is a popular edge detection algorithm.
image. Adaptive Thresholding algorithm determines the It is a multi-stage algorithm and goes through following
threshold for each pixel based on a small region surrounding stages:
it. The output is shown in Figure. 2(b). 1) Noise Reduction: It first smooths the image using a
 Gaussian Filter. Here, we use a 5*5 Gaussian Filter.
maxvalue if s(x, y) > T (x, y) ⎡ ⎤
d(x, y) = (1) 1 4 7 4 1
0 otherwise ⎢ 4 16 26 16 4 ⎥
1 ⎢⎢16 26 41 26 16⎥ ∗ A

b 
 b H= ⎢ ⎥ (5)
273 ⎣
T (x, y) = Gij ∗ Sij − c (2) 4 16 26 16 4 ⎦
i=1 j=1 1 4 7 4 1

73

Authorized licensed use limited to: Miami University Libraries. Downloaded on June 15,2021 at 14:58:11 UTC from IEEE Xplore. Restrictions apply.
Fig. 5: SIFT algorithm extracting features from the gesture
”moon” .
Fig. 4: Canny Edge Detection

(Dxx + Dy y )2
R= (11)
2) Intensity Gradient: The smoothed image is then filtered Dxx Dy y − Dx 2y
with a Sobel kernel to get first derivative in both
directions in the 2D image, (Gx ) and (Gy ). (rth + 1)2

if R > Reject the key − point (12)
rth
EdgeG radient(G) = G2x + G2y (6)
The descriptor is relative to the orientation calculated and thus
a −1 results in rotation invariance.
Angle(θ) = tan ( ) (7)
b
3) Non-maximum Suppression: After intensity gradient θ = arctan(L(x, y+1)−L(x, y−1), L(x+1, y)−L(x−1, y))
calculation, a full scan is performed to remove pixels (13)
that do no contribute to an edge. It is checked for each Finally a key-point descriptor is generated. (4*4)
pixel whether it is the local maximum as compared to neighbourhood pixels are considered with 8 bins for 8
it’s adjacent pixels in the direction of gradient. The directions. This gives a descriptor of dimension (4*4*8) =
result here is a binary image with ”thin edges”. 128.
4) Hysteresis Thresholding: This stage of the algorithm The outputs for SIFT are shown in Figure. 2(c) and Figure.
checks which edges are really edges and which are not. 5. Thus, a FAST, Canny and SIFT feature database is created
The output is shown in Figure. 4 for all the training images data set. These algorithms alone
can be used for object recognition, but for this work, we save
c) SIFT Mode: The image obtained from Adaptive the features from their descriptors and make a hybrid system
Thresholding is now applied to Scale-Invariant Feature with CNN.
Transformation. SIFT is used as it can identify objects and
extract local features even in cluttered and partially C. Hybrid Model
obstructive images. It is invariant to uniform scaling,
orientation, lighting changes. This algorithm finds all the We calculate CNN for No-filter using various convolution
key-points which is the maxima/minima of Difference of layers, max pooling, and dropouts. Also, the features
Gaussians at different scales σ. extracted from these above descriptors are passed through
Fully Connected Neural Network Layer consisting of 4096,
2048 and 4096 neurons respectively.
D(x, y, σ) = L(x, y, σ1 ) + L(x, y, σ2 ) (8)
The results from all these layers are merged in the final
L(x, y, σ) = G(x, y, σ) ∗ I(x, y) (9) result and the classification is done using a 100 neuron final
layer using the Softmax activation function. The model is
Here, G is the Gaussian Blur with scale, σ, and * is the shown in Figure 6.
convolution sign. This is followed by key-point localization
to remove the unnecessary low-contrast key-points and to D. Natural Language Processing
eliminate edge responses. This requires the principal
curvature which is given by the eigenvalues of 2*2 Hessian The tokens are available from the hybrid model. Now, the
matrix. tokens are to be used to make a sentence by adding proper
Dxx Dxy and relevant prepositions, articles, conjunctions, tenses, etc.
H= (10)
Dxy Dy y

74

Authorized licensed use limited to: Miami University Libraries. Downloaded on June 15,2021 at 14:58:11 UTC from IEEE Xplore. Restrictions apply.
Fig. 7: Grammar Rules

input tokens smaller than a threshold value are rejected.


n
AB i=1 Ai Bi
cos(A, B) = =  n  n
AB i=1 (A i)
2
i=1 (Bi )
2
(14)
• The sentences along with the sentences from the grammar
are sent to the grammar correction phase.
3) Grammar Correction: The grammar is checked using
Bidirectional Encoder Representations from Transformers
(BERT) which is used on the data set of Lang-8 Corpus of
Learner English . A multi-headed language model is
generated that uses BERT as encoder and decoder
Transformer for grammar correction (excluding spelling
correction) with Replace and Range Heads options. The
Fig. 6: Hybrid Model
Transformer is bidirectional and helps in understanding the
context of the sentence. We have used a MASK of 15%. The
multi-headed BERT gives a suggested correction. The
1) Grammar Rules: The words from the database are to be correction is stored in final results along with probability
fit inside a detailed context-free-grammar (a brief is shown in which is given as:
the Figure. 7) In addition to the words in the database, which n

mostly contains Nouns, Verbs, and Adjectives, another set of P (X) = xi ∗ cos(InputX, OutputX) (15)
words for Prepositions, Conjunctions, Determiners is added to i=1
the grammar. The working is as follows: Here, xi is the probability of gestures from the hybrid model
and cos is the Cosine Similarity between the input and
• The 64 tokens are POS-tagged from the brown corpus.
suggested correction of BERT.
• The tokens are passed to the grammar.
Final steps:
• The tag with the most probability is chosen as final tag.
• The top 10 most probable sentences are stored in final
• The grammar considers the possible sequences by putting
all the permutations and combinations. results and also stored in database for future referencing.
• The sentences are POS tagged and the unseen words are
• All the sentences are returned.
added to the Grammar.
• Give a Text-To-Speech audio output along with text.

2) Web Results: The grammar may not consist all the V. R ESULTS
words required to make a complete sentence structure. This
is where web search results are used. It is observed that web The results for the hybrid model are as shown in the Table.
search queries are quite organized in a way that form an I. A real time working of the entire work is shown in Figure
entire sentence. Thus, web search results for the given tokens 8. And Table. II shows the probability of sentences that are
are extracted and stored on run time. actually expected.

• The results are converted to tokens.


• All the results that have cosine similarity with the array of

75

Authorized licensed use limited to: Miami University Libraries. Downloaded on June 15,2021 at 14:58:11 UTC from IEEE Xplore. Restrictions apply.
Parsing and Text Similarity can be used if the input is a
paragraph. The application can be extended to take speech as
input and generate its corresponding gesture. We can also
add more instances into the data set such that it covers all
the spheres of communication and provides variations. A
personal assistant who actively interacts with people with
(a) I (b) Opposite (c) You
disabilities can be developed by extending this system. From
the keywords extracted, sentences can be generated in
various languages like Hindi, Marathi, Gujarati, etc. with
proper sentence formation schemes to ensure grammatically
correct sentences.
R EFERENCES
[1] Kanchan Dabre, Surekha Dholay, “Machine Learning Model for
(d) Probable sentences from tokens: I, Sign Language Interpretation using Webcam Images”, International
Opposite, You Conference on Circuits, Systems, Communication and Information
Technology Applications (CSCITA), IEEE, 2014.
Fig. 8: Example Input and Output [2] Tanatcha Chaikhumpha, Phattanaphong Chomphuwiset, “Real — time
two hand gesture recognition with condensation and hidden Markov
models”, International Workshop on Advanced Image Technology
TABLE I: Comparison of models for Image Processing part (IWAIT), 2018.
[3] Yiyuan Li, Antonios Anastasopoulos, Alan W Black, “Towards Minimal
Model-Type Accuracy Precision Recall Fscore Supervision BERT-based Grammar Error Correction”, arXiv:2001.03521
SIFT 0.841 0.90 0.85 0.87 Cs, Jan 2020.
CNN on No- [4] Xiujuan Chai, Hanjie Wang, Fang Yin, Xilin Chen, “Communication
0.852 0.89 0.88 0.88 tool for the hard of hearings: A large vocabulary sign language
Filter
CNN-SIFT recognition system”, International Conference on Affective Computing
0.934 0.91 0.89 0.90 and Intelligent Interaction (ACII), IEEE, 2015.
hybrid
Canny-SIFT [5] Sumeet R. Agarwal, Sagarkumar B. Agrawal, Akhtar M. Latif,
0.911 0.84 0.87 0.85 “Sentence Formation in NLP Engine on the Basis of Indian Sign
hybrid
Hybrid-model 0.942 0.92 0.88 0.88 Language using Hand Gestures”, International Journal of Computer
Applications (0975–8887) Volume 116 – No. 17, April 2015.
[6] Singha J, Das K, “Recognition of Indian sign language in live video”,
TABLE II: Examples showing the probability for the most International Journal of Computer Applications 70(19):17-22, May
2013.
relevant sentence [7] Mundher Al-Shabi, Wooi Ping Cheah, Tee Connie, “Facial Expression
Recognition Using a Hybrid CNN–SIFT Aggregator”, arXiv:1608.02833
Most Relevant Cs, 2016.
Tokens Probability [8] Sampada Wazalwar, Urmila Shrawankar, “Interpretation of sign
Sentence
IorMe, Opposite, You I am opposite to you. 1.0 language into English using NLP techniques”, Journal of Information
IorMe, Abroad I am going abroad. 0.90 and Optimization Sciences 38(6):895-910, August 2017.
IorMe, Moon I see the moon. 0.81 [9] Sunita Nayak, Sudeep Sarkar, Barbara Leoding, “Automated Extraction
IorMe, House I live in house. 0.82 of Signs from Continuous Sign Language Sentences using Iterated
You, Man You are a man. 1.0 Conditional Modes”, Conference on Computer Vision and Pattern
Recognition, IEEE, 2009.
[10] Aradhana Kar, Pinaki Sankar Chatterjee, “An Approach for Minimizing
the Time Taken by Video Processing for Translating Sign Language
to Simple Sentence in English”, International Conference on
VI. C ONCLUSION Computational Intelligence & Networks, IEEE, 2015.
A system is developed that can recognize a set of Indian
Sign Language gestures and convert them into meaningful
text/speech using various Image Processing and Machine
Learning/Deep Learning techniques. It makes a foundation
for a scalable project that can be extended to capture the
whole vocabulary of Indian Sign Language through manual
and non-manual signs. Using a hybrid model gives the
benefits of all the feature extraction techniques and gives
substantial accuracy along with decreasing the computational
time required. The sentences formed from the language
model ensure the correctness and preciseness when
compared to expected results.
VII. F UTURE S COPE
The system can be extended to include the knowledge of
phonological and morphological information. Dependency

76

Authorized licensed use limited to: Miami University Libraries. Downloaded on June 15,2021 at 14:58:11 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy