final report_phase 1
final report_phase 1
A
PROJECTPHASE 1
REPORT ON
Submitted by
SHRADDHA S J 1AM22IS102
SINDHU M K 1AM22IS110
VIBHA GUNAGA 1AM22IS122
Y G SNEHA 1AM22IS124
Under the Guidance of
MANJULA DEVI P
Assistant Professor
Dept. of ISE, AMCEC
2024-2025
AMC ENGINEERING COLLEGE
DEPARTMENT OF INFORMATION SCIENCE AND
ENGINEERING
Accreditedby NAAC A+ AND NBA
18 Km, Bannerghatta Road, Bangalore-560083
CERTIFICATE
Certified that the project work entitled: “MULTILINGUAL CONVERSION OF
SIGN LANGUAGE TO TEXT” has been successfully completed by SHRADHHA S
J (1AM22IS102), SINDHU M K (1AM22IS110), VIBHA GUNAGA (1AM22IS122),
Y G SNEHA (1AM22IS124) all Bonafide students of AMC Engineering College,
Bengaluru in partial fulfilment of the requirements for the award of degree in Bachelor
of Engineering in Information Science and Engineering of Visvesvaraya
Technological University, Belgaum during the academic year 2024- 2025. The project
phase 1 report has been approved as it satisfies the academic requirements in respect of
project work for the said degree.
We also declare that, to the best of our knowledge and believe the work reported here
does not form or part of any other dissertation on the basis of which a degree or award
was conferred on an early occasion of this by any other students.
Place:
Date:
SHRADDHA S J 1AM22IS102
SINDHU M K 1AM22IS110
Y G SNEHA 1AM22IS124
ACKNOWLEDGEMENT
At the very onset, we would like to place our gratitude on all those people who helped
us in making this project work a successful one.
Coming up, this project was not easy. Apart from the sheer effort, the enlightenment
of our very experienced teachers also plays a paramount role because it is they who
guide us in the right direction.
First of all, we would like to thank the Management of AMC Engineering College
for providing sucha healthy environment for the successful completion ofproject work
.
In this regard, we express our sincere gratitude to the Chairman Dr. K Paramahamsa
and the Principal Dr. Yuvaraju.B.N, for providing us all the facilities in this college.
We are extremely grateful to our Professor and Head of the Department of Information
Science and Engineering, Dr. R. Amutha, for having accepted to patronize us in the
right direction with all her wisdom.
SHRADDHA S J
[1AM22IS102]
SINDHU M K
[1AM22IS110]
VIBHAGUNAGA
[1AM22IS122]
Y G SNEHA
[1AM22IS124]
i
ABSTRACT
The Multilingual Conversion of Sign Language to Text project aims to bridge the communication
gap between individuals who use sign language and those who use spoken or written languages.
By leveraging advanced machine learning techniques and computer vision, this project seeks to
develop a real-time system capable of recognizing and translating sign language gestures into text
across multiple languages. This system utilizes computer vision and deep learning techniques to
recognize and interpret sign language gestures from video input, converting them into
grammatically accurate textual output. It then employs natural language processing and machine
translation models to render the recognized text into multiple spoken/written languages, enhancing
accessibility and inclusivity. By supporting various regional sign languages and spoken language
translations, this system offers a scalable, real-time communication tool that can be deployed
across platforms such as mobile devices, desktops, and smart cameras.
This system gathers a diverse dataset of sign language gestures from various sign languages such
as Indian Sign Language (ISL), Kannada Sign Language (KSL). This project will employ techniques
like image processing and neural networks to accurately identify and interpret sign language
gestures from video input. By combining cutting-edge technologies in computer vision and natural
language processing, the project seeks to make communication more accessible, efficient, and
inclusive for people worldwide.
Keywords: Indian Sign Language, Kannada Sign Language, Image Processing, Neural
networks, Deep Learning, Computer Vision.
ii
CONTENTS
TITLE PAGE NO.
ACKNOWLEDGEMENT i
ABSTRACT ii
CONTENTS iii
CHAPTERS
1: INTRODUCTION 1
2: LITERATURE REVIEW 8
4: SYSTEM ANALYSIS 12
5: SYSTEM DESIGN 17
5.2 Modules 20
iii
5.2.7 Language Translation Module 27
REFERENCES 31
iv
MULTILINGUAL CONVERSION OF SIGN LANGUAGE TO TEXT
Chapter 1
INTRODUCTION
Communication has been defined as an act of conveying intended meanings from one entity or
group to another through the use of mutually understood signs and semiotic rules. It plays a
vital role in the existence and continuity of human. For an individual to progress in life and
coexist with other individuals there is the need for effective communication. Effective
communication is an essential skill that enables us to understand and connect with people
around us. It allows us to build respect and trust, resolve differences and maintain sustainable
development in our environment where problem solving, caring and creative ideas can thrive.
Poor communication skills are the largest contributor to conflict in relationships.
The only form of communication for deaf and mute people—mostly illiterates—is sign
language. However, it is still difficult to interact with regular people without the aid of a human
interpreter because most members of the general public are not eager to learn this sign language.
The deaf and hard of hearing become isolated as a result. Nevertheless, the development of
technology makes it possible to overcome the obstacle and close the communication distance.
Various sign languages are used around the globe. There are around 300 different sign
languages in use around the globe. This is so because individuals from various ethnic groups
naturally created sign languages. Maybe there isn't a common sign language in India.
Based on the World Health Organization (WHO) statistics, there are over 430 million people
with hearing loss disability (WHO 2023) which is 5% of the world population and it is It is
estimated that by 2050 over 700 million people – or 1 in every 10 people – will have disabling
hearing loss. According to Sign Solutions, there are more than 300 sign languages used around
the world.
Different regions of India have their own dialects and lexical differences in Indian Sign
Language. However, new initiatives to standardize Indian Sign Language have been made
(ISL). It is possible to train the machine to recognize gestures and translate them into text and
voice. To facilitate communication between deaf-mute and vocal persons, the algorithm
effectively and accurately categorizes hand gestures. Additionally, the identified sign's gesture
name is spoken and displayed. system helps the blind to navigate independently using real time
object detection and identification. The development ofa sign language to text converter using
artificial intelligence (AI) and machine learning (ML) has the potential to revolutionize
communication for people who are deaf or hard of hearing. By using complex algorithms and
deep learning models, AI and ML can accurately recognize and interpret sign language gestures
in real-time. This technology can also improve over time as it is trained on more data and can
adapt to variations in signing styles and dialects. The result is a powerful tool that can convert
sign language into written or spoken language, making it easier for people who do not know
We present a novel method using convolutional neural networks (CNN) for hand sign language
recognition and conversion into text to address this problem. The purpose of this project is to
determine whether it is feasible and accurate to translate hand sign gestures into text. The need
for effective and inclusive communication, where deep learning technology is essential, is what
spurs this research.
In our project, we focus on producing a model which can recognize Fingerspelling-based hand
gestures to form a complete word by combining each gesture. The current study aims to review
the development of sign language recognition and translation systems in real-time concerning
various approaches and technology used. The following sections detail the unique contributions
of these works and highlight major ideas and improvements presented. Therefore, this survey
study synthesizes this huge body of research and is also likely to shed light on the
transformational potential and relevance of real-time sign language identification and
translation systems in encouraging inclusive communication for the deaf and persons with
hearing impairments.
Communication is a fundamental human right, yet millions of people with hearing and speech
impairments face daily challenges in expressing themselves to those who do not understand
sign language. This communication gap not only limits their ability to interact socially but
also restricts access to essential services such as education, healthcare, and employment.
Traditional means of bridging this gap—such as human interpreters—are often expensive,
unavailable, or impractical in real-time, everyday situations. As a result, there is an urgent
need for a cost-effective and scalable solution that can enable seamless communication
between sign language users and the general public. While several technologies have
attempted to address sign language translation, most existing systems are limited in
functionality. Many focus on converting gestures to text in a single language, often English,
and fail to accommodate multilingual needs or regional sign language variations.
Furthermore, these systems often lack real-time processing capabilities and struggle with
accurate gesture recognition due to environmental factors such as background noise, lighting,
or partial hand visibility. This reduces their effectiveness and limits their use in practical.
Another critical challenge lies in the complexity of sign languages themselves. Unlike spoken
languages, sign languages rely heavily on hand shapes, movement, orientation, and facial
expressions, which must all be interpreted together to derive accurate meaning. Additionally,
each country or region may have its own version of sign language—such as American Sign
Language (ASL), British Sign Language (BSL), or Indian Sign Language (ISL)—making it
essential for a robust system to support multilingual output and dynamic translation.
Addressing these intricacies is key to developing a truly inclusive solution.
The proposed system aims to bridge this communication gap by developing a Multilingual
Conversion of Sign Language to Text application. Using computer vision and machine
learning techniques, the system will detect and recognize hand gestures from video input,
extract relevant features, classify the gestures, and convert them into text in the user’s
preferred language. The system will be structured using a Finite State Machine (FSM) to
ensure reliable and logical state transitions during the recognition and translation process.
This approach guarantees modularity, error handling, and consistent user interaction.
Ultimately, this project seeks to empower the deaf and hard-of-hearing community by
providing them with a tool that translates their sign language gestures into readable,
multilingual text in real-time. Such a system would not only enhance inclusivity and
independence but also improve interactions in educational, professional, and public
environments. By addressing limitations in current technologies and incorporating
multilingual capabilities, this solution strives to make communication more accessible and
equitable for all.
• To develop a system that captures and interprets sign language gestures using
computer vision and machine learning techniques.
• To extract meaningful features from the gestures for reliable classification and
recognition.
To maximize impact, the system will be designed to run on commonly used platforms such as
web browsers and Android mobile devices. The scope includes developing a user-friendly
interface that enables deaf users, educators, and the general public to interact with the system
without requiring technical expertise. Accessibility features like voice output and adjustable
text size may also be included in later versions. The project is scoped to serve a wide range of
use cases, including communication in classrooms, workplaces, hospitals, and public service
environments. It can also act as a learning tool for those interested in sign language, promoting
inclusivity and awareness among the general population. By making sign language more
accessible and translatable, the project has the potential to empower deaf individuals and
integrate them more fully into society. Although the initial scope is limited to ISL and a few
output languages, the system is designed with scalability in mind. Future phases may involve
adding support forother sign languages like ASL or BSL, integrating speech synthesis for sign-
to-speech conversion, and recognizing facial expressions and body posture to interpret emotion
and context. This would make the system more comprehensive and closer to natural human
communication. The practical applications of this project are vast, with significant potential to
improve communication between deaf and hearing individuals. In education, healthcare,
customer service, and social settings, this technology can help bridge the communication gap.
Furthermore, real-time sign language translation could assist interpreters and provide a solution
in situations where interpreters are not available. The societal impact of this system is immense,
as it could improve the quality of life for deaf individuals, promote inclusivity, and reduce
communication barriers in diverse environments. This project will focus on both technical
innovation and real-world applicability, with the aim of developing a robust and scalable
solution for multilingual sign language recognition and translation.
Chapter 2
LITERATURE REVIEW
Ramesh M. Kagalkar and Nagaraj H.N, “New Methodology for Translation of Static Sign
Symbol to Words in Kannada Language.” Proposed a Methodology for Translation of
Static Sign Symbol to Words in Kannada Language. The goal of this paper is to develop a
system for automatic translation of static gestures of alphabets in Kannada sign language. It
maps letters, words and expression of a particular language to a collection of hand gestures
enabling an in individual exchange by using hands gestures instead of by speaking. The
system capable of recognizing signing symbols may be used as a way of communication with
hard of hearing people. Furthermore, the complexity of translating dynamic gestures into a
system for text output would require further development.. For example, systems based on
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been
widely used for gesture recognition, achieving impressive accuracy in detecting hand gestures.
Datasets such as the American Sign Language (ASL) dataset, RWTH-PHOENIX for German
Sign Language, and others have been instrumental in training these models. However,
challenges remain in ensuring the accuracy and robustness of such systems, especially when
dealing with diverse sign language variations and different signing styles across regions. [1].
Vishnu Sai Y and Rathna G N, “Indian Sign Language Gesture Recognition using Image
Processing and Deep Learning.” proposed Indian Sign Language Gesture Recognition using
Image Processing and Deep Learning To bridge the identical attempt to bridge the identical,
we propose a real time hand gesture recognition system supported the data captured by the
Microsoft Kinect RGBD camera. on condition that there is no one to 1 mapping between the
pixels of the depth and thus the RGB camera, we used computer vision techniques like 3D
construction and transformation. After achieving one to 1 mapping, segmentation of the hand
gestures was done from the background. Convolutional Neural Networks (CNNs) were
utilised for training 36 static gestures relating Indian sign Language (ISL) alphabets and
numbers. The model achieved an accuracy of 98.81% on training using 45,000 RGB images
and 45,000 depth images. Further Convolutional LSTMs were used for training 10 ISL
dynamic word gestures and an accuracy of 99.08% was obtained by training 1080 videos. The
model showed accurate real time performance on prediction of ISL static gestures, leaving a
scope for further research on sentence formation through gestures. The model also showed
competitive adaptability to
American Sign language (ASL) gestures when the ISL models’ weights were transfer learned
to ASL and it resulted in giving 97.71% accuracy [2].
Prof. Radha S. Shirbhate, Mr. Vedant D. Shinde, “Sign language Recognition Using
Machine Learning Algorithm.” proposed -Sign language Recognition Using Machine
Learning Algorithm. Hand gestures differ from one person to a different person in shape and
orientation; therefore, a controversy of linearity arises. Recent systems have come up with
various ways and algorithms to accomplish the matter and build this method. Algorithms like
K- Nearest neighbors (KNN), Multi-class Super Vector Machine (SVM), and experiments
using hand gloves were using decode the hand gesture movements before. during this paper, a
comparison between KNN, SVM, and CNN algorithms is completed to see which algorithm
would offer the simplest accuracy among all. Approximately 29,000 images were split into
test and train data and pre-processed to suit into the KNN, SVM, and CNN models to get
maximum accuracy [3].
Kajal Dakhare and Vidhi Wankhede, “A Survey On Recognition And Translation System
Of Real-Time Sign Language” proposed a method for Using depth cameras to recognise
sign language Processing depth camera data using computer vision methods. Depth cameras,
such as the Microsoft Kinect or Intel RealSense, capture three-dimensional data, which
provides more detailed information about the position and movement of the hands, arms, and
body compared to traditional RGB cameras. By processing depth data, their method improves
the accuracy and robustness of sign language recognition systems, especially in scenarios
where variations in lighting, background noise, or occlusions can affect the recognition
performance of conventional visual-based systems. The key advantage of using depth cameras
is the ability to detect the spatial relationships between the signer’s body parts in three
dimensions, which helps to distinguish subtle variations in gestures that may be challenging to
recognize using only RGB images.[4].
N.Tanibata and N.Shimada, “Extraction of Hand Features for Recognition of Sign Language Words”
proposed a system that describes a novel method of fingertips and centre of palms detection in
dynamic hand gestures generated by either one or both hands without using any kind of sensor
or marker. We call it Natural Computing as no sensor, marker or colour is used on hands to
segment skin in the images and hence user would be able to do operations with natural hand.
This is done by segmenting and tracking the face and hands using skin colour. The tracking of
elbows is done by matching the template of an elbow shape. The hand
features like area of hand, direction of hand motion, etc. are extracted and are then input to
Hidden Markov Model (HMM) [5].
Aditya Das, Shantanu, “Sign Language Recognition Using Deep Learning On Static
Gesture Images.” proposed a system for Sign Language Recognition Using Deep Learning
On Static Gesture Images. The image dataset used consists of static sign gestures captured on
an RGB camera. Preprocessing was performed on the pictures, which then served as cleaned
input. The paper presents results obtained by retraining and testing this signing gestures dataset
on a convolutional neural network model using Inception v3. The model consists of multiple
convolution filter inputs that are processed on the identical input. This paper also reviews the
assorted attempts that are made at sign language detection using machine learning and depth
data of images. It takes stock of the varied challenges posed in tackling such an issue, and
descriptions future scope also [6].
Chapter 3
➢ Python: The primary language for implementing machine learning, computer vision,
and NLP models. Python is highly suitable for this project due to its extensive libraries
and frameworks for data processing and model training.
➢ OpenCV: An open-source computer vision library for real-time image processing and
computer vision tasks. It will be used for hand gesture recognition, image processing,
and video frame manipulation.
➢ Media Pipe: A framework from Google that facilitates real-time hand tracking and pose
estimation. It's useful for detecting hand gestures and body movements from video
streams.
➢ OpenCV (DNN Module): Used for integrating pre-trained models or deep learning
algorithms in the computer vision pipeline.
➢ NumPy and Pandas: Essential for handling large datasets, data manipulation, and
numerical computing tasks.
➢ Matplotlib or Seaborn: For visualizing data and performance metrics during training
and evaluation.
➢ Google Translate API: For translating the output text (from the sign language
recognition) into various languages. It supports a wide range of languages and can be
easily integrated with the system.
➢ IDE/Code Editor: Tools like PyCharm, VS Code, or Jupyter Notebook for code
development and testing.
➢ Intel RealSense cameras: These are popular for depth sensing and 3D imaging. They
capture depth data, which allows for more accurate hand gesture recognition by
providing additional spatial context. These cameras also work well in various lighting
conditions and can detect hand and body movements with greater precision, making
them ideal for recognizing complex gestures.
Chapter 4
SYSTEM ANALYSIS
4.1 Existing System
The existing method of using single-handed signs for sign language recognition, particularly in
American Sign Language (ASL), poses certain challenges, both from a linguistic and physical
standpoint. ASL, as you noted, is used primarily in the United States and Canada, where many
sign language recognition systems have been developed to work with single- handed gestures.
However, this approach is not universally applicable to all sign languages, particularly those
used in other regions, such as the Indian Sign Language (ISL) or regional variants. In fact, the
Indian Deaf community typically uses two-handed signs for many of their gestures, which
differs significantly from the one-handed signs common in ASL.
One of the major issues with systems that rely heavily on single-handed gestures for sign
language recognition is the physical strain it places on users. Prolonged use of a single hand for
gesturing can result in hand sprains, muscle fatigue, and potentially lead to long-term injuries,
especially when used continuously for communication. Studies have highlighted the risks of
repetitive strain injuries (RSI) that arise from using one hand in unnatural or sustained positions.
Furthermore, the reliance on single-handed gestures limits the inclusivity of sign language
recognition systems. Sign languages across the world, including Indian Sign Language (ISL),
use a combination of one- and two-handed gestures, with some gestures requiring both hands
to convey meaning. For example, certain signs in ISL, such as alphabet gestures or complex
expressions, cannot be accurately conveyed using a single hand alone. Therefore, ASL-based
sign language recognition methods may not work well for other languages, especially in regions
where the Deaf community uses two-handed gestures as the norm.
From a health perspective, multi-hand gesture recognition systems can help reduce the risk of
repetitive strain injuries by promoting the use of both hands instead of relying on a single hand
for communication. The two-handed approach may allow for more natural and less strenuous
hand positions, reducing the likelihood of injuries associated with prolonged single-handed
gesturing.
The existing method of focusing on single-handed sign language recognition is not without its
limitations, both in terms of its applicability to a wider range of sign languages and its impact
on users' physical health. For a more inclusive, accessible, and ergonomically friendly sign
language recognition system, it is essential to incorporate multi-handed gesture recognition.
This approach would not only accommodate sign languages like ISL but also promote healthier
communication practices by reducing strain on a single hand, ultimately creating a more
effective and user-friendly system for the global Deaf community.
Once the gesture is captured, the system performs preprocessing and feature extraction, where
the key characteristics of the hand—such as shape, orientation, finger positions, and
movement—are analyzed. These features are used as input for machine learning or deep
learning models like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks
(RNNs), which are trained on large datasets containing labelled gestures from various sign
languages, including Indian Sign Language (ISL) and American Sign Language (ASL). The
classifier determines the most likely match for the gesture and outputs the corresponding word
or phrase. The recognized gesture is then mapped to a textual representation of the sign in a
base language (typically English). To achieve multilingual output, the system uses language
translation services such as the Google Translate API or pre-trained multilingual models to
convert the English text into the user's preferred language, such as Hindi, Kannada, or Tamil.
This adds flexibility to the system, making it suitable for users across different linguistic
regions. The translated text is displayed in real time on the screen, and optionally, it can also
be converted to speech using text-to-speech technology to aid communication with non-
signers.
Overall, this system provides a comprehensive solution to bridge the communication gap
between the Deaf or hard-of-hearing community and the hearing population. By supporting
multiple sign languages and translating the recognized gestures into different spoken
languages, the system becomes more inclusive and culturally adaptive. Additionally, it
addresses ergonomic concerns by supporting two-handed gestures and minimizing repetitive
single-hand usage, making it both health-conscious and user-friendly. The modular design of
the system also allows future enhancements such as emotion detection, facial expression
interpretation, and bidirectional communication through text-to-sign generation.
To ensure smooth and logical progression through each stage of the sign language recognition
process, the system incorporates a Finite State Machine (FSM) architecture. FSM divides
the workflow into distinct states—such as Idle, Gesture Detection, Feature Extraction,
Recognition, Translation, and Output Display. Each state has defined transitions based on
specific triggers, such as successful detection or classification failure. This modular flow
prevents system confusion, enables better error handling, and ensures that each process is
executed only when its preconditions are met. For instance, the system will not proceed to the
Translation state until a valid gesture is successfully recognized. This structured design
significantly improves reliability, especially in real-time scenarios.
The user interface (UI) is designed to be intuitive and accessible for users with varying
technical skills. It features real-time video feedback, language selection dropdowns, gesture
capture indicators, and a text display panel for the final translated output. An optional voice
output toggle allows users to enable or disable text-to-speech functionality depending on the
context. The UI ensures responsive interactions and clear feedback at every stage, which is
essential for users who depend on visual cues. Additionally, the system can log gesture inputs
and translation results for performance monitoring or future training purposes, providing a
data-driven approach to improvement.
From a technical standpoint, the system is built with scalability and extensibility in mind.
The modular design allows new gestures, sign languages, and spoken languages to be added
with minimal changes to the core architecture. Gesture models can be updated or retrained as
new datasets become available, improving accuracy over time.
Lastly, the system also considers future enhancements to enrich user interaction and expand
its functional scope. Planned improvements include incorporating facial expression and
emotion detection to support signs that depend on non-manual cues—an important aspect of
natural sign language communication. Another future goal is to enable bidirectional
communication, where not only can sign language be translated into text, but typed or
spoken language can also be converted into animated sign representations using 3D avatars or
gesture synthesis models. These additions would make the system even more interactive,
inclusive, and supportive of broader communication needs in education, healthcare, and
public services.
Chapter 5
SYSTEM DESIGN
5.1 Block Diagram
The diagram represents a systematic flow of a sign language to text and voice conversion
system, typically focused on American Sign Language (ASL), but adaptable to other sign
languages as well. The process begins with image acquisition, where a camera captures live
video footage of a signer performing gestures. This input is critical as the foundation of the
system's operation. The captured frames are then forwarded to the hand detection and tracking
module, where advanced computer vision techniques identify and isolate the hand region in
each frame. This step ensures that only the relevant gesture parts are processed, discarding
unnecessary background elements and improving overall recognition accuracy.
After detecting and tracking the hand, the system proceeds to preprocessing, where the video
data is cleaned and standardized for analysis. This involves operations such as resizing images,
converting themto grayscale, removing noise, and segmenting the hand region. Preprocessing
ensures consistency in the input data, which is essential for accurate recognition. Once the data
is pre-processed, the system moves to feature extraction. In this stage, critical details about the
hand's shape, movement, orientation, and finger positions are extracted. These features
uniquely represent each gesture and act as the basis for classification.
The extracted features are then sent to a training module, where machine learning algorithms
learn to associate these features with specific meanings. A training database is used to store
labelled gesture data that the system uses to learn from. This database contains various
examples of sign gestures and their corresponding text representations. Once trained, the
system moves into the recognition phase, where it matches incoming gesture data against the
trained models to identify the most appropriate label, such as a letter, word, or phrase.
Finally, the recognized gesture is translated into text output, which is displayed to the user. In
addition, the system may also include a text-to-speech (TTS) component to convert the text
into spoken words, making it easier for hearing individuals to understand the message. This
two-fold output—visual and auditory—enhances accessibility and supports smoother
communication between Deaf users and the general public. This end-to-end system effectively
bridges communication gaps and can be enhanced further to support multiple languages and
sign systems for broader societal impact.
Expanding on the system’s capabilities, the modular nature of this design allows for future
enhancements such as multilingual support. Currently, many existing sign language recognition
systems output only in English. However, by integrating language translation services (such as
Google Translate API), this system can be adapted to convert recognized signs into regional or
native languages like Hindi, Kannada, or Tamil. This multilingual feature broadens the
system’s applicability, particularly in a diverse country like India, where users may prefer
outputs in their mother tongue for better understanding and usability.
Additionally, the inclusion of both static and dynamic sign recognition capabilities makes this
system more robust. While static signs (such as letters or numbers) can be recognized from
single frames, dynamic signs (like "thank you" or "good morning") require temporal tracking
across multiple frames. This is where advanced models like Recurrent Neural Networks
(RNNs) or Long Short-Term Memory (LSTM) networks come into play, as they are designed
to learn from sequences. With such enhancements, the system can handle complete sentence
recognition in the future, making it a viable tool for real-time conversations between Deaf and
hearing individuals.
In terms of real-world implementation, this system could be deployed in public service centers,
hospitals, schools, or customer service points to assist communication with Deaf individuals.
It can also be integrated into mobile applications or kiosks, making it portable and easily
accessible. By reducing reliance on human interpreters and enabling independent
communication, such a system supports inclusivity and empowers the Deaf community to
interact confidently in society.
5.2 Modules
The project "Multilingual Conversion of Sign Language to Text" aims to develop a complete
pipeline that captures sign language gestures through a camera, processes the visual data,
recognizes the sign, and converts it into corresponding multilingual text and voice output. To
achieve this, the system is divided into well-structured functional modules, each responsible for
a specific task—from data collection to translation and final output generation. These modules
work together in a sequential manner to ensure accurate gesture recognition and user-friendly
communication.
5.2.1Data Collection Module
The data collection module is the foundational stage of the system. Its primary purpose is to
gather a comprehensive dataset of sign language gestures to train the machine learning models
effectively. This module can involve collecting images or videos of hand gestures using a live
webcam or importing data from publicly available sign language datasets such as the American
Sign Language (ASL) Alphabet dataset or Indian Sign Language (ISL) datasets. The quality and
diversity of the collected data directly affect the accuracy of gesture recognition.
In a practical implementation, this module may include a user-friendly interface for recording
gestures. Each gesture must be labeled accurately to associate it with the correct word, letter, or
phrase in the target language. Organizing the data in structured folders (e.g., one folder per
label) or maintaining a metadata file like a CSV or JSON file with annotations helps in the later
stages of training. Furthermore, collecting both static (single- frame) and dynamic (multi-frame
sequence) gesture data is important if the system aims to handle alphabet-level and sentence-
level translation. This module should also ensure ethical data usage and user consent if real user
recordings are involved.
Sources:
• Live gesture recordings via webcam.
• ASL/ISL datasets from open repositories
Tasks:
• Capture images or videos of individual signs.
• Label each sample accurately.
• Store them in a structured format (folders or CSVs).
Once data is collected, the preprocessing module prepares it for consistent and efficient model
training and recognition. Raw images and videos often contain noise, unnecessary background
objects, and inconsistencies in resolution or lighting, which can affect performance. This
module standardizes the input format, improves clarity, and highlights the region of interest—
the hand. Preprocessing begins with resizing all images to a fixed dimension (e.g., 224x224
pixels) and converting them to grayscale or HSV color space to simplify processing.
Segmentation techniques are applied to isolate the hand from the background. This can be
achieved using skin-color-based detection, background subtraction, or advanced methods like
Media Pipe segmentation. Noise reduction filters (e.g., Gaussian blur) are used to smooth the
images and remove irrelevant details. In dynamic gesture processing, this module may also
perform frame sampling to reduce redundancy. Normalization of pixel values (e.g., scaling to
0–1 range) ensures faster convergence during training. Overall, this step ensures uniformity in
the dataset, improving the learning and generalization capabilities of the model.
Background subtraction is crucial for isolating the signer from a possibly cluttered
background. The MOG2 (Mixture of Gaussians) algorithm, available in OpenCV, models the
background using a series of Gaussian distributions for each pixel. It dynamically adapts to
lighting changes and detects moving objects (like hands and arms). This helps in focusing
only on relevant gestures by removing static backgrounds.
Algorithm: cv2.createBackgroundSubtractorMOG2()
To reduce noise in video frames and improve the accuracy of gesture detection, Gaussian Blur
is applied. It smoothens the image by averaging pixel values based on a Gaussian kernel. This
makes the image less sensitive to random noise or sensor inconsistencies, which is especially
helpful in low-light or variable-light conditions.
Google’s Media Pipe Hands is a powerful framework for real-time hand and finger tracking. It
detects 21 key landmarks on each hand and provides their coordinates in every frame. This
allows precise region-of-interest (ROI) extraction, which is critical for sign language where
finger positioning matters. It significantly simplifies gesture recognition in the later stages
The hand detection and tracking module plays a critical role in recognizing where the hand is
in real-time input frames. It ensures the system can isolate the hand regardless of the user's
position or environment. This module uses real-time object detection techniques to identify
hands within a frame and maintain focus on them as they move. For high accuracy and speed,
pre-trained models such as Media Pipe Hands (by Google) are commonly used. These models
not only detect hands but also return 21 key hand landmarks, including fingertips and joints,
which are essential for feature extraction.
Tracking ensures continuous gesture interpretation for dynamic signs and avoids jitter or signal
loss due to motion blur. If dual-hand recognition is required, this module can be extended to
identify and distinguish between left and right hands. Techniques such as bounding boxes,
landmark matching, or optical flow can be applied. Efficient hand detection and tracking
contribute significantly to the robustness of the entire system, ensuring that subsequent modules
work with clear, focused hand regions for analysis.
In this module, meaningful numerical representations are extracted from the detected hand
regions to identify and classify gestures. Feature extraction converts visual information into a
machine-readable format, which the classification algorithm can interpret. The features may
include geometric attributes such as distances between fingers, angles between joints, hand
orientation, and relative positions of fingertips.
The extracted features are structured into vectors, which serve as input for the classifier. For
static gestures, a single feature vector per frame may be sufficient, while dynamic gestures may
require time-series data (a sequence of vectors). In such cases, temporal features like motion
flow, trajectory, and hand velocity can also be computed. The efficiency of this module depends
on its ability to produce distinct and consistent features for each gesture class, which directly
impacts recognition accuracy.
One of the most effective ways to extract features from sign language videos is using Media
Pipe’s pose and hand landmark detection. Media Pipe provides 2D or 3D coordinates of
specific body parts (like elbows, wrists, shoulders) and fingers (21 landmarks per hand).
These coordinates serve as numerical feature vectors that represent hand shapes, positions,
and movements. These features are crucial for both static and dynamic sign recognition.
Optical Flow estimates the direction and speed of motion between consecutive frames. In sign
language, hand and arm movements are dynamic, so motion-based features like Optical Flow
help track how pixels (or hand regions) move across frames. It’s especially useful for
detecting dynamic gestures or transitions in continuous signing.
This module is responsible for training the gesture recognition model using the extracted
features. It is a critical component that determines the intelligence of the system. The model
learns from the labeled dataset to identify patterns and associate them with corresponding
gesture labels. Various machine learning and deep learning algorithms can be used here. For
static signs, Convolutional Neural Networks (CNNs) are commonly used due to their ability to
process spatial data. For dynamic gestures, Recurrent Neural Networks (RNNs) or Long Short-
Term Memory(LSTM) networks are more effective as theycan learn fromsequences over time.
During training, the dataset is typically divided into training, validation, and testing sets. The
model’s performance is evaluated using metrics like accuracy, precision, recall, and F1-score.
Techniques such as data augmentation, dropout, and early stopping may be used to prevent
overfitting and improve generalization. Once trained, the model is saved and used by the
recognition module in real-time prediction. This module may also allow re-training or fine-
tuning if the system encounters new gestures or languages.
CNNs are widely used for static sign recognition, such as recognizing individual letters or
numbers in sign language. They excel at learning spatial features from images (e.g., hand
shapes, orientations). A CNN trained on preprocessed frames or hand images can accurately
classify signs.
LSTMs are a type of recurrent neural network (RNN) ideal for sequence modeling, such as
dynamic signs involving hand movement over time. LSTMs process sequences of features
(e.g., key points across frames) and capture long-term dependencies, like motion patterns or
sign transitions.
Unlike regular CNNs that operate on spatial data, 3D-CNNs extract features from both space
and time by using 3D kernels. They process short clips of video (stacked frames) and are
powerful for spatio-temporal gesture recognition.
HMMs were historically used in sign and speech recognition before deep learning. They
model the probabilistic transitions between gesture states. While not as powerful as modern
neural networks, they are lightweight and interpretable.
The recognition module uses the trained model to identify hand gestures in real-time. When a
user performs a gesture in front of the camera, the system captures the input, applies
preprocessing, extracts features, and forwards them to the trained model for classification. The
model outputs the predicted label, which corresponds to a word or alphabet. This module must
process input quickly and accurately to provide real-time feedback to users.
In the case of dynamic gesture recognition, the module handles temporal inputs by processing
gesture sequences. The system may also include a confidence score for each prediction to
inform the user about recognition certainty. The predicted label is then passed on to the next
stage for translation and output. The efficiency of this module is essential for making the system
responsive and interactive.
1. Input: The user performs a hand gesture, and the system captures the gesture's feature
vector (e.g., hand position, shape, and joint angles).
2. Process: KNN calculates the Euclidean distance between the new feature vector and all the
feature vectors in the training set.
3. Output: The system identifies the k nearest neighbors (most similar gestures) and assigns
the gesture to the class that is most common among these neighbors (e.g., "A" for American
Sign Language or "1" for ISL).
Adaptability: As new gestures are added to the system, KNN can easily adapt by storing
additional feature vectors and making classifications based on them.
No Need for Model Training: KNN does not require an expensive training phase and can
classify gestures as soon as the relevant features are extracted.
To fulfill the goal of multilingual communication, this module translates the recognized English
gesture label into a regional language like Hindi, Kannada, Tamil, or any preferred language. It
bridges the gap between gesture recognition and user comprehension. Translation can be
achieved using external APIs such as Google Translate API, or offline methods involving
language mapping dictionaries. This module receives the output from the gesture recognition
system and maps it to its equivalent in the target language.
In more advanced systems, this module can also support contextual or phrase-based translation,
where the recognized words are translated in groups rather than as isolated tokens. This ensures
grammatical correctness and fluency in the target language. It may also provide transliteration
options and support multi-language switching. This module adds significant value to the system
by making it more inclusive and adaptable across regions.
PBSMT models translation as a statistical process that finds the most probable output sentence
in the target language, given an input sentence. It uses a phrase table, where source phrases
(not just words) are mapped to likely target phrases based on probabilities learned from large
bilingual corpora. It also includes a language model (to ensure fluent output) and a decoder (to
find the best match). This approach was used by early systems like Google Translate before
they switched to neural methods..
Example:
The output generation module is responsible for presenting the translated gesture in both text
and speech formats. This ensures that the recognized sign language can be understood by
hearing individuals who rely on spoken or written language. The recognized and translated word
or sentence is displayed on the screen in a user-friendly format. For voice output, text-to- speech
(TTS) engines like pyttsx3 (offline) or gTTS (online) can be used to vocalize the result.
The user interface (UI) is the layer through which users interact with the system. It integrates
all other modules and provides a seamless and intuitive experience. The UI can be built as a
desktop application using Tkinter or PyQt, or as a web application using React, HTML/CSS,
and JavaScript. The main features include live video feed display, real-time text output,
language selection dropdown, and start/stop recognition buttons.
The UI ensures that users with no technical background can easily use the system. It may also
include tooltips, progress indicators, and error alerts to enhance usability. Accessibility features
like large text, high contrast modes, and multilingual support ensure that the UI can be used by
people of all age groups and abilities. This module is critical in determining how well the system
is received by end-users.
1. Event-Driven Programming
Model: Represents the application data (e.g., the sign language gloss or translated text)
A Finite State Machine (FSM) is a mathematical model used for designing the UI's state
transitions. It is especially useful for applications with distinct modes or states, such as
interactive forms, tutorial steps, or multi-step sign language translation. The FSM ensures that
the UI behaves predictably by defining a set of states and possible transitions based on user
actions.
Dept. of ISE, AMCEC 2024-25 29 |P a g e
MULTILINGUAL CONVERSION OF SIGN LANGUAGE TO TEXT
A Finite State Machine (FSM) provides a structured way to manage the flow of an
application's user interface by defining distinct states and the transitions between them. This
approach is especially beneficial for applications with clear sequential or conditional steps,
such as sign language translation systems. For instance, in such systems, the UI can be broken
into states like Idle, Capturing Gesture, Recognizing Gesture, Translating Text, and
Displaying Output. Transitions between these states are triggered by user interactions or
system events, such as starting the camera, detecting a hand gesture, or completing a
translation. By modeling these interactions explicitly, FSMs help ensure the system behaves
predictably and logically in response to inputs, reducing errors and making the interface easier
to understand and navigate.
FSMs also improve the maintainability and scalability of UI systems. Because each state and
its corresponding transitions are clearly defined, developers can easily isolate and troubleshoot
issues or extend the application with new functionality. In a multilingual sign language
translation app, for example, adding support for a new language might involve inserting new
states or transitions without disrupting the existing flow. FSMs can be implemented manually
or with state management libraries that visualize state transitions, enabling better debugging
and design validation. This structured approach ensures a seamless user experience, especially
in complex, multi-step applications where consistency and correctness are critical.
REFERENCES
[1]. Ramesh M. Kagalkar, Nagaraj H.N, “Methodology for Translation of Static Sign Symbol
to Words in Kannada Language”, International Journal of Recent Technology and Engineering
(IJRTE) -2020.
[2]. Vishnu Sai Y, Rathna G N, “Indian Sign Language Gesture Recognition using Image
Processing and Deep Learning”, International Journal of Scientific Research in Computer
Science (IJSRCSEIT)- 2018.
[3]. Prof. Radha S. Shirbhate, Mr. Vedant D. Shinde, “Sign language Recognition Using
Machine Learning Algorithm.”, IJASRT -2020.
[4]. Kohsheen Tiku, Jayshree Maloo, Aishwarya Ramesh, Indra R, “Real-time Conversion of
Sign Language to Text and Speech,” Second International Conference on Inventive Research
in Computing Applications, Coimbatore, India, 2020
[5]. Zeeshaan W. Siddiqui, Vinay E. Koche, Shubham P. Tapre, Prof. Neema Amish Ukani,
Prof. Sandeep Sonaskar “GESTURE BASED VOICE MESSAGE SYSTEM FOR PATIENT”
International Research Journalof Engineering and Technology(IRJET) - May 2021.
[6]. Aditya Das, Shantanu, “Sign Language Recognition Using Deep Learning On Static
Gesture Images.”, Institute of Electrical and Electronics Engineering(IEEE)-2018
[7]. Ajay S, Ajith Potluri, Gaurav R, Anusri S “Indian Sign Language Recognition Using
Random Forest Classifier” IEEE International Conference on Electronics, Computing and
Communication Technologies (CONECCT) - 2021.
[8]. Pooja B S, Ramesh Ganesh Patgar and Pradheep S, “Smart Gloves for Hand Gesture
Recognition in Kannada.”, International Conference On Circuits, Control, Communication
And Computing – 2022.
• https://www.nidcd.nih.gov/health/american-sign-language.
• https://www.ibm.com/in-en/topics/convolutional-neural-networks.