Capstone Paper
Capstone Paper
INTRODUCTION:
Related Works:
We first reviewed research papers for speech recognition and analysed how each one
has done it. Bano, S., Jithendra, P., Niharika, G. L., & Sikhi, Y. (2020).[1] has come up
with a Speech Recognition model that converts the speech data given by the user as an
input into the text format in his desired language. This model is developed by adding
Multilingual features to the existent Google Speech Recognition model based on some
of the natural language processing principles. The goal of this research is to build a
speech recognition model that even facilitates an illiterate person to easily communicate
with the computer system in his regional language.By implementing this model they
have shown how they use the SpeechRecognition packages to build a Speech
Translation model. The more they use these kinds of packages they get more flexibility
in the code and output that is to be displayed. This model can be used for any purpose
of translation of speech to text.
We then proceeded to the next part which is Speech to Text(STT) conversion. Shivangi
Nagdewani, Ashika Jain[2] have created a model for STT conversion which is carried
out by Hidden Markov Model (HMM) and Neural Network as it gives the highest
accuracy for STT. HMM is a statistical model used in speech recognition because a
speech signal can be viewed as a piecewise stationary signal or a short-time stationary
signal. According to their research the most suitable technique for STT conversion is by
deploying a combination of Hidden Markov Model with Deep Neural Network, which can
be implemented in Python using Google’s Speech Recognition API module. This
system can be improved by considering the punctuation marks while converting speech
to text.
Later, we read about the translation from one language to another. Fathimath Shouna
Shayyam C A and Pragisha K [3] have demonstrated the use of machine translation
models using python programming language together with an open source library called
the Natural Language Toolkit (NLTK) to translate code from one natural language to
another. This led to good results and stood marginally greater compared to other
methods.
After translation, we then learnt about lip synchronization and how it is done. K R
Prajwal,Rudrabha Mukhopadhyay, Vinay P. Namboodiri and V Jawahar jawahar[4] have
investigated the problem of lip-syncing a talking face video of an arbitrary identity to
match a target speech segment. Then they identified and showed the key reasons
pertaining to this and hence resolved them by learning from a powerful lip-sync
discriminator. Next, they proposed new, rigorous evaluation benchmarks and metrics to
accurately measure lip synchronization in unconstrained videos. Extensive quantitative
evaluations on their challenging benchmarks showed that the lip-sync accuracy of the
videos generated by their Wav2Lip model was almost as good as real synced videos.
Finally, we studied more about audio-video synchronization and came with a research
paper by Joon Son Chung and Andrew Zisserman[5] where they determined the audio-
video synchronization between mouth motion and speech in a video. They also
proposed a two-stream ConvNet architecture that enables a joint embedding between
the sound and the mouth images to be learnt from unlabelled data. The trained network
is used to determine the lip-sync error in a video. Finally they applied the network to
two further tasks such as active speaker detection and lip reading.
REFERENCE:
1) Bano, S., Jithendra, P., Niharika, G. L., & Sikhi, Y. (2020). Speech to Text
Translation enabling Multilingualism. 2020 IEEE International Conference for
Innovation in Technology (INOCON). doi:10.1109/inocon50539.2020.9298.
4) A Lip Sync Expert Is All You Need for Speech to Lip Generation In The
Wild K R Prajwal∗ prajwal.k@research.iiit.ac.in IIIT, Hyderabad, India
Rudrabha Mukhopadhyay∗ radrabha.m@research.iiit.ac.in IIIT,
Hyderabad, India Vinay P. Namboodiri vpn22@bath.ac.uk University of
Bath, England C V Jawahar jawahar@iiit.ac.in IIIT, Hyderabad, India
arXiv:2008.10010v1 [cs.CV] 23 Aug 2020
5) Out of time: automated lip sync in the wild Joon Son Chung and Andrew
Zisserman Visual Geometry Group, Department of Engineering Science,
University of Oxford.