0% found this document useful (0 votes)
33 views

Capstone Paper

Uploaded by

gokul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Capstone Paper

Uploaded by

gokul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

ABSTRACT:

We propose a system for audiovisual translation and dubbing, which translates


videos from one language to another. Initially, the given input video is broken into
audio and video. Then the audio is being recognised using google’s speech
recognition python module. Then the source language’s speech content is
transcribed to text using google’s speech to text python module and translated,
and automatically synthesized into target language speech using the wave2lip
pre-trained model. The visual material is translated into the target language by
synthesizing the speaker's lip motions to match the translated audio, resulting in
a continuous audiovisual experience.

INTRODUCTION:
Related Works:

We first reviewed research papers for speech recognition and analysed how each one
has done it. Bano, S., Jithendra, P., Niharika, G. L., & Sikhi, Y. (2020).[1] has come up
with a Speech Recognition model that converts the speech data given by the user as an
input into the text format in his desired language. This model is developed by adding
Multilingual features to the existent Google Speech Recognition model based on some
of the natural language processing principles. The goal of this research is to build a
speech recognition model that even facilitates an illiterate person to easily communicate
with the computer system in his regional language.By implementing this model they
have shown how they use the SpeechRecognition packages to build a Speech
Translation model. The more they use these kinds of packages they get more flexibility
in the code and output that is to be displayed. This model can be used for any purpose
of translation of speech to text.

We then proceeded to the next part which is Speech to Text(STT) conversion. Shivangi
Nagdewani, Ashika Jain[2] have created a model for STT conversion which is carried
out by Hidden Markov Model (HMM) and Neural Network as it gives the highest
accuracy for STT. HMM is a statistical model used in speech recognition because a
speech signal can be viewed as a piecewise stationary signal or a short-time stationary
signal. According to their research the most suitable technique for STT conversion is by
deploying a combination of Hidden Markov Model with Deep Neural Network, which can
be implemented in Python using Google’s Speech Recognition API module. This
system can be improved by considering the punctuation marks while converting speech
to text.

Later, we read about the translation from one language to another. Fathimath Shouna
Shayyam C A and Pragisha K [3] have demonstrated the use of machine translation
models using python programming language together with an open source library called
the Natural Language Toolkit (NLTK) to translate code from one natural language to
another. This led to good results and stood marginally greater compared to other
methods.

After translation, we then learnt about lip synchronization and how it is done. K R
Prajwal,Rudrabha Mukhopadhyay, Vinay P. Namboodiri and V Jawahar jawahar[4] have
investigated the problem of lip-syncing a talking face video of an arbitrary identity to
match a target speech segment. Then they identified and showed the key reasons
pertaining to this and hence resolved them by learning from a powerful lip-sync
discriminator. Next, they proposed new, rigorous evaluation benchmarks and metrics to
accurately measure lip synchronization in unconstrained videos. Extensive quantitative
evaluations on their challenging benchmarks showed that the lip-sync accuracy of the
videos generated by their Wav2Lip model was almost as good as real synced videos.

Finally, we studied more about audio-video synchronization and came with a research
paper by Joon Son Chung and Andrew Zisserman[5] where they determined the audio-
video synchronization between mouth motion and speech in a video. They also
proposed a two-stream ConvNet architecture that enables a joint embedding between
the sound and the mouth images to be learnt from unlabelled data. The trained network
is used to determine the lip-sync error in a video. Finally they applied the network to
two further tasks such as active speaker detection and lip reading.

REFERENCE:

1) Bano, S., Jithendra, P., Niharika, G. L., & Sikhi, Y. (2020). Speech to Text
Translation enabling Multilingualism. 2020 IEEE International Conference for
Innovation in Technology (INOCON). doi:10.1109/inocon50539.2020.9298.

2) A REVIEW ON METHODS FOR SPEECH-TO-TEXT AND TEXT-TO-SPEECH


CONVERSION Shivangi Nagdewani, Ashika Jain. International Research Journal
of Engineering and Technology (IRJET) e-ISSN: 2395-0056 p-ISSN: 2395-0072.

3) Advancements in Machine Translation as a part of Natural Language Processing


in Python Fathimath Shouna Shayyam C A , Pragisha K. International Journal of
Advanced Research in Computer and Communication Engineering IJARCCE.

4) A Lip Sync Expert Is All You Need for Speech to Lip Generation In The
Wild K R Prajwal∗ prajwal.k@research.iiit.ac.in IIIT, Hyderabad, India
Rudrabha Mukhopadhyay∗ radrabha.m@research.iiit.ac.in IIIT,
Hyderabad, India Vinay P. Namboodiri vpn22@bath.ac.uk University of
Bath, England C V Jawahar jawahar@iiit.ac.in IIIT, Hyderabad, India
arXiv:2008.10010v1 [cs.CV] 23 Aug 2020

5) Out of time: automated lip sync in the wild Joon Son Chung and Andrew
Zisserman Visual Geometry Group, Department of Engineering Science,
University of Oxford.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy