A Framework For Deepfake V2
A Framework For Deepfake V2
Spring 2022
Table of Contents
Abstract : ................................................................................................................... 3
Literature review : ...................................................................................................... 4
Working of Voice Cloning Technology : ....................................................................... 5
Project Assumptions : ................................................................................................. 5
Project constraints : .................................................................................................... 5
Working principles of Voice Cloning : ......................................................................... 6
Table of possible token type text : ................................................................................ 7
Difference between text and speech : ............................................................................ 7
Survey of Existing apps and Webs : ............................................................................. 8
Scope Management / WBS : ........................................................................................ 9
Improvements management : ...................................................................................... 9
Schedule/ time management / include Milestone : ......................................................... 9
Gantt chart : ............................................................................................................. 10
Risk and issue management : ..................................................................................... 10
Information of Budget : ............................................................................................. 10
Project Requirement : ............................................................................................... 11
Use Case Diagram : ................................................................................................... 12
Sequence Diagram : .................................................................................................. 13
User Interface : ......................................................................................................... 19
Reference : ............................................................................................................... 24
2
Abstract :
Artificial Intelligence and specially Machine Learning and Deep Learning techniques are increasingly populating
today’s technological and social landscape. These advancements have overwhelmingly contributed to the development
of Speech Synthesis, also known as, Text-To-Speech, where speech is artificially produced from text by means of
computer technology. That's where Voice Cloning technology comes into play, which allows to generate an artificial
synthetic speech that resembles a targeted human voice. Today, Artificial Intelligence (AI) and advances in Deep
Learning are advancing the quality of synthetic speech. Applications for TTS are now commonplace. Everyone who has
interacted with a phone-based Interactive Voice Response system, Apple’s Siri, Amazon Alexa, car navigation systems,
or numerous other voice interfaces, has experienced synthetic speech.in the past have their two approaches to TTS. The
first, Concatenative TTS, uses audio recordings to create a library of words and units of sound (phonemes) that can be
strung together to form sentences. it lacks the emotion and inflection found in natural human speech. When using
Concatenative TTS, and certainly the effort to clone any individual voice using this method requires enormous
investment. The second approach is Parametric TTS, a method that uses statistical models of speech to simplify creating
a voice, reducing the cost and effort compared to Concatenation. However, the effort for creating any single voice has
historically been expensive, and the results clearly not human. We can use the voice cloning in good things like
Education and Audiobooks and Assistive Tech and also in in cultural films about Saudi Arabia.
3
Literature review :
Below we have put several topics related to the main topic.
• Neural Voice Cloning with a Few Samples: In this paper, we introduce a neural voice cloning system that takes a
few audio samples as input. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation
is based on fine-tuning a multi-speaker generative model with a few cloning samples. Done by Sercan ̈O. Arık * 1
Jitong Chen * 1 Kainan Peng * 1 Wei Ping * 1 Yanqi Zhou 1
• Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward: provides
a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches for
deepfake generation and the methodologies used to detect such manipulations for both audio and visual deepfakes.
For each category of deepfake, we discuss information related to manipulation approaches, current public datasets,
and key standards for the performance evaluation of deepfake detection techniques along with their results. Done by
Momina Masood1, Mariam Nawaz2, Khalid Mahmood Malik3, Ali Javed4, Aun Irtaza5
• DATA EFFICIENT VOICE CLONING FOR NEURAL SINGING SYNTHESIS: we adapt one such technique to
the case of singing synthesis. By leveraging data from many speakers to first create a multispeaker model, small
amounts of target data can then efficiently adapt the model to new unseen voices. Done by Merlijn Blaauw⋆, Jordi
Bonada⋆, and Ryunosuke Daido
• Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice
Cloning: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is
able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across
languages, e.g. synthesize fluent Spanish speech using an English speaker’s voice, without training on any bilingual
or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin. Done by Yu
Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia,Andrew Rosenberg, Bhuvana
Ramabhadran
• Combining Statistical Parametric Speech Synthesis and Unit-Selection for Automatic Voice Cloning: In this paper
we will present two state of the art systems, an HMM based system HTS-2007, developed by CSTR and Nagoya
Institute Technology, and a commercial unit-selection system CereVoice, developed by Cereproc. Both systems have
been used to mimic the voice of George W. Bush (43rd presi-dent of the United States) using freely available audio
from the web
4
Working of Voice Cloning Technology :
It is a technology that allows your voice to be someone else, like a British energy company where an unknown hacking
organization used artificial intelligence voice cloning technology to make fraudulent calls, and succeeded in defrauding
220 thousand euros.
Voice cloning technology may seem from the perspective of most people that it is a bad technology, but from another
perspective, it may become an effective technology in helping many patient in hospital and student in education. For
the voice must be:
• sound clarity
• correct pronunciation and language recognition
• sound language resources
Project Assumptions :
1. Python environment must be installed on the system
2. Pyttsx3 Library must be installed on the system
3. Google COLAP platform account
Project constraints :
1. Scope constraint, project aim to help some companies, so not all users will be interested in
2. good CPU for converting text to speech quickly
3. Enable multiple voices for the user
4. clearness and loudness speaker
5
Working principles of Voice Cloning :
Determine language sources for speech synthesis. Text to speech synthesis is converting the text to the synthetic speech
that is as close to real speech as possible according to the pronunciation norms of special language. Such systems are
called text to speech (TTS) systems. Input element of TTS system is a text, output element is synthetic speech. There
are two possible cases. When it is necessary to pronounce the limited number of phrases (and their pronouncing linearly
does not vary), the necessary speech material is simply recorded in advance. In this case, certain problems are originated.
For example, in this approach, it is not possible to sound the text, which is not known in advance. For this purpose, the
pronounced text has to be kept in computer memory. And it will lead to increase of the size of memory required for
information content. This will bring to essential load of computer memory in case of much information and can create
certain problems in operation. The main approach used in this paper is voicing of previously unknown text based on a
specific algorithm.
Every language has its own unique features. For example: there are certain contradictions between letters and certain
sounds in English language. Thus, two different letters coming together, sound differently than when they are used
separately. For example: letters (t), (h) separately do not sound the same as in chain (th). This is only one of problems
faced in English language. In other words, the place of the letters affect on how they should be or should not be
pronounced. Thus, according to the phonetic rules of English language the first letter (k) of the word (know) is not
pronounced. As well, Russian language has certain pronunciation features. First of all, it should be noted that the letter
(o) does not always pronounce like sound.
Two parameters, naturalness of sounding and intelligibility of speech, are applied for the assessment of the quality of
synthesis system. One can say that naturalness of sounding of a speech synthesizer depends on how many generated
sounds are close to natural human speech. By a intelligibility (ease for understanding) of a speech synthesizer is meant
the easiness of artificial speech understanding. The ideal speech synthesizer should possess both characteristics:
naturalness of sounding and intelligibility. Existing and being developed systems for speech synthesis are aimed at
improvement of these two characteristics.
6
Table of possible token type text :
Type Text Speech
Decimal numbers 1.2 One and two Tenth
Ordinal numbers 1-st First
Roman numbers VI,X Sixth ,Tenth
Alphanumeric strings 110 One a power of a ten
Phone numbers +966501068872 Plus nine ,double six, five ,zero ,one ,zero
,six , double eight , seven , two
TEXT SPEECH
Phonoparagraph
Paragraph
Utterance
Sentence
Phonoword
Word
Diphone
Syllable
Phoneme
Letter
7
Survey of Existing apps and Webs :
8
Scope Management / WBS :
Improvements management :
application improvement count on users' feedback and the next version will be released with enhanced features. We
welcome all feedback, whether positive or negative, will improve the APP.
9
Gantt chart :
Information of Budget :
The project is using open source software(python) and google COLAB account so there is no cost to build the project.
10
Project Requirement :
11
Use Case Diagram :
12
Sequence Diagram :
13
• Sequence Diagram for Login
14
• Sequence Diagram for Manage Account
15
• Sequence Diagram for Voice Clone
16
• Sequence Diagram for Create Voice
17
• Sequence Diagram for Manage Voice
18
User Interface :
19
20
21
22
23
Reference :
[1]. Expressive Neural Voice Cloning. Proceedings of Machine Learning Research 157:–, 2021, Paarth Neekhara*
Shehzeen Hussain* Shlomo Dubnov ,Farinaz Koushanfar, Julian McAuley ,University of California, San Diego, 9500
Gilman Dr, La Jolla, CA 92093 * Denotes Equal Contribution
[2]. Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic Voice Cloning by Matthew
P. Aylett12, Junichi Yamagishi1, 1Centre for Speech Technology Research, University of Edinburgh, U.K.2Cereproc
Ltd., U.K.
[3]. TTS-SYNTHESIZER AS A COMPUTER MEANS FOR PERSONAL VOICE “CLONING” Boris M. Lobanov*
and Helena B. Karnevskaya** Institute of Engineering Cybernetics, Nat. Ac. of Sc. Belarus * Minsk Linguistic State
University
[4]. Text Analysis and Word Pronunciation in Text-to-speech Synthesis, Mark Y. Liberman, Kenneth W. Church,
AT&T Bell Laboratories, 600 Mountain Ave. Murray Hill, N.J., 07974
[5]. The Main Principles of Text-to-Speech Synthesis System by К.R. Aida–Zade, C. Ardil and A.M. Sharifova
[6]. From website massam.fandom we understand the List of words the Microsoft speech engines can't say correctly
[7]. ARCHISEGMENT-BASED LETTER-TO-PHONE CONVERSION FOR CONCATENATIVE SPEECH
SYNTHESIS IN PORTUGUESE, Eleonora Cavalcante Albano and Agnaldo Antonio Moreira, LAFAPE-IEL-
UNICAMP, Campinas, SP, Brazil
[8]. Sadhan ¯ a¯ Vol. 36, Part 5, October 2011, pp. 837–852. c Indian Academy of Sciences, An introduction to
statistical parametric speech synthesis SIMON KING, The Centre for Speech Technology Research, University of
Edinburgh, Edinburgh
[9]. DATA EFFICIENT VOICE CLONING FOR NEURAL SINGING SYNTHESIS by Merlijn Blaauw, Jordi
Bonada, and Ryunosuke Daido, Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain, Sound
Processing Group, Yamaha Corporation, Hamamatsu, Japan
24