0% found this document useful (0 votes)
182 views24 pages

A Framework For Deepfake V2

This document presents a framework for a deepfake voice synthesis project. It includes an abstract describing the goals of using voice cloning technology for speech synthesis. It then reviews relevant literature on neural voice cloning and deepfake generation/detection. The document outlines the project assumptions, constraints, and working principles of voice cloning technology. It also includes sections on requirements, use cases, interfaces, schedule, risks, and budget for the voice cloning project.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
182 views24 pages

A Framework For Deepfake V2

This document presents a framework for a deepfake voice synthesis project. It includes an abstract describing the goals of using voice cloning technology for speech synthesis. It then reviews relevant literature on neural voice cloning and deepfake generation/detection. The document outlines the project assumptions, constraints, and working principles of voice cloning technology. It also includes sections on requirements, use cases, interfaces, schedule, risks, and budget for the voice cloning project.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

A Framework for Deepfake Voice Synthesis

Abdullah Altulahi 382102957

Bander Altamimi 382102941

Under the supervision of


Dr.Jayadev Gyani

Department of Computer Science


College of Computer and Information Sciences
Majmaah University

Spring 2022
Table of Contents
Abstract : ................................................................................................................... 3
Literature review : ...................................................................................................... 4
Working of Voice Cloning Technology : ....................................................................... 5
Project Assumptions : ................................................................................................. 5
Project constraints : .................................................................................................... 5
Working principles of Voice Cloning : ......................................................................... 6
Table of possible token type text : ................................................................................ 7
Difference between text and speech : ............................................................................ 7
Survey of Existing apps and Webs : ............................................................................. 8
Scope Management / WBS : ........................................................................................ 9
Improvements management : ...................................................................................... 9
Schedule/ time management / include Milestone : ......................................................... 9
Gantt chart : ............................................................................................................. 10
Risk and issue management : ..................................................................................... 10
Information of Budget : ............................................................................................. 10
Project Requirement : ............................................................................................... 11
Use Case Diagram : ................................................................................................... 12
Sequence Diagram : .................................................................................................. 13
User Interface : ......................................................................................................... 19
Reference : ............................................................................................................... 24

2
Abstract :
Artificial Intelligence and specially Machine Learning and Deep Learning techniques are increasingly populating
today’s technological and social landscape. These advancements have overwhelmingly contributed to the development
of Speech Synthesis, also known as, Text-To-Speech, where speech is artificially produced from text by means of
computer technology. That's where Voice Cloning technology comes into play, which allows to generate an artificial
synthetic speech that resembles a targeted human voice. Today, Artificial Intelligence (AI) and advances in Deep
Learning are advancing the quality of synthetic speech. Applications for TTS are now commonplace. Everyone who has
interacted with a phone-based Interactive Voice Response system, Apple’s Siri, Amazon Alexa, car navigation systems,
or numerous other voice interfaces, has experienced synthetic speech.in the past have their two approaches to TTS. The
first, Concatenative TTS, uses audio recordings to create a library of words and units of sound (phonemes) that can be
strung together to form sentences. it lacks the emotion and inflection found in natural human speech. When using
Concatenative TTS, and certainly the effort to clone any individual voice using this method requires enormous
investment. The second approach is Parametric TTS, a method that uses statistical models of speech to simplify creating
a voice, reducing the cost and effort compared to Concatenation. However, the effort for creating any single voice has
historically been expensive, and the results clearly not human. We can use the voice cloning in good things like
Education and Audiobooks and Assistive Tech and also in in cultural films about Saudi Arabia.

3
Literature review :
Below we have put several topics related to the main topic.
• Neural Voice Cloning with a Few Samples: In this paper, we introduce a neural voice cloning system that takes a
few audio samples as input. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation
is based on fine-tuning a multi-speaker generative model with a few cloning samples. Done by Sercan ̈O. Arık * 1
Jitong Chen * 1 Kainan Peng * 1 Wei Ping * 1 Yanqi Zhou 1

• Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward: provides
a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches for
deepfake generation and the methodologies used to detect such manipulations for both audio and visual deepfakes.

For each category of deepfake, we discuss information related to manipulation approaches, current public datasets,
and key standards for the performance evaluation of deepfake detection techniques along with their results. Done by
Momina Masood1, Mariam Nawaz2, Khalid Mahmood Malik3, Ali Javed4, Aun Irtaza5

• DATA EFFICIENT VOICE CLONING FOR NEURAL SINGING SYNTHESIS: we adapt one such technique to
the case of singing synthesis. By leveraging data from many speakers to first create a multispeaker model, small
amounts of target data can then efficiently adapt the model to new unseen voices. Done by Merlijn Blaauw⋆, Jordi
Bonada⋆, and Ryunosuke Daido

• Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice
Cloning: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is
able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across
languages, e.g. synthesize fluent Spanish speech using an English speaker’s voice, without training on any bilingual
or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin. Done by Yu
Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia,Andrew Rosenberg, Bhuvana
Ramabhadran

• Combining Statistical Parametric Speech Synthesis and Unit-Selection for Automatic Voice Cloning: In this paper
we will present two state of the art systems, an HMM based system HTS-2007, developed by CSTR and Nagoya

Institute Technology, and a commercial unit-selection system CereVoice, developed by Cereproc. Both systems have
been used to mimic the voice of George W. Bush (43rd presi-dent of the United States) using freely available audio
from the web

4
Working of Voice Cloning Technology :
It is a technology that allows your voice to be someone else, like a British energy company where an unknown hacking
organization used artificial intelligence voice cloning technology to make fraudulent calls, and succeeded in defrauding
220 thousand euros.
Voice cloning technology may seem from the perspective of most people that it is a bad technology, but from another
perspective, it may become an effective technology in helping many patient in hospital and student in education. For
the voice must be:
• sound clarity
• correct pronunciation and language recognition
• sound language resources

Project Assumptions :
1. Python environment must be installed on the system
2. Pyttsx3 Library must be installed on the system
3. Google COLAP platform account

Project constraints :
1. Scope constraint, project aim to help some companies, so not all users will be interested in
2. good CPU for converting text to speech quickly
3. Enable multiple voices for the user
4. clearness and loudness speaker

5
Working principles of Voice Cloning :
Determine language sources for speech synthesis. Text to speech synthesis is converting the text to the synthetic speech
that is as close to real speech as possible according to the pronunciation norms of special language. Such systems are
called text to speech (TTS) systems. Input element of TTS system is a text, output element is synthetic speech. There
are two possible cases. When it is necessary to pronounce the limited number of phrases (and their pronouncing linearly
does not vary), the necessary speech material is simply recorded in advance. In this case, certain problems are originated.
For example, in this approach, it is not possible to sound the text, which is not known in advance. For this purpose, the
pronounced text has to be kept in computer memory. And it will lead to increase of the size of memory required for
information content. This will bring to essential load of computer memory in case of much information and can create
certain problems in operation. The main approach used in this paper is voicing of previously unknown text based on a
specific algorithm.

Every language has its own unique features. For example: there are certain contradictions between letters and certain
sounds in English language. Thus, two different letters coming together, sound differently than when they are used
separately. For example: letters (t), (h) separately do not sound the same as in chain (th). This is only one of problems
faced in English language. In other words, the place of the letters affect on how they should be or should not be
pronounced. Thus, according to the phonetic rules of English language the first letter (k) of the word (know) is not
pronounced. As well, Russian language has certain pronunciation features. First of all, it should be noted that the letter
(o) does not always pronounce like sound.

Two parameters, naturalness of sounding and intelligibility of speech, are applied for the assessment of the quality of
synthesis system. One can say that naturalness of sounding of a speech synthesizer depends on how many generated
sounds are close to natural human speech. By a intelligibility (ease for understanding) of a speech synthesizer is meant
the easiness of artificial speech understanding. The ideal speech synthesizer should possess both characteristics:
naturalness of sounding and intelligibility. Existing and being developed systems for speech synthesis are aimed at
improvement of these two characteristics.

6
Table of possible token type text :
Type Text Speech
Decimal numbers 1.2 One and two Tenth
Ordinal numbers 1-st First
Roman numbers VI,X Sixth ,Tenth
Alphanumeric strings 110 One a power of a ten
Phone numbers +966501068872 Plus nine ,double six, five ,zero ,one ,zero
,six , double eight , seven , two

Count 45 Forty five


Date 29/11/1999 Twenty nine of November nineteen ninety
nine
Time 11:15 pm Quarter past eleven post meridiem

Mathematical 5+4=9 Five plus four is equal to nine

Difference between text and speech :


Text and speech signal have clearly defined hierarchical nature. In view of hierarchical representation, we can conclude
that for the qualitative construction of systems of speech synthesis it is necessary to develop a model of mechanism of
speech formation. In the system, initially we should define the flow of the information which should proceed according
to the scheme presented below:

TEXT SPEECH

Phonoparagraph
Paragraph
Utterance
Sentence
Phonoword
Word
Diphone
Syllable
Phoneme
Letter

7
Survey of Existing apps and Webs :

Web n. characters n. voice language Time download Free speed

ttsmp3 3,000 61 28 no Yes yes no


fromtexttospeech 50,000 17 8 yes Yes yes yes

ibm 5,000 40 13 yes no yes yes

fakeyou 1000 1385 5 yes yes yes yes


IOS/ANDROID n. characters n. voice language Time download Free speed
MOTOREAD 2500 1 9 No no Yes/no No

Voice dream reader 5000 15 11 No No no Yes

Voice aloud reader 5500 2 15 Yes No yes No

Speech central 3000 2 5 No No Yes/no yes

• More words more time, less word less time to work


• Most apps have both gender voices
• There are some words mispronounced like :
o Bowser will say Boh-zer
o Calaway will say cal-uh-wah-ee purk
o Cheespider will say cheese-pitter

8
Scope Management / WBS :

Improvements management :
application improvement count on users' feedback and the next version will be released with enhanced features. We
welcome all feedback, whether positive or negative, will improve the APP.

Schedule/ time management / include Milestone :

9
Gantt chart :

Risk and issue management :

Information of Budget :
The project is using open source software(python) and google COLAB account so there is no cost to build the project.

10
Project Requirement :

11
Use Case Diagram :

12
Sequence Diagram :

• Sequence Diagram for Sign Up

13
• Sequence Diagram for Login

14
• Sequence Diagram for Manage Account

15
• Sequence Diagram for Voice Clone

16
• Sequence Diagram for Create Voice

17
• Sequence Diagram for Manage Voice

18
User Interface :

19
20
21
22
23
Reference :
[1]. Expressive Neural Voice Cloning. Proceedings of Machine Learning Research 157:–, 2021, Paarth Neekhara*
Shehzeen Hussain* Shlomo Dubnov ,Farinaz Koushanfar, Julian McAuley ,University of California, San Diego, 9500
Gilman Dr, La Jolla, CA 92093 * Denotes Equal Contribution
[2]. Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic Voice Cloning by Matthew
P. Aylett12, Junichi Yamagishi1, 1Centre for Speech Technology Research, University of Edinburgh, U.K.2Cereproc
Ltd., U.K.
[3]. TTS-SYNTHESIZER AS A COMPUTER MEANS FOR PERSONAL VOICE “CLONING” Boris M. Lobanov*
and Helena B. Karnevskaya** Institute of Engineering Cybernetics, Nat. Ac. of Sc. Belarus * Minsk Linguistic State
University
[4]. Text Analysis and Word Pronunciation in Text-to-speech Synthesis, Mark Y. Liberman, Kenneth W. Church,
AT&T Bell Laboratories, 600 Mountain Ave. Murray Hill, N.J., 07974
[5]. The Main Principles of Text-to-Speech Synthesis System by К.R. Aida–Zade, C. Ardil and A.M. Sharifova
[6]. From website massam.fandom we understand the List of words the Microsoft speech engines can't say correctly
[7]. ARCHISEGMENT-BASED LETTER-TO-PHONE CONVERSION FOR CONCATENATIVE SPEECH
SYNTHESIS IN PORTUGUESE, Eleonora Cavalcante Albano and Agnaldo Antonio Moreira, LAFAPE-IEL-
UNICAMP, Campinas, SP, Brazil
[8]. Sadhan ¯ a¯ Vol. 36, Part 5, October 2011, pp. 837–852. c Indian Academy of Sciences, An introduction to
statistical parametric speech synthesis SIMON KING, The Centre for Speech Technology Research, University of
Edinburgh, Edinburgh
[9]. DATA EFFICIENT VOICE CLONING FOR NEURAL SINGING SYNTHESIS by Merlijn Blaauw, Jordi
Bonada, and Ryunosuke Daido, Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain, Sound
Processing Group, Yamaha Corporation, Hamamatsu, Japan

24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy