0% found this document useful (0 votes)
21 views

SER_documentation_satwik

The project report focuses on the development of a Speech Emotion Recognition (SER) system aimed at improving human-computer interaction by accurately classifying emotions from speech. It outlines the methodology, including data collection, feature extraction, and the use of machine learning algorithms, specifically the Multi-layer Perceptron classifier, to enhance the performance of emotion detection systems. The report also discusses the significance of the RAVDESS dataset and the importance of feature selection in achieving effective emotion classification.

Uploaded by

vamsiladi14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

SER_documentation_satwik

The project report focuses on the development of a Speech Emotion Recognition (SER) system aimed at improving human-computer interaction by accurately classifying emotions from speech. It outlines the methodology, including data collection, feature extraction, and the use of machine learning algorithms, specifically the Multi-layer Perceptron classifier, to enhance the performance of emotion detection systems. The report also discusses the significance of the RAVDESS dataset and the importance of feature selection in achieving effective emotion classification.

Uploaded by

vamsiladi14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

A

Project Report
On

“SPEECH TO EMOTION REGONITION”

Submitted in partial fulfillment of


the requirements for the 8th Semester Sessional Examination of

BACHELOR OF TECHNOLOGY
IN

COMPUTER SCIENCE & ENGINEERING

By
D SHIVA SAWIK
20UG010391

Under the esteemed guidance of


Mr Sitanshu Kar
Dept of CSE

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


GANDHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
GUNUPUR – 765022 2023 - 24

i
GIET UNIVERSITY, GUNUPUR School of
Engineering and Technology
Department of Computer Science & Engineering
Approved by Govt. of Odisha

CERTIFICATE

This is to certify that the project work entitled “SPEECH TO EMOTION

REGONITION” is done by Name-D SHIVA SATWIK, Regd. No.- 20UG010391

in partial fulfilment of the requirements for the 8th Semester Sessional

Examination of Bachelor of Technology in Computer Science and

Engineering during the academic year 2023-24. This work is submitted to

the department as a part of evaluation of 8th Semester Project.

Mr Sitanshu Kar Dr. K.Murali Gopal


Project Supervisor HoD, CSE

ii
ABSTRACT

Communication is the key to express one’s thoughts and ideas clearly. Amongst all forms of
communication, speech is the most preferred and powerful form of communications in human. The era of the
Internet of Things (IoT) is rapidly advancing in bringing more intelligent systems available for everyday use.
These applications range from simple wearables and widgets to complex self- driving vehicles and automated
systems employed in various fields. Intelligent applications are interactive and require minimum user effort to
function, and mostly function on voice-based input.

This creates the necessity for these computer applications to completely comprehend human speech. A
speech percept can reveal information about the speaker including gender, age, language, and emotion. Several
existing speech recognition systems used in IoT applications are integrated with an emotion detection system
in order to analyze the emotional state of the speaker. The performance of the emotion detection system can
greatly influence the overall performance of the IoT application in many ways and can provide many advantages
over the functionalities of these applications.

This research presents a speech emotion detection system with improvements over an existing system
in terms of data, feature selection, and methodology that aims at classifying speech percepts based on emotions,
more accurately.

i
CONTENTS:

Table Page.No
1.Introduction------------------------------------------------------------------------ 1
2. SYSTEM ANALYSIS----------------------------------------------------------- 3

3. Methodology---------------------------------------------------------------------- 4
4.DFD diagram---------------------------------------------------------------------- 5
5. Modules--------------------------------------------------------------------------- 6
6. dataset----------------------------------------------------------------------------- 7
7.feature Extraction----------------------------------------------------------------- 8
8. Algorithms------------------------------------------------------------------------ 10
9. Classification report------------------------------------------------------------- 18
10.System Design------------------------------------------------------------------- 23
11. Analysis-------------------------------------------------------------------------- 24
12.Coding---------------------------------------------------------------------------- 25
13.Conclusion----------------------------------------------------------------------- 42
14.Reference------------------------------------------------------------------------ 43

ii
SPEECH TO EMOTION RECOGNITION

1. INTRODUCTION
Speech emotion recognition is an act of predicting human's emotion through their speech along
with the accuracy of prediction. It creates a better human computer interaction. Though it is difficult to
predict the emotion of a person as emotions are subjective and annotation audio is challenging, “Speech
Emotion Recognition(SER)” makes this possible.

This is the same theory which is used by animals like dogs, elephants and horses etc. Do to be
able to understand human emotion. There are various states to predict one's emotion, they are tone, pitch,
expression, behavior etc.

• Speaker Identification

• Speech Recognition

• Speech Emotion Detection

1.1. PURPOSE
• The primary objective of SER is to improve man-machine interface.

• It can also be used to monitor the psycho physiological state of a person in lie detectors.

• In recent time, speech emotion recognition also find its applications in medicine and forensics.

1.2. PROJEECT SCOPE

This project deals with the various functioning in College management process. The main
idea is to implement a proper process to system. In our existing system contains a many
operations registration, student search, fees, attendance, exam records, performance of the
student etc. All these activity takeout manually by administrator.

1.3. EXISTING SYSTEM

The speech emotion detection system is implemented as a Machine Learning (ML)


model. The steps of implementation are comparable to any other ML project, with additional
fine-tuning procedures to make the model function better. The flowchart represents a pictorial
overview of the process (see Figure 1). The first step is data collection, which is of prime
importance. The model being developed will learn from the data provided to it and all the
decisions and results that a developed model will produce is guided by the data. The second step,
called feature engineering, is a collection of several machine learning tasks that are executed

1
SPEECH TO EMOTION RECOGNITION

over the collected data. These procedures address the several data representation and data quality
issues. The third step is often considered the core of an ML project where an algorithmic based
model is developed. This model uses an ML algorithm to learn about the data and train itself to
respond to any new data it is exposed to. The final step is to evaluate the functioning of the built
model. Very often, developers repeat the steps of developing a model and evaluating it to
compare the performance of different algorithms. Comparison results help to choose the
appropriate ML algorithm most relevant to the problem.

1.4. PROPOSED SYSTEM

In this current study, we presented an automatic speech emotion 6 recognition (SER)


system using machine learning algorithms to classify the emotions. The performance of the
emotion detection system can greatly influence the overall performance of the application in
many ways and can provide many advantages over the functionalities of these applications. This
research presents a speech emotion detection system with improvements over an existing system
in terms of data, feature selection, and methodology that aims at classifying speech percepts
based on emotions, more accurately.

2
SPEECH TO EMOTION RECOGNITION

2. SYSTEM ANALYSIS
2.1. HARDWARE REQUIREMENTS

Processor Brand : Intel

Processor Type : Core i3

Processor Speed : 2GHZ

Processor Count : 1

RAM Size : 8GB

Memory Technology : DDR4

Computer Memory Type : DDR4

SDRAM Hard Drive Size : 160 GB

2.2. SOFTWARE REQUIREMENTS

Operating system : Windows 10 or 11

Application server : Jupyter Notebook

Frontend : Machine learning using Python

Datasets : RAVDESS Data set

3
SPEECH TO EMOTION RECOGNITION

3. METHODOLOGY
The speech emotion detection system is implemented as a Machine Learning (ML)
model. The steps of implementation are comparable to any other ML project, with additional
fine-tuning procedures to make the model function better. The flowchart represents a
pictorial overview of the process (see Figure 1). The first step is data collection, which is of
prime importance. The model being developed will learn from the data provided to it and
all the decisions and results that a developed model will produce is guided by the data. The
second step, called feature engineering, is a collection of several machine learning tasks that
are executed over the collected data. These procedures address the several data
representation and data quality issues. The third step is often considered the core of an ML
project where an algorithmic based model is developed. This model uses an ML algorithm
to learn about the data and train itself to respond to any new data it is exposed to. The final
step is to evaluate the functioning of the built model. Very often, developers repeat the steps
of developing a model and evaluating it to compare the performance of different algorithms.
Comparison results help to choose the appropriate ML algorithm most relevant to the
problem.

4
SPEECH TO EMOTION RECOGNITION

4. DFD DIAGRAM

The DFD is also called as bubble chart. It is a simple graphical formalism that can
be used to represent a system in terms of input data to the system, various processing carried
out on this data, and the output data is generated by this system. The data flow diagram
(DFD) is one of the most important modeling tools. It is used to model the system
components. These components are the system process, the data used by the process, an
external entity that interacts with the system and the information flows in the system. DFD
shows how the information moves through the system and how it is modified by a series of
transformations. It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to output. DFD is also known as
bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD
may be partitioned into levels that represent increasing information flow and functional
detail.

Figure 3.1: Flow of implementation

5
SPEECH TO EMOTION RECOGNITION

5. MODULES
• Speech input Module

• Feature extraction and selection

• Classification

• Recognized emotional output

5.1. MODULE DESCRIPTION


• Speech input Module Input to the system is speech taken with the help of
audio. Then equivalent digital representation of received audio is produced
through sound file.

• Feature extraction and selection There are so many emotional states of


emotion and emotion relevance is used to select the extracted speech features.
For speech feature extraction to selection corresponding to emotions all
procedure revolves around the speech signal.

• Classification Module Finding a set of significant emotions for


classification is the main concern in speech emotion recognition system.
There are various emotional states contains in a typical set of emotions that
makes classification a complicated task.

• Recognized emotional output Fear, surprise, anger, joy, disgust and sadness
are primary emotions and naturalness of database level is the basis for speech
emotion recognition system evaluation.

6
SPEECH TO EMOTION RECOGNITION
6. DATASET

6.1. RAVDESS DATASET


(The Ryerson Audio-Visual Database of Emotional Speech and Song)

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains
7,356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male),
vocalizing two lexically-matched statements in a neutral North American accent. Speech includes
calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy,
sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity
(normal, strong), with an additional neutral expression. All conditions are available in three modality
formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and
Video-only (no sound). Note, there are no song files for Actor_18.

• The size of the dataset is large enough for the model to be trained effectively. The more
exposure to data given to a model helps it to perform better.

• All basic emotional categories of data are present. A combination of these emotions can be
used for further research like Sarcasm and Depression detection.

• Data is collected from two different age groups which will improve the classification.

• The audio files are mono signals, which ensures an error-free conversion with most of the
programming libraries.

7
SPEECH TO EMOTION RECOGNITION

7. FEATURE EXTRACTION

7.1. THE PROCESS:

Speech is a varying sound signal. Humans are capable of making modifications to the sound
signal using their vocal tract, tongue, and teeth to pronounce the phoneme. The features are a way
to quantify data. A better representation of the speech signals to get the most information from the
speech is through extracting features common among speech signals. Some characteristics of good
features include [14]:

• The features should be independent of each other. Most features in the feature vector are
correlated to each other. Therefore it is crucial to select a subset of features that are individual
and independent of each other.

• The features should be informative to the context. Only those features that are more
descriptive about the emotional content are to be selected for further analysis.

• The features should be consistent across all data samples. Features that are unique and
specific to certain data samples should be avoided.

• The values of the features should be processed. The initial feature selection process can result
in a raw feature vector that is unmanageable. The process of Feature Engineering will remove
any outliers, missing values, and null values

The features in a speech percept that is relevant to the emotional content can be grouped into
two main categories:

• Prosodic features

• Phonetic features.

The prosodic features are the energy, pitch, tempo, loudness, formant, and intensity. The
phonetic features are mostly related to the pronunciation of the words based on the language.
Therefore for the purpose of emotion detection, the analysis is performed on the prosodic
features or a combination of them. Mostly the pitch and loudness are the features that are very
relevant to the emotional content

8
SPEECH TO EMOTION RECOGNITION

7.2.MEL FREQUENCY CEPSTRUM COEFFICIENTS (MFCC) FEATURES

A subset of features that are used for speech emotion detection is grouped under a category called
the Mel Frequency Cepstrum Coefficients (MFCC) [16]. It can be explained as follows:

• The word Mel represents the scale used in Frequency vs Pitch measurement (see Figure 2)
[16]. The value measured in frequency scale can be converted into Mel scale using the
formula m = 2595 log10 (1 + (f/700))

• The word Cepstrum represents the Fourier Transform of the log spectrum of the speech
signal

9
SPEECH TO EMOTION RECOGNITION

8. ALGORITHMS
8.1. MLP CLASSIFIER

MLPClassifier stands for Multi-layer Perceptron classifier which in the name itself connects
to a Neural Network. Unlike other classification algorithms such as Support Vectors or Naive Bayes
Classifier, MLPClassifier relies on an underlying Neural Network to perform the task of
classification

We will use the confusion matrix to determine the accuracy which is measured as the total
number of correct predictions divided by the total number of predictions.

A multi-layer rather than a single layer network is required since a single layer perceptron
(SLP) can only compute a linear decision boundary, which is not flexible enough for most realistic
learning problems. For a problem that is linearly separable, (that is capable of being perfectly
separated by linear decision boundary), the perceptron convergence theorem guarantees
convergence. In its simplest form, SLP training is based on the simple idea of adding or subtracting
a pattern from the current weights when the target and predicted class disagrees, otherwise the
weights are unchanged. For a non-linearly separable problem, this simple algorithm can go on
cycling indefinitely.

The modification known as least mean square (LMS) algorithm uses a mean squared error
cost function to overcome this difficulty, but since there is only a single perceptron, the decision
boundary is still linear. An MLP is a universal approximator [6] that typically uses the same squared
error function as LMS.

However, the main difficulty with the MLP is that the learning algorithm has a complex error
surface, which can become stuck in local minima. There does not exist any MLP learning algorithm
that is guaranteed to converge, as with SLP. The popular MLP back propagation algorithm has two
phases, the first being a forward pass, which is a forward simulation for the current training pattern
and enables the error to be calculated. It is followed by a backward pass, that calculates for each
weight in the network how a small change will affect the error function.

The derivative calculation is based on the application of the chain rule, and training typically
proceeds by changing the weights proportional to the derivative.

10
SPEECH TO EMOTION RECOGNITION

Fig 6.1 MLP Classifier Confusion Matrix.

11
SPEECH TO EMOTION RECOGNITION

8.2. XGBOOST CLASSIFIER

• XGBoost is an optimized distributed gradient boosting library designed to be highly


efficient, flexible and portable. It implements machine learning algorithms under the
Gradient Boosting framework..

• XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many
data science problems in a fast and accurate way. The same code runs on major distributed
environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

• We will use the confusion matrix to determine the accuracy which is measured as the total
number of correct predictions divided by the total number of predictions.

Fig 6.2 XGBOOST Classifier Confusion Matrix.

12
SPEECH TO EMOTION RECOGNITION

8.3. LGBM CLASSIFIER


• LightGBM is a fast, distributed, high performance gradient boosting framework based on
decision tree algorithms, used for ranking, classification and many other machine learning
tasks. Another reason why Light GBM is so popular is because it focuses on accuracy of
results. LGBM also supports GPU learning and thus data scientists are widely using
LGBM for data science application development.

• We will use the confusion matrix to determine the accuracy which is measured as the total
number of correct predictions divided by the total number of predictions.

9.

Fig 6.3 LGBM Classifier Confusion Matrix.

13
SPEECH TO EMOTION RECOGNITION
8.4. RANDOMFOREST CLASSIFIER

Random Forest is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset. Random
Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It
can be used for both Classification and Regression problems in ML. It is based on the concept of
ensemble learning, which is a process of combining multiple classifiers to solve a complex problem
and to improve the performance of the model.

The term ―Random Forest Classifier‖ refers to the classification algorithm made up of
several decision trees. The algorithm uses randomness to build each individual tree to promote
uncorrelated forests, which then uses the forest‘s predictive powers to make accurate decisions.

Random forest classifiers fall under the broad umbrella of ensemble based learning methods. They
are simple to implement, fast in operation, and have proven to be extremely successful in a variety
of domains. The key principle underlying the random forest approach comprises the construction of
many ―simple‖ decision trees in the training stage and the majority vote (mode) across them in the
classification stage. Among other benefits, this voting strategy has the effect of correcting for the
undesirable property of decision trees to overfit training data. In the training stage, random forests
apply the general technique known as bagging to individual trees in the ensemble. Bagging
repeatedly selects a random sample with replacement from the training set and fits trees to these
samples. Each tree is grown without any pruning. The number of trees in the ensemble is a free
parameter which is readily learned automatically using the so-called out-of-bag error .

We will use the confusion matrix to determine the accuracy which is measured as the total number
of correct predictions divided by the total number of predictions.

Much like in the case of naïve Bayes– and k-nearest neighbor–based algorithms, random
forests are popular in part due to their simplicity on the one hand, and generally good performance
on the other. However, unlike the former two approaches, random forests exhibit a degree of
unpredictability as regards the structure of the final trained model. This is an inherent consequence
of the stochastic nature of tree building. As we will explore in more detail shortly, one of the key
reasons why this characteristic of random forests can be a problem in regulatory reasons—clinical
adoption often demands a high degree of repeatability not only in terms of the ultimate performance
of an algorithm but also in terms of the mechanics as to how a specific decision is made.

14
SPEECH TO EMOTION RECOGNITION

Fig 6.4 RandomForest Classifier Confusion Matrix.

15
SPEECH TO EMOTION RECOGNITION

8.5. KNN CLASSIFIER

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique. It is simple to implement. It is robust to the noisy training data. It
can be more effective if the training data is large.

The concept of the k-nearest neighbor classifier can hardly be simpler described. This is an
old saying, which can be found in many languages and many cultures.This means that the concept
of the k-nearest neighbor classifier is part of our everyday life and judging: Imagine you meet a
group of people, they are all very young, stylish and sportive. They talk about there friend Ben, who
isn't with them. So, what is your imagination of Ben? Right, you imagine him as being yong, stylish
and sportive as well.If you learn that Ben lives in a neighborhood where people vote conservative
and that the average income is above 200000 dollars a year? Both his neighbors make even more
than 300,000 dollars per year? What do you think of Ben? Most probably, you do not consider him
to be an underdog and you may suspect him to be a conservative as well?

The principle behind nearest neighbor classification consists in finding a predefined number,
i.e. the 'k' - of training samples closest in distance to a new sample, which has to be classified. The
label of the new sample will be defined from these neighbors. k-nearest neighbor classifiers have a
fixed user defined constant for the number of neighbors which have to be determined. There are also
radius-based neighbor learning algorithms, which have a varying number of neighbors based on the
local density of points, all the samples inside of a fixed radius. The distance can, in general, be any
metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods
are known as non-generalizing machine learning methods, since they simply "remember" all of its
training data. Classification can be computed by a majority vote of the nearest neighbors of the
unknown sample.

The k-NN algorithm is among the simplest of all machine learning algorithms, but despite
its simplicity, it has been quite successful in a large number of classification and regression
problems, for example character recognition or image analysis.

The algorithm for the k-nearest neighbor classifier is among the simplest of all machine
learning algorithms. k-NN is a type of instance-based learning, or lazy learning. In machine learning,
lazy learning is understood to be a learning method in which generalization of the training data is
delayed until a query is made to the system. On the other hand, we have eager learning, where the
system usually generalizes the training data before receiving queries. In other words: The function
is only approximated locally and all the computations are performed, when the actual classification
is being performed. 16
SPEECH TO EMOTION RECOGNITION

We will use the confusion matrix to determine the accuracy which is measured as the total
number of correct predictions divided by the total number of predictions.

Fig 6.5 KNN Classifier Confusion Matrix

17
SPEECH TO EMOTION RECOGNITION

9. CLASSIFICATION REPORT

9.1. MLP CLASSIFIER

A Classification report is used to measure the quality of predictions from a classification


algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report.

Table 7.1 MLPClassifier Classification Report

18
SPEECH TO EMOTION RECOGNITION

9.2. XGBOOST CLASSIFIER

A Classification report is used to measure the quality of predictions from a classification


algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report

Table 7.2 XGBClassifier Classification Report

19
SPEECH TO EMOTION RECOGNITION

9.3. LGBM CLASSIFIER

A Classification report is used to measure the quality of predictions from a classification


algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report

Table 7.3 LGBMClassifier Classification Report

20
SPEECH TO EMOTION RECOGNITION

9.4. RANDOMFOREST CLASSIFIER

A Classification report is used to measure the quality of predictions from a classification


algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report.

Table 7.4 Random Forest Classifier Classification Report

21
SPEECH TO EMOTION RECOGNITION

9.5. KNN CLASSIFIER

A Classification report is used to measure the quality of predictions from a classification


algorithm. How many predictions are True and how many are False. More specifically, True
Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a
classification report

Table 7.5 KNN Classifier Classification Report

22
SPEECH TO EMOTION RECOGNITION

10. SYSTEM DESIGN


10.1. INPUT DESIGN

The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are necessary to put
transaction data in to a usable form for processing can be achieved by inspecting the computer to
read data from a written or printed document or it can occur by having people keying the data directly
into the system. The design of input focuses on controlling the amount of input required, controlling
the errors, avoiding delay, avoiding extra steps and keeping the process simple. The input is designed
in such a way so that it provides security and ease of use with retaining the privacy. Input Design
considered the following things: What data should be given as input? How the data should be
arranged or coded? The dialog to guide the operating personnel in providing input. Methods for
preparing input validations and steps to follow when error occur.

10.2. OUTPUT DESIGN

A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to other
system through outputs. In output design it is determined how the information is to be displaced for
immediate need and also the hard copy output. It is the most important and direct source information
to the user. Efficient and intelligent output design improves the system‘s relationship to help user
decision-making. The output form of an information system should accomplish one or more of the
following objectives. Convey information about past activities, current status or projections of the
Future. Signal important events, opportunities, problems, or warnings. Trigger an action. Confirm
an action

23
SPEECH TO EMOTION RECOGNITION

11. ANALYSIS
11.1. FINAL REPORT

In Speech to Emotion Recognition after analyzing these five Classifiers we find that MLP classifier
gives the higher accuracy.

XGBOOST & LGBM classifier gives nearly higher accuracy shown in below bar chart.

Figure 8.1: Final Report Analysis

24
SPEECH TO EMOTION RECOGNITION

12. CODING

Loading Libraries:

Loading a Audio:

25
SPEECH TO EMOTION RECOGNITION

26
SPEECH TO EMOTION RECOGNITION

Feature Preprocessing:

27
SPEECH TO EMOTION RECOGNITION

28
SPEECH TO EMOTION RECOGNITION

29
SPEECH TO EMOTION RECOGNITION

30
SPEECH TO EMOTION RECOGNITION

Feature Extraction

31
SPEECH TO EMOTION RECOGNITION

32
SPEECH TO EMOTION RECOGNITION

33
SPEECH TO EMOTION RECOGNITION

34
SPEECH TO EMOTION RECOGNITION

35
SPEECH TO EMOTION RECOGNITION

36
SPEECH TO EMOTION RECOGNITION

37
SPEECH TO EMOTION RECOGNITION

Evaluation

38
SPEECH TO EMOTION RECOGNITION

39
SPEECH TO EMOTION RECOGNITION

Testing

40
SPEECH TO EMOTION RECOGNITION

41
SPEECH TO EMOTION RECOGNITION

13. CONCLUSION
The emerging growth and development in the field of AI and machine learning have led to
the new era of automation. Most of these automated devices work based on voice commands from
the user. Many advantages can be built over the existing systems if besides recognizing the words,
the machines could comprehend the emotion of the speaker (user). Some applications of a speech
emotion detection system are computer- based tutorial applications, automated call center
conversations, a diagnostic tool used for therapy and automatic translation system.

In this thesis, the steps of building a speech emotion detection system were discussed in
detail and some experiments were carried out to understand the impact of each step. Initially, the
limited number of publically available speech database made it challenging to implement a well-
trained model. Next, several novel approaches to feature extraction had been proposed in the earlier
works, and selecting the best approach included performing many experiments. Finally, the classifier
selection involved learning about the strength and weakness of each classifying algorithm with
respect to emotion recognition. At the end of the experimentation, it can be concluded that an
integrated feature space will produce a better recognition rate when compared to a single feature.

42
SPEECH TO EMOTION RECOGNITION

14. REFERENCE
• Code for Interview YouTube Channel.

• Soegaard, M. and Friis Dam, R. (2013). The Encyclopedia of Human-Computer


Interaction. 2nd ed.

• Internet & World Wide Web: How to Program Deitel, PJ Deitel.

• T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using


hidden Markov models,” Speech Commun., vol. 41, no. 4, pp. 603–623, Nov.
2003.

• www.data-flair.training.com

• www.researchgate.net

43

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy