3

Computer Methods and Programs in Biomedicine 215 (2022) 106646
Contents lists available at ScienceDirect
Computer Methods and Programs in Biomedicine

journal homepage: www.elsevier.com/locate/cmpb
Automated emotion recognition: Current trends and future

perspectives
M. Maithri a, U. Raghavendra b, Anjan Gudigar b,∗, Jyothi Samanth c, Prabal Datta Barua d,e,
Murugappan Murugappan f, Yashas Chakole b, U. Rajendra Acharya g,h,i
a
Department of Mechatronics, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal 576104, India
b
Department of Instrumentation and Control Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal 576104, India
c
Department of Cardiovascular Technology, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India
d
School of Management & Enterprise, University of Southern Queensland, Toowoomba, QLD 4350, Australia
e
Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW 2007, Australia
f
Department of Electronics and Communication Engineering, Kuwait College of Science and Technology, 13133, Kuwait
g
School of Engineering, Ngee Ann Polytechnic, Clementi 599489, Singapore
h
Department of Biomedical Informatics and Medical Engineering, Asia University, Taichung, Taiwan
i
Department of Biomedical Engineering, School of Science and Technology, SUSS University, Singapore
a r t i c l e i n f o a b s t r a c t
Article history: Background: Human emotions greatly affect the actions of a person. The automated emotion recognition
Received 8 August 2021 has applications in multiple domains such as health care, e-learning, surveillance, etc. The development
Revised 25 December 2021
of computer-aided diagnosis (CAD) tools has led to the automated recognition of human emotions.
Accepted 16 January 2022
Objective: This review paper provides an insight into various methods employed using electroencephalo-
gram (EEG), facial, and speech signals coupled with multi-modal emotion recognition techniques. In this
Keywords: work, we have reviewed most of the state-of-the-art papers published on this topic.
Human emotions Method: This study was carried out by considering the various emotion recognition (ER) models pro-
Electroencephalogram (EEG) posed between 2016 and 2021. The papers were analysed based on methods employed, classifier used
CAD
and performance obtained.
Machine learning
Results: There is a significant rise in the application of deep learning techniques for ER. They have been
Facial
Voice widely applied for EEG, speech, facial expression, and multimodal features to develop an accurate ER
model.
Conclusion: Our study reveals that most of the proposed machine and deep learning-based systems have
yielded good performances for automated ER in a controlled environment. However, there is a need to
obtain high performance for ER even in an uncontrolled environment.
© 2022 Elsevier B.V. All rights reserved.
1. Introduction system signals originating in the human physiological system for

each emotion [2].
Expression of emotion is an important aspect of a human’s life. The conceptualization of emotion was evolved decades ago [3],
Expression portrays perception about the incidents, human inter- however there is no standard thought recognized by the psychol-
action, decision-making, and intelligence. Emotion expresses the ogists. Various researchers portrayed the classification of emotion
psycho-physiological and psychological status of a human being into two methods: (i)segregated the emotion into discrete cate-
[1]. Human emotional state has an important effect on psychologi- gories [4], and (ii) multiple dimensions are used to identify emo-
cal health which is the cause of depression in many. Recognition tions [5]. Broad spectrum of emotions are classified as positive and
of these emotions and their classification is possible by various negative emotions where extreme negative emotions impact hu-
modalities such as electroencephalography (EEG), gestures, facial man life adversely [6].
expressions, speech pattern, etc. Physiological signals like are heart While describing emotions, participants would be provoked for
rate (HR), and skin conductance (SC) depict autonomous nervous demonstrating their emotional expression by the visual, aural, au-
diovisua, and emotional recall stimulations by using depictions,
music, movies, recalling past events, etc. More recently use of com-
∗
Corresponding author. puter games, 3D videos, and creating real environments virtually
E-mail address: anjan.gudigar@manipal.edu (A. Gudigar).
https://doi.org/10.1016/j.cmpb.2022.106646
0169-2607/© 2022 Elsevier B.V. All rights reserved.
M. Maithri, U. Raghavendra, A. Gudigar et al. Computer Methods and Programs in Biomedicine 215 (2022) 106646
Fig. 1. 2-D model for valence – arousal.
are gaining more research interests presenting augmented real- human and machine there is a need for the computer to under-
ity and mixed reality environments [7,8,9]. Generally emotion is stand the human emotion properly so that it can respond correctly
classified as happy, fear, sad, anger, surprise and disgust based on to a given situation [15].
valence-arousal plane- a two dimensional plane model [10]. These There are many ways by which the emotion can be analysed.
expressions are broadly categorized as positive (happy, surprise) Still images of facial expressions are mostly used for ER. How-
and negative emotions (sadness, anger, fear, and disgust) with re- ever, they suffer from limitations such as environmental factors
spect to the human’s state of mind [11]. Fig. 1 depicts the 2-D (light intensity, distance between user to the camera, background
model for valence –arousal [12]. Besides, the emotion can be mod- changes) and it is highly challenging to estimate the hidden or
elled in three dimensions such as valence-arousal-dominance. The unexpressed emotions. Emotion analysis using biosensors such as
dominance state is used to investigate the degree of control ex- EEGs help in understanding the emotions by directly capturing and
erted by a stimulus. analysing the brain electrical activities. But they suffer from noise
Further it is classified as primary emotions reflecting joy, anger, while capturing the signals [16]. Speech ER is also gaining popular-
sadness, fear, surprise, disgust and secondary emotions that show ity. However, the existing speech datasets are small [17]and most
a mental image associated with memory or primary emotions [2]. commonly using a single language. In multimodal-based ER, more
Multidimensional classification of emotion is possible by the scales than one (most commonly two) modes are combined to improve
derived based on arousal and valence state [11]. Arousal dimension the performance of ER and develop more robust models for real-
ranges from not-aroused to excited state whereas the valence di- time applications. A multimodal approach using the combination
mension is demonstrated on whether positive or negative emotion of facial and speech information are gaining popularity due to its
[2,13,11]. applications in health care sector and human computer interaction
Understanding human emotions is always fascinating. Emotions [18]. The other commonly used multimodal combinations are fa-
play a vital role in decision-making in our life. In some circum- cial expressions and EEG signals, EEG – heart rate – Galvanic skin
stances, human beings can restrain themselves or may not be able response [19], Text -audio -video [20]. Many deep learning (DL)
to express their true feelings. A mentally or physically disabled and machine learning (ML) based approaches have been proposed
person may not be able to show his/her true feelings during treat- for automatic recognition of the human emotions in healthy and
ment at the hospital. An automated way of gathering human emo- pathological people.
tions through a computer aided diagnosis (CAD) tool plays a vi- In the world, approximately 280 million people are suffering
tal role in such a situation. It can play a vibrant role in national from depression. Around 3.8% of the world population is affected
defense during the training of soldiers in simulated environments. by depression out of which 5% are adults and among adults,
The tool can be used to assess the mental conditions in combat sit- 5.7% are people aged more than 60 years. Over 70 0,0 0 0 peo-
uations [14]. The introduction of internet of things (IoT) has given ple die due to suicide every year. The suicide rate has increased
more importance to automated human emotion recognition (ER) and is the 4th major cause of death in the age group between
in smart homes, smart hospitals, and smart cities. During the era 15 and 29 years (https://www.who.int/en/news-room/fact-sheets/
of smart human-computer interaction (HCI) it is very important detail/depression). In India, there are around 57 million people are
to have computational tools which can recognize human emotions suffering from depression [21]. The depressed person has a de-
automatically. Also with the increase in the interaction between pressed mood, feels sad, irritated, and lost interest in activities.
2
Depression is most commonly seen in students. A study on depres- as multimodal, multi-modal, emotion recognition, emotion classifi-
sion among Indian universities found that 37.7%, 13.1%, and 2.4% of cation. As a result of this search total of 214 articles were obtained.
the students were suffering from moderate, severe, and extremely The reviewing of reference sections of relevant articles yielded 131
severe depressions, respectively. This study stresses the need for articles. Certain criteria were set for article selection and hence,
mental health support services to these students immediately [22]. finally, 290 articles were considered for review purpose. We fol-
The automated system can help clinicians better understand the lowed the PRISMA guidelines while preparing this systematic re-
student’s mental health. It helps in the early detection and treat- view [23].
ment of mental illness.
To develop an automated system, it is important to train the 2.2. Selection of the studies for the current review
system using a database. A pool of stimuli normally a video will
be selected by researchers who intend to develop a dataset. The The following criteria were considered to select the papers for
selected datasets were made to watch by volunteers and assess this review:
the video content in terms of positive/negative/neutral emotions. A
- If it provided a method for recognition or classification of emo-
manikin-based or score-based self assessment sheet will be shared
tions based on EEG or facial or voice modality or multimodal
with the volunteers and informed to assess the emotions felt while
study on deep learning
watching the videos. An effort will be made to gather balanced rat-
- If it provided a method for recognition or classification of emo-
ings for each video and the top-rated video for particular emotion
tions based on EEG or facial or voice modality or multimodal
is considered. This forms the ground truth. During the data captur-
study on machine learning.
ing stage, a volunteer will be made to wear an EEG signal captur-
- Studies based on any other technique for the recognition of
ing device to collect the EEG signals.
emotions.
- Papers in English language only
2. Review objective and collection of articles
Following publications are considered irrelevant to our review
Humans display numerous emotions as a reaction to any ac- paper:
tion that occurred within him/her, or to external stimulations from
- System on chip/Field Programmable Gate Array (FPGA)-related
other human beings, nature, or even computers. Due to the ad-
articles.
vancement of HCI, it is not only sufficient if a computer is giving
- Review articles
standard reactions for any verbal words or actions. It is also im-
portant for the computer to understand the emotion behind a par- All the relevant articles were initially segregated into 4 folders,
ticular action so that it can respond accordingly. Humans can de- one for each modality and one for multimodal. While segregating,
pict emotions verbally or through facial expression and even there publications were renamed with first author and year for further
can be unexpressed internal emotions that can be captured by EEG analysis. Going through the abstract, they were sorted into deep
signals. This review is conducted to analyze three main modalities learning, machine learning, and other techniques. At every stage of
namely facial, EEG, and voice involved in human ER. Further, the analysis articles which do not meet the above mentioned criteria
objective of this review are as follows: were excluded. The overview of the process is depicted in Fig. 2.
• Selecting studies related to EEG or voice or facial expression- 2.3. Data extraction
based human ER and analysis of the performance using ML and
DL techniques. On further reading of the individual study, the following infor-
• Selecting studies related to multi-modal human ER and analysis mation was extracted and arranged in an excel file.
of the performance using ML and DL techniques. First author name, year of publication, title, proposed method,
• Analysis of the various features used for experimentation under classifiers, results, dataset details [24,25].
each modality This review is organized as follows: Section 2 provides details
about the collection and selection process of articles considered in
2.1. Plan of action for search and selection of articles this review. Section 3 presents the highlight of different techniques
involved for ER in each modality. Section 4 discusses the combina-
In this work, we used popular research publication databases tion of different modalities used for automated ER. The discussion
such as IEEE Xplore, Science Direct, Springer, Google Scholar, and on various search results obtained, limitations of the study, and fu-
PubMed to search for the research articles related to this area. ture scope are presented in Section 5 and finally paper concludes
Using the advanced search tab of these databases the publica- in Section 6.
tion year range was set to 2016 to 2021. Later the search was ex-
tended to early 2022. The search was done for EEG, facial, voice- 3. Methodology
based, and multimodal ER systems on the same platform and set-
tings. The search for EEG was initialized using the keyword EEG Understanding of mood or feeling of a human being can be
and emotion. Since the search result had both technical as well done through speech signal, physical activity, gestures, facial ex-
as medical-related publications the search words were fine-tuned pression or even with the help of physiological signals such as
with combinations of words such as CAD, automatic, recognition, EEG, HR, SC, and electrocardiogram (ECG). This section provides an
emotion detection, emotion classification, machine learning, neu- overview of three modalities used for human ER, and it is repre-
ral network, deep learning, etc. Further, for voice-based emotion sented in Fig. 3.
recognition, the search was conducted with words such as voice,
speech, recognition, database CAD, automatic, recognition, emo- 3.1. ER using EEG signals
tion detection, emotion classification, machine learning, neural net-
work, deep learning, etc. Sometimes a person’s inner emotion may be different from the
In the same manner, facial emotion recognition-related articles external expression of emotion. Subjective evaluation of self-report
were collected from the database. The search for studies on mul- of a person’s emotions can provide a better insight but there ex-
timodal emotion recognition was carried out using keywords such ist issues related to its genuineness. There is a possibility that
3
Fig. 2. Flowchart of PRISMA model used to select the studies in this review.
Fig. 3. Overview of CAD system for human ER using three modalities.
4
Fig. 4. Schematic diagram of EEG- based automated ER approach.
such reports may provide, how in general one would feel in each ing a support vector machine (SVM) classifier. In another study,
situation rather than providing the details of their true feelings. both time and frequency domain, as well as entropy-based fea-
There is a chance of manipulation or suppression of true feeling. tures, were extracted from EEG signals. They conducted experi-
So to attain a better accuracy in assessing true emotions physio- ments with SVM, ANN, and naïve Bayes as classifiers on the DEAP
logical signals can help [26]. As a physiological signal, EEG analy- dataset and concluded that ANN yielded an average accuracy of
sis plays a vital role in ER as these signals are generated by the 97.74%. They also found that entropy being the best feature pro-
central nervous system (CNS). The brain electrical activities trig- vided an average accuracy of 90.53% [32]. The analysis on the DEAP
gered by a stimulus can be accurately captured by EEG signal mak- dataset by Liu &Fu reached SROCC:0.789, PLCC:0.843 with SVM be-
ing it an important area of research for EEG-based human ER. ing used for training the emotions [33]. A study on ER decomposed
Some of the more commonly used datasets include database for the EEG signals into intrinsic mode functions (IMF). Second-order
emotion analysis using physiological signals (DEAP) and SJTU Emo- difference plots (SODP) were used to extract the features. They
tion EEG Dataset (SEED), DREAMER, and GAMEEMO. These datasets used SVM and multilayer perceptron (MLP) for classification. MLP
are publicly available. Access can be obtained upon request for showed better performance than SVM and attained 100% accuracy
DEAP dataset which consists of 32 subjects’ recordings where each for the classification of High and Low Arousal [34]. Firat univer-
subject watched 40 music videos. Out of the 32 subjects, frontal sity logo inspired study was proposed by a set of researchers. They
videos are also available for 22 subjects. DEAP dataset consists of termed their model as fractal Firat pattern (FFP). FFP with tun-
ratings for arousal, valence, and dominance. A detailed description able Q-factor wavelet transform (TQWT) model developed for ER
of DEAP dataset is available in (http://www.eecs.qmul.ac.uk/mmv/ attained maximum accuracy of 99.82% with SVM [35].
datasets/deap/, [27]). SEED database consists of recording from 15
subjects watching 15 Chinese film clips for emotion positive, neg-
3.1.2. Deep learning-based studies
ative, and neutral. This session was repeated 3 times for each par-
Various deep learning models such as deep neural network
ticipant. The additional details of this dataset is available in (https:
(DNN), convolutional neural network (CNN), long short-term mem-
//bcmi.sjtu.edu.cn/home/seed/seed.html, [28]). DREAMER has both
ory (LSTM), and hybrid of CNN-LSTM models were tested on DEAP
EEG as well as ECG signals. An audiovisua arousal, dominance, va-
dataset. The investigation concluded that DNN has higher learn-
lence stimuli was given to 23 participants and their respective self-
ing rate compared to other models and it attained optimal con-
assessment (SA) was collected (https://zenodo.org/record/546113#.
vergence with smaller number of epochs. This study obtained
Ybt2CGhBw2x). Gameemo is 28 subjects dataset. In this dataset,
the best accuracy of 94.17%, for CNN-LSTM model [36]. Another
each subject was made to play four computer games corresponding
study proposed an emotional model which used channel-wise fea-
to four different emotions namely horror, funny, calm, and boring
tures. These features were fed to LSTM model. The two-class clas-
(https://data.mendeley.com/datasets/b3pn4kwpmn/3, [29]).
sification of valence and arousal attained a classification rate of
Multiple human ER algorithms using EEG signals have been
98.93% and 99.10% on DEAP dataset and the three-class classifica-
proposed recently focusing mainly on ML and DL techniques.
tion achieved 99.63% on the SEED dataset [37]. In a model with
Schematic diagram of EEG- based automated ER approaches is
a dynamic graph CNN for the classification of emotions from EEG
given in Fig. 4.
signals using SEED dataset was able to recognize with 90.4% ac-
curacy for subject dependent validation and 79.95% for subject
3.1.1. Machine learning-based studies independent classification. Experiments conducted on DREAMER
The study for extraction of emotions namely genuine, neutral, dataset attained 86.23%, 84.54%, and 85.02% of average accuracies
and fake/acted smile from EEG achieved maximum accuracy of for valence, arousal, and dominance, respectively [38]. A model
94.3% and 84.1% using discrete wavelet transform -empirical mode used the fusion of graph convolutional neural network (GCNN) and
decomposition (DWT-EMD) with artificial neural network (ANN) LSTM to develop a model called emotion recognition deep learn-
classifier for alpha and beta bands respectively [30]. In this study, ing (ERDL). Differential entropy is used as a feature for emotion
researchers developed a dataset with genuine and fake emotional recognition. Their experiment on the DEAP dataset attained the
expressions. In a model named intensive multivariate empirical accuracies of 90.45% and 90.60% for valence and arousal, respec-
mode decomposition (iMEMD) the time and frequency domain in- tively in subject-dependent experiments while 84.81% and 85.27%
formation were obtained from complex continuous wavelet trans- in subject-independent experiments [39]. Using the time-varying
form (CCWT). Differential entropy and mutual information were spectral content of EEG signals parallel cascades of fuzzy logic-
used as features [31]. This work is verified on the SEED and DEAP based systems were modelled and achieved the lowest root mean
datasets and achieved an average classification rate of 96.3% us- square error (RMSE) value of 0.082. They also stressed the role of
5
the theta frequency band in the recognition of emotions from EEG. audio-video data. The emotional classes considered in this dataset
For the experiment, they developed a dataset with fifteen volun- are: happiness, sadness, neutral, and angry [49]. Another com-
teers with both positive and negative musical emotions as stim- monly used voice dataset is Berlin -EmoDB. It is a German lan-
uli [40]. Stratified normalization was used to train the deep neural guage 10 actors database with 7 emotions and 535 utterances
network. The study on the SEED dataset with the multitaper fea- (http://emodb.bilderbar.info/docu/#download, [50]). Other voice-
ture extraction technique reached 91.6% accuracy for two emotion based database includes Ryerson Audio-Visual Database of Emo-
categories (positive and negative) and 79.6% for three classes (pos- tional Speech and Song (RAVDESS) and Surrey Audio visual Ex-
itive, negative, and neutral) [41]. The textual-based features were pressed emotion (SAVEE). RAVDESS has 12 male and 12 female pro-
extracted from time and frequency domain signals from EEG sig- fessional actor’s data. This database consists of sad, fear, happy, an-
nals. Differential entropy features are fed to the 4D convolutional gry, disgust, surprise and calm emotions (https://smartlaboratory.
recurrent neural network. The proposed CRNN model is the combi- org/ravdess/). SAVEE database has 7 emotions acted by 4 differ-
nation of CNN and RNN with LSTM. They achieved the accuracy of ent male acting professionals. The audiovisua performance was
94.74% and 94.22% for their analysis on SEED and DEAP datasets, recorded and quality is evaluated by 10 subjects (http://kahlan.eps.
respectively [42]. surrey.ac.uk/savee/). This dataset is also used for facial expression-
EEG is a non-stationary signal which cannot be analyzed in ei- based ER as well.
ther time or frequency domain techniques alone. Wavelet trans- Various techniques applied for speech emotion recognition are
form (WT) is one of the efficient techniques to analyze non-linear discussed below and overview of the models is shown in Fig. 6.
signals. WT produces the time-frequency representation of a signal
[43]. Empirical mode decomposition (EMD) is another promising 3.2.1. Deep learning-based studies
feature extraction technique. The amplitude and frequency modu- In a study using ladder networks for the recognition of emo-
lated components are obtained using EMD. These components are tions a denoising autoencoder is used for combining the input
called as intrinsic mode functions (.s). The combination of DWT and the intermediate features [51]. The flexibility of this proposed
and EMD has also shown promising results [30]. Reduction of model is exhibited by implementing sentence-level features and
the feature set dimensionality is another important aspect in EEG was evaluated on the cross-corpus. For within-corpus, this model
based recognition systems. Studies using limited number of leads achieved a concordance correlation coefficient (CCC) between 3.0%
or channels are showing improved performance [[36],44]. However, and 3.5% whereas for cross-corpus it was between 16.1% and 74.1%.
a detailed study on this area is needed to find the optimum num- This model was evaluated on USC-IEMOCAP and MSP-IMPROV cor-
ber of leads/channels required to obtain the highest performance pora. A real-time speech emotion recognition (SER) model was pro-
for ER. posed by considering local and global features from a speech sig-
A study extracted entropy and Higuchi’s fractal dimension nal [52]. The proposed 1D dilated CNN model obtained a recogni-
(HFD) features and applied empirical mode decomposition/intrinsic tion rate of 73% and 90% on the IEMOCAP and EMO-DB database,
mode functions (EMD/IMF) and variational mode decomposition respectively. A system named PCNSE consists of parallel convolu-
(VMD) on EEG signals. Their experiment on the DEAP dataset con- tional layers (PCN) integrated with a squeeze-and-excitation net-
firmed CNN is giving better accuracy of 95.20% compared to naïve work (SEnet) combined with self-attention DRN (Dilated Residual
Bayes, k-nearest neighbor (k-NN), and decision tree (DT) [45]. Us- Network) [53]. The experiments conducted on IEMOCAP showed
ing CNN for automatic recognition has notably increased the per- the weighted accuracy (WA) of 73.1% and an unweighted accuracy
formance of the systems. A study using maximum mutual infor- (UA) of 66.3% and the FAU-AEC database depicted UA of 41.1%. A
mation was proposed by Ghosh et al. [46]. By increasing the ef- study utilized combination of CNN and LSTM. This study proposed
ficiency of the search process the mutual information can be in- a model with four local feature learning blocks (LFLBs) along with
creased. Their experiments using DEAP dataset attained 95.87% ac- LSTM. LFLBs learn local features and LSTM is used to learn the re-
curacy for dominance and 82% for 2 class emotional scoring. liance of these features in long term. This model attained an ac-
Fig. 5 shows the year-wise distribution of papers reviewed for curacy of 91.6% on speaker dependent evaluation and 92.9% on
EEG based emotion recognition. The summary of recent state-of- speaker independent cases using Berlin EmoDB database [54]. It
the-art works related to EEG based emotion recognition system us- has been observed that CNN-LSTM based models are gaining pop-
ing machine learning and deep learning approaches are given in ularity as they combined the strength of both and tackles the weak
Table 1 and the summary of the remaining studies are given in point of each other.
Table A1 in appendix.
3.2.2. Machine learning-based studies
3.2. ER using voice signals A model was developed for speech ER in which root mean
square energy (RMS), mel-frequency cepstral coefficients (MFCC),
A speaker can be identified by using the characteristics of voice and zero-crossing rate are used as acoustic features. It also made
which consists of speech rate, pitch, prosody, and also emotion use of features extracted from different pre-trained deep neu-
[47]. Understanding the right emotion in speech is very impor- ral networks to formulate the hybrid feature vectors [48]. Finally,
tant in every conversation as it has both physical and psycholog- the Relief algorithm was used to select the most prominent fea-
ical influences of the person. An automated system for emotion tures for ER. This study was conducted on Ryerson Audio-Visual
detection and ER has several real-time applications such as voice- Database of Emotional Speech and Song (RAVDESS), Belin (EMO-
based surveillance, e-learning, lie detectors, call canter’s, defense, DB), and Interactive Emotional Dyadic Motion Capture (IEMOCAP)
etc. [48]. Voice-based virtual assistance like Siri, Alexa commu- datasets. With the SVM classifier, the proposed model attained an
nicates with humans by using natural language processing tech- accuracy of 90.21% with EMO-DB dataset. In another study an au-
niques. However, the level of these intelligent personal assistant toencoder was used for reducing the dimensionality of the au-
(IPA) devices can be increased by embedding ER in these devices dio files [55]. They evaluated the system on RAVDESS and Toronto
[16]. Speech ER also has application in the medical field in un- emotional speech set (TESS)database using three classifiers namely
derstanding the emotion of a patient under depression etc. One decision tree (DT), CNN, SVM, and attained maximum accuracy of
of the most commonly used speech signal databases is IEMO- 96% for CNN with TESS dataset. A study used a magnitude spec-
CAP (Interactive emotional dyadic motion capture database). This trum for extraction of Mel frequency cepstral coefficient (MFCC)
database consists of 10 English-speaking actors. It consists of 12 hr without discrete cosine transform (DCT) [56]. The evaluation of
6
Fig. 5. Distribution (yearly) of papers published on EEG- based automated ER.
Fig. 6. Schematic diagram of voice- based automated ER approach.
this study was performed on six datasets namely Berlin, RAVDESS, classification purpose, this model used SVM, random forest, k-
SAVEE, EMOVO, eNTERFACE, and Urdu datasets. They reported a nearest neighbor’s algorithm, and neural network classifiers. In a
recognition rate of 95.25% with SVM classifier with Urdu dataset. model with broad learning system, 39-D MFCC features were fed.
A model proposed cryptographic structure called shuffle box for The experiments of this model on CASIA Chinese emotion corpus
the generation of features [57]. This model consists of three main achieved 100% recognition rate [59].
stages: tunable Q wavelet transform, twine-shufpat, and itera- A different type of speaker-specific emotion detection model
tive neighbourhood component analysis. A mixed dataset gener- was proposed [60]. This model inspected emotional and non-
ated from RAVDESS, Berlin, SAVEE, and EMOVO corpora with SVM emotional speech with the help of excitation features present in
classifier achieved 80.05% accuracy. A model used deep convolu- the speech. The similarity between emotion and neutral features
tional neural network (DCNN) with automatic feature selection by was computed by Kullback-Leibler (KL) distance. The study was
correlation-based feature selection (CFS) was proposed [58]. This conducted using IIIT-H Telugu emotional speech database, Berlin
model was implemented on Berlin, SAVEE, IEMOCAP, and RAVDESS emotional speech database and attained neutral vs emotional ac-
and for speaker-dependent SER attained the accuracies of 95.10%, curacy of 91.67% for IIT-H database. Most of the studies have used
82.10%, 83.80%, and 81.30%, respectively using these datasets. For CNN-based network for automated feature extraction. But the CNN
7
Table 1
Summary of state-of-the-art techniques developed for EEG- based automated ER system.
Author &
Sl. No year Methods Classifiers Results Accuracy(ERacc ) Database
Machine Learning-based studies
1 [262] Multi-scale frequency bands SEED IV: ERacc : 82.75%, SEED IV+DEAP
ensemble learning (MSFBEL) 87.87%, and 78.27%, DEAP:
ERacc : 74.22%
2 [31] iMEMD SVM and k-NN classifiers ERacc : 96.3% SEED, and DEAP
3 [40] Fuzzy parallel cascades (FPC) linear regression (LR), RMSE:0.082 PERSONAL:15ppl
support vector regression
(SVR), and
Long–Short-Term-Memory
recurrent neural network
(LSTM-RNN)
4 [34] EMD and its second order SVM and 2-hidden layer ERacc : 100% - high and low DEAP
difference plots (SODP) Multilayer Perceptron arousal
5 [33] Multi-channel feature fusion SVM SROCC:0.789 PLCC:0.843 DEAP
of EEG signal method
6 [267] TQWT tunable wavelet k-NN, SVM, ANN, RF, four ERacc : 93% SEED
transform and RFE rotation different types of the
forest ensemble decision tree (DT)
algorithms
7 [277] Topographic (TOPO-FM) and SVM ERacc : 0.8125 ± 0.0173 – DEAP, SEED, DREAMER,
holographic (HOLO-FM) Valence ERacc : and AMIGOS
representation of EEG signal 0.8510 ± 0.0262 – arousal
characteristics
8 [311] Multiple generator k-NN, SVM ERacc : 84% - SVM SEED
conditional Wasserstein GAN
(MG–CWGAN)
9 [148] Spatio -temporal feature SVM ERacc : 80.52% - Valence ERacc : DEAP
extraction using CNN 75.22% - Arousal
10 [229] Discrete Wavelet Packet k-NN, PNN, RF RF classifier: ERacc : 85.29% - PERSONAL:19:Left Brain
Transform LBD ERacc : 79.54% - RBD Damagel (LBD), 19:Right
ERacc : 79.09% - NC Brain Damage (RBD),
19:Normal Control (NC)
Deep Learning-based studies
1 [39] GCNN+LSTM subject-dependent ERacc : DEAP

90.45% - valence ERacc :
90.60% - arousal
subject-independent ERacc :
85.27% - arousal
2 [141] AlexNet and VGG16 ERacc : 74% PERSONAL:9ppl
3 [199] Transferable attention neural ERacc : 93.34% SEED, SEED iv, MPED
network (TANN)
4 [41] Stratified normalization ERacc : 91.60% SEED
5 [142] Multi-feature deep forest ERacc : 71.05% DEAP
(MFDF)
6 [288] Frame-level distilling neural DEAP: ERacc : 83.85% - DEAP and DREAMER
network (FLDNet) valence ERacc : 78.2% -
arousal ERacc : 77.52% -
dominance DREAMER: ERacc :
87.67% - arousal ERacc :
90.28% - dominance
7 [216] Rhythm-specific Beta-rhythm: ERacc : 98.91% - DEAP, DREAMER, and
multi-channel convolutional LV vs. HV ERacc : 98.45% - LA DASPS
neural network (CNN) vs. HA ERacc : 98.69% - LD vs.
HD Theta-rhythm: ERacc :
98.56% - LV vs. HV ERacc :
98.82% - LA vs. HA ERacc :
98.99% - LD vs. HD
8 [107] DCNN+ConvLSTM ERacc : 87.69% -Arousal ERacc : DEAP
87.84% -Valence
9 [147] Electrodermal activity (EDA) ERacc : 69.33% - Valence ERacc : DEAP
signals and multiscale CNN 71.43% - Arousal
10 [167] Pearson’s Correlation ERacc: 78.22% - valence ERacc : DEAP
Coefficient (PCC) into CNN 74.92% - arousal
cannot effectively map the temporal dependencies in voice signals. by data augmentation techniques. It is also observed that most of
It mainly extracts the translationally invariant features [61]. The the studies are done using SVM for classification purposes as they
high-level features of the spectrogram can be extracted by using yielded promising results.
Deep CNN [62]. This is one of the main reasons for low perfor- However, performance of various studies cannot be directly
mance in speech-based recognition systems. This can be overcome compared because different studies have chosen different datasets
8
Fig. 7. Distribution (yearly) of papers published on speech signal- based automated ER.
which vary in size and the number of emotion classes. Fig. 7 3.3.1. Deep learning-based studies
shows year-wise distribution of papers reviewed for speech signal- For the continuous recognition of human emotions Choi
based ER system. The summary of recent state-of-the-art works and song proposed a CNN-LSTM based regressor. They evalu-
related to speech signal-based ER system using machine learn- ated the semi-supervised learning model on MAHNOB-HCI and
ing and deep learning approaches are given in Table 2 and the AFEW-VA datasets. This model is having RMSE (root of the
summary of the remaining studies are given in Table A2 in mean of squared errors) 0.0385±0.0032, PCC (Pearson Correla-
appendix. tion Coefficient):0.53±0.070 for MAHNOB-HCI, and RMSE:0.2185,
concordance correlation coefficient (CCC):0.541 for AFEW-VA[73]
. Recognition rate of facial emotion can be enhanced by consid-
3.3. ER using facial images
ering multiple features. With this idea, a regional-based multi-
feature similarity (RMFS) technique was proposed for the detection
The facial features consists of forehead, eye, nose, mouth, lips,
of emotions. The webcam captured images are pre-processed and
chin, and skin features. When a face displays an expression most
multiple features were extracted. These features are trained with a
of these features show changes thus exhibiting the emotion. By
neural network. This model achieved 98.4% performance for emo-
identifying these feature changes facial emotion detection can be
tion detection [63]. A transfer learning-based approach for CNN
performed [63]. However, as facial expression consists of several
pre-trained networks used Mobile Net, Inception V3, VGG19, and
features and these features vary from person to person, based on
ResNet50 as pre-trained networks. The experiments on the CK+
gender, age, ethnic etc., researchers are using the combination of
dataset with MobileNet attained the highest accuracy of 96% [74].
these features to increase the recognition rate [64]. Usually hand-
In a study, the input images were pre-processed by the Gamma-HE
crafted features were mainly used for ER, and then these fea-
technique and for the extraction of facial points, they proposed a
tures were fed into the SVM classifier [65]. Nowadays, due to the
pyramid histogram of oriented gradients (PHOG) based supervised
advancement of DL techniques, features are extracted automati-
descent (SMD) [69]. The deep learning neural network-regression
cally yielding an increased recognition rate [66,67], [68].These au-
activation (DR) classifier then classified the emotions. This system
tomated facial ER (FER) are becoming popular owing to their ap-
attained an accuracies of 0.9885 and 0.9727 for the analysis con-
plications in the healthcare domain, human-computer interaction,
ducted on CK+ and JAFFE datasets, respectively. A CNN-based fa-
online education, surveillance, etc. [69]. The most commonly used
cial emotion recognition (FERC) which has two parts was proposed
facial expression dataset is Extended Cohn–Kanade CK+ dataset.
[75]. The background of the picture is removed in the first part
It is a combination of both posed as well as spontaneous emo-
and the second part is used for extraction of the feature vector.
tions. It consists of 593 video sequence from 123 subjects pro-
This model uses an expressional vector (EV) to obtain 5 types of
viding 7 emotional state expressions (https://paperswithcode.com/
facial expressions. The performance of the FERC model was tested
dataset/ck, [70]). Another more commonly used dataset for FER is
on Caltech faces, CMU, and NIST datasets. Using 24 digit EV this
Japanese Female Facial Expression (JAFEE). It is a female expres-
model achieved an accuracy of 96%. It can be observed that most
sion dataset with 7 emotional states. It has 213 posed gray scale
of the researchers have used CNN- based technique for FER. How-
images with a spatial resolution of 256 × 256 (https://zenodo.
ever, using only CNN temporal features cannot be captured. Hence
org/record/3451524#.YbYnOb1Bw2y, [71]). The other dataset used
combing CNN with LSTM can produce improved results.
in FER is FER-2013. It consists of 35,887 images depicting 7
emotions with image size being 48 × 48 (https://datarepository.
wolframcloud.com/resources/FER-2013, [72]). The schematic dia- 3.3.2. Machine learning-based studies
gram of facial features- based automated ER approach is shown in A genetic programming model (GP-FER) for the recognition of
Fig. 8. facial expression was proposed. Geometric features and local bi-
9
Table 2
Summary of state-of-the-art techniques developed for speech signal- based automated ER system.
Author &
Sl. No year Methods Classifiers Results: Accuracy (ERacc ) Database
1 [109] MFCC+CNN SVM and LSTM ERacc : 73.5% RAVDESS

2 [55] Auto encoder for SVM, DT, CNN ERacc : 96% - CNN RAVDESS, TESS
dimensionality reduction
3 [56] Mel frequency magnitude SVM ERacc : 95.25% (urdu) EMO-DB, Ravdess, Savee,
coefficient EMOVO, eNTERFACE and
Urdu databases
4 [57] Twine-shHuf-pat, INCA, and SVM ERacc : 90.09% RAVDESS Speech, EMO-DB,
TQWT techniques. SAVEE, and EMOVO
5 [182] EMD LDA, Naïve Bayes, k-NN, ERacc : 93.3% F1 score: 87.9% TESS
SVM, RF, and Gradient AUC: 0.995
BoostingMachine
6 [48] Hybrid architecture of SVM RAVDESS: ERacc: 79.41% RAVDESS, EMO-DB,
acoustic and deep features EMO-DB: ERacc: 90.21% IEMOCAP
IEMOCAP: ERacc: 85.37%
7 [320] Transfer sparse discriminant SVM ERacc : 43.34% F1 score: IEMOCAP, MSP-Improv
subspace learning (TSDSL) 45.33%
8 [265] SVM, RNN ERacc : 86.36% - SVM CREMA-D, EMO-DB,
RAVDESS, SAVEE
9 [316] deep binaural SVM AFEW5.0: ERacc: 36.29% AFEW5.0, BAUM-1s
representations with CNNs BAUM-1s: ERacc: 44.31%
10 [58] DCNN MLP, SVM, RF, KNN Emo-DB: ERacc : 95.10% - SVM Emo-DB, SAVEE, IEMOCAP,
RAVDESS
1 [197] Spatiotemporal and EMO-DB: IEMOCAP, EMO-DB,

frequential cascaded (Leave-One-Speaker-Out eNTERFACE05, SAVEE
attention network of CNN cross-validation) WA: 83.30%
UA: 82.10%
2 [194] BLSTM-DSA WA:85.95% IEMOCAP, EMO-DB,
3 [52] DCNN EMODB ERacc : 90% IEMOCAP and EMODB
4 [317] Multi-CNNs BAUM-1 s: ERacc : 44.06% AFEW5.0 and BAUM-1
5 [325] SATN WA:68.6% UA:69.5% IEMOCAP
6 [53] PCNSE-SADRN–CTC IEMOCAP: WA: 73.1% IEMOCAP, FAU-AEC
UA:66.3%
7 [128] Dual attention-BLSTM UA: 70.29% IEMOCAP
8 [130] Improved long and EMO-DB: ERacc : 65.29%. EMO-DB, CASIA
short-term memory network CASIA:ERacc : 74.17%
(ILSTM) + Self-Attention
Mechanism
9 [195] Emotional category based EMO-DB:ERacc : 72.19% IEMOCAP, MASC, EMO-DB.
feature weighting (ECFW) IEMOCAP: ERacc : 60.83%
10 [243] Multi-resolution RECOLA: (centering the RAW RECOLA and SEWA
modulation-filtered MMCG) CCC:0.865 – Arousal
cochleagram CCC: 0.524 -Valence
(MMCG) + parallel LSTM
nary patterns (LBP) are combined for recognition. Multiple experi- posed by Kumar et al., a sub-band of stationary wavelet transform
ments were performed on DISFA, DISFA+, CK+, and MUG datasets (SWT) is considered [79]. DCT is performed on the weighted en-
and 98% accuracy is achieved by the CK+ dataset [64]. A model in ergy of these sub-bands. Pearson kernel PCA is used for the di-
which geometric features were used for the selection of key frames mensionality reduction of features. Then the emotion classifica-
was proposed in a study. From each clip, the discriminant features tion is performed by Gaussian membership function fuzzy SVM
were selected by employing k-means clustering. For the evaluation classifier. This model evaluation was done on JAFEE, CK+, and
of person dependent and person independent cases, using RML, FG Net database and attained 98.9% accuracy on CK+ database.
and SAVEE datasets, the SAVEE dataset for person dependent cases The combination of wavelet gradient with SVM is provided max-
attained an accuracy of 98.77% [76]. It is also suggested that CNN- imum accuracy. Fig. 9 represents the year-wise distribution of pa-
based classifiers have performed better than the SVM classifiers. A pers reviewed for Facial modality-based emotion recognition sys-
model was developed using histogram of oriented gradients (HOG) tem. The summary of recent state-of-the-art works related to fa-
and local binary pattern (LBP) feature descriptors to extract facial cial expression-based ER system using machine learning and deep
features whose dimensionality was reduced by deep stacked au- learning approaches are given in Table 3 and the summary of the
toencoders [77]. The experiments for emotion detection and classi- remaining studies are given in Table A3 in appendix.
fication were conducted on CK+ and JAFFE databases using multi-
class SVM attained an accuracy of 97.66% for CK+ dataset. In an- 4. ER using multimodal data
other study geometric features, landmark curvature, and vectorized
landmark were utilized for extraction of features [78]. Their model Emotions can be expressed using various modalities, thus mak-
combined SVM with genetic algorithm and attained an overall ac- ing it difficult for automatic ER [80]. Currently, most of the au-
curacy of 95% on the experiments conducted on CK+ and multi- tomatic ER techniques are based on facial expressions, audio sig-
media understanding group (MUG) dataset. In another system pro- nals, EEG, and ECG signals. Many works have been done to ex-
10
Fig. 8. Schematic diagram of facial features- based automated ER approach.
Fig. 9. Distribution (yearly) of papers published using facial features for automated ER.
plore the emotions using these modalities individually [81]. How- Various studies on multimodal ER are performed using DL and
ever, the researchers are also exploring the possibilities to attain ML approaches are discussed below:
a better recognition rate and develop a more robust ER system
by combining more than one modalities of emotions. Fusion is
4.1. Deep learning-based studies
commonly done in three methods: data level, feature level, and
score level [82]. At the data level, raw data collected from different
The study proposed by Wang et al. considered the fusion
sources are combined and features extracted from these sources
of speech information with facial expression for speech ER. Ini-
are fed into a classifier. In the case of feature level, various fea-
tially, facial ER is achieved by the combination of CNN and RNN.
tures are extracted from input sources, and the relation between
LSTM and CNN are combined by a weighted decision fusion algo-
these features is modelled with the help of a classifier. In the case
rithm to obtain accurate speech ER system. Their study used RML,
of score type of fusion, with the help of a classified data from
AFEW6.0, eNTERFACE’05 database. AFEW6.0 consists of 773 sam-
different sources are sorted into different classes. The scores at-
ples for training, 383 samples for validation and 593 samples for
tained by each class are combined to get the final score [82]. One
testing. The fusion of facial expression with speech signals has in-
of the largest dataset in multimodal sentiment analysis and ER is
creased the SER performance approximately by 5% [17]. In another
CMU multimodal opinion sentiment and emotion intensity (CMU-
study, a method is proposed to integrate EEG and facial expression
MOSEI) dataset. It has 10 0 0 youtube speaker data in 250 different
features to recognize the emotion. Author has proposed a model
topics with six emotions [83]. Another multimodal datset available
with bimodal deep automatic encoder to combine EEG and facial
is enterface’05 which consists of six emotions. It has audio visual
signals coupled with LIBSVM classifier. The dataset was developed
data of 43 subjects [84]. The RML dataset is multimodal dataset
by collecting video clips from different movies and TV shows. Ex-
developed by Ryerson Multimedia Research Lab., Ryerson Univer-
periments on this dataset attained an average ER rate of 85.71%
sity. It consists of 720 samples of audiovisua expressions with six
[81]. A two-stage audiovisua fusion model was proposed in an-
emotions (http://shachi.org/resources/4965).
other study. In this work, spatio-temporal features of visual ex-
11
Table 3
Summary of state-of-the-art techniques developed using facial features for automated ER system.
Author &
Machine Learning-based studies:
1 [64] GP – FER CK+: ERacc: 98% DISFA, DISFA+, CK+ and

MUG
2 [76] CNN-t-SNE SVM, Gaussian SVM Savee: ERacc: 98.77% - person RML and SAVEE
dependent
3 [77] Modified Histogram of multi-class SVM ck+: ERacc: 97.66% CK+, JAFFE
Oriented Gradients (HOG)
and Local Binary Pattern
(LBP) feature descriptor
4 [111] PCA and PSO DT, MLP, RF ERacc: 94.97% J AFFE
5 [205] GA-SVM ERacc: 95% CK+,Multimedia
UnderstandingGroup
(MUG)
6 [301] PCA +GMM+GLCM with an ERacc: 93% FFDM (full-field digital
SVM mammography
7 [214] Triangulation method for k-NN, DT, RF, SVM, ELM, ERacc: 98.17% - RF PERSONAL:85ppl
feature extraction PNN
8 [213] Gabor Wavelets Rank correlation:0.679 PERSONAL: 10 ppl
9 [79] Wavelet gradient transform Gaussian membership ERacc: 98.9% JAFEE, CK +, FG Net
function fuzzy SVM
classifier.
10 [252] Filipino features SVM ERacc: 80.11% PERSONAL
Deep Learning-based studies:
1 [149] GRERN Softmax ERacc: 86.73% CEAR and AFEW

2 [225] CNN JAFFE: ERacc: 93.5% CK+: JAFFE and CK+
ERacc: 96.6%
3 [90] Deep residual network ERacc: :95.39 ± 1.41% PERSONAL: 20ppl
ResNet-50
4 [256] Face-sensitive convolutional ERacc: 95% UMD
neural network (FS-CNN)
5 [74] Pretrained convolutional ERacc: 96% CK +
neural networks of VGG19,
Resnet50, Inception V3 and
MobileNet are used by
transfer learning approaches
6 [69] Modified Monarch Butterfly Deep learning Neural CK+: ERacc: 0.9885 JAFFE: CK+, JAFFE
Optimization network regression ERacc: 0.9727
activation
7 [138] CNN SAVEE: ERacc: 97.5% - AFEW 2016, SAVEE,
Subject-dependent Weight Chonnam Emotion Videos
fusion (CEV)
8 [153] Temporal and spatial CNN ERacc: 95.42% - FER-2013 RML, MMI, BAUM-1 s,
streams and eNTERFACE05, FER-2013
9 [185] ERMOPI (Emotion ERacc: 90% - CMU Multi-PIE AffectNet face database,
Recognition using ERacc: - 68% - AffectNet face CMU Multi-PIE
Metalearning across database
Occlusion, Pose and
Illumination)
10 [158] Multi-Branch Deep RBF ERacc: 0.9949 - CK+ RAF-DB, CK+, JAFFE, FER
Network 2013
pression were computed using 3D-CNN and CNN-RNN was used 4.2. Machine learning-based studies
to compute features from speech signals. These features were then
fused with a mixture of the brain emotional learning (MoBEL) A study proposed 3D-CNN for extracted features from the elec-
model. Their study on the eNterface’05 database attained a recog- troencephalogram (EEG) and face signals. Bagging and stacking
nition accuracy of 81.7% [18]. A study proposed multimodal ap- techniques are used for the fusion of features. The experiments
proach for SER using speech and text modalities. This study pro- performed on the DEAP dataset reported the best accuracy rate of
posed the combination of multi-level multi-head fusion attention 96.13%, and 96.79% for valence, and arousal, respectively. This work
mechanism and RNN. From speech signal MFCC were captured used the data and score fusion method [82]. A model for iden-
and using bidirectional encoder representations from transform- tifying seven different human emotions by integrating facial and
ers (BERT) text data were fused. Their model was tested using speech signal features was proposed. This method utilizes statis-
IEMOCAP, MELD, CMU-MOSEI dataset. They reported the maximum tically significant features and their experiments on the Crema-D
accuracy of 99.19% using CMU-MOSEI dataset compared to other dataset attained an overall correct classification score of 93% with
models [85]. Further, it is also observed that CNN based feature a deep learning classifier [86]. A model with a deep learning ap-
extraction is depicting improved results than handcrafted feature proach consisting of speech and video using huge database was de-
extraction. veloped. In this system, the Mel-spectrogram of the speech signal
12
Fig. 10. Distribution (yearly) of papers published using multimodal features for automated emotion recognition.
which considered as an image were fed to the CNN, and frames extraction techniques, feature selection/reduction, and then emo-
extracted from video signals were fed to another CNN. The out- tion classification.
put of these CNNs were then fused using two consecutive extreme In the recent years, large number of studies have been proposed
learning machines (ELMs) and then fed to SVM for classification. using DL and ML approaches. Feature extraction plays major role in
The feature fusion technique employed in this method attained an the performance of any automated system. Feature extraction can
accuracy of 99.9% for big data and 86.4% accuracy on the eNTER- be categorized into shallow and deep features. Shallow features are
FACE database. The SVM classifier with multi dimension CNN has the handcrafted features. Though, they have a great impact on the
a great impact on recognition rate[102]. performance of a ML based system, extraction of these features is
Fig. 10 shows the year-wise distribution of papers reviewed for cumbersome and requires sound domain knowledge. Increased fea-
multimodal emotion recognition system. The summary of state- ture dimensionality makes this process even more time consuming
of-the-art works related to multimodal ER system using machine and inefficient in identifying optimal features required for the op-
learning and deep learning approaches are given in Table 4 the timum performance of the ML model. In DL-based model, auto-
summary of the remaining studies are given in Table A4 in ap- matic feature extraction is employed. DL-based techniques such as
pendix. CNN, RNN and auto encoders can be used for this purpose [15].
Hence the combination of DL and ML-based approaches are gain-
ing popularity with DL for feature extraction and ML for classifi-
5. Discussion cation. CNN’s are best suited for hierarchical data and mainly used
for extracting unlabelled features from images [87]. On the other
The growth of computational power has inspired researchers to side, the RNN’s are more suitable for sequential data and suited
replicate many human activities. One of the prominent researches for speech signals. LSTM can be used for mapping the input se-
in the current scenario is human ER. Identifying user’s emotions quence into a vector of fixed length. [88] LSTM are the variant of
play a vital role in several applications such as healthcare (psycho- RNN used in classification, processing and prediction of time series
logical counselling, anxiety and stress assessment, pain assessment, data.
etc.), e-learning, neuro marketing, neuro economics, lie detection, In this paper, various modalities namely EEG, facial, Voice as
humanoid development, companion robots, etc. [15,14]. However, well as multi-modal-based ER techniques are extensively reviewed
emotion is a complex and dynamical phenomenon. The perception and presented. In EEG modality, IMFs are generated using EMD
of emotion differs for each user for the same stimuli/situation. It is technique and are used for DL-based classifiers to attain better
always challenging to develop a generalized ER system for all users. recognition rate. In voice modality, neutral network-based stud-
Because the performance of ER system mainly depends on gender, ies have shown better recognition rates. In facial modality, CNN-
race, ethnic, and age. Besides, the previous history of emotional based methods have yielded promising results. However, recogni-
events also influences the user’s emotions while recalling activity. tion rates can be further improved with the combination of CNN
There are several challenges in developing intelligent and more ro- and other DL methods. Multimodality is a technique where the ad-
bust ER system for different users. This field of emotion recognition vantages of different modalities are combined to attain improved
in affective computing is started emerging from last two decades performance. Audio-video is the most commonly used combina-
ago and now it is playing a significant role in developing intelligent tion for human ER. It is also observed that deep neural network-
emotion recognition system for variety of applications. Researchers based methods like CNN were used for features extraction followed
have developed several approaches and methodologies based on by shallow networks like SVM for classification purposes. In few
single or multiple modalities to effectively identify the user’s emo- studies, such combination has yielded high ER performances. We
tional states. The CAD tools used for automated emotion recogni- have observed that, SVM has been widely used ML technique in all
tion mainly deal with pre-processing algorithms, stages of feature modalities for ER.
13
Table 4
Summary of state-of-the-art techniques developed using multimodal features for automated emotion recognition system.
Author
Sl. No & year Methods Classifiers Results Accuracy (ERacc ) Database Modalities considered
Machine Learning-based studies:
1 [115] Late-fusion approach support vector IEMOCAP-SD: CC: 0.564 - IEMOCAP, Acoustic and text
machine (SVM) HSF2 + WE, HSF2 + GloVe MSP-IMPROV information
2 [82] 3D-CNN SVM ERacc :96.13% - valence DEAP, COCO EEG and face
ERacc :96.79% - arousal
3 [86] Speech signal, image SVM, deep learning CC score: 93% Crema-D dataset Speech and face
processing, analysis & classification
features extraction
4 [204] Bimodal deep belief LIBSVM ERacc : 90.89% Friends’ data set Speech and facial
network (BDBN) expression
5 [150] 2D-CNN,3D-CNN SVM RML dataset: ERacc : 96.79%: RML, Speech and visual
eNTERFACE05 ERacc : 98.92% eNTERFACE05, features
BAUM-1 s: Racc : 71.26% BAUM-1 s
6 [129] DS evidence theory LS-SVM ERacc : 85.38% - Valance, Personal: 20ppl EEG, ECG
ERacc : 77.52% - Arousal
7 [250] 1D,2D CNN Autoencoder+SVM ERacc: 71.24% Ck+, RAVDESS and Facial images, audio
SEED-IV signals, EEG
8 [178] Six Linear AUC:0.83 Personal: 17ppl
Discriminant Magnetoencephalography
Analysis (MEG), EEG
9 [250] Wrapper based feature k-NN ERacc: 85.18% - valence Personal:18ppl EEG, GSR, and PPG
selection ERacc: 76.54% - arousal
10 [237] Genetic algorithms Optimized extreme ERacc: 93.53% - CK+ CK+, Enterface’05, Visual,audio
learning machine BAUM-1s
1 [18] CNN+RNN+MoBEL BEL ERacc : 81.7% eNterface’05, Audio, visual

Berlin Emo-DB
2 [108] bidirectional LSTM Visual IEMOCAP: ERacc : 80.38% IEMOCAP, Audio, text and
CMU-MOSI: ERacc : 80.18% CMU-MOSI
3 [20] LSTM MOSI: ERacc : 77.3% MOUD: MOSI, MOUD Text, video and audio
ERacc : 66.1%
4 [259] LSTM CCC:.719-Arousal, RECOLA Audio and visual
CCC:.740-Valence
5 [292] CROSS-modal dynamic CMU-MOSEI: ERacc : 80.5% F1: IEMOCAP,
convolution 80.2 CMU-MOSEI
6 [162] Bi-GRU (Bidirectional CMU-MOSI: precision:78.7% CMU- Textual, visual, and
gated recurrent unit) POM: precision:43.5% MOSI,persuasion audio
and attention fusion opinion
multimodal (POM)
7 [201] CNN+BiLSTM ERacc: 93.20±2.55% - DEAP EEG, Audio
Valence ERacc:
93.18 ± 2.71% - Arousal
8 [264] Hierarchical DNN based ERacc: 81.2% - RAVDEES RAVDESS, SAVEE, Audio-text
classifier on audio ERacc: 81.7% - -SAVEE and IEMOCAP
features ERacc: 74.5% - IEMOCAP
9 [78] C3D-LSTM hybrid Model, ERacc: 81.18% RML Facial, speech
MFCC, CNN
10 [330] CNN ERacc: 67.8% -Valence Personal: 49ppl EEG, Eye, face
ERacc: 77.0%-Arousal
5.1. The major findings related to each modality is discussed below camera, area of eye, nose, mouth covered, variation in poses,
etc. [74]. This might lead to the poor segmentation of the region
EEG modality: of interest thus causing poor recognition accuracy.
• Reducing the number of EEG channels can improve the perfor-
• Most of the current methodologies are not suitable to identify
mance as not all channels provide quality information needed the micro facial emotional expressions. Because the expression
for analyzing the emotions [31]. of emotions through facial actions varies for different users and
• Reducing the number of channels can reduce the iterative noise it mainly depends on several factors such as age, gender, ethnic,
which may otherwise decrease the quality of the network [44]. race, etc.
• Considering all the channels of EEG signals may take a longer
• Database updating must also be considered by adding more
time to train a particular network [31]. emotions. Also, well-balanced facial emotion set need to be
used [90].
Facial Modality:
• Due to the type of image segmentation method adopted, most Voice Modality:
of the systems proposed efficiently work for front-facing im-
ages. But this makes it less reliable for adopting them for de-
• Development and analysis of multilingual datasets are needed
veloping a more robust system [89].
for speech ER [58].
• The challenges involved in facial expression recognition involve
non-uniformity in illumination, distance between the user and
14
Fig. 11. Application of a multimodal system in smart health care.
Multimodal: Gamma bands are more suitable for emotion recognition. Ex-
periments conducted by Li. et al. [95] indicate EEG frequency
• A multimode approach with a combination of facial images, band beta has more effect on mild depression identification.
audio-video signals, EEG signals, and other physiological sig- Hence detailed study on five frequency bands can be performed
nals can give improved recognition rates [89,91,92]. However, to identify the more reliable frequency band for ER.
it is also important to combine them in proper proportions to • Recently, deep learning techniques have been widely used to
yield maximum performance. Too many physiological signals improve the performance of ER. However, the deep neural net-
with many channels may increase the computational complex- works can also be used to investigate the brain functional con-
ity. nectivity patterns during different emotions with graph convo-
lutional neural network (GCNN).
5.2. Limitations of this present study • The use of compact and light weight deep neural network such
as capsule network, 2D capsule network, Siamese networks can
1 We have only reviewed the articles published between 2016 reduce the computational complexity and increase the robust-
and 2021 and early 2022. This review on automated ER can pro- ness of the ER systems.
vide better view on the progress of ER. • The effects of different types of noises and artifacts in EEG data
2 We have only considered the most popular modalities used should be carefully handled to improve the performance of ER
for ER such as EEG, facial expression, and speech. However, system. The common spatial filters (CSP), independent compo-
there are several works which have focused on other physio- nent analysis (ICA), and other similar methods can be employed
logical signals such as electrocardiogram (ECG), electromyogram in EEG analysis.
(EMG), facial electromyogram (FEMG), galvanic skin response • Development and analysis of multilingual dataset are needed
(GSR), and skin temperature (ST) to assess the emotions of sub- for speech ER [58].
jects. • Most of the proposed studies are tested on a database devel-
3 This present study focusses the on existing automated ER tech- oped in a controlled environment. A real-time ER is difficult
niques based on several factors such as methods, classifiers to manage as it is prone to noise and may degrade the per-
and performance obtained. It does not review the applications formance drastically [91,96].
based on automated ER methods. • Usually same dataset is used for training and testing. A model
trained using one dataset must be tested against another
dataset. This type cross dataset testing might increase the ro-
5.3. Future research scopes
bustness of the model. However, this needs a large amount of
training dataset.
• There is a need to develop more robust facial ER system that • Future aspect of a multimode system is shown in Fig. 11. A per-
can efficiently work with side face images and images rotated
son with his/her EEG signal, facial expressions and speech sig-
in any direction [89].
nals are combined and fed to an automatic ER in the cloud. The
• Increasing larger public datasets with more diverse data can
automated ER system will send the outcome to the clinicians at
train the model well and help to obtain accurate results. Be-
the hospital. After manual confirmation with the signals, doc-
sides, if the model is trained with different databases which in-
tors will send the results to the patients. Meanwhile, the test
volve multi-ethnic, different age groups, genders, etc. will help
signals can be used to train the developed ER system.
to develop a robust automated ER model.
• More types of emotions can be considered. Also, better to have
more balanced dataset for each class [90]. 6. Conclusion
• EEG feature extraction was performed on five sub-band fre-
quencies. Findings of Zhang et al. [93] indicate gamma sub- In this paper, we have reviewed various state-of-the-art models
band is more correlated to emotions thus leading to increased proposed for human automated ER. The summary of the same is
classification accuracy. Studies of Li. et al. [94] indicate Beta and prepared based on methods, modalities used and performance ob-
15
tained. Various machine and deep learning techniques have been these systems in real-time automated ER systems which can per-
employed for automated ER using EEG, facial, speech and multi- form well in uncontrolled environment need to be developed.
modal signals. Our study indicates that, there is significant rise
in automated ER using DL techniques and have yielded high per- Declaration of Competing Interest
formance measures under controlled environment. However, there
are very few models proposed for automated ER in dynamic en- There is no conflict of interest in this work.
vironment where the subjects might be in movement, and sud- Appendix
den switch between expressions are considered. Hence to employ
Table A1
Summary of state of art techniques for EEG based emotion recognition system.
Author &
Sl. No year Methods Classifiers Results Accuracy (ERacc ) Database
1 [125] Deep CNN Bagging tree (BT), support CVCNN, GSCNN, STCNN AUC DEAP
vector machine (SVM), values: 1 -FREQNORM
linear discriminant
analysis (LDA), and
Bayesian linear
discriminant analysis
(BLDA)
2 [96] Quadratic time-frequency SVM classifiers with the ERacc : 89.8% - Arousal ERacc : DEAP
distribution (QTFD) RBF kernel 88.9% - valence
3 [274] Single-electrode-level power SVM ERacc : 87.80% - arousal ERacc : DEAP
spectral density 86.91% - valence
4 [186] SVR (Support vector Mean absolute error:0.74 DEAP
regression) and 1.45
5 [314] PNN SVM ERacc: 81.21% - valence, ERacc: DEAP
81.26% - arousal
6 [238] Logistic Regression with Naive Bayes (NB), SVM, ERacc: 77.17% - valence ERacc: DEAP
Gaussian kernel and linear LR with 77.03% - arousal
Laplacian prior L1-regularization (LR_L1),
linear LR with
L2-regularization (LR_L2)
7 [30] Discrete wavelet transforms k-NN, SVM, ANN ERacc: 94.3% - true emotions PERSONAL (available for
(DWT), empirical mode ERacc: 84.1% - fake emotions public):28 ppl
decomposition (EMD)
8 [100] Local cortical activations SVM ERacc: 90.3% - genuine vs PERSONAL:28 ppl
with dynamic functional neutral ERacc: 88.52% -
network patterns neutral vs fake ERacc: 78.82%
-genuine vs fake
9 [226] Black hole algorithm SVM ERacc: 92.56% MAHNOB HCI Tagging
10 [173] DWT and EMD with SVM ERacc: 85.71% BCI Competition 2008
approximate entropy datasets 2b data
11 [118] Time, frequency and wavelet SVM, MLP and K-NN ERacc : 98% - age group PERSONAL:30 ppl
domain features (26–35) - MLP
12 [99] SVM with error-correcting ERacc: 94.79% PERSONAL:18ppl
output code (ECOC)
13 [329] Minimal SVM ERacc: 87.36% PERSONAL:30ppl
redundancy-maximal
relevance
14 [28] Discriminative Graph k-KNN, logistic regression, ERacc: 69.67%- DEAP ERacc: DEAP+SEED
regularized Extreme SVM 91.07% - SEED
Learning Machine
15 [127] Kernel Spectral Regression k-NN, naïve Bayesian, SVM, ERacc: 92.7 ± 2.1 – RF & DEAP
(KSR) random forest KSR
16 [307] Dynamical recursive feature least square support vector ERacc: 0.7896 - arousal DEAP
elimination (D-RFE), machine F1-score:0.7991 - arousal
ERacc: 0.7143 - valence
F1-score: 0.7257 - valence
17 [308] Transfer recursive feature linear least square support ERacc: 0.7867 arousal DEAP
elimination (T-RFE) vector machine F1-score: 0.7526 - arousal
ERacc: 0.7875 - valence
F1-score:0.8077 – valence
18 [206] Sparse Linear Discriminant SVM ERacc: 92.26% Non-neutrality PERSONAL:30 ppl
Analysis vs. neutrality ERacc: 86.63%
Positive vs. negative
19 [95] BestFirst (BF), Greedy BayesNet (BN), SVM, Emo_blockbeta band ERacc: PERSONAL:37 ppl
Stepwise, Genetic Search, Logistic Regression (LR), 92.00% AUC: 0.957
Linear Forward Selection k-NN and Random Forest Neu_block beta band ERacc:
(LFS) and Rank Search (RS) 98.00% AUC: 0.997
based on Correlation
Features Selection (CFS)
(continued on next page)
16
Table A1 (continued)
Sl. No Author & Methods Classifiers Results Accuracy (ERacc ) Database

year
20 [326] Group sparse canonical correlation analysis ERacc: 80.20% SEED

(GSCCA)
21 [315] Wavelet feature extraction SVM, RF ERacc: 82%+3.78% -RF & LPP DEAP
22 [304] Fixed low-rank spatial filter LR-6 Mean: 0.302 Standard PERSONAL:23 ppl
estimation deviation:0.103
23 [280] Statistical Features SVM ERacc: 80% DEAP
24 [198] Relevance vector machine Relevance vector machine RVM: ERacc: 93.33% SVM: DEAP
(RVM). and SVM ERacc: 78.67%
25 [224] Discrete wavelet transforms SVM and k-NN ERacc: 86.75% - arousal ERacc: DEAP
84.05% - valence
26 [220] LPP (late positive potential) SVM SVM: ERacc: 57.9% PERSONAL:21 ppl
27 [183] Bispectral Analysis LS-SVM and an ANN ERacc: 64.84% -Low/High DEAP
Arousal ERacc: 61.17% -
Low/High Valence
28 [170] Stationary wavelet transform SVM, k-NN SVM: ERacc: 92% KNN: ERacc: PERSONAL:19
(SWT) 94%
29 [114] Minimum-Redundancy- SVM ERacc: 60.7% - Arousal ERacc: DEAP
Maximum- 62.33%- Valence
Relevance(mRMR)+kernel
classifiers
30 [106] PSD features SVM ERacc: 85.86%. PERSONAL: 14ppl
31 [187] SVM ERacc: 49.63% - 4 emotions PERSONAL:5 ppl
classification ERacc: 71.75% -
any two emotions
classification ERacc: 73.10%
for positive negative
emotions classification
32 [103] Discrete Wavelet Transform, quadratic discriminant ERacc: 83.87% DEAP
Wavelet Energy analysis k-NN, and SVM
33 [261] MVAR and DTF analyses SVM ERacc: 93.7% ± 1.06% - joyful PERSONAL:19ppl
vs. neutral ERacc: 80.43% ±
1.74% - joyful vs.
melancholic ERacc: 83.04% ±
1.47 - familiar vs. unfamiliar
34 [209] Spectrum centroid and LZC k-NN and SVM ERacc: 86.46% - Valence ERacc: DEAP
in EMD domain with SBS 84.90% - Arousal
feature selection
35 [154] Flexible analytic wavelet RF and SVM ERacc: 90.48% - SEED+DEAP
transform (FAWT) positive/neutral/negative
36 [249] Ensemble model, Decision SEED: ERacc 74.85% DEAP: SEED+DEAP
Tree, k-NN and Random ERacc: 62.63%
Forest
37 [279] Sparse Discriminative k-SVM ERacc : 70.1% – Arousal ERacc : DEAP
Ensemble Learning (SDEL) 77.4% - Valence
38 [144] Minimum spanning tree SVM and quadratic ERacc : 88.28% - arousal ERacc : DEAP
discriminant analysis 81.25% - valence
(QDA)
39 [32] Activation probability of EEG SVM, ANN, NB ANN: ERacc : 97.74% DEAP
channels
40 [45] (EMD/IMF) and variational k-NN, CNN,NB,DT CNN: ERacc : 95.20% DEAP
mode decomposition (VMD)
41 [156] Firefly integrated SVM, k-NN, Bayes, DT, RF FIOA: TPR: 0.9891 PPV: 0.91 DEAP, PERSONAL:20 ppl
optimization algorithm TNR: 0.9167 NPV: 0.99 ACC: for Validation
(FIOA) 0.95
42 [37] Feature level fusion and k-NN, SVM, DT ERacc: 86.98% - k-NN PERSONAL: 86 depressed
genetic algorithm patients 92 normal
43 [171] Magnitude Squared SVR, LR, MLP ERacc: 67.5% - valence DEAP
Coherence Estimate(MSCE)
44 [180] Phase synchronization k-NN, SVM,SRC ERacc: 94% - 2 class DEAP
45 [112] GoogLeNet model-based SVM, k-NN and ELM ERacc: 98.78% -SVM Gameemo
deep-learning approach
46 [7] DWT MLPNN, k-NN, SVM ERacc: 82% - MLPNN – F4 Gameemo
channel
47 [139] Prime pattern and tunable SVM ERacc: 99% DEAP, DREAMER,
q-factor wavelet transform GAMEEMO
48 [273] EMD+ CIF-based filtering multi-class least squares ERacc: 90.63% F1-score: PERSONAL:20ppl
SVM 0.9064 kappa value: 0.8751
49 [35] Fractal Firat pattern LDA, k-NN, SVM ERacc: 99.82% - SVM GAMEEMO
(FFP) + Tunable Q-factor
Wavelet Transform (TQWT)
17
Sl. No Author & Methods Classifiers Results Accuracy (ERacc ) Database

year

Sl.No Author & Methods Classifiers Results Accuracy(ERacc ) Database
year
1 [102] LSTM Dense layer ERacc : 85.65%- arousal ERacc : DEAP
85.45%- valence ERacc :
87.99%-liking
2 [294] CNN-GRU, CNN-LSTM Mahalanobis Mean CRR: 99.90%–100% DEAP
distance-based classifier
3 [146] Echo State Network (ESN) SN-IP SEED: ERacc : 89.01% DEAP+SEED
4 [299] Stack Auto Encoder LSTM ERacc : 81.10% - valence ERacc : DEAP
(SAE)+LSTM-RNN 74.38% - arousal
5 [287] 3-D covariance shift 3-D CNN architecture with ERacc : 73.3% - arousal ERacc : DEAP
adaptation-based CNN batch normalization, dense 72.1% - valence
prediction
6 [222] Multivariate extension k-NN/ANN k-NN: ERacc : 51.01%- arousal DEAP
(MEMD) ERacc : 67%valence ANN:
ERacc : 75%- arousal ERacc :
72.87% - valence
7 [278] DNN, CNN DNN: ERacc : 75.78% - Valence DEAP
CNN: ERacc : 81.41% - Valence
8 [303] SDAE EL-SDAE ERacc : 92% PERSONAL:8 ppl
9 [36] DNN, CNN, L STM, CNN-L STM CNN: ERacc : 90.12 DEAP
CNN-LSTM: ERacc : 94.17%
10 [309] Adaptive Stacked Denoising ERacc : 0.8579 PERSONAL:7 ppl
Auto Encoder (SDAE)
11 [232] LSTM ERacc: 77.68% PERSONAL:20 ppl
12 [286] Broad dynamical graph ERacc: 93.66% SEED
learning system (BDGLS
13 [94] Hierarchical convolutional Beta wave: ERacc: 86.2% SEED
neural network (HCNN) Gamma wave: ERacc: 88.2%
14 [124] 3-class CNN Single path CNN ERacc: DEAP
98.48% - 3-class arousal
ERacc: 97.59% - 3-class
valence 2-path CNN ERacc:
97.58% - 3-class arousal
ERacc: 98.75% - 3-class
valence
15 [37] Channel-wise features and MLP DEAP: ERacc: 98.93% - SEED, DEAP
LSTM valence ERacc: 99.10% -
arousal SEED: ERacc: 99.63% -
positive, neutral, negative
16 [38] DGCNN SEED,: ERacc : 90.4 - subject DREAMER, SEED
dependent ERacc : 79.95% -
subject independent
DREAMER: ERacc : 86.23 -
valence ERacc : 84.54 - arousal
ERacc : 85.02% -dominance
17 [42] 4DCRNN LSTM SEED: ERacc :94.74% DEAP: SEED and DEAP
ERacc :94.22%
18 [302] CNN ERacc : 96.72% - CNN ERacc : DEAP
98.31 – shallow model
19 [253] Kernel Density Estimation multilayer perceptron ERacc: 72% - Calm ERacc: PERSONAL
(KDE) 75% - Fear ERacc: 78% -
Happy ERacc: 74% - Sad
20 [251] Relative power spectral ERacc: 94.63% ± 3.68 SEED
density+CNN
21 [257] LSTM ERacc: 94.12% −10 fold cross DEAP, SEED
validation
22 [176] Modified Differential BiLSTM, MLP IDEA: ERacc: 98.5% - subject IDEA- Intellect Database
Entropy (MD-DE) dependent ERacc: 88.57%.- for Emotion Analysis,
subject independent SEED, DEAP
CRR- correct recognition rate; HA - high arousal; LA - low arousal; HV - high valence; LV - low valence; HEIM - Hybrid Emotion Inference Model; DSNN - Deep Sparse
Neural Network WA - weighted accuracy; UA - unweighted accuracy; CCC - concordance correlation coefficient; UAR-unweighted average recall; PCC -Pearson Correlation
Coefficient CC - correct classification; ρ c - concordance correlation coefficient, mean squared error (MSE), DEAP - database for emotion analysis; using physiological
signals, SEED - SJTU Emotion EEG Dataset.
18
Table A2
Summary of state of art techniques for speech signal-based emotion recognition system.
Author &
Sl. No year Methods Classifiers Results: Accuracy(ERacc ) Database
1 [212] F0 contour, Mel-frequency SVM F-score: 84.11% - Arousal F-score: SEMAINE, Wall Street
cepstral coefficients (MFCCs), 66.92% - Valence Journal based Continuous
zero crossing rate and RMS Speech recognition Corpus
energy. Phase II database.
2 [189] eGeMAPS features Deep Belief Networks FAU-AIBO, EMOCAP,
(DBN), SVM, Sparse EMO-DB, SAVEE, EMOVO
autoencoder
3 [217] Emotion-discriminative and SVM ABC: ERacc : 65.62% FAU-AEC, ABC, EMO-DB
Domain-invariant Feature
Learning Method (EDFLM
4 [291] random Deep belief SVM EMO-DB: 82.32% CASIA: 48.50% EMO-DB, SAVEE, Chinese
networks (RDBN) SAVEE: 53.60% Academy of Sciences
(CASIA)
5 [318] DCNN model + DTPM SVM EMO-DB: WAR: 87.31% UAR: 86.30% EMO-DB, RML,
eNTERFACE05, BAUM-1 s,
6 [328] DBN +SVM ERacc : 95.80% Chinese Academy of
Sciences
7 [104] One-Dimensional deep learning (DL) and ERacc : 99.97% - DL ERacc : 99.7% - SVM Private
Conventional Neural SVM
Network
8 [282] Basic prosodic and spectral SVM ERacc: 96% Toronto emotional speech
feature extraction set (TESS)
9 [324] Recurrent convolutional SVM IEMOCAP: WA: 53.6% - RCL+2-layer TIMIT, IEMOCAP
Layer (RCL) MLP
1 [54] 1D CNN LSTM, 2D CNNLSTM Berlin EmoDB: 2DCNN ERacc : 95.33% IEMOCAP, EMO-DB
- speaker-dependent ERacc : 95.89% -
speaker-independent
2 [97] Domain adversarial neural MSP-IMPROV, USC-IEMOCAP: ERacc : 71.2% - USC-IEMOCAP,
network (DANN) Arousal ERacc : 60.2% - Valence MSP-Podcast
MSP-IMPROV: ERacc : 70.6% ERacc : 75%
3 [110] Multimodal VAD LSTM ERacc : 91.52% PUBLIC
4 [117] CNN ERacc : 84.30% Berlin emotions dataset
5 [151] CNN-KELM (kernel extreme kernel extreme learning Emo-DB: UA:92.45%, WA: 92.90% EMO-DB, IEMOCAP
learning machine) machine (KELM) IEMOCAP: UA: 57.99%, WA:56.55%
6 [174] Deep sparse neural network HEIM:F1 score: 0.7523 Precision: Sogou Voice Assistant4
model (DSNN), hybrid 0.7614 DSNN: F1 score: 0.4355 (Chinese Siri)
emotion inference Precision: 0.4516
model(HEIM)
7 [231] Bi-directional long IEMOCAP: ERacc: 72.25% EMO-DB: IEMOCAP, EMO-DB,
short-term memory ERacc: 85.57% RAVDESS: ERacc: 77.02% RAVDESS
(BiLSTM), redial based
function network (RBFN)
8 [235] Dual exclusive attentive ABC: ERacc: 65:02% EMO-DB: ERacc: Interspeech 2009 Emotion
transfer (DEAT) 67:79% Challenge FAU Aibo
Emotion Corpus, ABC and
Emo-DB
9 [51] Ladder MSP-IMPROV, MSP-PODCAST: CCC: 0.771 – Arousal USC-IEMOCAP, networks
MSP-PODCAST (Lad+MTL+UL)
10 [244] 3D convolutions and IEMOCAP: ERacc: 62.6% MSP-IMPROV: IEMOCAP, MSP-IMPROV,
attention-based sliding ERacc: 55.7%
recurrent neural networks
(ASRNNs)
11 [296] Attentive temporal pooling EMO-DB: WA: 85.73% UA: 82.86% RML, EMO-DB, IEMOCAP
module into a DNN
12 [297] LSTM-TF-at CASIA UAR:92.8% CASIA, eNTERFACE, GEMEP
13 [126] ACRNN IEMOCAP: UAR: 64.74%± 5.44 IEMOCAP, Emo-DB
(average +SD) Emo-DB: UAR:
82.82%± 4.99 (average +SD)
14 [272] CNN+RNN UAR:48.8% EmotAsS (EMOTional
Sensitivity Assistance
System for people with
disabilities
15 [305] Phoneme and spectrogram ERacc : 73.9% IEMOCAP
combined CNN model
16 [323] DEEP CNN Berlin EmoDB: ERacc: 0.9271 – EMO-DB, IEMOCAP
speaker dependent ERacc: 0.9178 –
speaker independent
19
Sl. No Author & Methods Classifiers Results: Accuracy(ERacc ) Database

year
17 [258] DNN WA: 70.1% UA: 60.7% IEMOCAP

18 [131] Deep recurrent neural Connectionist Temporal ERacc : 54% IEMOCAP
network Classification (CTC)
19 [136] Universum autoencoders ABC: UAR::63.3± 1.2% - U-AE Geneva Whispered
Emotion Corpus, ABC,
EMO-DB, Speech Under
Simulated and Actual
Stress (SUSAS)
20 [145] Convolutional Neural LSTM-RNN ERacc : 64.78% IEMOCAP
Networks
21 [190] Multi-task deep BLSTM CLS – Decoded: Development: CCC: RECOLA
-RNN 0.859 – Arousal CCC: 0.596 –
Valence Test: CCC: 0.596 – Arousal
CCC: 0.460 – Valence
22 [203] CNNs and RNNs ERacc : 88.01% EMO-DB
23 [218] Sharing Priors between Related Source and Target ABC: ERacc : 61.54% INTERSPEECH 2009
classes (SPRST) Emotion Challenge
two-class task, ABC,
EMO-DB
24 [223] RNN WA: 63.50% UA: 58.8% IEMOCAP
25 [289] DNNs ELM ERacc : 92.72% Mandarin
26 [62] Deep Retinal Convolution ERacc : 99% IEMOCAP
Neural Networks (DRCNNs)
27 [210] iCST-Voting ERacc : 98.42% gender dataset and the
Deterding dataset
28 [133] ANN Variance: 64.4% - arousal Variance: International Affective
65.4% - valence Digitized Sound (IADS)
29 [230] CLSTM IEMOCAP: ERacc : 75% RAVDESS: IEMOCAP, RAVDESS
ERacc : 80%
30 [293] CNN ERacc : 81% EMO-DB, RAVDESS
31 [306] Modified Cuckoo Search and EMO-DB: ERacc : 87.66% and 87.20% - EMO-DB, IEMOCAP
NSGA-II based feature Speaker-dependent ERacc : 76.80%
selection. and 76.82% - speaker-independent
32 [108] Lightweight CNN EMO-DB ERacc : 92.02% IEMOCAP, EMO-DB
33 [123] GAN IEMOCAP: UAR: 53.6% FEEL-25k: IEMOCAP and FEEL-25k
UAR: 54:6%
34 [200] CNN-BLSTM WA: 81.6% UA:82.8% IEMOCAP
35 [276] Utterance level parametric EMO-DB: ERacc : 76.77% IEMOCAP: EMO-DB and IEMOCAP,
Generative noise model ERacc : 53.35% NOISEX-92
36 [233] Multi-path and group loss- based network (MPGLN) KESDy18_EM, IEMOCAP: (MPGL) KESD, IEMOCAP
WA: 0.718 UA: 0.74 PR: 0.677 F1:
0.693
37 [191] Reduced speech bandwidth ERacc : 82% EMO-DB
and the μ-low companding
procedure
38 [135] Vowel-like regions (VLRs) Extreme Learning Machine EMO-DB: ERacc : 92.74% - VLRs ERacc : EMO-DB, IEMOCAP, FAU
and non-vowel-like regions 81.64% - non - VLRs AIBO
(non-VLRs).
39 [236] Weighted TrBaidu algorithm F1-score - 23% Multimodal Emotion Lines
Dataset (MELD)
40 [59] Mel-Frequency Cepstral Coefficients and broad ERacc : 100%. CASIA Chinese emotion
learning system corpus
41 [321] MT-SHL-DNN UAR: 52.84% ABC, AVIC, Danish
Emotional Speech (DES),
EMO-DB, Enterface, Belfast
Sensitive Artificial Listener
(SAL), SmartKom (Smart)
corpus, SUSAS,VAM
42 [193] Fusion-ConvBERT EMO-DB: WA: 88.43 - EMO-DB, IEMOCAP
Speaker-Independent UA: 86.04%-
Speaker-Independent EMO-DB:
WA:94.23%-Speaker-Dependent,
UA:92.1%- Speaker-Dependent
43 [246] CNN ERacc: 97.06% - 64 × 64 resolution JAFFE, CK+, and MUG
44 [113] 3D CNN-LSTM ERacc: 96.18% - RAVDESS RAVDESS, RML, SAVEE
45 [184] MFCC + GFCC with deep ERacc: 80% RAVDESS
C-RNN
46 [242] CNN Linear Prediction Cepstral ERacc: 93.8931% berlin
Coefficient (LPCC) with
k-NN
ABC - Airplane Behaviour Corpus, AVIC - audiovisua Interest Corpus, IEMOCAP - Emotional Dyadic Motion Capture, SAVEE - Surrey Audio-Visual xpressed Emotion.
20
Table A3
Summary of state of art techniques for facial modality-based emotion recognition system.
Author &
1 [165] Spatio-Temporal Completed SMIC: ERacc: 75.31%- Spontaneous

Local Quantization Patterns Detection ERacc: 64.02% - Micro-expression Corpus
(STCLQP) pos/neg/sur (SMIC), CASME and
CASME2
2 [164] Discriminative CASME: ERacc: 64.33% CASME, CASME2 and SMIC
spatiotemporal local binary CASME2: ERacc: 64.78% SMIC:
pattern ERacc: 63.41%
3 [207] Main Directional Mean SVM CASME: ERacc: 64.07% SMIC, CASME and CASME
Optical flow (MDMO) II
4 [172] Viola Jones Face Detection SVM, RF, k-NN ERacc: 91.545% -k-NN JAFFE
5 [211] Viola Jones with Haar Multiclass SVM Elderly age group: ERacc: Lifespan database
Features 90.32% - neutral ERacc:
84.61% - happiness ERacc:
66.6% - sadness other age
group ERacc: 95.24% - neutral
ERacc: 88.57% - happiness
ERacc: 80% - sadness
6 [221] Log Gabor filter +PCA Euclidean distance ERacc: 93.57% JAFFE
7 [134] Geometrical and SVM ERacc: 89.26% Cohn-Kanade (CK+)
texture-based LBP features
8 [313] Spatially coherent SVM BU-3DFE:ERacc: 87.36% −0° BU-3DFE and SFEW,
feature-learning
9 [119] Neural Networks model Multi-class SVM ERacc: 86.04% CK+
(VGG16)
10 [137] VGG 16 and ResNet50 SVM ERacc: 92.40% Kaggle and KDEF
11 [175] Modified mouthmap and LR, LDA, RNN, CART, NB, KDEF: ERacc: 98.1% - NB KDEF, Oulu-CASIA, CK+
eyemap SVM
12 [228] Viola-Jones using Haar-like k-NN,DT ERacc: 98.03%- k-NN ERacc: PERSONAL:55ppl
features 97.21%-DT
1 [327] Bilinear - CNN ERacc: 89.98% FER-2013

2 [157] A Binary CNN (B-CNN) and VGG-16 network ERacc: 64.60% IAPS subset, Art photo,
an Eight-class CNN (E-CNN). Abstract painting
3 [101] 3DCNN+ConvLSTM Softmax Logistic Classifier ERacc: 97± 05% - CK+ CK+, MMI, DISFA, DynEmo
4 [177] CNNs FER-2013: ERacc: 65% custom FER-2013 and custom
dataset: ERacc: 60% datasets
5 [179] CNN-RNN OMG: CCC: 0.535 – Valence OMG, Aff-Wild and
CCC: 0.365 - Arousal Aff-Wild2
6 [181] CNN Average Precision: 77.16 – EMOTIC
excitement
7 [188] CNN ERacc: 70% FER 2013
8 [169] CNN FERC-2013: ERacc: 70.14% FERC-2013, JAFFE
JAFFE: ERacc: 98.65%
9 [91] PRATIT CNN ERacc: 78.52%. FER2013
10 [241] CNN ERacc: 89.98% FER2013
11 [247] CNN+SJMT ERacc: 84.2% SFEW, Oulu-CASIA,
EmotioNet
12 [248] DCNN ERacc: 98.82% - sad PERSONAL
13 [281] Venturi Architecture of CNN ERacc: 98.87% - training ERacc: Karolinska Directed
86.78% - testing Emotional Faces
14 [93] CNN ERacc: 88.56% Fer-2013, LFW
15 [73] CNN-LSTM-based regressor MAHNOB- MAHNOB-HCI and
HCI:RMSE:0.0385±0.0032 AFEW-VA
PCC: 0.53±0.070 AFEW-VA:
RMSE:0.2185 CCC:0.541
16 [63] Regional Multi Feature ERacc: 98.4%.
Similarity (RMFS)
17 [152] CNN, ResNet and attention Training ERacc: 85.76% - CNN FER
block Testing ERacc: 64.40% -
CNN+ResNet+attention
block
18 [155] Graph mining ERacc: 90.00% Surrey Audio-Visual
Expressed Emotion
(SAVEE)
21
Sl. No Author & Methods Classifiers Results Accuracy(ERacc ) Database

year
19 [161] Local enhanced motion ERacc: 59.02% AFEW, CK+ and MMI
history image (LEMHI)+
CNN-LSTM
20 [89] Viola-Jones face detection, ANN CK+: ERacc: 99.67% JAFFE, CK+, RaFD
HOG, LBP, PCA, ANN
21 [196] Reinforcement learning for RAF-DB: ERacc: 72.84% RAF-DB, ExpW, and
pre-selecting useful images FER2013
(RLPS)
22 [269] Attention Shallow Model CK+: ERacc: 99.1% MMI: ERacc: CK+, MMI, and
(ASModel), ADModel, and 89.88% Oulu-CASIA:ERacc: Oulu-CASIA
MSDModel 87.33%
23 [283] CNN FER ERacc: 97.07% FER2013 and Japanese
female facial expression
(JAFFE)
24 [310] CNN+Gabor filters ERacc: 91–92% JAFFE
25 [121] Convolution block attention CK+: ERacc: 96% CK+, FER-2013
modules (CBAM)+VGG
26 [275] Metric-based emotion CK+: ERacc: 99.64% UNBC-McMaster Shoulder
intensity definition and a Pain Expression Archive
deep hybrid CNN (UNBC), CK+,
27 [239] Spatial and temporal CNN BAUM-1s: ERacc: 46.51% BAUM-1 s, eNTERFACE05
eNTERFACE05: ERacc: 43.72%
28 [208] Conditional convolutional CK+ and JAFFE: ERacc: 99.02% CK+, JAFFE, multi-view
neural network enhanced BU-3DEF and LFW
random forest (CoNERF)
29 [68] Semi-supervised learning RMSE: 0.0451 MAHNOB-HCI,
(SSL) INHA(PERSONAL): 59ppl
30 [75] CNN ERacc: 96% Caltech faces, CMU, NIST,
CK+
31 [245] Temporal Relational Network MLP ERacc: 92.70% DISFA+
(TRN)
32 [255] Transfer learning RMSE = 0.09-valence RMSE AffectNet, AMIGOS
approach-CNN =0.1-arousal
33 [168] DNNs JAFFE: ERacc: 95.23% CK +: CK +, JAFFE
ERacc: 93.24%
34 [260] VGG16 CNN+ Bi-LSTM RNN ERacc +SD: 87.62%± 05.41% - IST-EURECOM Light Field
subject-specific ERacc +SD: Face
80.37%± 05 09.03% -
subject-independent
35 [322] Correlation emotion label ERacc: 84.40% Oulu_CASIA
distribution learning
36 [202] Deep Convolutional Neural ERacc: 92.81% CK+
Network
37 [227] Viola-Jones using Haar-like ELM, PNN ERacc: 88% -ELM ERacc: 92- PERSONAL:55ppl
features PNN
38 [219] Automated marker k-NN, PNN ERacc: 96.94%-PNN PERSONAL:30ppl
placement algorithm
39 [105] Maximum Response-based ELM-RBF CK+ with GSDRS: ERacc: JAFFE, CK+, MUG, SFEW,
Directional Texture Pattern 98.4% (MRDTP) MMI, DISFA, DISFA+
(MRDTP) and a Maximum
Response-based Directional
Number Pattern (MRDNP),
40 [122] FATAUVA-Net, (FATAUVA Reg) MSE:0.1232- FERA2015, BP4D,SEMAINE
valence MSE:0.0954 -
arousal
ELM - Extreme learning machine,PNN-Probabilistic neural network, KDEF - Karolinska Directed Emotional Faces, CART – Classification and regression trees.
22
Table A4
Summary of state of art techniques for multimodal emotion recognition system.
Author & Modalities

Sl. No year Methods Classifiers Results Accuracy(ERacc ) Database considered
1 [81] BDAE feature fusion LIBSVM ERacc : 85.71% Personal: 13ppl EEG signals and
facial expression
2 [160] CNNs and the ELMs SVM Big Data: ERacc : 99.9% Big Data of SPEECH AND
eNTERFACE’05: ERacc : emotion: 50 ppl, VIDEO/IMAGE
86.4% eNTERFACE’05 FRAMES
3 [116] RF, SVM, logistic ERacc : 73.08% - arousal, DEAP Respiratory Belt,
regression (LR) ERacc : 72.18% - valence Photo plethysmog-
raphy,and Fingertip
Temperature from
EEG signal
4 [132] linear regression, ERacc : 78.20% - CNN RAVDEES
DT, RF, SVM, CNN
5 [298] Liblinear, REPTree, ERacc : 96.79% - valence, Personal: 39 ppl EEG and video
XGBoost, MLP, RF, and ERacc : 97.79% - arousal
RBFNetwork
6 [290] Multimodal deep belief LIBSVM, ERacc : 80.89% BioVid EMO- DB EEG and video
network (MDBN)
7 [263] ANN svm RAVDESS: ERacc : 86.36% RAVDESS, VIRI Visible images,
Recall: 0.86 Precision: infrared (IR)
0.88 f-measure: 0.87 images and speech
8 [234] CNN multiclass SVM SAVEE: ERacc : 100% - RF SAVEE, Audio and visual
and RF eNTERFACE’05,
RML
9 [163] STLMBP SVM, 3 Nearest Decision-level (LWF): MAHNOB-HCI Facial, EEG
Neighbour ERacc : 66 0.28% -
Valence, ERacc : 63 0.22%
- Arousal,
10 [215] 2D CNN and 3DCNN SVM Enterface05: ERacc : 92.3% rml, enterface05, Audiovisua
- Arousal, ERacc : 91.8% - baum-1s
Valence,
11 [312] Audio features and EEG Random Forest ERacc: 83.29% APM EEG, Audio
based features
12 [19] Recursive feature SVM D2 dataset-MFE:ERacc : DEAP, EEG, Galvanic skin
elimination (RFE) and 73.69 ± 0.80% F1-score: MAHNOB-HCI response, heart
margin-maximizing 72.07 ± 0.74 rate
feature elimination
(MFE)
13 [266] A-LSTM SVM, k-NN Protocol one: ERacc: Personal: 23ppl EEG, galvanic skin
72.93% Std. deviation: response,
13.19 respiration, ECG
14 [319] CNN+3D-CNN+DBN SVM RML: ERacc : 80.36%, RML, Audio–Visual
eNTERFACE05: ERacc : eNTERFACE05,
85.97% BAUM-1s: ERacc : BAUM-1s
54.57%
15 [143] Neural Aggregation k-NN, RF, SVM, BNB: ERacc : 66.8% - MAHNOB-HCI Visual, EEG
Network with CNN – Bernoulli naive Arousal ERacc : 69.38% -
visual LFCC and SampEn Bayesian (BNB) Valence RF: ERacc : 69.21%
- EEG - Arousal ERacc : 67.15%
-Valence
16 [271] CNN SVM ERacc::83.3% FER-2013, Seed IV, Face, EEG
Personal: 03ppl
Sl. No Author & Methods Classifiers Results Accuracy(ERacc ) Database Modalities

year considered
1 [17] CNN and RNN RML, AFEW6.0, Facial expression
eNTERFACE’05 and speech
2 [16] CNN+LSTM ERacc : 99.81%-Facial ERacc : Personal: 55 ppl Facial Expressions
87.25%-EEG and EEG
3 [85] Multi-Level Multi-Head CMU-MOSEI: ERacc : IEMOCAP, Audio and text.
Fusion attention + RNN 99.19% F1 SCORE: 99.19% MELD,CMU-MOSEI
4 [192] CNN attention model ERacc : 88.89% CMU-MOSEI Speech and Text
5 [80] 1D and 2D CNN ρ c: 0.788 - Arousal ρ c: RECOLA SPEECH AND
0.732 – Valence VISUAL
6 [140] Convolution Bidirectional ERacc : 91.5% Facial and heart
Long Short-term rate
Memory Neural Network
(CBLNN)
23
Sl. No Author & Methods Classifiers Results Accuracy(ERacc ) Database Modalities

year considered
7 [285] Multimodal deep MSE:0.103 –valence LIRIS-ACCEDE Audio and visual

regression Bayesian MSE:0.099-arousal PCC:
network (MMDRBN) 0.375 - valence
PCC:0.330 -arousal
8 [300] 3D Convolutional-Long MOUD:ERacc : 96.75%, MOUD and Visual, textual, and
Short-Term Memory IEMOCAP: ERacc : 78.75% IEMOCAP audio
9 [159] Multi-directional Extreme learning ERacc : 85.06% eNTERFACE Speech, face
regression machine (ELM)
10 [98] Deep multi-task learning F1-score: 78.6%, CMU-MOSEI text, acoustic and
framework Weighted ERacc: 62.8% visual frames) of a
video
11 [254] 2D-CNN and ERacc : 60.59% AEFW audio and visual
3D-CNN+LSTM
12 [295] BLSTM DNNs ERacc : 91% - binary CMUMOSEI, audio, video, and
classification text
13 [270] Spiking neural networks ERacc : 73.15% MAHNOB-HCI Facial, ECG, skin
(SNNs) temperature, skin
conductance,
respiration signal,
mouth length,
pupil size
14 [240] CNN-based multimodal ERacc: 88.56% F1-score: Combined from Audio, video
networks 0.88 AUC: 0.987 multiple sources
15 [284] CNN ERacc::92.6% CK+, FER-2013 Face, EEG
References [17] X. Wang, X. Chen, C. Cao, Human emotion recognition by optimally fusing fa-
cial expression and speech feature, Signal Process. Image Commun. 84 (2020)
[1] J. Kumar, J.A. Kumar, Machine learning approach to classify emotions us- 115831 January, doi:10.1016/j.image.2020.115831.
ing GSR, Adv. Res. Electr. Electron. Eng. 2 (12) (2015) 72–76 https://www. [18] Z. Farhoudi, S. Setayeshi, Fusion of deep learning features with mixture of
krishisanskriti.org/vol_image/18Dec201512125258. Jyotish kumar(electrical) brain emotional learning for audio-visual emotion recognition, Speech Com-
227-231.pdf227-231.pdf (consultado a 02-12-2020). mun. 127 (2021) 92–103 June 2020, doi:10.1016/j.specom.2020.12.001.
[2] M. Ménard, P. Richard, H. Hamdi, B. Daucé, T. Yamaguchi, Emotion recogni- [19] C. Torres-Valencia, M. Álvarez-López, Á. Orozco-Gutiérrez, SVM-based feature
tion based on heart rate and skin conductance, in: PhyCS 2015 - 2nd Inter- selection methods for emotion recognition from multimodal data, J. Multi-
national Conference on Physiological Computing Systems, 2015, pp. 26–32, modal User Interfaces 11 (1) (2017) 9–23, doi:10.1007/s12193- 016- 0222- y.
doi:10.5220/0 0 0524110 0260 032. , Proceedings, January 2015. [20] W. Nie, Y. Yan, D. Song, K. Wang, Multi-modal feature fusion based on multi-
[3] Ekman, P. (1999). Basic emotions. In New York: Sussex U.K.: JohnWiley and layers LSTM for video emotion recognition, Multimedia Tools Appl. 80 (11)
Sons,Ltd (pp. 1–6). 10.1007/978-3-319-28099-8_495-1 (2021) 16205–16214, doi:10.1007/s11042- 020- 08796- 8.
[4] Schmidt, P., Reiss, A., Duerichen, R., & Van Laerhoven, K. (2018). Wearable [21] P.A. Gandhi, J. Kishore, Prevalence of depression and the associated factors
affect and stress recognition: a review. http://arxiv.org/abs/1811.08854 among the software professionals in Delhi: a cross-sectional study, Indian J.
[5] B. Bontchev, Adaptation in affective video games: a literature review, Cybern. Public Health 64 (4) (2020) 413–416, doi:10.4103/ijph.IJPH_568_19.
Inform. Technol. 16 (3) (2016) 3–34, doi:10.1515/cait- 2016- 0032. [22] S. Deb, P.R. Banu, S. Thomas, R.V. Vardhan, P.T. Rao, N. Khawaja, D. Ko-
[6] M. Ali, A.H. Mosa, F.A. Machot, K Kyamakya, Emotion recognition involving morowski, S Pietraszek, Depression among Indian university students and
physiological and speech signals: a comprehensive review, Stud. Syst. Decis. its association with perceived university academic environment, living ar-
Control 18 (7) (2018), doi:10.3390/s18072074. rangements and personal issues, Asian J. Psychiatry 23 (1) (2016) 1–15,
[7] Candra, H. (2017). Emotion recognition using facial expression and electroen- doi:10.1016/j.ajp.2016.07.010.
cephalography features with support vector machine classifier student. [23] D. Moher, A. Liberati, J. Tetzlaff, D.G. Altman, Preferred reporting items
[8] D. Liao, W. Zhang, G. Liang, Y. Li, J. Xie, L. Zhu, X. Xu, L. Shu, Arousal eval- for systematic reviews and meta-analyses: the PRISMA statement, BMJ 339
uation of VR affective scenes based on HR and SAM, IEEE MTT-S 2019 Inter- (7716) (2009) 332–336, doi:10.1136/bmj.b2535.
national Microwave Biomedical Conference, IMBioC 2019, 2019, doi:10.1109/ [24] R. Karthik, R. Menaka, A. Johnson, S. Anand, Neuroimaging and deep learn-
IMBIOC.2019.8777844. ing for brain stroke detection - a review of recent advancements and fu-
[9] A. Gruenewald, D. Kroenert, J. Poehler, R. Brueck, F. Li, J. Littau, K. Schnieber, ture prospects, Comput. Methods Programs Biomed. 197 (2020), doi:10.1016/
A. Piet, M. Grzegorzek, H. Kampling, B. Niehaves, Biomedical data acquisition j.cmpb.2020.105728.
and processing to recognize emotions for affective learning, in: 2018 IEEE [25] S. Layeghian Javan, M.M. Sepehri, H. Aghajani, Toward analyzing and synthe-
18th International Conference on Bioinformatics and Bioengineering, BIBE sizing previous research in early prediction of cardiac arrest using machine
2018, 2018, pp. 126–132, doi:10.1109/BIBE.2018.0 0 031. learning based on a multi-layered integrative framework, J. Biomed. Inform.
[10] A. Goshvarpour, A. Abbasi, A. Goshvarpour, An accurate emotion recognition 88 (2018) 70–89 September, doi:10.1016/j.jbi.2018.10.008.
system using ECG and GSR signals and matching pursuit method, Biomed. J. [26] S.M. Alarcão, M.J. Fonseca, Emotions recognition using EEG signals: a survey,
40 (6) (2017) 355–368, doi:10.1016/j.bj.2017.11.001. IEEE Trans. Affective Comput. 10 (3) (2019) 374–393, doi:10.1109/TAFFC.2017.
[11] A.S. Kanagaraj, A. Shahina, M. Devosh, N. Kamalakannan, EmoMeter: measur- 2714671.
ing mixed emotions using weighted combinational model, in: 2014 Interna- [27] S. Koelstra, I. Patras, Fusion of facial expressions and EEG for implicit affective
tional Conference on Recent Trends in Information Technology, ICRTIT 2014, tagging, Image Vision Comput. 31 (2) (2013) 164–174, doi:10.1016/j.imavis.
2014, pp. 2–7, doi:10.1109/ICRTIT.2014.6996192. 2012.10.002.
[12] J.A. Russell, A circumplex model of affect, J. Pers. Soc. Psychol. 39 (6) (1980) [28] W.L. Zheng, J.Y. Zhu, B.L. Lu, Identifying stable patterns over time for emo-
1161–1178, doi:10.1037/h0077714. tion recognition from eeg, IEEE Trans. Affect. Comput. 10 (3) (2019) 417–429,
[13] H. Dabas, C. Sethi, C. Dua, M. Dalawat, D. Sethia, Emotion classification us- doi:10.1109/TAFFC.2017.2712143.
ing EEG signals, in: ACM International Conference Proceeding Series, 2018, [29] T.B. Alakus, M. Gonen, I. Turkoglu, Database for an emotion recognition sys-
pp. 380–384, doi:10.1145/3297156.3297177. December. tem based on EEG signals and various computer games – GAMEEMO, Biomed.
[14] N.S. Suhaimi, J. Mountstephens, J. Teo, EEG-based emotion recognition: a Signal Process. Control 60 (2020) 101951, doi:10.1016/j.bspc.2020.101951.
state-of-the-art review of current trends and opportunities, Comput. Intell. [30] M. Alex, U. Tariq, F. Al-Shargie, H.S. Mir, H. Al Nashash, Discrimination of gen-
Neurosci. 2020 (2020), doi:10.1155/2020/8875426. uine and acted emotional expressions using EEG signal and machine learning,
[15] J. Zhang, Z. Yin, P. Chen, S. Nichele, Emotion recognition using multi-modal IEEE Access 8 (2020) 191080–191089, doi:10.1109/ACCESS.2020.3032380.
data and machine learning techniques: a tutorial and review, Inform. Fus. 59 [31] A.M. Asghar, M.J. Khan, M. Rizwan, M. Shorfuzzaman, R.M. Mehmood, AI
(2020) 103–126 March 2019, doi:10.1016/j.inffus.2020.01.011. inspired EEG – based spatial feature selection method using multivariate
[16] A. Hassouneh, A.M. Mutawa, M. Murugappan, Development of a real-time empirical mode decomposition for emotion classification, Multimedia Syst.
emotion recognition system using facial expressions and EEG based on ma- (2021) 0123456789, doi:10.10 07/s0 0530- 021- 00782- w.
chine learning and deep neural network methods, Inform. Med. Unlocked 20
(2020) 100372, doi:10.1016/j.imu.2020.100372.
24
[32] M.K. Ahirwal, M.R. Kose, Audio-visual stimulation based emotion classifica- [58] M. Farooq, F. Hussain, N.K. Baloch, F.R. Raja, H. Yu, Y.B. Zikria, Impact of fea-
tion by correlated EEG channels, Health Technol. 10 (1) (2020) 7–23, doi:10. ture selection algorithm on speech emotion recognition using deep convolu-
1007/s12553- 019- 00394- 5. tional neural network, Sensors 20 (21) (2020) 1–18, doi:10.3390/s20216008.
[33] Y. Liu, G. Fu, Emotion recognition by deeply learned multi-channel textual [59] Z. Yang, Y. Huang, Algorithm for speech emotion recognition classification
and EEG features, Future Gen. Comput. Syst. 119 (2021) 1–6, doi:10.1016/j. based on mel-frequency cepstral coefficients and broad learning system, Evol.
future.2021.01.010. Intell. (2021) 0123456789, doi:10.1007/s12065- 020- 00532- 3.
[34] N. Salankar, P. Mishra, L. Garg, Emotion recognition from EEG signals us- [60] S.R. Kadiri, P. Alku, Excitation features of speech for speaker-specific emo-
ing empirical mode decomposition and second-order difference plot, Biomed. tion detection, IEEE Access 8 (2020) 60382–60391, doi:10.1109/ACCESS.2020.
Signal Process. Control 65 (2021) 102389 December 2020, doi:10.1016/j.bspc. 2982954.
2020.102389. [61] R. Jahangir, Y.W. Teh, F. Hanif, G. Mujtaba, Deep learning approaches for
[35] T. Tuncer, S. Dogan, A. Subasi, A new fractal pattern feature generation func- speech emotion recognition: state of the art and research challenges, Multi-
tion based emotion recognition method using EEG, Chaos Solitons Fractals media Tools Appl. 80 (16) (2021) Multimedia Tools and Applications, doi:10.
144 (2021) 110671, doi:10.1016/j.chaos.2021.110671. 1007/s11042- 020- 09874- 7.
[36] Y. Zhang, J. Chen, J.H. Tan, Y. Chen, Y. Chen, D. Li, L. Yang, J. Su, X. Huang, [62] Niu, Y., Zou, D., Niu, Y., He, Z., & Tan, H. (2017). A breakthrough in speech
W. Che, An investigation of deep learning models for EEG-based emotion emotion recognition using deep retinal convolution neural networks. ArXiv,
recognition, Front. Neurosci. 14 (2020) 1–12 December, doi:10.3389/fnins. 1–7.
2020.622759. [63] Dinakaran, K., & Ashokkrishna, E.M. (2020). Efficient regional multi
[37] L. Jin, E.Y. Kim, Interpretable cross-subject EEG-based emotion recognition feature similarity measure based emotion detection system in web
using channel-wise features †, Sensors 20 (2020) 750–762, doi:10.3390/ portal using artificial neural network. Microprocessors Microsyst., 77.
s20236719. 10.1016/j.micpro.2020.103112
[38] T. Song, W. Zheng, P. Song, Z. Cui, EEG Emotion Recognition Using Dynamical [64] H. Ghazouani, A genetic programming-based feature selection and fusion for
Graph Convolutional Neural Networks, in: IEEE TRANSACTIONS ON AFFECTIVE facial expression recognition, Appl. Soft Comput. 103 (2021) 107173, doi:10.
COMPUTING, 11, IEEE, 2020, pp. 532–541. 1016/j.asoc.2021.107173.
[39] Y. Yin, X. Zheng, B. Hu, Y. Zhang, X. Cui, EEG emotion recognition using fusion [65] Y. Ma, W. Chen, X. Ma, J. Xu, X. Huang, R. Maciejewski, A.K.H. Tung, EasySVM:
model of graph convolutional neural networks and LSTM, Appl. Soft Comput. a visual analysis approach for open-box support vector machines, Computat.
100 (2021) 106954, doi:10.1016/j.asoc.2020.106954. Vis. Media 3 (2) (2017) 161–175, doi:10.1007/s41095- 017- 0077- 5.
[40] F. Hasanzadeh, M. Annabestani, S. Moghimi, Continuous emotion recognition [66] H. Jung, S. Lee, J. Yim, S. Park, J. Kim, Joint fine-tuning in deep neural net-
during music listening using EEG signals: a fuzzy parallel cascades model, works for facial expression recognition, in: 2015 International Conference on
Appl. Soft Comput. 101 (2021) 107028, doi:10.1016/j.asoc.2020.107028. Computer Vision, 2015, pp. 2983–2991, doi:10.1109/ICCV.2015.341.
[41] J. Fdez, N. Guttenberg, O. Witkowski, A. Pasquali, Cross-subject EEG-based [67] H. Yang, U. Ciftci, L. Yin, Facial expression recognition by de-expression
emotion recognition through neural networks with stratified normalization, residue learning, in: Proceedings of the IEEE Computer Society Conference on
Front. Neurosci. 15 (2021) February, doi:10.3389/fnins.2021.626277. Computer Vision and Pattern Recognition, 2018, pp. 2168–2177, doi:10.1109/
[42] F. Shen, G. Dai, G. Lin, J. Zhang, W. Kong, H. Zeng, EEG-based emotion recog- CVPR.2018.00231.
nition using 4D convolutional recurrent neural network, Cognit. Neurodyn. 14 [68] D.Y. Choi, B.C. Song, Semi-supervised learning for continuous emotion recog-
(6) (2020) 815–828, doi:10.1007/s11571- 020- 09634- 1. nition based on metric learning, IEEE Access 8 (2020) 113443–113455, doi:10.
[43] D. Komorowski, S. Pietraszek, The use of continuous wavelet transform 1109/ACCESS.2020.3003125.
based on the fast fourier transform in the analysis of multi-channel elec- [69] A.S.D. Devi, C.H. Satyanarayana, An efficient facial emotion recognition sys-
trogastrography recordings, J. Med. Syst. 40 (1) (2016) 1–15, doi:10.1007/ tem using novel deep learning neural network-regression activation classifier,
s10916-015-0358-4. Multimedia Tools Appl. (2021), doi:10.1007/s11042- 021- 10547- 2.
[44] W.L. Zheng, B.L. Lu, Investigating critical frequency bands and channels for [70] P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The ex-
EEG-based emotion recognition with deep neural networks, IEEE Trans. Au- tended Cohn-Kanade dataset (CK+): a complete dataset for action unit and
ton. Ment. Dev. 7 (3) (2015) 162–175, doi:10.1109/TAMD.2015.2431497. emotion-specified expression, in: 2010 IEEE Computer Society Conference on
[45] R. Alhalaseh, S. Alasasfeh, Machine-learning-based emotion recognition Computer Vision and Pattern Recognition - Workshops, CVPRW 2010, 2010,
system using EEG signals, Computers 9 (4) (2020) 1–15, doi:10.3390/ pp. 94–101, doi:10.1109/CVPRW.2010.5543262. July.
computers9040095. [71] M. Lyons, “Excavating AI” re-excavated: debunking a fallacious account of the
[46] S.M. Ghosh, S. Bandyopadhyay, D. Mitra, Nonlinear classification of emotion Jaffe dataset, SSRN Electron. J. (2021) 1–20, doi:10.2139/ssrn.3900990.
from EEG signal based on maximized mutual information, Expert Syst. Appl. [72] I.J. Goodfellow, D. Erhan, P. Luc Carrier, A. Courville, M. Mirza, B. Hamner,
185 (2021) 115605 July, doi:10.1016/j.eswa.2021.115605. W. Cukierski, Y. Tang, D. Thaler, D.H. Lee, Y. Zhou, C. Ramaiah, F. Feng, R. Li,
[47] H. Choi, M. Hahn, Sequence-to-sequence emotional voice conversion with X. Wang, D. Athanasakis, J. Shawe-Taylor, M. Milakov, J. Park, Y. Bengio, Chal-
strength control, IEEE Access 9 (2021) 42674–42687, doi:10.1109/ACCESS. lenges in representation learning: a report on three machine learning con-
2021.3065460. tests, Neural Netw. 64 (2015) 59–63 December 2017, doi:10.1016/j.neunet.
[48] M.B. Er, A novel approach for classification of speech emotions based on 2014.09.005.
deep and acoustic features, IEEE Access 8 (2020), doi:10.1109/ACCESS.2020. [73] D.Y. Choi, B.C. Song, Semi-supervised learning for facial expression-based
3043201. emotion recognition in the continuous domain, Multimedia Tools Appl. 79
[49] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, (37–38) (2020) 28169–28187, doi:10.1007/s11042- 020- 09412- 5.
S. Lee, S.S. Narayanan, IEMOCAP: interactive emotional dyadic motion cap- [74] M.K. Chowdary, T.N. Nguyen, D.J. Hemanth, Deep learning-based facial emo-
ture database, Lang. Resour. Eval. 42 (4) (2008) 335–359, doi:10.1007/ tion recognition for human – computer interaction applications, Neural Com-
s10579- 008- 9076- 6. put. Appl. 8 (2021), doi:10.10 07/s0 0521- 021- 06012- 8.
[50] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, A database [75] N. Mehendale, Facial emotion recognition using convolutional neu-
of German emotional speech, in: 9th European Conference on Speech Com- ral networks (FERC), SN Appl. Sci. 2 (3) (2020) 1–8, doi:10.1007/
munication and Technology, September, 2005, pp. 1517–1520, doi:10.21437/ s42452- 020- 2234- 1.
interspeech.2005-446. [76] N. Hajarolasvadi, E. Bashirov, H. Demirel, Video-based person-dependent and
[51] Parthasarathy, S., Member, S., Busso, C., & Member, S. (2020). Semi-supervised person-independent facial emotion recognition, Signal Image Video Process.
speech emotion recognition. 28, 2697–2709. (2021), doi:10.1007/s11760- 020- 01830- 0.
[52] S.M. Mustaqeem, S. Kwon, MLT-DNet: speech emotion recognition using 1D [77] D. Lakshmi, R. Ponnusamy, Facial emotion recognition using modified HOG
dilated CNN based on multi-learning trick approach, Expert Syst. Appl. 167 and LBP features with deep stacked autoencoders, Microprocessors Microsyst.
(2021) 114177 October 2020, doi:10.1016/j.eswa.2020.114177. 82 (2021) 103834 October 2020, doi:10.1016/j.micpro.2021.103834.
[53] Z. Zhao, Q. Li, Z. Zhang, N. Cummins, H. Wang, J. Tao, W. Schuller, B, Com- [78] D. Liu, L. Chen, L. Wang, Z. Wang, A multi-modal emotion fusion
bining a parallel 2D CNN with a self-attention dilated residual network for classification method combined expression and speech based on atten-
CTC-based discrete speech emotion recognition, Neural Netw. 141 (2021) 52– tion mechanism, Multimedia Tools Appl. (2021) 0123456789, doi:10.1007/
60, doi:10.1016/j.neunet.2021.03.013. s11042- 021- 11260- w.
[54] J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1D & 2D [79] R.J.R. Kumar, M. Sundaram, N. Arumugam, Facial emotion recognition us-
CNN LSTM networks, Biomed. Signal Process. Control 47 (2019) 312–323, ing subband selective multilevel stationary wavelet gradient transform
doi:10.1016/j.bspc.2018.08.035. and fuzzy support vector machine, Vis. Comput. (2020), doi:10.1007/
[55] N. Patel, S. Patel, S.H. Mankad, Impact of autoencoder based compact repre- s00371- 020- 01988- 1.
sentation on emotion detection from audio, J. Ambient Intell. Human. Com- [80] P. Tzirakis, G. Trigeorgis, M.A. Nicolaou, W. Schuller, End-to-end multimodal
put. (2021) 0123456789, doi:10.1007/s12652- 021- 02979- 3. emotion recognition using deep neural networks, IEEE J. Sel. Top. Signal Pro-
[56] J. Ancilin, A. Milton, Improved speech emotion recognition with mel fre- cess. 11 (8) (2017) 1301–1309.
quency magnitude coefficient, Appl. Acoust. 179 (2021) 108046, doi:10.1016/ [81] H. Zhang, Expression-EEG based collaborative multimodal emotion recogni-
j.apacoust.2021.108046. tion using deep autoencoder, IEEE Access 8 (2020) 164130–164143, doi:10.
[57] T. Tuncer, S. Dogan, U.R. Acharya, Automated accurate speech emotion recog- 1109/ACCESS.2020.3021994.
nition system using twine shuffle pattern and iterative neighborhood compo- [82] E.S. Salama, R.A. El-Khoribi, M.E. Shoman, M.A. Wahby Shalaby, A 3D-
nent analysis techniques, Knowl.-Based Syst. 211 (2021) 106547, doi:10.1016/ convolutional neural network framework with ensemble learning techniques
j.knosys.2020.106547. for multi-modal emotion recognition, Egypt. Inform. J. (2021) xxxx, doi:10.
1016/j.eij.2020.07.005.
25
[83] A. Zadeh, P.P. Liang, J. Vanbriesen, S. Poria, E. Tong, E. Cambria, M. Chen, [107] Y. An, N. Xu, Z. Qu, Leveraging spatial-temporal convolutional features for
L.P. Morency, Multimodal language analysis in the wild: CMU-MOSEI dataset EEG-based emotion recognition, Biomed. Signal Process. Control 69 (2021)
and interpretable dynamic fusion graph, in: ACL 2018 - 56th Annual Meeting June, doi:10.1016/j.bspc.2021.102743.
of the Association for Computational Linguistics, Proceedings of the Confer- [108] T. Anvarjon, Mustaqeem, S. Kwon, Deep-net: a lightweight cnn-based speech
ence (Long Papers), 1, 2018, pp. 2236–2246, doi:10.18653/v1/p18-1208. emotion recognition system using deep frequency features, Sensors 20 (18)
[84] O. Martin, I. Kotsia, B. Macq, I. Pitas, in: The eNTERFACE’ 05 Audio-Visual (2020) 1–16, doi:10.3390/s20185212.
Emotion Database - IEEE Conference Publication, 1, 2019, pp. 2–9. https: [109] K.A. Araño, P. Gloor, C. Orsenigo, C. Vercellis, When old meets new: emo-
//ieeexplore.ieee.org/abstract/document/1623803. tion recognition from speech signals, Cognit. Comput. (2021) October 2020,
[85] N.H. Ho, H.J. Yang, S.H. Kim, G. Lee, Multimodal approach of speech emo- doi:10.1007/s12559- 021- 09865- 2.
tion recognition using multi-level multi-head fusion attention-based recur- [110] I. Ariav, I. Cohen, An end-to-end multimodal voice activity detection using
rent neural network, IEEE Access 8 (2020) 61672–61686, doi:10.1109/ACCESS. WaveNet encoder and residual networks, IEEE J. Sel. Top. Signal Process. 13
2020.2984368. (2) (2019) 265–274, doi:10.1109/JSTSP.2019.2901195.
[86] C.P. Loizou, An automated integrated speech and face imageanalysis system [111] M. Arora, M. Kumar, AutoFER: PCA and PSO based automatic facial emotion
for the identification of human emotions, Speech Commun. 130 (2021) 15– recognition, Multimedia Tools Appl. 80 (2) (2021) 3039–3049, doi:10.1007/
26 February, doi:10.1016/j.specom.2021.04.001. s11042- 020- 09726- 4.
[87] D. Roy, P. Panda, K. Roy, Tree-CNN: a hierarchical deep convolutional neural [112] M. Aslan, CNN based efficient approach for emotion recognition, J. King Saud
network for incremental learning, Neural Netw. 121 (2020) 148–160, doi:10. Univ. (2021) xxxx, doi:10.1016/j.jksuci.2021.08.021.
1016/j.neunet.2019.09.010. [113] O. Atila, A. Şengür, Attention guided 3D CNN-LSTM model for accurate speech
[88] Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion recognition using CNN- based emotion recognition, Appl. Acoust. 182 (2021), doi:10.1016/j.apacoust.
RNN and C3D hybrid networks, in: ICMI 2016 - Proceedings of the 18th 2021.108260.
ACM International Conference on Multimodal Interaction, 2016, pp. 445–450, [114] J. Atkinson, D. Campos, Improving BCI-based emotion recognition by combin-
doi:10.1145/2993148.2997632. October 2017. ing EEG feature selection and kernel classifiers, Expert Syst. Appl. 47 (2016)
[89] B. Islam, F. Mahmud, A. Hossain, P.B. Goala, M.S. Mia, A facial region seg- 35–41, doi:10.1016/j.eswa.2015.10.049.
mentation based approach to recognize human emotion using fusion of HOG [115] B.T. Atmaja, M. Akagi, Two-stage dimensional emotion recognition by fusing
LBP features and artificial neural network, in: 4th International Conference predictions of acoustic and text networks using SVM, Speech Commun. 126
on Electrical Engineering and Information and Communication Technology, (2021) 9–21 November 2020, doi:10.1016/j.specom.2020.11.003.
ICEEiCT 2018, 2019, pp. 642–646, doi:10.1109/CEEICT.2018.8628140. [116] D. Ayata, Y. Yaslan, M.E. Kamasak, Emotion recognition from multimodal
[90] B. Li, D. Lima, Facial expression recognition via ResNet-50, Int. J. Cognit. Com- physiological signals for emotion aware healthcare systems, J. Med. Biol. Eng.
put. Eng. 2 (2021) 57–64 January, doi:10.1016/j.ijcce.2021.02.002. 40 (2) (2020) 149–157, doi:10.1007/s40846- 019- 00505- 7.
[91] Mungra, D., Agrawal, A., Sharma, P., Tanwar, S., & Obaidat, M.S. (2020). [117] A.M. Badshah, J. Ahmad, N. Rahim, S.W. Baik, Speech emotion recognition
PRATIT: a CNN-based emotion recognition system using histogram equaliza- from spectrograms with deep convolutional neural network, in: 2017 Inter-
tion and data augmentation. Multimedia Tools Appl., 79(3–4), 2285–2307. national Conference on Platform Technology and Service, PlatCon 2017, 2017
10.1007/s11042-019-08397-0 - Proceedings, doi:10.1109/PlatCon.2017.7883728.
[92] H. Becker, J. Fleureau, P. Guillotel, F. Wendling, I. Merlet, L. Albera, Emotion [118] A.M. Bhatti, M. Majid, S.M. Anwar, B. Khan, Human emotion recognition and
recognition based on high-resolution EEG recordings and reconstructed brain analysis in response to audio music using brain signals, Comput. Hum. Behav.
sources, IEEE Trans. Affective Comput. 11 (2) (2020) 244–257, doi:10.1109/ 65 (2016) 267–275, doi:10.1016/j.chb.2016.08.029.
TAFFC.2017.2768030. [119] J.D. Bodapati, N. Veeranjaneyulu, Facial emotion recognition using deep CNN
[93] H. Zhang, A. Jolfaei, M. Alazab, A Face Emotion recognition method using con- based features, Int. J. Innov. Technol. Explor. Eng. 8 (7) (2019) 1928–1931.
volutional neural network and image edge computing, IEEE Access 7 (2019) [120] H. Cai, Z. Qu, Z. Li, Y. Zhang, X. Hu, B. Hu, Feature-level fusion approaches
159081–159089, doi:10.1109/ACCESS.2019.2949741. based on multimodal EEG data for depression recognition, Inform. Fus. 59
[94] J. Li, Z. Zhang, H. He, Hierarchical convolutional neural networks for EEG- (2020) 127–138 March 2019, doi:10.1016/j.inffus.2020.01.008.
based emotion recognition, Cognit. Comput. 10 (2) (2018) 368–380, doi:10. [121] W. Cao, Z. Feng, D. Zhang, Y. Huang, Facial expression recognition via a CBAM
1007/s12559- 017- 9533- x. embedded network, Proc. Comput. Sci. 174 (2020) 463–477, doi:10.1016/j.
[95] Xiaowei Li, B. Hu, S. Sun, H. Cai, EEG-based mild depressive detection us- procs.2020.06.115.
ing feature selection methods and classifiers, Comput. Methods Programs [122] W.Y. Chang, S.H. Hsu, J.H. Chien, FATAUVA-net: an integrated deep learning
Biomed. 136 (2016) 151–161 November, doi:10.1016/j.cmpb.2016.08.010. framework for facial attribute recognition, action unit detection, and valence-
[96] R. Alazrai, R. Homoud, H. Alwanni, M.I. Daoud, EEG-based emotion recogni- arousal estimation, IEEE Computer Society Conference on Computer Vision
tion using quadratic time-frequency distribution, Sensors 18 (8) (2018) 1–32, and Pattern Recognition Workshops, 2017 2017-July 1963–1971, doi:10.1109/
doi:10.3390/s18082739. CVPRW.2017.246.
[97] M. Abdelwahab, C. Busso, Domain adversarial for acoustic emotion recogni- [123] A. Chatziagapi, G. Paraskevopoulos, D. Sgouropoulos, G. Pantazopou-
tion, IEEE/ACM Trans. Audio Speech Lang. Process. 26 (12) (2018) 2423–2435, los, M. Nikandrou, T. Giannakopoulos, A. Katsamanis, A. Potamianos,
doi:10.1016/b978- 0- 12- 804490- 2.0 0 0 05-1. S. Narayanan, Data augmentation using GANs for speech emotion recogni-
[98] M.S. Akhtar, D.S. Chauhan, D. Ghosal, S. Poria, A. Ekbal, P. Bhattacharyya, tion, in: Proceedings of the Annual Conference of the International Speech
Multi-task learning for multi-modal emotion recognition and sentiment anal- Communication Association, INTERSPEECH, 2019-Septe, 2019, pp. 171–175,
ysis, in: NAACL HLT 2019 - 2019 Conference of the North American Chapter doi:10.21437/Interspeech.2019-2561.
of the Association for Computational Linguistics: Human Language Technolo- [124] K.H. Cheah, H. Nisar, V.V. Yap, C.Y. Lee, Short-time-span EEG-based person-
gies - Proceedings of the Conference, 1, 2019, pp. 370–379, doi:10.18653/v1/ alized emotion recognition with deep convolutional neural network, in: Pro-
n19-1034. ceedings of the 2019 IEEE International Conference on Signal and Image Pro-
[99] F. Al-shargie, T.B. Tang, N. Badruddin, M. Kiguchi, Towards multilevel mental cessing Applications, ICSIPA 2019, 2019, pp. 78–83, doi:10.1109/ICSIPA45851.
stress assessment using SVM with ECOC: an EEG approach, Med. Biol. Eng. 2019.8977786.
Comput. 56 (1) (2018) 125–136, doi:10.1007/s11517-017-1733-8. [125] J.X. Chen, P.W. Zhang, Z.J. Mao, Y.F. Huang, D.M. Jiang, Y.N. Zhang, Accu-
[100] F. Al-Shargie, U. Tariq, M. Alex, H. Mir, H. Al-Nashash, Emotion recognition rate EEG-based emotion recognition on combined features using deep con-
based on fusion of local cortical activations and dynamic functional networks volutional neural networks, IEEE Access 7 (2019) 44317–44328, doi:10.1109/
connectivity: an EEG study, IEEE Access 7 (2019) 143550–143562, doi:10.1109/ ACCESS.2019.2908285.
ACCESS.2019.2944008. [126] M. Chen, X. He, J. Yang, H. Zhang, 3-D convolutional recurrent neural net-
[101] D.A. AL CHANTI, A. Caplier, Deep learning for spatio-temporal modeling of works with attention model for speech emotion recognition, IEEE Signal Pro-
dynamic spontaneous emotions, IEEE Trans. Affect. Comput. 3045(c) (2018) cess Lett. 25 (10) (2018) 1440–1444, doi:10.1109/LSP.2018.2860246.
1–14, doi:10.1109/TAFFC.2018.2873600. [127] Chen, P., & Zhang, J. (2017). Performance comparison of machine learning al-
[102] S. Alhagry, A.A. Fahmy, R.A. El-Khoribi, Emotion Recognition based on EEG gorithms for EEG-signal-based emotion recognition. Lecture Notes in Com-
using LSTM Recurrent Neural Network, Int. J. Adv. Comput. Sci. Appl. 8 (10) puter Science (Including Subseries Lecture Notes in Artificial Intelligence and
(2017) 8–11, doi:10.14569/ijacsa.2017.081046. Lecture Notes in Bioinformatics), 10613 LNCS, 208–216. 10.1007/978-3-319-
[103] M. Ali, A.H. Mosa, F. Al Machot, K Kyamakya, EEG-based emotion recognition 68600-4_25
approach for e-healthcare applications, in: International Conference on Ubiq- [128] Q. Chen, G. Huang, A novel dual attention-based BLSTM with hybrid features
uitous and Future Networks, ICUFN, 2016, pp. 946–950, doi:10.1109/ICUFN. in speech emotion recognition, Eng. Appl. Artif. Intell. 102 (2021) 104277
2016.7536936. 2016-Augus. April, doi:10.1016/j.engappai.2021.104277.
[104] R.S. Alkhawaldeh, DGR: gender recognition of human speech using one- [129] T. Chen, H. Yin, Emotion recognition based on fusion of long short-term
dimensional conventional neural network, Sci. Program. 2019 (2019), doi:10. memory networks and SVMs, Digital Signal Process. 1 (2021) 1–10, doi:10.
1155/2019/7213717. 1016/j.dsp.2021.103153.
[105] A.S. Alphonse, D. Dharma, Novel directional patterns and a generalized su- [130] Chen, X., Huang, R., Li, X., Xiao, L., Zhou, M., & Zhang, L. (2021). A novel user
pervised dimension reduction system (GSDRS) for facial emotion recog- emotional interaction design model using long and short-term memory net-
nition, Multimedia Tools Appl. 77 (8) (2018) 9455–9488, doi:10.1007/ works and deep learning. 12(April), 1–13. 10.3389/fpsyg.2021.674853
s11042- 017- 5141- 8. [131] Chernykh, V., & Prikhodko, P. (2018). Emotion recognition from speech with
[106] M. Alsolamy, A. Fattouh, Emotion estimation from EEG signals during lis- recurrent neural networks. ArXiv.
tening to Quran using PSD features, in: CSIT 2016: 2016 7th International [132] A. Christy, S. Vaithyasubramanian, A. Jesudoss, M.D.A. Praveena, Multimodal
Conference on Computer Science and Information Technology, 2016, pp. 3– speech emotion recognition and classification using convolutional neural net-
7, doi:10.1109/CSIT.2016.7549457.
26
work techniques, Int. J. Speech Technol. 23 (2) (2020) 381–388, doi:10.1007/ [160] M.S. Hossain, G. Muhammad, Emotion recognition using deep learning ap-
s10772- 020- 09713- y. proach from audio–visual emotional big data, Inform. Fus. 49 (2019) 69–78
[133] S. Cunningham, H. Ridley, J. Weinel, R. Picking, Supervised machine learn- November 2017, doi:10.1016/j.inffus.2018.09.008.
ing for audio emotion recognition: enhancing film sound design using audio [161] M. Hu, H. Wang, X. Wang, J. Yang, R. Wang, Video facial emotion recogni-
features, regression models and artificial neural networks, Pers. Ubiquitous tion based on local enhanced motion history image and CNN-CTSLSTM net-
Comput. (2020), doi:10.10 07/s0 0779- 020- 01389- 0. works, J. Visual Commun. Image Represent. 59 (2019) 176–185, doi:10.1016/j.
[134] S. Datta, D. Sen, R. Balasubramanian, Integrating geometric and textural fea- jvcir.2018.12.039.
tures for facial emotion classification using SVM frameworks, in: Proceedings [162] R.H. Huan, J. Shu, S.L. Bao, R.H. Liang, P. Chen, K.K. Chi, Video multimodal
of International Conference on Computer Vision and Image Processing, 78, emotion recognition based on Bi-GRU and attention fusion, Multimedia Tools
2017, pp. 10287–10323, doi:10.1007/s11042- 018- 6537- 9. Appl. 80 (6) (2021) 8213–8240, doi:10.1007/s11042- 020- 10030- 4.
[135] S. Deb, S. Dandapat, Emotion classification using segmentation of vowel– [163] Huang, X., Kortelainen, J., Zhao, G., Li, X., Moilanen, A., Seppänen, T., &
like and non-vowel-like regions, IEEE Trans. Affective Comput. 10 (3) (2019) Pietikäinen, M. (2016). Multi-modal emotion analysis from facial expressions
360–373. and electroencephalogram. 147, 114–124. 10.1016/j.cviu.2015.09.015
[136] J. Deng, X. Xu, Z. Zhang, S. Fruhholz, B. Schuller, Universum autoencoder- [164] X. Huang, S.J. Wang, X. Liu, G. Zhao, X. Feng, M. Pietikainen, Discrimina-
based domain adaptation for speech emotion recognition, IEEE Signal Process tive spatiotemporal local binary pattern with revisited integral projection for
Lett. 24 (4) (2017) 500–504, doi:10.1109/LSP.2017.2672753. spontaneous facial micro-expression recognition, IEEE Trans. Affect. Comput.
[137] P. Dhankhar, N. Delhi, ResNet-50 and VGG-16 for recognizing facial emotions, 10 (1) (2019) 32–47, doi:10.1109/TAFFC.2017.2713359.
Int. J. Innov. Eng. Technol. 13 (4) (2019) 126–130. [165] X. Huang, G. Zhao, X. Hong, W. Zheng, M. Pietikäinen, Spontaneous facial
[138] L.N. Do, H.J. Yang, H.D. Nguyen, S.H. Kim, G.S. Lee, I.S. Na, Deep neural micro-expression analysis using spatiotemporal completed local quantized
network-based fusion model for emotion recognition using visual data, J. Su- patterns, Neurocomputing 175 (2016) 564–578 PartA, doi:10.1016/j.neucom.
percomput. (2021) 0123456789, doi:10.1007/s11227- 021- 03690- y. 2015.10.096.
[139] A. Dogan, M. Akay, P. Barua, M. Baygin, S. Dogan, T. Tuncer, A. Dogru, [166] M.G. Huddar, S.S. Sannakki, V.S. Rajpurohit, Attention-based multimodal con-
U. Acharya, PrimePatNet87: prime pattern and tunable q-factor wavelet trans- textual fusion for sentiment and emotion classification using bidirectional
form techniques for automated accurate EEG emotion recognition, Comput. LSTM, Multimedia Tools Appl. 80 (9) (2021) 13059–13076, doi:10.1007/
Biol. Med. 138 (2021). s11042- 020- 10285- x.
[140] G. Du, Z. Wang, B. Gao, S. Mumtaz, K.M. Abualnaja, C. Du, A convolution [167] M.R. Islam, M.M. Islam, M.M. Rahman, C. Mondal, S.K. Singha, M. Ahmad,
bidirectional long short-term memory neural network for driver emotion A. Awal, M.S. Islam, M.A. Moni, EEG Channel Correlation Based Model for
recognition, IEEE Trans. Intell. Transp. Syst. (2020) 1–9, doi:10.1109/tits.2020. Emotion Recognition, Comput. Biol. Med. 136 (2021) 104757 August, doi:10.
3007357. 1016/j.compbiomed.2021.104757.
[141] M.B. Er, H. Çiğ, İ.B. Aydilek, A new approach to recognition of human emo- [168] D.K. Jain, P. Shamsolmoali, P. Sehdev, Extended deep neural network for facial
tions using brain signals and music stimuli, Appl. Acoust. 175 (2021), doi:10. emotion recognition, Pattern Recognit. Lett. 120 (2019) 69–74, doi:10.1016/j.
1016/j.apacoust.2020.107840. patrec.2019.01.008.
[142] Y. Fang, H. Yang, X. Zhang, H. Liu, B. Tao, Multi-feature input deep forest for [169] A. Jaiswal, A. Krishnama Raju, S. Deb, Facial emotion detection using deep
EEG-based emotion recognition, Front. Neurorobot. 14 (2021) 1–11 January, learning, in: 2020 International Conference for Emerging Technology, INCET
doi:10.3389/fnbot.2020.617531. 2020, 2020, pp. 1–5, doi:10.1109/INCET49848.2020.9154121.
[143] Y. Fang, R. Rong, J. Huang, Hierarchical fusion of visual and physiological [170] A. Jalilifard, E.B. Pizzolato, M.K. Islam, Emotion classification using single-
signals for emotion recognition, Multidimension. Syst. Signal Process. 32 (4) channel scalp-EEG recording, in: Proceedings of the Annual International
(2021) 1103–1121, doi:10.1007/s11045- 021- 00774- z. Conference of the IEEE Engineering in Medicine and Biology Society, EMBS,
[144] S. Farashi, R. Khosrowabadi, EEG based emotion recognition using minimum 2016, pp. 845–849, doi:10.1109/EMBC.2016.7590833. 2016-Octob.
spanning tree, Phys. Eng. Sci. Med. 43 (3) (2020) 985–996, doi:10.1007/ [171] M. Javidan, M. Yazdchi, Z. Baharlouei, A. Mahnam, Feature and channel se-
s13246- 020- 00895- y. lection for designing a regression-based continuous-variable emotion recog-
[145] H.M. Fayek, M. Lech, L. Cavedon, Evaluating deep learning architectures for nition system with two EEG channels, Biomed. Signal Process. Control 70
speech emotion recognition, Neural Netw. 92 (2017) 60–68, doi:10.1016/j. (2021) 102979 July, doi:10.1016/j.bspc.2021.102979.
neunet.2017.02.013. [172] J. Jayalekshmi, Tessy Mathew, Facial expression recognition and emotion clas-
[146] R. Fourati, B. Ammar, J. Sanchez-Medina, A.M. Alimi, Unsupervised learning sification system for sentiment analysis, 2017 International Conference on
in reservoir computing for EEG-based emotion recognition, IEEE Trans. Affect. Networks & Advances in Computational Technologies (NetACT), IEEE, 2017.
Comput. 3045 (c) (2020) 1–13, doi:10.1109/TAFFC.2020.2982143. [173] N. Ji, L. Ma, H. Dong, X. Zhang, EEG signals feature extraction based on DWT
[147] N. Ganapathy, Y.R. Veeranki, H. Kumar, R Swaminathan, Emotion recognition and EMD combined with approximate entropy, Brain Sci. 9 (8) (2019), doi:10.
using electrodermal activity signals and multiscale deep convolutional neural 3390/brainsci9080201.
network, J. Med. Syst. 45 (49) (2021) 1–10, doi:10.1007/s10916- 020- 01676- 6. [174] J. Jia, S. Zhou, Y. Yin, B. Wu, W. Chen, F. Meng, Y. Wang, Inferring Emotions
[148] Qiang Gao, Y. Yang, Q. Kang, Z. Tian, Y Song, EEG-based emotion recognition from Large-Scale Internet Voice Data, IEEE Trans. Multimedia 21 (7) (2019)
with feature fusion networks, Int. J. Mach. Learn. Cybern. (2021) 0123456789, 1853–1866, doi:10.1109/TMM.2018.2887016.
doi:10.1007/s13042- 021- 01414-5. [175] A. Joseph, P. Geetha, Facial emotion detection using modified eyemap–
[149] Qinquan Gao, H. Zeng, G. Li, T. Tong, Graph reasoning-based emotion recog- mouthmap algorithm on an enhanced image and classification with tensor-
nition network, IEEE Access 9 (2021) 6488–6497, doi:10.1109/ACCESS.2020. flow, Vis. Comput. 36 (3) (2020) 529–539, doi:10.10 07/s0 0371- 019- 01628- 3.
3048693. [176] V.M. Joshi, R.B. Ghongade, IDEA: intellect database for emotion analysis using
[150] C. Guanghui, Z. Xiaoping, Multi-modal emotion recognition by fusing corre- EEG signal, J. King Saud Univ. (2020) xxxx, doi:10.1016/j.jksuci.2020.10.007.
lation features of speech-visual, IEEE Signal Process Lett. 28 (2021) 533–537, [177] P. Kaviya, T. Arumugaprakash, Group facial emotion analysis system using
doi:10.1109/LSP.2021.3055755. convolutional neural network, in: Proceedings of the Fourth International
[151] L. Guo, L. Wang, J. Dang, Z. Liu, H. Guan, Exploration of complementary fea- Conference on Trends in Electronics and Informatics (ICOEI 2020), IEEE, 2020,
tures for speech emotion recognition based on kernel extreme learning ma- pp. 643–647.
chine, IEEE Access 7 (2019) 75798–75809, doi:10.1109/ACCESS.2019.2921390. [178] M. Kheirkhah, S. Brodoehl, L. Leistritz, T. Götz, P. Baumbach, R. Huonker,
[152] A. Gupta, S. Arunachalam, R. Balakrishnan, Deep self-attention network for O.W. Witte, C.M. Klingner, Automated emotion classification in the early
facial emotion recognition, Proc. Comput. Sci. 171 (2019) (2020) 1527–1534, stages of cortical processing: an MEG study, Artif. Intell. Med. 115 (2021) July
doi:10.1016/j.procs.2020.04.163. 2019, doi:10.1016/j.artmed.2021.102063.
[153] R. Gupta, L.K. Vishwamitra, Facial expression recognition from videos using [179] D. Kollias, S.P. Zafeiriou, Exploiting multi-CNN features in CNN-RNN based
CNN and feature aggregation, Mater. Today (2021) xxxx, doi:10.1016/j.matpr. dimensional emotion recognition on the OMG in-the-wild dataset, IEEE Trans.
2020.11.795. Affect. Comput. 3045(c) (2020) 1–12, doi:10.1109/TAFFC.2020.3014171.
[154] V. Gupta, M.D. Chopda, R.B. Pachori, Cross-subject emotion recognition using [180] W. Kong, X. Song, J. Sun, Emotion recognition based on sparse representa-
flexible analytic wavelet transform from EEG signals, IEEE Sensors J. 19 (6) tion of phase synchronization features, Multimedia Tools Appl. 80 (14) (2021)
(2019) 2266–2274, doi:10.1109/JSEN.2018.2883497. 21203–21217, doi:10.1007/s11042- 021- 10716- 3.
[155] A.K. Hassan, S.N. Mohammed, A novel facial emotion recognition scheme [181] R. Kosti, J.M. Alvarez, A. Recasens, A. Lapedriza, Context based emotion recog-
based on graph mining, Defence Technol. 16 (5) (2020) 1062–1072, doi:10. nition using EMOTIC dataset, IEEE Trans. Pattern Anal. Mach. Intell. 42 (11)
1016/j.dt.2019.12.006. (2020) 2755–2766.
[156] H. He, Y. Tan, J. Ying, W. Zhang, Strengthen EEG-based emotion recogni- [182] P.T. Krishnan, A.N. Joseph Raj, V. Rajangam, Emotion classification from
tion using firefly integrated optimization algorithm, Appl. Soft Comput. J. 94 speech signal based on empirical mode decomposition and non-linear
(2020) 106426, doi:10.1016/j.asoc.2020.106426. features, Complex Intell. Syst. 7 (4) (2021) 1919–1934, doi:10.1007/
[157] X. He, W. Zhang, Emotion recognition by assisted learning with convolutional s40747-021-00295-z.
neural networks, Neurocomputing 291 (2018) 187–194, doi:10.1016/j.neucom. [183] N. Kumar, K. Khaund, S.M. Hazarika, Bispectral analysis of EEG for emotion
2018.02.073. recognition, Proc. Comput. Sci. 84 (2016) 31–35, doi:10.1016/j.procs.2016.04.
[158] F. Hernández-Luquin, H.J. Escalante, Multi-branch deep radial basis func- 062.
tion networks for facial emotion recognition, Neural Comput. Appl. (2021) [184] U. Kumaran, S. Radha Rammohan, S.M. Nagarajan, A. Prathik, Fusion of mel
0123456789, doi:10.10 07/s0 0521-021-06420-w. and gammatone frequency cepstral coefficients for speech emotion recog-
[159] M.S. Hossain, G. Muhammad, Audio-visual emotion recognition using multi- nition using deep C-RNN, Int. J. Speech Technol. 24 (2) (2021) 303–314,
directional regression and Ridgelet transform, Journal on Multimodal User In- doi:10.1007/s10772- 020- 09792- x.
terfaces 10 (4) (2016) 325–333, doi:10.1007/s12193- 015- 0207- 2.
27
[185] S. Kuruvayil, S. Palaniswamy, Emotion recognition from facial images with si- [210] I. Livieris, E. Pintelas, P. Pintelas, Gender recognition by voice using an im-
multaneous occlusion, pose and illumination variations using meta-learning, proved self-labeled algorithm, Mach. Learn. Knowl. Extract. 1 (1) (2019) 492–
J. King Saud Univ. (2021) xxxx, doi:10.1016/j.jksuci.2021.06.012. 503, doi:10.3390/make1010030.
[186] Z. Lan, G.R. Muller-Putz, L. Wang, Y. Liu, O. Sourina, R. Scherer, Using sup- [211] N. Lopes, A. Silva, S.R. Khanal, A. Reis, J. Barroso, V. Filipe, J. Sampaio, Fa-
port vector regression to estimate valence level from EEG, in: 2016 IEEE In- cial emotion recognition in the elderly using a SVM classifier, 2018 - 2nd
ternational Conference on Systems, Man, and Cybernetics, SMC 2016, 2016, International Conference on Technology and Innovation in Sports, Health and
pp. 2558–2563, doi:10.1109/SMC.2016.7844624. - Conference Proceedings. Wellbeing (TISHW), 2018, doi:10.1109/TISHW.2018.8559494.
[187] Z. Lan, O. Sourina, L. Wang, Y. Liu, Real-time EEG-based emotion monitor- [212] R. Lotfian, C. Busso, Lexical dependent emotion detection using synthetic
ing using stable features, Vis. Comput. 32 (3) (2016) 347–358, doi:10.1007/ speech reference, IEEE Access 7 (2019) 22071–22085, doi:10.1109/ACCESS.
s00371- 015- 1183- y. 2019.2898353.
[188] I. Lasri, A.R. Solh, M.E. Belkacemi, Facial emotion recognition of students us- [213] Lyons, M.J., Kamachi, M., & Gyoba, J. (2020). Coding facial expressions with
ing convolutional neural network, in: 2019 Third International Conference on Gabor wavelets (IVC special issue). ArXiv, 1–13. 10.5281/zenodo.4029679
Intelligent Computing in Data Sciences (ICDS), 2019, pp. 1–6. [214] M Murugappan, A Mutawa, Facial geometric feature extraction based emo-
[189] S. Latif, R. Rana, S. Younis, J. Qadir, J. Epps, Transfer learning for improving tional expression classification using machine learning algorithms, PLoS ONE
speech emotion classification accuracy, in: Proceedings of the Annual Confer- 16 (2) (2021) e0247131, doi:10.1371/journal.pone.0247131.
ence of the International Speech Communication Association, INTERSPEECH, [215] Y. Ma, Y. Hao, M. Chen, J. Chen, P. Lu, A. Košir, Audio-visual emotion fusion
2018, pp. 257–261, doi:10.21437/Interspeech.2018-1625. 2018-Septe. (AVEF): a deep efficient weighted approach, Inform. Fus. 46 (2019) 184–192
[190] D. Le, Z. Aldeneh, E.M. Provost, Discretized continuous speech emotion recog- June 2018, doi:10.1016/j.inffus.2018.06.003.
nition with multi-task deep recurrent neural network, in: Proceedings of the [216] Maheshwari, D., Ghosh, S.K., Tripathy, R.K., Sharma, M., & Acharya, U.R.
Annual Conference of the International Speech Communication Association, (2021). Automated accurate emotion recognition system using rhythm-
INTERSPEECH, 2017, pp. 1108–1112, doi:10.21437/Interspeech.2017-94. 2017- specific deep convolutional neural network technique with multi-channel
Augus. EEG signals. April.
[191] M. Lech, M. Stolar, C. Best, R. Bolia, Real-time speech emotion recognition [217] Q. Mao, G. Xu, W. Xue, J. Gou, Y. Zhan, Learning emotion-discriminative and
using a pre-trained image classification network: effects of bandwidth reduc- domain-invariant features for domain adaptation in speech emotion recogni-
tion and companding, Front. Comput. Sci. 2 (2020) 1–14 May, doi:10.3389/ tion, Speech Commun. 93 (2017) 1–10, doi:10.1016/j.specom.2017.06.006.
fcomp.2020.0 0 014. [218] Q. Mao, W. Xue, Q. Rao, F. Zhang, Zhan Yongzhao, Yongzhao, Domain adap-
[192] Lee, C., Song, K., Jeong, J., & Choi, W. (2019). Convolutional attention net- tation for speech emotion recognition by sharing priors between related
works for multimodal emotion recognition from speech and text data. 28–34. source and target classes, in: 2016 IEEE International Conference on Acous-
10.18653/v1/w18-3304 tics, Speech and Signal Processing (ICASSP), IEEE, China, 2016, pp. 2608–2612.
[193] S. Lee, D.K. Han, H. Ko, Fusion-convbert: parallel convolution and Bert fusion [219] V. Maruthapillai, M. Murugappan, Optimal geometrical set for automated
for speech emotion recognition, Sensors 20 (22) (2020) 1–19, doi:10.3390/ marker placement to virtualized real-Time facial emotions, PLoS ONE 11 (2)
s20226688. (2016) 1–18, doi:10.1371/journal.pone.0149003.
[194] D. Li, J. Liu, Z. Yang, L. Sun, Z. Wang, Speech emotion recognition using recur- [220] R.M. Mehmood, H.J. Lee, A novel feature extraction method based on late pos-
rent neural networks with directional self-attention, Expert Syst. Appl. 173 itive potential for emotion recognition in human brain signal patterns, Com-
(2021) 114683 September 2019, doi:10.1016/j.eswa.2021.114683. put. Electr. Eng. 53 (2016) 444–457, doi:10.1016/j.compeleceng.2016.04.009.
[195] D. Li, Y. Zhou, Z. Wang, D. Gao, Exploiting the potentialities of features for [221] N. Mehta, S. Jadhav, Facial emotion recognition using log Gabor filter and
speech emotion recognition, Inform. Sci. 548 (2021) 328–343, doi:10.1016/j. PCA, 2nd International Conference on Computing, Communication, Control
ins.2020.09.047. and Automation, ICCUBEA 2016, 2017, doi:10.1109/ICCUBEA.2016.7860054.
[196] H. Li, H. Xu, Deep reinforcement learning for robust emotional classifica- [222] A. Mert, A. Akan, Emotion recognition from EEG signals by using multivari-
tion in facial expression recognition, Knowl.-Based Syst. 204 (2020) 106172, ate empirical mode decomposition, Pattern Anal. Appl. 21 (1) (2018) 81–89,
doi:10.1016/j.knosys.2020.106172. doi:10.1007/s10044- 016- 0567- 6.
[197] S. Li, X. Xing, W. Fan, B. Cai, P. Fordson, X. Xu, Spatiotemporal and frequential [223] S. Mirsamadi, E. Barsoum, C. Zhang, Automatic speech emotion recognition
cascaded attention networks for speech emotion recognition, Neurocomput- using recurrent neural networks with local attention, in: IEEE International
ing 448 (2021) 238–248, doi:10.1016/j.neucom.2021.02.094. Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2017, IEEE,
[198] Xin Li, X.Q. Sun, X.Y. Qi, X.F. Sun, Relevance vector machine based EEG emo- Richardson USA, 2017, pp. 2227–2231, doi:10.1016/j.specom.2019.09.002.
tion recognition, in: 2016 6th International Conference on Instrumentation [224] Z. Mohammadi, J. Frounchi, M. Amiri, Wavelet-based emotion recognition
and Measurement, Computer, Communication and Control, IMCCC 2016, 2016, system using EEG signal, Neural Comput. Appl. 28 (8) (2017) 1985–1990,
pp. 293–297, doi:10.1109/IMCCC.2016.106. doi:10.10 07/s0 0521- 015- 2149- 8.
[199] Li, Yang, Fu, B., Li, F., Shi, G., & Zheng, W. (2021). A novel transferability [225] G. Muhammad, M.S. Hossain, Emotion recognition for cognitive edge comput-
attention neural network model for EEG emotion recognition. 447, 92–101. ing using deep learning, IEEE Internet Things J. 4662 (c) (2021), doi:10.1109/
10.1016/j.neucom.2021.02.048 JIOT.2021.3058587.
[200] Y. Li, T. Zhao, T. Kawahara, Improved end-to-end speech emotion recogni- [226] R. Munoz, R. Olivares, C. Taramasco, R. Villarroel, R. Soto, T.S. Barcelos,
tion using self attention mechanism and multitask learning, in: Proceed- E. Merino, M.F. Alonso-Sánchez, Using black hole algorithm to improve EEG-
ings of the Annual Conference of the International Speech Communication based emotion recognition, Comput. Intell. Neurosci. 2018 (2018), doi:10.
Association, INTERSPEECH, 2019-Septe, 2019, pp. 2803–2807, doi:10.21437/ 1155/2018/3050214.
Interspeech.2019-2594. [227] M. Murugappan, V. Maruthapillai, W. Khariunizam, A.M. Mutawa, S. Sruthi,
[201] Z. Li, G. Zhang, J. Dang, L. Wang, J. Wei, Multi-modal emotion recognition C.W. Yean, Virtual markers based facial emotion recognition using ELM and
based on deep learning of eeg and audio signals, in: Proceedings of the In- PNN classifiers, in: 2020 16th IEEE International Colloquium on Signal Pro-
ternational Joint Conference on Neural Networks, 2021-July, 2021, pp. 1–6, cessing and Its Applications, CSPA 2020, CSPA, 2020, pp. 261–265, doi:10.
doi:10.1109/IJCNN52387.2021.9533663. 1109/CSPA48992.2020.9068708.
[202] D.Y. Liliana, Emotion recognition from facial expression using deep con- [228] M. Murugappan, A.M. Mutawa, S. Sruthi, A. Hassouneh, A. Abdulsalam, S. Jer-
volutional neural network, J. Phys. Conf. Ser. 1193 (1) (2019), doi:10.1088/ ritta, R. Ranjana, Facial expression classification using KNN and decision tree
1742-6596/1193/1/012004. classifiers, in: 4th International Conference on Computer, Communication and
[203] W. Lim, D. Jang, T. Lee, Speech emotion recognition using convolutional re- Signal Processing, ICCCSP 2020, 2020, pp. 15–17, doi:10.1109/ICCCSP49186.
current neural networks and spectrograms, in: 2016 Asia-Pacific Signal and 2020.9315234.
Information Processing Association Annual Summit and Conference (APSIPA), [229] M. Murugappan, B.S. Zheng, W. Khairunizam, Recurrent quantification
2016, pp. 1–4, doi:10.1109/CCECE47787.2020.9255752. analysis-based emotion classification in stroke using electroencephalo-
[204] D. Liu, L. Chen, Z. Wang, G. Diao, Speech expression multimodal emotion gram signals, Arab. J. Sci. Eng. 46 (10) (2021) 9573–9588, doi:10.1007/
recognition based on deep belief network, J. Grid Comput. (2021) 2/2021, s13369- 021- 05369- 1.
doi:10.1007/s10723- 021- 09564- 0. [230] S.M. Mustaqeem, S. Kwon, CLSTM: deep feature-based speech emotion recog-
[205] X. Liu, X. Cheng, K. Lee, GA-SVM-based facial emotion recognition, IEEE Sen- nition using the hierarchical convlstm network, Mathematics 8 (12) (2020)
sors J. 21 (10) (2021) 11532–11542. 1–19, doi:10.3390/math8122133.
[206] Y.J. Liu, M. Yu, G. Zhao, J. Song, Y. Ge, Y. Shi, Real-time movie-induced dis- [231] S.M. Mustaqeem, Md. Sajjad, S. Kwon, Clustering-based speech emotion
crete emotion recognition from EEG signals, IEEE Trans. Affect. Comput. 9 (4) recognition by incorporating learned features and deep BiLSTM, IEEE Access
(2018) 550–562, doi:10.1109/TAFFC.2017.2660485. 8 (2020) 79861–79875, doi:10.1109/APSIPA.2016.7820699.
[207] Y.J. Liu, J.K. Zhang, W.J. Yan, S.J. Wang, G. Zhao, X. Fu, A main directional [232] B. Nakisa, M.N. Rastgoo, A. Rakotonirainy, F. Maire, V. Chandran, Long short
mean optical flow feature for spontaneous micro-expression recognition, IEEE term memory hyperparameter optimization for a neural network based emo-
Trans. Affect. Comput. 7 (4) (2016) 299–310, doi:10.1109/TAFFC.2015.2485205. tion recognition framework, IEEE Access 6 (2018) 49325–49338, doi:10.1109/
[208] Y. Liu, X. Yuan, X. Gong, Z. Xie, F. Fang, Z. Luo, Conditional convolution neu- ACCESS.2018.2868361.
ral network enhanced random forest for facial expression recognition, Pattern [233] K.J. Noh, C.Y. Jeong, S. Jiyoun, C. Seungeun, K. Gague, L. Jeong Mook, J. Hyun-
Recognit. 84 (2018) 251–261, doi:10.1016/j.patcog.2018.07.016. tae, Multi-path and group-loss-based network for speech emotion, Sensors
[209] Z.-T. Liu, Q. Xie, W. Min, W.-H. Cao, D.-Y. Li, Li, Si-Han, Electroencephalogram (2021) 1–18, doi:10.3390/s21051579.
emotion recognition based on empirical mode decomposition and optimal [234] F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, G. Anbarjafari, Audio-visual
feature selection, IEEE Trans Cognit. Deve. Syst. 11 (2019) 517–526 dECEM- emotion recognition in video clips, IEEE Trans. Affect. Comput. 10 (1) (2019)
BER http://link.springer.com/10.1007/978- 3- 030- 03243- 2_299- 1 . 60–75, doi:10.1109/TAFFC.2017.2713783.
28
[235] E.N.N. Ocquaye, Q. Mao, H. Song, G. Xu, Y. Xue, Dual exclusive attentive trans- [261] H. Shahabi, S. Moghimi, Toward automatic detection of brain responses to
fer for unsupervised deep convolutional domain adaptation in speech emo- emotional music through analysis of EEG effective connectivity, Comput.
tion recognition, IEEE Access 7 (2019) 93847–93857, doi:10.1109/ACCESS.2019. Hum. Behav. 58 (2016) 231–239, doi:10.1016/j.chb.2016.01.005.
2924597. [262] F. Shen, Y. Peng, W. Kong, G. Dai, Multi-scale frequency bands ensemble
[236] J. Oliveira, I. Praca, On the usage of pre-trained speech recognition deep lay- learning for EEG-based emotion recognition, Sensors 21 (4) (2021) 1–20,
ers to detect emotions, IEEE Access 9 (2021) 9699–9705, doi:10.1109/ACCESS. doi:10.3390/s21041262.
2021.3051083. [263] M.F.H. Siddiqui, A.Y. Javaid, A multimodal facial emotion recognition frame-
[237] B. Pan, K. Hirota, Z. Jia, L. Zhao, X. Jin, Y. Dai, Multimodal emotion recog- work through the fusion of speech with visible and infrared images, Multi-
nition based on feature selection and extreme learning machine in video modal Technol. Interact. 4 (3) (2020) 1–21, doi:10.3390/mti4030046.
clips, J. Ambient Intell. Human. Comput. (2021) 0123456789, doi:10.1007/ [264] P. Singh, R. Srivastava, K.P.S. Rana, V. Kumar, A multimodal hierarchical ap-
s12652- 021- 03407- 2. proach to speech emotion recognition from audio and text[formula pre-
[238] C. Pan, C. Shi, H. Mu, J. Li, X. Gao, EEG-based emotion recognition using lo- sented], Knowl.-Based Syst. 229 (2021) 107316, doi:10.1016/j.knosys.2021.
gistic regression with gaussian kernel and laplacian prior and investigation of 107316.
critical frequency bands, Appl. Sci. 10 (5) (2020), doi:10.3390/app10051619. [265] R. Singh, H. Puri, N. Aggarwal, V. Gupta, An efficient language-independent
[239] X. Pan, W. Guo, X. Guo, W. Li, J. Xu, J. Wu, Deep temporal-spatial aggregation acoustic emotion classification system, Arab. J. Sci. Eng. 45 (4) (2020) 3111–
for video-based facial expression recognition, Symmetry 11 (1) (2019), doi:10. 3121, doi:10.1007/s13369- 019- 04293- 9.
3390/sym11010052. [266] T. Song, W. Zheng, C. Lu, Y. Zong, X. Zhang, Z. Cui, MPED: a multi-modal phys-
[240] Y.R. Pandeya, J. Lee, Deep learning-based late fusion of multimodal informa- iological emotion database for discrete emotion recognition, IEEE Access 7
tion for emotion classification of music video, Multimedia Tools Appl. 80 (2) (2019) 12177–12191, doi:10.1109/ACCESS.2019.2891579.
(2021) 2887–2905, doi:10.1007/s11042- 020- 08836- 3. [267] A. Subasi, T. Tuncer, S. Dogan, D. Tanko, U. Sakoglu, EEG-based emotion recog-
[241] R. Pathar, A. Adivarekar, A. Mishra, A. Deshmukh, Human emotion recogni- nition using tunable Q wavelet transform and rotation forest ensemble clas-
tion using convolutional neural network in real time, in: Proceedings of 1st sifier, Biomed. Signal Process. Control 68 (2021) 102648 April, doi:10.1016/j.
International Conference on Innovations in Information and Communication bspc.2021.102648.
Technology, ICIICT 2019, 2019, doi:10.1109/ICIICT1.2019.8741491. [268] G. Subramanian, N. Cholendiran, K. Prathyusha, N. Balasubramanain, J. Ar-
[242] M.D. Pawar, R.D. Kokate, Convolution neural network based automatic speech avinth, Multimodal emotion recognition using different fusion techniques, in:
emotion recognition using mel-frequency cepstrum coefficients, Multimedia Proceedings of 2021 IEEE 7th International Conference on Bio Signals, Images
Tools Appl. (2021), doi:10.1007/s11042- 020- 10329- 2. and Instrumentation, 2021 ICBSII 2021, doi:10.1109/ICBSII51839.2021.9445146.
[243] Z. Peng, J. Dang, M. Unoki, M. Akagi, Multi-resolution modulation-filtered [269] X. Sun, P. Xia, F. Ren, Multi-attention based deep neural network with hybrid
cochleagram feature for LSTM-based dimensional emotion recognition from features for dynamic sequential facial expression recognition, Neurocomput-
speech, Neural Netw. 140 (2021) 261–273, doi:10.1016/j.neunet.2021.03.027. ing (2020) xxxx, doi:10.1016/j.neucom.2019.11.127.
[244] Z. Peng, X. Li, Z. Zhu, M. Unoki, J. Dang, M. Akagi, Speech emotion recognition [270] C. Tan, G. Ceballos, N. Kasabov, N.P. Subramaniyam, Fusionsense: emotion
using 3D convolutions and attention-based sliding recurrent networks with classification using feature fusion of multimodal data and deep learning
auditory front-ends, IEEE Access 8 (2020) 16560–16572, doi:10.1109/ACCESS. in a brain-inspired spiking neural network, Sensors 20 (18) (2020) 1–27,
2020.2967791. doi:10.3390/s20185328.
[245] A. Pise, H. Vadapalli, I. Sanders, Facial emotion recognition using temporal re- [271] Y. Tan, Z. Sun, F. Duan, J. Solé-Casals, C.F. Caiafa, A multimodal emotion
lational network: an application to E-learning, Multimedia Tools Appl. (2020), recognition method based on facial expressions and electroencephalogra-
doi:10.1007/s11042- 020- 10133- y. phy, Biomed. Signal Process. Control 70 (2021) March, doi:10.1016/j.bspc.2021.
[246] D.A. Pitaloka, A. Wulandari, T. Basaruddin, D.Y. Liliana, Enhancing CNN with 103029.
preprocessing stage in automatic emotion recognition, Proc. Comput. Sci. 116 [272] D. Tang, J. Zeng, M. Li, An end-to-end deep learning framework with
(2017) 523–529, doi:10.1016/j.procs.2017.10.038. speech emotion recognition of atypical individuals, in: Proceedings of the
[247] Pons, G., Masip, D., & Member, S. (2020). Multitask, multilabel, and multido- Annual Conference of the International Speech Communication Associa-
main learning with convolutional networks for emotion recognition. 1–8. tion, INTERSPEECH, 2018-Septe(September), 2018, pp. 162–166, doi:10.21437/
[248] E. Pranav, S. Kamal, S.C. Chandran, M.S H, Facial emotion recognition using Interspeech.2018-2581.
deep convolutional neural network, in: 6th International Conference on Ad- [273] S. Taran, V. Bajaj, Emotion recognition from single-channel EEG signals using
vanced Computing & Communication Systems (ICACCS), 2020, pp. 317–320. a two-stage correlation and instantaneous frequency-based filtering method,
[249] C. Qing, R. Qiao, X. Xu, Y. Cheng, Interpretable emotion recognition using EEG Comput. Methods Programs Biomed. 173 (2019) 157–165, doi:10.1016/j.cmpb.
signals, IEEE Access 7 (2019) 94160–94170, doi:10.1109/ACCESS.2019.2928691. 2019.03.015.
[250] A. Raheel, M. Majid, S.M. Anwar, DEAR-MULSEMEDIA: dataset for emotion [274] N. Thammasan, K. Moriyama, K. Fukui, ichi, M. Numao, Familiarity effects
analysis and recognition in response to multiple sensorial media, Inform. Fus. in EEG-based emotion recognition, Brain Inform. 4 (1) (2017) 39–50, doi:10.
65 (2021) 37–49 July 2020, doi:10.1016/j.inffus.2020.08.007. 1007/s40708- 016- 0051- 5.
[251] M.A. Rahman, A. Anjum, M.M.H. Milu, F. Khanam, M.S. Uddin, M.N. Mollah, [275] S. Thuseethan, S. Rajasegarar, J. Yearwood, Emotion intensity estimation from
Emotion recognition from EEG-based relative power spectral topography us- video frames using deep hybrid convolutional neural networks, in: Pro-
ing convolutional neural network, Array 11 (2021) 10 0 072 June, doi:10.1016/ ceedings of the International Joint Conference on Neural Networks, 2019-
j.array.2021.10 0 072. July(July), 2019, pp. 1–10, doi:10.1109/IJCNN.2019.8852365.
[252] A.L. Ramos, B.G. Dadiz, A.B.G Santos, Classifying Emotion based on Facial Ex- [276] U. Tiwari, M. Soni, R. Chakraborty, A. Panda, S.K. Kopparapu, Multi-
pression Analysis using Gabor Filter : a Basis for Adaptive Effective Teaching conditioning and data augmentation using generative noise model for speech
Strategy, Springer, Singapore, 2020, doi:10.1007/978- 981- 15- 0058- 9. emotion recognition in noisy conditions, in: ICASSP, IEEE International Con-
[253] N.I.M. Razi, M. Othman, H. Yaacob, EEG-based emotion recognition in the ference on Acoustics, Speech and Signal Processing - Proceedings, 2020-May,
investment activities, in: 6th International Conference on Information and 2020, pp. 7194–7198, doi:10.1109/ICASSP40776.2020.9053581.
Communication Technology for the Muslim World, ICT4M 2016, 2010, 2016, [277] A. Topic, M. Russo, Emotion recognition based on EEG feature maps through
pp. 325–329, doi:10.1109/ICT4M.2016.65. deep learning network, Eng. Sci. Technol. Int. J. (2021) xxxx, doi:10.1016/j.
[254] M. Ren, W. Nie, A. Liu, Y. Su, Multi-modal Correlated Network for emotion jestch.2021.03.012.
recognition in speech, Visual Inform. 3 (3) (2019) 150–155, doi:10.1016/j. [278] S. Tripathi, S. Acharya, R.D. Sharma, S. Mittal, S. Bhattacharya, Using deep and
visinf.2019.10.003. convolutional neural networks for accurate emotion classification on DEAP
[255] M. Rescigno, M. Spezialetti, S. Rossi, Personalized models for facial emo- dataset, in: Proceedings of the Twenty-Ninth AAAI Conference on Innovative
tion recognition through transfer learning, Multimedia Tools Appl. 79 (47–48) Applications (IAAI-17)), 2017, pp. 4746–4752.
(2020) 35811–35828, doi:10.1007/s11042- 020- 09405- 4. [279] H. Ullah, M. Uzair, A. Mahmood, M. Ullah, S.D. Khan, F.A. Cheikh, Internal
[256] Y. Said, M. Barr, Human emotion recognition based on facial expressions emotion classification using EEG signal with sparse discriminative ensemble,
via deep learning on high-resolution images, Multimedia Tools Appl. (2021), IEEE Access 7 (2019) 40144–40153, doi:10.1109/ACCESS.2019.2904400.
doi:10.1007/s11042- 021- 10918- 9. [280] Y. Velchev, S. Radeva, S. Sokolov, D. Radev, Automated estimation of human
[257] A. Sakalle, P. Tomar, H. Bhardwaj, D. Acharya, A. Bhardwaj, A LSTM based deep emotion from II, in: 2016 Digital Media Industry & Academic Forum (DMIAF),
learning network for recognizing emotions using wireless brainwave driven 2016, pp. 40–42.
system, Expert Syst. Appl. 173 (2021) 114516 January, doi:10.1016/j.eswa.2020. [281] A. Verma, P. Singh, J. Sahaya, R. Alex, Modified convolutional neural net-
114516. work architecture analysis for facial emotion recognition, in: 2019 Interna-
[258] M. Sarma, P. Ghahremani, D. Povey, N.K. Goel, K.K. Sarma, N. Dehak, Emotion tional Conference on Systems, Signals and Image Processing (IWSSIP), 2019,
identification from raw speech signals using DNNs, in: Proceedings of the pp. 169–173. https://ieeexplore.ieee.org/abstract/document/8787215.
Annual Conference of the International Speech Communication Association, [282] D. Verma, D. Mukhopadhyay, Age driven automatic speech emotion recogni-
INTERSPEECH, 2018-Septe, 2018, pp. 3097–3101, doi:10.21437/Interspeech. tion system, in: IEEE International Conference on Computing, Communication
2018-1353. and Automation, ICCCA 2016, 2016, pp. 1005–1010, doi:10.1109/CCAA.2016.
[259] L. Schoneveld, A. Othmani, H. Abdelkawy, Leveraging recent advances in deep 7813862.
learning for audio-visual emotion recognition, Pattern Recognit. Lett. 146 [283] G. Verma, H. Verma, Hybrid-deep learning model for emotion recognition us-
(2021) 1–7, doi:10.1016/j.patrec.2021.03.007. ing facial expressions, Rev. Socionetw. Strat. 14 (2) (2020) 171–180, doi:10.
[260] A. Sepas-moghaddam, A. Etemad, F. Pereira, P.L. Correia, Facial Emotion 1007/s12626- 020- 00061- 6.
Recognition Using Light Field Images with Deep Attention-Based Bidirectional [284] M. Wang, Z. Huang, Y. Li, L. Dong, H. Pan, Maximum weight multi-modal in-
LSTM, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, formation fusion algorithm of electroencephalographs and face images for
Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 3367–3371.
29
emotion recognition, Comput. Electr. Eng. 94 (2021) 107319 April, doi:10. [308] Z. Yin, Y. Wang, L. Liu, W. Zhang, J. Zhang, Cross-subject EEG feature selection
1016/j.compeleceng.2021.107319. for emotion recognition using transfer recursive feature elimination, Front.
[285] S. Wang, L. Hao, Q. Ji, Knowledge-augmented multimodal deep regression Neurorobot. 11 (2017) 1–16 APR, doi:10.3389/fnbot.2017.0 0 019.
bayesian networks for emotion video tagging, IEEE Trans. Multimedia 22 (4) [309] Z. Yin, J. Zhang, Cross-session classification of mental workload levels using
(2020) 1084–1097, doi:10.1109/TMM.2019.2934824. EEG and an adaptive deep learning model, Biomed. Signal Process. Control 33
[286] X.H. Wang, T. Zhang, X.M. Xu, L. Chen, X.F. Xing, C.L.P. Chen, EEG emotion (2017) 30–47, doi:10.1016/j.bspc.2016.11.013.
recognition using dynamical graph convolutional neural networks and broad [310] M.M.T. Zadeh, M. Imani, B. Majidi, Fast facial emotion recognition using con-
learning system, in: 2018 IEEE International Conference on Bioinformatics volutional neural networks and Gabor filters, in: 2019 IEEE 5th Conference on
and Biomedicine, BIBM 2018, 2018, pp. 1240–1244, doi:10.1109/BIBM.2018. Knowledge Based Engineering and Innovation, KBEI 2019, 2019, pp. 577–581,
8621147. doi:10.1109/KBEI.2019.8734943.
[287] Y. Wang, Z. Huang, B. McCane, P. Neo, EmotioNet: a 3-D convolutional neu- [311] A. Zhang, L. Su, Y. Zhang, Y. Fu, L. Wu, S. Liang, EEG data augmentation for
ral network for EEG-based emotion recognition, in: Proceedings of the Inter- emotion recognition with a multiple generator conditional Wasserstein GAN,
national Joint Conference on Neural Networks, 2018-July, 2018, doi:10.1109/ Complex Intell. Syst. (2021) 0123456789, doi:10.1007/s40747- 021- 00336- 7.
IJCNN.2018.8489715. [312] Fan Zhang, H. Meng, M. Li, Emotion extraction and recognition from mu-
[288] Z. Wang, T. Gu, Y. Zhu, D. Li, H. Yang, W. Du, FLDNet: frame level distilling sic, in: 2016 12th International Conference on Natural Computation, Fuzzy
neural network for EEG emotion recognition, IEEE J. Biomed. Health Inform. Systems and Knowledge Discovery, ICNC-FSKD 2016, 2016, pp. 1728–1733,
2194(c) (2021) 1–12, doi:10.1109/JBHI.2021.3049119. doi:10.1109/FSKD.2016.7603438.
[289] Z.-q. Wang, I. Tashev, Learning utterance-level representations for speech [313] F. Zhang, Q. Mao, X. Shen, Y. Zhan, M. Dong, Spatially coherent feature learn-
emotion and age /gender recognition using deep neural networks depart- ing for pose-invariant facial expression recognition, ACM Trans. Multimed.
ment of computer science and engineering, in: Proceedings 42nd IEEE In- Comput. Commun. Appl. 14 (1s) (2018), doi:10.1145/3176646.
ternational Conference on Acoustics, Speech, and Signal Processing, ICASSP [314] J. Zhang, M. Chen, S. Hu, Y. Cao, R. Kozma, PNN for EEG-based emotion
2017, The Ohio State University, USA Microsoft Research, One Microsoft Way, recognition, in: 2016 IEEE International Conference on Systems, Man, and Cy-
Redmond, USA, 2017, pp. 5150–5154. bernetics, SMC 2016, 2016, pp. 2319–2323, doi:10.1109/SMC.2016.7844584. -
[290] Zhongmin Wang, X. Zhou, W. Wang, C. Liang, Emotion recognition us- Conference Proceedings.
ing multimodal deep learning in multiple psychophysiological signals and [315] J. Zhang, P. Chen, S. Nichele, A. Yazidi, Emotion recognition using time-fre-
video, Int. J. Mach. Learn. Cybern. 11 (4) (2020) 923–934, doi:10.1007/ quency analysis of EEG signals and machine learning, in: IEEE Symposium
s13042- 019- 01056- 8. Series on Computational Intelligence (SSCI), 2019, pp. 404–409.
[291] G. Wen, H. Li, J. Huang, D. Li, E. Xun, Random deep belief networks for rec- [316] S. Zhang, A. Chen, W. Guo, Y. Cui, X. Zhao, L. Liu, Learning deep binau-
ognizing emotions from speech signals, Comput. Intell. Neurosci. 2017 (2017), ral representations with deep convolutional neural networks for spontaneous
doi:10.1155/2017/1945630. speech emotion recognition, IEEE Access 8 (2020) 23496–23505, doi:10.1109/
[292] H. Wen, S. You, Y. Fu, Cross-modal dynamic convolution for multi-modal ACCESS.2020.2969032.
emotion recognition, J. Vis. Commun. Image Represent. 78 (2021) 103178 [317] S. Zhang, X. Tao, Y. Chuang, X. Zhao, Learning deep multimodal affective
June, doi:10.1016/j.jvcir.2021.103178. features for spontaneous speech emotion recognition, Speech Commun. 127
[293] L. Wijayasingha, J.A. Stankovic, Robustness to noise for speech emotion clas- (2021) 73–81 May 2020, doi:10.1016/j.specom.2020.12.009.
sification using CNNs and attention mechanisms, Smart Health 19 (2021) [318] S. Zhang, S. Zhang, T. Huang, W. Gao, Speech emotion recognition using deep
100165 December 2020, doi:10.1016/j.smhl.2020.100165. convolutional neural network and discriminant temporal pyramid matching,
[294] Wilaiprasitporn, T., Ditthapron, A., Matchaparn, K., Tongbuasirilai, T., Ban- IEEE Trans. Multimedia 20 (6) (2018) 1576–1590.
luesombatkul, N., & Chuangsuwanich, E. (2018). Affective EEG-based person [319] S. Zhang, S. Zhang, T. Huang, W. Gao, Q. Tian, Learning affective features
identification using the deep learning approach. ArXiv, 12(3), 486–496. with a hybrid deep model for audio-visual emotion recognition, IEEE Trans.
[295] Williams, J., Kleinegesse, S., Comanescu, R., & Radu, O. (2018). Recog- Circ. Syst. Video Technol. 28 (10) (2018) 3030–3043, doi:10.1109/TCSVT.2017.
nizing emotions in video using multimodal DNN feature fusion. 11–19. 2719043.
10.18653/v1/w18-3302 [320] W. Zhang, P. Song, Transfer sparse discriminant subspace learning for cross-
[296] X. Xia, D. Jiang, H. Sahli, Learning salient segments for speech emotion recog- corpus speech emotion recognition, IEEE/ACM Trans. Audio Speech Lang. Pro-
nition using attentive temporal pooling, IEEE Access 8 (2020) 151740–151752, cess. 28 (2020) 307–318, doi:10.1109/TASLP.2019.2955252.
doi:10.1109/ACCESS.2020.3014733. [321] Y. Zhang, Y. Liu, F. Weninger, B. Schuller, Multi-task deep neural network
[297] Y. Xie, R. Liang, Z. Liang, C. Huang, C. Zou, B. Schuller, Speech emotion clas- with shared hidden layers: breaking down the wall between emotion rep-
sification using attention-based LSTM, IEEE/ACM Trans. Audio Speech Lang. resentations, in: ICASSP, IEEE International Conference on Acoustics, Speech
Process. 27 (11) (2019) 1675–1685, doi:10.1109/TASLP.2019.2925934. and Signal Processing, 645378, 2017, pp. 4990–4994, doi:10.1109/ICASSP.2017.
[298] B. Xing, H. Zhang, K. Zhang, L. Zhang, X. Wu, X. Shi, S. Yu, S. Zhang, Exploit- 7953106. - Proceedings.
ing EEG signals and audiovisual feature fusion for video emotion recognition, [322] Z. Zhang, C. Lai, H. Liu, Y.F. Li, Infrared facial expression recognition via
IEEE Access 7 (1) (2019) 59844–59861, doi:10.1109/ACCESS.2019.2914872. Gaussian-based label distribution learning in the dark illumination environ-
[299] X. Xing, Z. Li, T. Xu, L. Shu, B. Hu, X. Xu, SAE+LSTM: a new framework for ment for human emotion detection, Neurocomputing 409 (2020) 341–350,
emotion recognition from multi-channel EEG, Front. Neurorobot. 13 (2019) doi:10.1016/j.neucom.2020.05.081.
1–14 June, doi:10.3389/fnbot.2019.0 0 037. [323] J. Zhao, X. Mao, L. Chen, Learning deep features to recognise speech emotion
[300] G. Xu, W. Li, J. Liu, A social emotion classification approach using multi-model using merged deep CNN, IET Signal Proc. 12 (6) (2018) 713–721, doi:10.1049/
fusion, Future Gen. Comput. Syst. 102 (2020) 347–356, doi:10.1016/j.future. iet-spr.2017.0320.
2019.07.007. [324] Y. Zhao, X. Jin, X. Hu, Recurrent convolutional neural network for speech
[301] Yadav, S.P. (2021). Emotion recognition model based on facial expressions. processing, in: 2017 IEEE International Conference on Acoustics, Speech and
April 2020. Signal Processing (ICASSP), 2017, pp. 5300–5304, doi:10.1109/ICASSP.2017.
[302] M. Yanagimoto, C. Sugimoto, Recognition of persisting emotional valence 7953168.
from EEG using convolutional neural networks, in: 2016 IEEE 9th Inter- [325] Z. Zhao, K. Wang, Z. Bao, Z. Zhang, N. Cummins, S. Sun, H. Wang, J. Tao,
national Workshop on Computational Intelligence and Applications, IWCIA B.W. Schuller, Self-attention transfer networks for speech emotion recogni-
2016, 2016, pp. 27–32, doi:10.1109/IWCIA.2016.7805744. - Proceedings. tion, Virtual Real. Intell. Hardw. 3 (1) (2021) 43–54, doi:10.1016/j.vrih.2020.
[303] S. Yang, Z. Yin, Y. Wang, W. Zhang, Y. Wang, J. Zhang, Assessing cognitive 12.002.
mental workload via EEG signals and an ensemble deep learning classifier [326] W. Zheng, Multichannel EEG-based emotion recognition via group sparse
based on denoising autoencoders, Comput. Biol. Med. 109 (2019) 159–170 canonical correlation analysis, IEEE Trans. Cognit. Dev. Syst. 9 (3) (2017) 281–
January, doi:10.1016/j.compbiomed.2019.04.034. 290, doi:10.1109/TCDS.2016.2587290.
[304] K. Yano, T. Suyama, Fixed low-rank EEG spatial filter estimation for emotion [327] F. Zhou, S. Kong, C.C. Fowlkes, T. Chen, B. Lei, Fine-grained facial expression
recognition induced by movies, in: PRNI 2016 - 6th International Workshop analysis using dimensional emotion model, Neurocomputing 392 (2020) 38–
on Pattern Recognition in Neuroimaging, 2, 2016, pp. 3–6, doi:10.1109/PRNI. 49, doi:10.1016/j.neucom.2020.01.067.
2016.7552327. [328] L. Zhu, L. Chen, D. Zhao, J. Zhou, W. Zhang, Emotion recognition from chinese
[305] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, J. Vepa, Speech emo- speech for smart affective services using a combination of SVM and DBN,
tion recognition using spectrogram & phoneme embedding, in: Proceed- Sensors 17 (7) (2017), doi:10.3390/s17071694.
ings of the Annual Conference of the International Speech Communication [329] N. Zhuang, Y. Zeng, K. Yang, C. Zhang, L. Tong, B. Yan, Investigating patterns
Association, INTERSPEECH, 2018, pp. 3688–3692, doi:10.21437/Interspeech. for self-induced emotion recognition from EEG signals, Sensors 18 (3) (2018)
2018-1811. 2018-Septe(September). 1–22, doi:10.3390/s18030841.
[306] S. Yildirim, Y. Kaya, F. Kılıç, A modified feature selection method based on [330] Wang Kay Ngai, Haoran Xie, Di Zou, Kee-Lee Chou, Emotion recognition based
metaheuristic algorithms for speech emotion recognition, Appl. Acoust. 173 on convolutional neural networks and heterogeneous bio-signal data sources,
(2021) 107721, doi:10.1016/j.apacoust.2020.107721. Information Fusion 77 (2022) 107–117, doi:10.1016/j.inffus.2021.07.007.
[307] Z. Yin, L. Liu, L. Liu, J. Zhang, Y. Wang, Dynamical recursive feature elimina-
tion technique for neurophysiological signal-based emotion recognition, Cog-
nit. Technol. Work 19 (4) (2017) 667–685, doi:10.1007/s10111- 017- 0450- 2.
30

3

Uploaded by

Copyright:

Available Formats

3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3

Uploaded by

Copyright:

Available Formats

Computer Methods and Programs in Biomedicine 215 (2022) 106646

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine

Automated emotion recognition: Current trends and future

1. Introduction system signals originating in the human physiological system for

Fig. 1. 2-D model for valence – arousal.

Fig. 3. Overview of CAD system for human ER using three modalities.

Fig. 4. Schematic diagram of EEG- based automated ER approach.

Fig. 5. Distribution (yearly) of papers published on EEG- based automated ER.

Fig. 6. Schematic diagram of voice- based automated ER approach.

Machine Learning-based studies

Deep Learning-based studies

1 [39] GCNN+LSTM subject-dependent ERacc : DEAP

Machine Learning-based studies

1 [109] MFCC+CNN SVM and LSTM ERacc : 73.5% RAVDESS

Deep Learning-based studies

1 [197] Spatiotemporal and EMO-DB: IEMOCAP, EMO-DB,

Fig. 8. Schematic diagram of facial features- based automated ER approach.

Machine Learning-based studies:

1 [64] GP – FER CK+: ERacc: 98% DISFA, DISFA+, CK+ and

Deep Learning-based studies:

1 [149] GRERN Softmax ERacc: 86.73% CEAR and AFEW

Machine Learning-based studies:

Deep Learning-based studies:

1 [18] CNN+RNN+MoBEL BEL ERacc : 81.7% eNterface’05, Audio, visual

Fig. 11. Application of a multimodal system in smart health care.

Machine Learning-based studies

Sl. No Author & Methods Classiﬁers Results Accuracy (ERacc ) Database

20 [326] Group sparse canonical correlation analysis ERacc: 80.20% SEED

Sl. No Author & Methods Classiﬁers Results Accuracy (ERacc ) Database

Deep Learning-based studies

Machine Learning-based studies

Deep Learning-based studies

Sl. No Author & Methods Classiﬁers Results: Accuracy(ERacc ) Database

17 [258] DNN WA: 70.1% UA: 60.7% IEMOCAP

Machine Learning-based studies

1 [165] Spatio-Temporal Completed SMIC: ERacc: 75.31%- Spontaneous

Deep Learning-based studies:

1 [327] Bilinear - CNN ERacc: 89.98% FER-2013

Sl. No Author & Methods Classiﬁers Results Accuracy(ERacc ) Database

Author & Modalities

Machine Learning-based studies

Deep Learning-based studies

Sl. No Author & Methods Classiﬁers Results Accuracy(ERacc ) Database Modalities

Sl. No Author & Methods Classiﬁers Results Accuracy(ERacc ) Database Modalities

7 [285] Multimodal deep MSE:0.103 –valence LIRIS-ACCEDE Audio and visual

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.