Multimodal Emotion Recognition in Deep Learninga Survey
Multimodal Emotion Recognition in Deep Learninga Survey
Abstract—Emotion recognition is an active research issue in information, which makes VER susceptible to noise
recent decades. It can be divided into implicit emotion recognition interference; emotion is abstract and belongs to high-level
based on physiological signals and direct emotion recognition semantic information, which requires a variety of visual
based on non-physiological signals. Direct emotion recognition information to express.
through the processing of different modal information for emotion
recognition is the emphasis of this paper. Firstly, we analyze the Audio emotion recognition includes emotion recognition of
framework and research methods of video emotion recognition speech and music. Due to the particularity of music, music
and music emotion recognition in detail, especially the research emotion recognition (MER) has been a key research topic and
method based on deep learning. Then, the fusion technology in has been widely used in music information retrieval. The
multimodal emotion recognition combining video and audio is particularity of music is that people can resonate with music to
compared. On the basis of the above contents, this paper reviews regulate mood and relieve stress. At present, MER faces the
the research status of emotion recognition based on deep learning. following problems: it is difficult to make a benchmark
Finally, according to the current research situation, we put database and to determine effective music emotion features.
forward some suggestions for future research. (Abstract)
Due to the limitations of single modal emotion recognition,
Keywords—video emotion recognition, music emotion to improve the recognition effect, researchers launched the
recognition, multimodal emotion recognition (key words) research of multimodal emotion recognition. It can achieve
better results than single modal emotion recognition do by
I. INTRODUCTION using the complementarity of multiple modal information. In
Emotion recognition, as an important branch of affective multimodal emotion recognition, the combination of modalities
computing, plays a crucial role in the development of artificial mainly includes video-audio, text-video, and text-audio. In this
intelligence. The purpose of affective computing is to create a paper, we pay more attention to multimodal emotion
computing system that can recognize, perceive and understand recognition based on video-audio combination.
human emotions, and give reasonable responses. In the research In the development of VER, MER and multimodal emotion
of affective computing, sentiment analysis and emotion recognition, researchers summarized the existing technologies
recognition are two major tasks of this research. Sentiment from different perspectives. Wang et al. [1] summarized the
analysis is a coarse-grained affective computing, which divides direct and implicit VER. Baveye et al. [2] studied the video
emotion into two groups: positive and negative; Emotion emotional recognition of movie types. Panda et al. [3] gave a
recognition is a fine-grained affective computing, which detailed overview of the music features in MER. Drakopoulos
classifies emotion in a more specific way, such as happy, excite et al. [4] studied emotion recognition of speech signal in detail.
and so on. Yang et al. [5] and Joseph et al. [6] conducted relatively
Currently, emotion recognition can be divided into two comprehensive research on MER. Liu et al. [7] and Poria et al.
categories: implicit emotion recognition and direct emotion [8] analyzed multimodal emotion recognition of text, video and
recognition. The former uses human physiological responses audio, in which the emphasis was on facial information and
such as electrocardiogram, electroencephalogram for emotion speech information. The above review studies had certain
recognition [1]. The latter uses different modal information such limitations, and only introduced emotion recognition based on
as text, image, video and audio to recognize emotion. traditional machine learning methods. Therefore, this paper
concentrates on emotion recognition based on deep learning,
Video emotion recognition (VER), as the main method of and makes a comprehensive review and summary of this study.
understanding video content, is not limited to facial
information, but focuses more on the emotional information The rest of this paper is organized as follows. In Section II,
conveyed by video. VER is currently used in many fields such the emotional expression model and database are introduced.
as video retrieval, video summary, and emotional video Section III and Section IV introduce VER and MER
advertising and so on. At present, VER faces the following respectively, including the framework of emotion recognition
problems: only a few frames in the video contain rich emotional and the development and the status quo of research methods.
A. Emotional expression model • Sparse emotional content. Unlike movie clips, UGVs
usually only have a few of video frames to express emotion
Psychologists often use two models to describe emotions:
directly [17], [18]. Therefore, it is very crucial to pick out those
category emotion states (CES) and dimensional emotion space
video clips that contribute the most to the overall emotion of the
(DES). CES divides emotions into several categories, such as
video. Xu et al. [13] referred to this task as emotion attribution
the six emotional categories proposed by Ekman [9]: anger,
and those video clips as emotional frames.
disgust, fear, happiness, sadness and surprise. Although this
category emotion model is widely used in research, it cannot • Rich in content. The diversity of video content
reflect the complexity of emotion well [10]. In DES, emotions increases the difficulty of emotion recognition [17]. A single
are represented by coordinates, which can reflect the continuity feature is not enough to represent comprehensive visual
of emotion changes. The Valence-Arousal (VA) model information.
proposed by Thayer [11] is a commonly used DES at present. In
VA model, emotions are expressed from the two dimensions of • Low quality. UGVs are mostly obtained by non-
valence and arousal. Although compared with CES, DES is professional shooting equipment. In addition, users generally
smoother in emotional expression and more in line with human lack shooting skills, and their photography is usually of low
emotions, it is difficult to make datasets, and the valence and quality [19], [20]. As a result, video clips usually contain a lot
arousal values determined by subjective feelings cannot be of noise and are lack of beauty.
unified. This section mainly reviews and summarizes the research on
B. Datasets VER of UGVs. The framework of VER mainly includes three
parts: emotional frames calculation, feature extraction and
Because of the different emotional expression models used recognition network. By emotional frames calculation, the video
in the construction of datasets, datasets can be divided into two frames with large emotion contribution are obtained. Feature
types: discrete type and dimension type, corresponding to CES extraction can obtain emotional features from video frames. The
and DES emotional expression models respectively. recognition network maps the video features to the emotion
In the video datasets, Video Emotion Dataset and Ekman space. There are two research methods of VER: traditional
Dataset are two discrete type of benchmark datasets, both machine learning and deep learning.
collected from social video websites. Video Emotion Dataset B. Traditional machine learning method
[12] divided emotions into 8 categories: anger, expectation,
disgust, fear, joy, sadness, surprise and trust. In Ekman Dataset VER based on traditional machine learning often uses the
[13], according to Ekman's theory, emotions were divided into method of key frames extraction to calculate the emotional
six categories. LIRIS-ACCEDE dataset is a commonly-used frame of video. Key frames extraction uses different methods
dimension type of dataset, and the video content is mainly to extract some frames and discard redundant frames to
movie clips. accomplish the purpose of emotional frames calculation. Wei
et al. [17] summarized the key frames extraction methods. The
Different from video datasets, music produced by method of key frames extraction only retains some video
professionals is copyrighted, so it’s hard to make public music frames, which has advantages in running speed, but loses some
datasets [14]. At present, discrete type music datasets include video information.
CAL500 Dataset, Soundtracks Dataset and so on. Emotion in
Music Database [15] is a commonly-used dimension type music Among traditional machine learning methods, movie
dataset. It contains 1000 songs, which are labeled by VA model. features and image features are often used. Movie features [1]
include rhythm, lighting, color and others. Image features
In multimodal datasets, discrete type of datasets such as include Histograms of Oriented Gradient(HOG), and Scale
Video Emotion Datasets and Ekman Datasets are commonly Invariant Feature Transform(SIFT) and so on. The above two
used, which include audio information when constructed. features are often used in early emotion recognition research for
Dimension type such as AEVC Datasets [16], which collected movie clips, so they can be collectively referred to as early
human-to-human interactions in the wild, annotated values on features. Early features are usually manually extracted, which
valence and arousal dimensions every 100 milliseconds. does not contain semantic information, has limited
representation ability, and cannot effectively express emotion.
III. VIDEO EMOTION RECOGNITION Support Vector Machine(SVM), Hidden Markov Model(HMM)
A. Introduction and so on are mostly used in traditional recognition network.
In VER, the research methods for various types of videos C. Deep learning method
are different. At present, there are mainly two kinds of videos: In the deep learning method, emotional frames are obtained
movie clips and user-generated videos (UGVs). Before 2014, by calculating the weight of video frames. The core idea is to
researchers mainly concentrated on movie-type videos [1]. use the attention mechanism to calculate the weight of each
With rapid development of network, the number of user-
78
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on November 17,2024 at 09:55:03 UTC from IEEE Xplore. Restrictions apply.
video frame [18], [21], [22], [23].And the video frames with TABLE I. ACCURACY ON DISCRETE DATASETS
larger weight also contribute more to the video emotions. Qiu
et al. [21] used the attention mechanism to propose a time series Video
Emotional Ekman
Reference Features Net Emotion
focus network. This network focuses on the time dimension to frames
Dataset
Dataset
find the segments that best represent emotions. Compared with
key frames extraction, the method of video frames weight not content
only retains the global information of video as time series, but [26] / CNN 50.60% 51.80%
features
also realizes the dynamic change of weight through attention
mechanism. attribute CNN-
[19] / 52.50% 54.40%
In feature extraction, due to the limitations of early features, features SVM
researchers use deep learning network to extract video features,
which include attribute features and content features. [27] /
content
CNN 51.48% 55.62%
features
In 2014, Jiang et al. [12] used attribute features for video
emotion recognition. Attribute features include three parts: video
deep
Classemes, ObjectBank, SentiBank .Among them, SentiBank [23] frames
features
3DCNN 53.60% 54.50%
contains 1200 adjective-noun pairs. Adjectives are strongly weight
related to emotions, and nouns correspond to objects and scenes video
which are expected to be automatically detected. Reference content CNN-
[21] frames 53.34% 57.37%
features LSTM
[24] utilized deep learning to extract SentiBank features and weight
named it DeepSentiBank, which is widely used [17], [25]. CNN-
deep
In 2016, Chen et al. [26] used three high-level semantic [17] key frames RF- 52.85% 59.51%
features
SVM
features: event, object and scene features for video emotional
recognition. Xu et al. [27] put forward the concept of content emotional frames calculation, the sparse representation method
features based on reference [26], including action, scene and is used to denoise the videos, and good results were achieved.
object features. Action features: human behavior is highly Zhao et al. [23] utilized auxiliary image datasets and 3DCNN
correlated with emotions. Scene features: emotion is also for emotion recognition. Qiu et al. [21] used video frames
related to the background in the video. Object features: objects weights to obtain emotion frames, used CNN to extract content
in videos usually contain important action recognition clues. features, and used Long-Short Term Memory (LSTM) for
emotion recognition. In order to further improve the recognition
Attribute features and content features are extracted by deep performance, Wei et al. [17] utilized CNN, SVM and Random
learning network, so they are also called deep features. These Forest (RF) three network models, and adopted a hybrid fusion
deep features contain rich semantic information and contextual mechanism of fractional fusion and Top-K decision fusion to
content, which can effectively express video emotion. obtain the best results at present - an accuracy rate of 59.51% in
In VER using deep learning, the recognition network is Ekman Dataset.
divided into a single network and a combined network. A single Due to the sparse emotion frames of UGVs, the use of
network uses Convolutional Neural Networks (CNN) to emotional frame calculation technology is beneficial to further
complete video feature extraction and feature mapping to processing. Among them, the key frame method using clustering
emotional space [25], [27]. At present, the effect of emotion often requires large auxiliary image datasets. The video frames
recognition using CNN alone is not significant. The main weight method using attention mechanism is more complex, but
reason is that CNN is not suitable for processing sequence the processing effect is better. The deep features, including
information and ignores the spatial and temporal information of attributes and content, contain rich semantic information and can
video. Improved CNN such as 3DCNN can extract temporal effectively express emotion. It can achieve a better effect overall
and spatial information, but rely too much on the sequence of by using CNN-RNN combined network as emotion recognition
video frames [18], so the effect of video emotion recognition network.
has not been significantly improved. Combined network firstly
uses CNN to extract video features, and then uses Recurrent IV. MUSIC EMOTION RECOGNITION
Neural Networks (RNN) or traditional machine learning to map
features to emotion space. For instance the CNN-RNN A. Introduction
combined network [18], [21], [22]; the CNN-SVM combined Compared with speech signals, music signals contain more
network [23], [28]. audio features, so our research focuses on music emotion
recognition (MER).
D. Analysis of research results
In VER, we fist review the research methods of some classic Unlike the user-generated videos, music has a certain
works, and then sort out the research results in recent years rhythm, melody, expressiveness and requires special expression
(2014-2021), as shown in Table I. Finally, we summarize and skills, so music production is more professional. In addition, the
analyze the current research. emotion of user-generated videos is single, while music
contains rich emotion. It is difficult to unify the emotion
Zhang et al. [19] used ResNet to extract attribute features, expressed by a piece of music in different segments. Therefore,
and used SVM as the recognition network. Although there is no researchers also pay attention to the dynamic changes of music
79
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on November 17,2024 at 09:55:03 UTC from IEEE Xplore. Restrictions apply.
emotions. They perform emotion recognition or prediction on TABLE II. ROOT MEAN SQUARD ERROR(RMSE) ON EMOTION IN MUSIC
DATABASE
short-term segments of each piece of music. This research is
called music emotion variation detection (MEVD). Since the
research method of MEVD [29] is also based on MER research, Reference Features Net Arousal Valence
this section reviews and summarizes MER research.
The framework of MER mainly includes two parts: feature
spectrogram 0.07 ± 0.06 ±
extraction and recognition network. Feature extraction can [32]
features
CNN-LSTM
0.05 0.04
obtain emotional features from music, and recognition network
maps music features to emotional space. There are two research
audio
methods of MER: traditional machine learning and deep [39]
waveform
CNN-GRU 0.214 0.240
learning.
B. Traditional machine learning method audio
[38] CNN-LSTM 0.212 0.219
In traditional machine learning methods, dimensional waveform
features are often used. They are divided according to different
dimensions of music. Reference [3] divided music features [41] tone color CNN-GRU 0.231 0.279
according to eight dimensions of music, namely: Melody,
Harmony, Rhythm, Dynamic, Tone Color, Expressivity, Texture,
and Form. At present, tone color features are often used in spectrogram
[29] CNN-RNN 0.100 0.110
research, including zero crossing rate [30], spectrogram, features
spectrum flux [31], etc. The spectrogram includes Mel-
spectrogram (MFCC) [32], [33], [34], [35], Cochlear- The recognition network in MER is also divided into single
spectrogram [29], [32]. MFCC has the time-frequency domain network and combined network. The single network is still
information of music, which is based on human language and dominated by CNN. The combined network includes the CNN-
hearing. It has the characteristics of strong recognition ability LSTM [29], [32], [38]; the CNN-GRU [40].
and strong noise resistance [33]. Cochlear-spectrogram D. Analysis of research results
contains rich texture information. In traditional machine
learning, classical frameworks or toolkits such as Marsyas, It should be noted that the music datasets used by different
MIR Toolbox, and PsySound are often applied to extract researchers in the research process are not the same. Therefore,
features. However, the artificial design of musical features there is no unified evaluation standard. In MER, we first review
requires professional knowledge of acoustics and musicology, the research methods of some classic works, and then sort out
and the process of artificial design and selection of features is the research results in recent years (2014-2021), as shown in
cumbersome. Traditional recognition networks mostly use Table II. Finally, we summarize and analyze the current
SVM, RF, etc. research.
C. Deep learning method Er et al. [40] took chromaticity spectrogram as input, used
CNN to extract deep features, and used data enhancement
In deep learning methods, in addition to dimensional technology to increase the amount of data. The accuracy rate on
features, the commonly used features include two features: Soundtracks Dataset reached 75.5%. Du et al. [32] took Mel-
mid-level perceptual features and deep features. spectrogram and cochlear-spectrogram as inputs, extracted
In 2018, Aljanaki et al. [36] proposed and verified the features through CNN respectively, and finally used LSTM for
effectiveness of seven mid-level perceptual features, namely: emotion recognition. References [38] and [39] utilized the
Melodiousness, Tonal stability, Rhythmic stability, Modality, original audio as the input of network, and used CNN-RNN
Rhythmic complexity, Dissonance and Articulation. These combined network to carry out emotion recognition.
mid-level perceptual features increase the interpretability of At present, the latest research inputs the original audio
emotion recognition research. waveform as a feature into CNN-RNN combined network,
Although the mid-level perceptual features are highly which avoids the tedious process of obtaining spectrograms or
explanatory, the spectrogram is still used as the input [37]. features, and proves the effectiveness of this research method.
Spectrogram retain the time domain and frequency domain However, the effect of this method is not significant at present,
information of music signal, but ignore the amplitude and phase which needs further research.
information.
V. MULTIMODAL EMOTION RECOGNITION
References [34], [38] and [39] used the original audio
A. Introduction
waveform as input and extracted deep features. Taking the
original audio waveform as the input of the network can Multimodal learning can realize the transformation between
effectively retain the amplitude and phase information of the different modal information by jointly representing different
music signal [34], while using deep learning can automatically modal data and capturing the internal correlation between
extract features, avoiding the tedious process of manual design different modes [42]. The essential problem of multimodal
and feature selection. On different music datasets, these studies learning is to associate different modes, so it is particularly
achieved good results, which proved the effectiveness of this important to build a shared representation space. There are two
method.
80
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on November 17,2024 at 09:55:03 UTC from IEEE Xplore. Restrictions apply.
main methods to construct shared representation space: fusion C. Analysis of research results
and alignment. In multimodal emotion recognition, we first review the
Multimodal fusion fuses different modal information to add research methods of some classic works, and then sort out the
information. Multimodal alignment is responsible for finding research results in recent years (2014-2021), as shown in Table
the correspondence between different modal information. In the III. Finally, we summarize and analyze the current research.
research of emotion recognition, multimodal fusion technology Pandeya et al. [34] adopted a multi-channel system, used
is used to fuse video information and audio information in 3DCNN and I3D networks to extract video features, utilized
different ways so as to improve the recognition accuracy. CNN to extract MFCC features and original audio waveform
B. Multimodal emotion recognition features, and adopted decision-level fusion in modal fusion, and
the recognition accuracy reached 88.568% in the self-
At present, the fusion methods in emotion recognition constructed dataset. Huang et al. [44] completed model-level
mainly include feature-level fusion, decision-level fusion and fusion through Transformer, and used LSTM to focus on
model-level fusion. temporal context information, and conducted experiments on
Feature-level fusion: The features extracted from different AVEC dataset. The consistency correlation coefficients (CCC)
modes are directly connected. Feature-level fusion is simple of Arousal and Valence dimensions were 0.654 and 0.708,
and easy to achieve, but it is difficult to learn the relationship respectively.
between video and audio. Yi et al. [22] has improved feature
level fusion by learning the relationship between different VI. CONCLUSION
modes. Adaptive methods are used to automatically adjust the In this paper, the research of emotion recognition is
weights of feature vectors of different modes. summarized, with emphasis on the research work based on deep
learning method. Firstly, we list two research contents of
Decision-level fusion: The final recognition result is
emotion recognition: direct emotion recognition and implicit
obtained by combining the recognition results of different
emotion recognition, with emphasis on direct emotion
modes. Decision-level fusion is flexible and can be realized by
recognition for modal information. Secondly, we illustrate the
various decision-making methods, such as weighted sum and
significance of emotion recognition and review the main
average value. However, decision-level fusion recognizes
emotion expression models and related datasets. Finally, we
emotions in a way that multiple modes are independent of each
concentrate on the research methods of VER, MER and
other, which cannot reflect the complementarity between
multimodal emotion recognition, and analyze the current
modes well.
situation of the research.
Model-level fusion: The recognition results are obtained
In addition, we give some suggestions on emotion
through a model that can combine multimodal information.
recognition research. In VER, the video frames weight based on
Model-level fusion can learn the multimodal interaction within
attention mechanism can effectively calculate emotional frames.
the model better and build a shared representation space. The
Content features that contain rich semantic information can
realization of model-level fusion mainly depends on the models
express emotions well. In MER, CNN is usually used to extract
used, such as Deep Boltzmann Machine (DBM) and
tone color features, such as spectrogram as input for feature
Transformer [43].
extraction. But we think that deep features directly based on
TABLE III. ACCURACY ON DISCRETE DATASETS
audio waveform have greater prospects. Using audio waveform
as features is not only easy to operate but also can obtain more
music information. As video and music are sequence
Fusion Video Emotion information, the CNN-RNN combined network can effectively
Reference Ekman Dataset
Methods Dataset
identify emotions. Moreover, it is also critical to the
improvement of the network. In the research of multimodal
feature-level emotion recognition, fusion technology is relatively critical.
[12] 46.10% /
fusion Among them, feature-level fusion is easy to achieve, so it is still
widely used. Decision-level fusion technology is relatively
[45]
model-level
49.90% / mature, but more decision-making strategies still need to be
fusion improved. The model used in model-level fusion needs to be
further verified for its effectiveness. And it is particularly
[46]
model-level
51.10% /
important to explore other models.
fusion
REFERENCES
feature-level [1] S. Wang and Q. Ji, "Video Affective Content Analysis: A Survey of State-
[47] 46.70% 50.40% of-the-Art Methods," in IEEE Transactions on Affective Computing, vol.
fusion
6, no. 4, pp. 410-430, 1 Oct.-Dec. 2015.
[2] Y. Baveye, C. Chamaret, E. Dellandréa and L. Chen, "Affective Video
feature-level Content Analysis: A Multidisciplinary Insight," in IEEE Transactions on
[23] 54.50% 55.30%
fusion Affective Computing, vol. 9, no. 4, pp. 396-409, 1 Oct.-Dec. 2018.
[3] R. Panda, R. M. Malheiro and R. P. Paiva, "Audio Features for Music
Emotion Recognition: a Survey," in IEEE Transactions on Affective
Computing.2020.
81
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on November 17,2024 at 09:55:03 UTC from IEEE Xplore. Restrictions apply.
[4] G. Drakopoulos, et al. "Emotion Recognition from Speech: A [26] C. Chen, Z Wu, and Y.-G Jiang. "Emotion in context: Deep semantic
Survey,"WEBIST. 2019. feature fusion for video emotion recognition," Proceedings of the 24th
[5] X. Yang, Y. Dong, and J.Li. "Review of data features-based music ACM international conference on Multimedia. 2016.
emotion recognition methods," Multimedia systems 24.4, pp.365-389. [27] B. Xu, Y. Zheng, H. Ye, C. Wu, H. Wang and G. Sun, "Video Emotion
2018. Recognition with Concept Selection," 2019 IEEE International
[6] C. Joseph and S. Lekamge, "Machine Learning Approaches for Emotion Conference on Multimedia and Expo (ICME), Shanghai, China, pp. 406-
Classification of Music: A Systematic Literature Review," 2019 411. 2019.
International Conference on Advancements in Computing (ICAC), pp. [28] R. Panda, R. Malheiro and R. P. Paiva, "Novel Audio Features for Music
334-339. 2019. Emotion Recognition," in IEEE Transactions on Affective Computing,
[7] J. Liu, P. Zhang, Y. Liu, W.Zhang, and J.Fang. "Summary of Multi-modal vol. 11, no. 4, pp. 614-626, 1 Oct.-Dec. 2020.
Sentiment Analysis Technology," Journal of Frontiers of Computer [29] M.Russo, et al. "Cochleogram-based approach for detecting perceived
Science and Technology. 2021.(Chinese) emotions in music," Information Processing & Management, vol. 57, no.
[8] S. Poria, et al. "A review of affective computing: From unimodal analysis 5, pp. 102270. 2020.
to multimodal fusion," Information Fusion 37, pp.98-125. 2017. [30] F. Zhang, H. Meng and M. Li, "Emotion extraction and recognition from
[9] P. Ekman, WV. Friesen, and P. Ellsworth. Emotion in the human face: music," 2016 12th International Conference on Natural Computation,
Guidelines for research and an integration of findings. vol. 11. Elsevier. Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pp. 1728-
2013. 1733. 2016.
[10] S. Huang, L. Zhou, Z. Liu, S. Ni and J. He, "Empirical Research on a [31] Y. Liu, "Neural Network Technology in Music Emotion Recognition,"
Fuzzy Model of Music Emotion Classification Based on Pleasure-Arousal International Journal of Frontiers in Sociology 3.1. 2021.
Model," 2018 37th Chinese Control Conference (CCC), pp. 3239-3244. [32] P. Du, X. Li and Y. Gao, "Dynamic Music emotion recognition based on
2018. CNN-BiLSTM," 2020 IEEE 5th Information Technology and
[11] RE. Thayer, The biopsychology of mood and arousal. Oxford University Mechatronics Engineering Conference (ITOEC), pp. 1372-1376. 2020.
Press. 1990. [33] F. H. Rachman, R. Sarno and C. Fatichah, "Music Emotion Detection
[12] Y.-G. Jiang, Yu-Gang, B. Xu, and X Xue. "Predicting emotions in user- using Weighted of Audio and Lyric Features," 2020 6th Information
generated videos," Proceedings of the AAAI Conference on Artificial Technology International Seminar (ITIS), pp. 229-233. 2020.
Intelligence. vol. 28, no. 1. 2014. [34] Y. R. Pandeya, and J. Lee. "Deep learning-based late fusion of multimodal
[13] B. Xu, Y. Fu, Y. Jiang, B. Li and L. Sigal, "Heterogeneous Knowledge information for emotion classification of music video," Multimedia Tools
Transfer in Video Emotion Recognition, Attribution and Summarization," and Applications, vol. 80, no. 2, pp. 2887-2905. 2021.
in IEEE Transactions on Affective Computing, vol. 9, no. 2, pp. 255-270, [35] R. Sarkar, et al. "Recognition of emotion in music based on deep
1 April-June 2018. convolutional neural network," Multimedia Tools and Applications, vol.
[14] M. B. Er, and I. B Aydilek. "Music emotion recognition by using chroma 79, no. 1, pp. 765-783. 2020.
spectrogram and deep visual features," International Journal of [36] A. Aljanaki, and M Soleymani. "A data-driven approach to mid-level
Computational Intelligence Systems 12.2, pp. 1622-1634. 2019. perceptual musical feature modeling," arXiv preprint arXiv:1806.04903.
[15] M. Soleymani, et al. "1000 songs for emotional analysis of music," 2018.
Proceedings of the 2nd ACM international workshop on Crowdsourcing [37] S. Chowdhury, et al. "Towards explainable music emotion recognition:
for multimedia. 2013. The route via mid-level features," arXiv preprint arXiv:1907.03572. 2019.
[16] F. Ringeval, et al. "Avec 2017: Real-life depression, and affect [38] N. HE and S. Ferguson, "Multi-view Neural Networks for Raw Audio-
recognition workshop and challenge," Proceedings of the 7th Annual based Music Emotion Recognition," 2020 IEEE International Symposium
Workshop on Audio/Visual Emotion Challenge. 2017. on Multimedia (ISM), pp. 168-172. 2020.
[17] J. Wei, X.Yang, and Y.Dong. "User-generated video emotion recognition [39] R. Orjesek, R. Jarina, M. Chmulik and M. Kuba, "DNN Based Music
based on key frames," Multimedia Tools and Applications vol. 80, no. 2, Emotion Recognition from Raw Audio Signal," 2019 29th International
pp. 14343-14361. 2021. Conference Radioelektronika (RADIOELEKTRONIKA), pp. 1-4. 2019.
[18] C. Li, Y. Shi, and X. Yi. "Video emotion recognition based on [40] M. B.Er, and I. B. Aydilek. "Music emotion recognition by using chroma
Convolutional Neural Networks," Journal of Physics: Conference Series. spectrogram and deep visual features," International Journal of
vol. 1738, no. 1, IOP Publishing, 2021. Computational Intelligence Systems, vol. 12, no. 2, pp. 1622-1634. 2019.
[19] H. Zhang and M. Xu, "Recognition of Emotions in User-Generated [ 41] M.Malik, et al. "Stacked convolutional and recurrent neural networks for
Videos With Kernelized Features," in IEEE Transactions on Multimedia, music emotion recognition," arXiv preprint arXiv:1706.02292, 2017.
vol. 20, no. 10, pp. 2824-2835, Oct. 2018. [42] Y. SUN, Z. JIA, ZHU Haoyu. "Survey of multimodal deep learning.
[20] X. Gu, et al. "Sentiment key frame extraction in user-generated micro- Computer Engineering and Applications," vol. 56, no. 21, pp. 1-10. 2010.
videos via low-rank and sparse representation," Neurocomputing 410, pp. (Chinese)
441-453. 2020. [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez,
[21] H. Qiu, L. He and F. Wang, "Dual Focus Attention Network For Video et al., "Attention is all you need," Neural Information Processing Systems
Emotion Recognition," 2020 IEEE International Conference on (NIPS), 2017.
Multimedia and Expo (ICME), pp. 1-6. 2020. [44] J. Huang, J. Tao, B. Liu, Z. Lian and M. Niu, "Multimodal Transformer
[22] Y. Yi, H. Wang and Q. Li, "Affective Video Content Analysis With Fusion for Continuous Emotion Recognition," ICASSP 2020 - 2020 IEEE
Adaptive Fusion Recurrent Network," in IEEE Transactions on International Conference on Acoustics, Speech and Signal Processing
Multimedia, vol. 22, no. 9, pp. 2454-2466, Sept. 2020. (ICASSP), pp. 3507-3511. 2020.
[23] S. Zhao, et al. "An End-to-End visual-audio attention network for emotion [45] L.Pang, and C. Ngo. "Mutlimodal learning with deep boltzmann machine
recognition in user-generated videos," Proceedings of the AAAI for emotion prediction in user generated videos," Proceedings of the 5th
Conference on Artificial Intelligence. vol. 34, no. 01, 2020. ACM on International Conference on Multimedia Retrieval. 2015.
[24] T. Chen, et al. "Deepsentibank: Visual sentiment concept classification [46] L. Pang, S. Zhu and C. Ngo, "Deep Multimodal Learning for Affective
with deep convolutional neural networks," arXiv preprint Analysis and Retrieval," in IEEE Transactions on Multimedia, vol. 17, no.
arXiv:1410.8586 . 2014. 11, pp. 2008-2020, Nov. 2015.
[25] J. Gao, et al. "Frame-transformer emotion classification network," [47] B. Xu, et al. "Video emotion recognition with transferred deep feature
Proceedings of the 2017 ACM on International Conference on encodings," Proceedings of the 2016 ACM on International Conference
Multimedia Retrieval. 2017. on Multimedia Retrieval. 2016.
82
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on November 17,2024 at 09:55:03 UTC from IEEE Xplore. Restrictions apply.