SSRN 4719922
SSRN 4719922
ed
A Study for the Effectiveness of the Deep Feature of EOG in Emotion Recognition
Minchao Wu,Ping Li,Zhao Lv,Cunhang Fan,Shengbing Pei,Xiangping Gao,Fan Li,Wen Liang
• VMD and deep feature extractor are employed to decode emotional EOG signals.
iew
• vEOG generally outperform hEOG especially for classifying negative emotions.
ev
rr
ee
tp
t no
rin
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
A Study for the Effectiveness of the Deep Feature of EOG in Emotion
Recognition
ed
Minchao Wua,∗ , Ping Lia,∗ , Zhao Lva,b,∗∗ , Cunhang Fana , Shengbing Peia , Xiangping Gaoa , Fan Lib
and Wen Liangc
iew
a Anhui Province Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui
University, Hefei, 230601, China
b Civil Aviation Flight University of China, Chuanghan, 618307, China
c Google Research, California, 94043, United States
ev
Emotion recognition (HCI) system which make the HCI more intelligent. To date, the emotion recognition is mainly
Human-computer interface (HCI) achieved by decoding electroencephalogram (EEG), speach, etc, while hardly by electro-oculography
Electro-oculography (EOG) (EOG).
Variational mode decomposition (VMD) New method: In this study, we proposed a deep learning model, VCLSTM, which is combined with
Deep feature extraction variational mode decomposition (VMD), convolution neural network (CNN), and long-short term
memory (LSTM), to perform the EOG-based emotion recognotion. More concretely, we employed
rr
VMD to calculate the intrinsic mode functions (IMF) of EOG signals and emphysize the different
frequency information. Furthermore, we sequentially utilized CNN and LSTM to extract the deep
temporal features and frequency-sequential features from IMFs, respectively.
Results: Two public avalible datasets, i.e., DEAP and OVPD-II, were employed to evaluate the
performance of the VCLSTM. For DEAP, we achieved mean accuracy of 92.09% for four classes
(high/low valence/arousal); For OVPD-II which consisted of traditional video pattern and olfactory-
ee
enhanced pattern, we achieved mean accuracies of 93.74% and 92.09% for positive, neutral, and
negative emotions, respectively.
Conclusion: We found that the vertical EOG generally outperformed the horizontal EOG. Besides,
the experimental results also indicated that olfaction could affect the ability for vision to percepect
emotional information, especially for negative emotions.
tp
such as brain-computer interface (BCI) and rehabilitation corresponding to EEG-based emotion recognition accounts
medicine (Wiem and Lachiri, 2017). Indeed, emotion in- for a large part of the proportion. The reasons caused the
teraction is one of essentials of a HCI system . That is to phenomenon may be: 1) The generation of emotions is
say, HCI can recognize humans’ emotion and make more closely related to brain mechanism, and EEG is to some
reliable decisions, and thereby enhance the interaction qual-
extent the medium of studying brain mechanism; 2) Similar
ep
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 1 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
An EOG-based emotion recognition deep learning model
compared with the EEG-related equipment. For example, in the dimensional space according to their properties in dif-
the EOG electrode can be embedded in the virtual glasses, ferent dimensions. To date, the valence-arousal model (Rus-
ed
which can be used more conveniently utilization in practical sell, 1980) and the Plutchik’s Wheel of Emotions (Plutchik,
application. 2001) are relatively frequently use of two-dimensions and
In the present work, we adopted the EOG signals to three-dimensions emotion models, respectively.
perform the emotion recognition task in the two public
emotional datasets, i.e., the DEAP dataset and the OVPD- 2.2. EOG for Emotion Recognition
II dataset. The main contributions of this work can be Research has indicated that since about 70% of the
iew
summarized as follows: (1) Based on the proposed model, sensory information acquired by humans comes from vision
we validate the effectiveness of the deep features extracted (Welch, DutionHurt and Warren, 1986), the EOG signals
from EOG signals for the emotion recognition task; (2) that can effectively record ocular activity are capable of
According to the experimental results in the two public characterizing the emotional states in humans. Thus, there
emotional datasets, we find that vertical EOG (vEOG) gener- are some studies have indicated that the EOG signals are
ally outperform the horizontal EOG (hEOG), especially for useful to perform the affective computing.
recognizing the negative emotions; (3) For OVPD-II dataset Paul et al. (Paul, Banerjee and Tibarewala, 2017) col-
ev
which is based on multi-sensor stimulating, we found that lected the EOG signals by audio-visual stimuli, extracted
the involvement of olfaction could help to better identify the Hjorth and the discrete wavelet transform (DWT) fea-
negative emotions, which may indicated that the senses can tures, and achieved accuracies of 81% and 79.85% for pos-
influence each other’s ability to perceive external emotions. itive, neutral, and negative emotions using the horizontal
The layout of this paper is organized into following sec- and vertical EOG, respectively. Cai et al. (Cai, Liu, Jiang,
rr
tions: In Section II, a brief introduction of the related works Ni, Zhou and Cangelosi, 2021) extracted the power spec-
is presented. The framework and corresponding detailed tural density (PSD), approximate coefficients of wave trans-
information of the proposed model are described in Section form, and statistical features for EOG to recognize emo-
III. Experimental results, analysis, and discussion for the two tions of valence and arousal dimension. Wang et al. (Wang,
ee
public datasets are presented in Section IV. Finally, Section Lv and Zheng, 2018) extracted the time-frequency features
VI presents the conclusions about this work. and eye movement features such as saccade duration to
achieve the EOG-based emotion recognition task. Both Niu
et al.(Niu, Todd, Kyan and Anderson, 2012) and Hamed et al.
2. Related Works (R.-Tavakoli, Atyabi, Rantanen, Laukka, Nefti-Meziani and
tp
2.1. Affective Models Heikkilä, 2015) indicated the feasibility of employing eye
In the research of affective computing, a great deal of movements information for valence perception by studying
studies has been conducted on affective models. Generally, subjects in the positive, neutral, negative emotions. Lu et
affective models can be roughly categorized as discrete al. (Lu, Zheng, Li and Lu, 2015) investigated sixteen eye
models and dimensional models. movements and identify the intrinsic patterns of these eye
movements corresponding to three emotional states, i.e.,
no
emotions model (Ekman, 1992) and Panksepp’s four-basic 2023) systematic investigated the effect of the length of
rin
emotions model (Panksepp, 1982); Another part of basic time window on emotion recognition task by EOG signals,
emotional theorists believe that basic emotions are psycho- and achieve best accuracies of 77.21% and 78.28% by K-
logically irreducible, and they argue that the basic emo- nearest neighbor (KNN) classifier for arousal and valence
tions are those with the basic induced conditions, and that dimension, respectively.
the basic emotions have no other emotions as components, All of these previous works indicated the effectiveness
for EOG signals on emotion recognition. However, almost
ep
certain dimensional space composed of several dimensions, the CNN and LSTM to extract the deep temporal features and
and different emotions are distributed in different positions achieve the emotion recognition task for EOG.
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 2 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
An EOG-based emotion recognition deep learning model
ed
information about the signal, where the VMD method is a
common technique to extract the IMFs.
Recently, VMD (Dragomiretskiy and Zosso, 2013) has
been widely used in emotion recognition. Pendey et al.
(Pandey and Seeja, 2022) extracted the PSD and first dif-
ference of IMFs which are calculated by VMD, and feed
iew
these features into the SVM and deep neural network (DNN)
to solve the EEG-based emotion recognition. Taran and
Bajaj (Taran and Bajaj, 2019) combined the empirical mode
decomposition (EMD) and VMD to decompose the EEG
signals, and employed the SVM to classify four basic emo-
tions where the best accuracy reached 90.63%. Liu et al. (Liu, Figure 1: The IMF reults of VMD for the hEOG signals (subject
Hu, She, Yang and Xu, 2023) combined features of differen-
ev
#1 in the DEAP dataset when watching the first video)
tial entropy (DE) and short-time energy (STE) which were
extracted from IMFs after VMD, and presented the XGBoost
classifier to classify emotions. Zheng et al. (Zhang, Yeh extracted by VMD are called IMFs. The solving process of
and Shi, 2023) developed the variational phase-amplitude VMD mainly includes two constraints: (1) the sum of the
coupling method to quantify the rhythmic nesting structure
rr
bandwidth of the central frequency of each IMF component
for emotional EEG signals. Khare and Bajaj (Khare and is required to be the minimum; (2) the sum of all the IMFs
Bajaj, 2020) introduced an optimized variational mode de- is equal to the original signal. The extracted IMFs can be
composition (OVMD) to decide the optimal parameters of expressed as:
VMD for emotional EEG. Mishra et al. (Mishra, Warule
ee
and Deb, 2023) fused the MFCC, mel-spectrogram, ap- 𝑢𝑘 (𝑡) = 𝐴𝑘 (𝑡) cos(𝜙𝑘 (𝑡)) (1)
proximate entropy (ApEn), and permutation entropy (PrEn) where 𝐴𝑘 (𝑡) is the envelope amplitude of 𝑢𝑘 (𝑡), and 𝜙𝑘 is the
extracted from each VMD mode, and achieved classification instantaneous phase, and 𝑘 means the 𝑘-th IMFs. Certainly,
accuracies of 91.59% and 80.83% for two public datasets, similar to the empirical modal decomposition (EMD), the
respectively. Rudd et al. (Hason Rudd, Huo and Xu, 2023) IMFs shown in Eq.(1) satisfies two constraint conditions:
tp
proposed the VGG-optiVMD algorithm for speech emotion (1) the difference between the number of extreme and the
recognition with achieving the state-of-the-art 96.09% ac- number of zero-crossings must be equal to 0 or 1; (2) the
curacy for seven discrete emotions. Li et al. (Li, Zhang, mean between the upper and lower envelope must be 0 at
Li and Huang, 2023) proposed VMD-Teager-Mel (VTMel) every time point.
spectrogram to emphasize the high-frequency components Typically, compared with EMD, one of the strength of
no
of speech signals, and developed the CNN-DBM network to VMD is that VMD can specify the number of IMFs, 𝐾.
recognize emotions. However, how to choose the value of 𝐾 is one of essential
In summary, VMD is an effective technique to help problems for VMD. In this work, we adopted the enumera-
recognizing emotions. However, the exited works about tion strategy which set K as 5 to 12 with step of 1.
mainly concentrated on the EEG-based and speech-based Fig. 1 shows the results for a series of hEOG after
emotion recognition, while there hardly no work reported performing VMD when K was set as 5. As shown in the
that utilizing the VMD to EOG-based emotion recognition. figure, it can be found that, there is obviously eye movement
t
Therefore, in this work, we discuss for the first time the modes in the IMFs with low frequency, such as sacade.
rin
feasibility for VMD in EOG-based emotion recognition. However, for the IMFs with the high frequency, there still
contains some eye movement information. Therefore, the
3. Materials and Methods VMD is an effective technique to separate the EOG signals
into different IMFs which contain different information, and
3.1. Variational Mode Decomposition these information may related to different emotion states.
ep
fore, VMD establishes the constraint optimization problem dimensional convolution layer to learn the temporal patterns
according to the component narrowband condition, and es- from each IMF signal. The convolution layer is formulated
timate the center frequency of the signal component and as
reconstruct the corresponding component. The components
ℎ𝑛𝑘 = 𝑊𝑛 𝑢𝑘 , 𝑛 = 1, 2, ..., 𝑁, 𝑘 = 1, 2, ..., 𝐾 (2)
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 3 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
An EOG-based emotion recognition deep learning model
ed
𝑢𝑘 is the 𝑘-th IMF signal of original EOG signals. Then, a
ReLU layer and a maxpooling layer are performed for the
extracted features through the convolution layer. Finally, the
features are flattened by a one-dimensional vector denoted
as ℎ𝑘 .
iew
3.3. Frequency sequential feature extraction
As shown in Fig.1, the calculated IMFs would corre-
spond to different center frequencies, which cause the differ-
ent information related to emotion. Therefore, the long-short
term memory (LSTM) was employed to learn the frequency Figure 2: The structure of LSTM
sequential dependencies and extract the more discriminating
emotion-related features. LSTM is an effective technique on
ev
processing sequential data in emotion recognition task. As Table 1
shown in Fig. 2, a LSTM consists of the forget gate, the input The parameter details of VCLSTM
gate, and the output gate. The three gates are formulated as Parameters Values
̇ 𝑘−1 , ℎ̃ 𝑘−1 , ℎ𝑘 ] + 𝑏𝑓 )
𝑓𝑡 = 𝜎(𝑊𝑓 [𝐶 (3) # of convolution layer 7
rr
# of ReLU layer 7
# of maxpooling layer 7
̇ 𝑘−1 , ℎ̃ 𝑘−1 , ℎ𝑘 ] + 𝑏𝑖 )
𝑖𝑡 = 𝜎(𝑊𝑖 [𝐶 (4)
Kernel size 3
ee
# of kernel [16,32,64,128,256,5512,1024]
# of hidden layer cells of LSTM 200
̇ 𝑘−1 , ℎ̃ 𝑘−1 , ℎ𝑘 ] + 𝑏𝑜 )
𝑜𝑡 = 𝜎(𝑊𝑜 [𝐶 (5)
Proportion of dropout 0.5
where 𝑓𝑡 , 𝑖𝑡 , and 𝑜𝑡 represent the outputs of the forget gate,
Learning rate 0.0001
input gate, and output gate, respectively. ℎ𝑘 is the input of the
tp
LSTM at time 𝑘, ℎ̃ 𝑘−1 and ℎ̃ 𝑘 are the outputs of the LSTM at Batch size 64
time 𝑘 − 1 and 𝑘 respectively. 𝑏𝑓 , 𝑏𝑖 , and 𝑏𝑜 are bias terms of Max epochs 200
the three gate. 𝐶𝑘−1 and 𝐶𝑘 are the updated cell states, and
𝐶̃𝑘 is a candidate vector that will be updated into the 𝐶𝑘 . 𝜎
indicates the activate function, 𝑠𝑖𝑔𝑚𝑜𝑖𝑑. precision, recall, and F1 -score. The calculation formulas of
no
module, the learned features of the two streams are con- 2 × 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
catenated as a one-dimension vector, which are employed to 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = (9)
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
classified different emotion states. To recognize emotions, a
fully connection layer with 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 as activation function
are utilized. The implementation details of the parameters of
Pr
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 4 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
An EOG-based emotion recognition deep learning model
ed
iew
ev
rr
ee
Figure 3: The framework of the proposed method for EOG-based emotion recognition
dataset which is used for analyzing emotions based on 4.1.2. The OVPD-II dataset
dimensional emotion model. The dataset simultaneously To investigate the interactive effects between the senses
recorded several different physiological signals, such as when perceiving emotional information, we developed an
tp
EEG, EOG, electromyography (EMG), and others periph- emotional dataset with olfactory-enhanced videos as stimu-
eral physiological signals. 32 healthy subjects (16 males lus material, i.e., the OVPD-II (Olfactory-enhanced Video
and 16 females) with mean age of 26.9 were invited to Multi-modal Emotion Dataset). The OVPD-II is also a
participant the experiment. Each subject was asked to watch multi-modal signals dataset that contain EEG and EOG. Dif-
40 one-minute-long music video clips (i.e. 40 trials), and ferent with the DEAP, the OVPD-II adopted both film video
no
after each trial, the subject needed to make a self report clips and odors as the emotion evoking materials, which
which evaluate the emotion as four dimension, i.e. valence, can synchronously stimulated subjects’ visual, auditory, and
arousal, dominance, and liking. Besides, the ratings for each olfactory senses. 16 healthy subjects (8 males and 8 females)
dimension range from 1 to 9. The recorded EOG signals were with mean age of 23.5 participated in the experiment. For
downsampled to 128Hz, and segmented into 60 second trials the stimuli, 10 film video clips and 10 odors were utilized to
and a 3 second pre-trial baseline. Furthermore, the EOG evoke subjects’ emotions, where 4 video clips and 4 odors
t
contained two channels, i.e., the hEOG and vEOG channels. for eliciting positive emotions, 2 video clips and 2 odors
The hEOG and vEOG were extracted by: for eliciting neutral emotions, and 4 video clips and 4 odors
rin
In this work, we only considered the valence-arousal enhanced video pattern. Each subject was stimulated by
space, and the 4-class emotion recognition task would per- a intersection of the two patterns, and there are 30 trials
formed, i.e., HAHV (high arousal and low valence), HALV where each trial contained the two patterns and continued
(high arousal and low valence), LAHV (low arousal and high 2 minutes. Similarity, after each trial, the subject was asked
valence), and LALV (low arousal and low valence), where to make a self report in valence-arousal dimensions, and
Pr
the basics of dividing the four classes are as follows: then the emotions were classed as three states, i.e., positive,
1 ≤ 𝐿𝐴 < 5 & 5 ≤ 𝐻𝐴 ≤ 9 (12) neutral, and negative. The corresponding protocol is shown
in Fig. 2. The sampling rate was 250Hz, and the EEG with
28 channels and the EOG with 4 channels were recorded
1 ≤ 𝐿𝑉 < 5 & 5 ≤ 𝐻𝑉 ≤ 9 (13) (EEG recording: FP1/2, F3/4, F7/8, Fz, FC1/2, FC5/6,
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 5 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
An EOG-based emotion recognition deep learning model
Table 2 Table 3
The results of accuracies (mean/ std), precision, recall, and The comparative results of accuracies for DE, SE , and the
ed
F1 -score achieved by VCLSTM with different value of 𝐾 for proposed deep features when 𝐾=12 (%)
the DEAP dataset (%)
Feature Accuracy (mean/ std)
𝐾-value Accuracy Precision Recall F1 -score
DE (SVM) 50.32/ 0.45
5 73.94/ 0.63 73.41 73.17 73.29
SE (SVM) 43.56/ 0.72
iew
6 75.76/ 0.84 75.70 74.61 75.15
Our model 90.21/ 0.66
7 78.21/ 1.62 78.22 77.34 77.78
8 82.95/ 1.62 82.93 82.34 82.63
4.3. Experiment on the OVPD-II dataset
9 85.86/ 1.18 85.89 85.68 85.48 In this section, the OVPD-II dataset was employed to
10 88.21/ 1.66 88.08 87.91 87.99 test the proposed model, where the corresponding results
of mean accuracies and stand deviations, precisions, recalls,
ev
11 89.25/ 0.71 89.21 88.76 88.99
and F1 scores with different stimuli patterns are displayed in
12 90.21/ 0.66 90.09 89.90 90.00 Table 4 and 5, respectively. As shown in the two tables, we
can also conclude that with the lager and lager the value of
𝐾, the four classification metrics also achieve better perfor-
FT9/10, C3/4, T7/8, Cz, CP1/2, CP5/6, TP9/10, P7/8, Pz, mance for both the two stimulus patterns. What’s more, for
rr
Oz; EOG recording: hEOG1 , hEOG2 , vEOG1 , vEOG2 ), and the traditional video stimuli, the best accuracy, precision,
the signals were segmented into 60 second trials for each recall, and F1 -score reaches 93.74%, 93.71%, 93.51%, and
pattern. 93.62%, respectively; while for the olfactory-enhanced video
stimuli, the best accuracy, precision, recall, and F1 -score
4.2. Experiment on the DEAP dataset
ee reaches 92.09%, 92.09%, 92.13%, and 92.11%, respectively.
In this section, the DEAP dataset was employed to In addition, it can be found that, whatever the value of 𝐾,
test the proposed model. The results of mean accuracies the performances of VCLSTM are similar for both the two
and stand deviation, precisions, recalls, and F1 -scores are stimulus patters.
summarized in Table 2. As shown in the table, with the However, we also calculated the classification accuracies
larger the parameter 𝐾 of VMD, the values of the four of the SVM classifier by using the DE and SE features
tp
classification metrics of VCLSTM achieve lager and larger, with the same parameter setting as in section 4.2. The
and the classification performance is also more stable, where corresponding results were displayed in Table 6, where the
the best accuracy, precision, recall, and F1 -score are 90.21%, DE feature achieved mean accuracies of 61.83% and 60.70%
90.09%, 89.90%, and 90.00%, respectively. Besides, it can for traditional video pattern and olfactory-enhanced video
be found that when 𝐾 is less than 10, the classification per- pattern, respectively; while the SE feature achieved 69.52%
no
formance of the model increases rapidly with an average by and 69.24% for the both two stimulus patterns respectively.
around 3%; while when 𝐾 is greater than 10, the growth rate Compared the DE and SE features with the SVM model,
of the model performance is relatively slow with increasing the VCLSTM model has a 23% to 32% increase for both
by almost 1%. the two stimulus patterns in classifying positive, neutral, and
To further compare the effectiveness of VSLSTM, we negative emotional states.
extract two handcraft features, i.e., the DE (Zheng and Lu,
2015) and sample entropy (SE) (Richman and Moorman,
t
kernel function was employed to perform the classification model, we performed the ablation experiments, where the
task. The grid search method was used to find the optimal following experiments were designed:
regularization parameters and kernel coefficients from the
parameter pool, 𝑃 = {0.01𝑘, 0.1𝑘, 𝑘|𝑘 = 1, 2, ..., 9}. DE • VMD+CNN (VCNN): In this model, we only extract
and SE are used to measure the uncertainty of series data. the deep temporal features by CNN with the same
ep
DE was employed by Zheng et al.(Zheng and Lu, 2015) to parameters of VCLSTM, while the LSTM was not
decoding emotional EEG signals and had been validated to employed to extract the frequency features;
be superior for EEG-based emotion recognition task. SE is
• VMD+LSTM (VLSTM): In this model, we only
also one of the feature which usually utilized to recognizing
extract the frequency sequence features from orig-
emotions. The comparative results are shown in Table 3. It
inal IMFs by LSTM with the same parameters of
Pr
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 6 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
An EOG-based emotion recognition deep learning model
Table 4 and VCLSTM) that validated in the DEAP dataset and the
The results of accuracies (mean/ std), precision, recall, and OVPD-II dataset were shown in Fig. 4. From the general
ed
F1 -score achieved by VCLSTM with different value of 𝐾 in the trend point of view over the value of 𝐾, it can be obviously
traditional video stimuli for the OVPD-II dataset (%) found that the proposed model always achieved the best
accuracies compared with others four models, expect the
𝐾-value Accuracy Precision Recall F1 -score
case when 𝐾=5 in the OVPD-II dataset for traditional video
5 71.46/ 2.85 71.92 69.99 70.94 pattern. Besides, the VLSTM model always got the worst
performances in both the two datasets, where the average ac-
iew
6 74.56/ 0.84 74.10 73.63 73.86
curacies were around 35% and 48% respectively. It indicated
7 78.78/ 1.62 80.29 77.45 78.85 the effectiveness of the deep feature extractor by the convo-
8 85.72/ 1.62 85.58 85.40 85.49 lution layers to some extent. More concretely, the VLSTM
model utilized the original VMD results as the sequence
9 88.02/ 1.18 87.88 87.71 87.80
features, while the proposed model sequentially extracted
10 84.49/ 1.66 84.11 84.89 84.50 the deep temporal features and the sequence features by the
convolution layers and LSTM from the VMD signals. More-
ev
11 90.48/ 0.71 90.60 89.98 90.29
over, the VCNN model respectively achieved the average
12 93.74/ 0.66 93.71 93.51 93.62
accuracies of around 65% and 75% for the two datasets,
which indicated the effectiveness of the sequence feature
Table 5 extractor for the VCNN only employed the temporal feature
extractor. In addition, compared with the VLSTM model,
rr
The results of accuracies (mean/ std), precision, recall, and
F1 -score achieved by VCLSTM with different value of 𝐾 in the the VCNN model achieved an increase of around 30% on
olfactory-enhanced video stimuli for the OVPD-II dataset (%) classification accuracy. This illustrated the importance of
extracting features from the original signal instead of directly
𝐾-value Accuracy Precision Recall F1 -score
using only the original signal as a feature. However, similar
5 72.01/ 2.24 72.34 72.17
ee 72.26 conclusions can also be drawn by comparing Table 3, Table
6 75.56/ 1.72 75.86 75.64 75.75 6, and Fig. 4, where the SVM model with the DE or SE as
features always achiever higher accuracies when 𝐾=12 for
7 78.63/ 1.44 78.82 78.58 78.70
each dataset.
8 84.45/ 1.66 84.81 84.52 84.66 Furthermore, as shown in Fig. 4, we can also find that
tp
9 86.26/ 1.32 86.38 86.20 86.29 for almost all values of 𝐾, the VCLSTM model is superior
to both the VCLSTMℎ model and the VCLSTM𝑣 model,
10 87.82/ 4.20 88.11 87.86 87.98 and for the DEAP dataset, VCLSTM increases the average
11 90.22/ 2.14 90.25 90.20 90.23 accuracy by 13.51% and 11.78% over all values of 𝐾, re-
spectively; for the traditional video stimulation pattern of the
12 92.09/ 1.52 92.09 92.13 92.11
no
Traditional video Olfactory-enhanced video among all values of 𝐾, the performance of the VCLSTM𝑣
rin
DE (SVM) 61.83/ 0.93 60.70/ 0.71 model is generally better than the VCLSTMℎ model, with
the average accuracy for the DEAP dataset and the tradi-
SE (SVM) 69.52/ 0.84 69.24/ 0.93
tional video and olfactory-enhanced video stimulation pat-
Our model 93.74/ 0.66 92.09/ 1.52 terns of the OVPD-II dataset improved by 1.71%, 1.88% and
2.79%, respectively. Specially, when 𝐾=12, the difference is
ep
the emotion recognition task with the same parameters be one of the cores for building the HCI system. A variety of
of VCLSTM. methods have been developed to recognizing different emo-
The corresponding ablation experimental results for all tional states and studying the affective mechanism. However,
five models (i.e., VCNN, VLSTM, VCLSTMℎ , VCLSTM𝑣 ,
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 7 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
An EOG-based emotion recognition deep learning model
ed
iew
(a) (b) (c)
Figure 4: The ablation experimental results. (a) The ablation experiment tested in the DEAP dataset; (b) The ablation experiment
tested in the OVPD-II dataset for the traditional video pattern; (c) The ablation experiment tested in the OVPD-II dataset for
ev
the olfactory-enhanced video pattern.
in the subconscious mind, most people generally tend to sub- the physiological signal, and the low signal-to-noise ratio
jectively perceive positive emotions and resist negative emo- (SNR) of these signals, the data used by emotion recognition
rr
tions (Tamir, 2009; Larsen, 2000). Actually, many works research is mostly collected by the subjects in the experi-
has consistent findings that most of the developed emotion mental paradigm according to the laboratory environment,
recognition models performed better on recognizing the where the emotional materials are mainly pictures or video
positive emotion rather than the negative emotion. Zheng with different emotional labels. That is to say, in these exper-
et al. (Zheng and Lu, 2015) employed the DBN model and iments, subjects usually perceived the externally emotional
ee
achieved 100% highest accuracy for recognizing the positive information by audition or vision. While in the OVPD-II
emotions, while only 81% for the negative emotions, for dataset, we employed the olfaction as the third sense to
one experiment of one participant; W. Zheng (Zheng, 2016) ensure the subjects to be evoked different emotions more
proposed a novel group sparse canonical correlation analysis fully and efficiently. And we made a preliminary study of
(GSCCA) method, and achieved the classify accuracies of the effect of olfaction on the visual perception of emotional
tp
88.55% and 78.11% for recognizing the positive and negative information. However, according to Fig. 5(a) and 5(b), we
emotions, respectively. Li et al. (Li, Liu, Si, Li, Li, Zhu, could find an interesting phenomenon that whatever the
Huang, Zeng, Yao, Zhang et al., 2019) achieved highest classification model (i.e., the VCLSTMℎ , the VCLSTM𝑣 ,
accuracy on recognizing the HVHA dimension, followed and the VCLSTM), the accuracies for recognizing positive
by the HVLA dimension, the LVLA dimension and last the and negative emotions were generally higher when employ-
no
LVHA dimension, where the positive emotions are always ing the olfactory-enhanced videos as the stimuli rather than
categorized as high valence emotions, while the negative employing the traditional videos. On the one hand, this
emotions are mainly categorized as low valence emotions finding is consistent with our previous works (Wu, Teng,
(Russell, 1980). Fan, Pei, Li and Lv, 2023; Xue, Wang, Hu, Bi and Lv, 2022)
Indeed, the consistent finding can be also drawn in the and other researchers’ studies (Raheel, Majid and Anwar,
study from Fig. 5 in which the average confusion matrices 2019; Murray, Lee, Qiao and Muntean, 2014). All these
for the VCLSTMℎ , the VCLSTM𝑣 , and the VCLSTM for studies took EEG as the analysis object to verify that smell
t
the two datasets among all 𝐾 values were displayed. As is can enhance the subjects’ emotional perception ability. More
rin
shown in Fig. 5(a) and 5(b), the accuracies for the positive concretely, compared with the traditional video mode, smell
emotion are generally higher than the negative emotions; and enhanced video mode can make the model achieve higher
in Fig. 5(c), the accuracies for the HVHA dimension are classification accuracy, especially for the negative emotions.
always higher than the LVHA dimension, where the positive On the other hand, similar to the views of the multi-sensory
discrete emotions are mainly concentrated in the HVHA integration which think that the interactive synergy among
ep
dimension while the negative discrete emotions are mainly different senses can effectively enhance the physiological
concentrated in the LVHA dimension. Furthermore, as is salience of a stimulus (Stein and Stanford, 2008; Benssassi
displayed in all three subfigures in Fig. 5, compared with and Ye, 2021), this finding may provide a further support that
the hEOG signals, the features extracted from the vEOG the olfaction does affect the ability for the vision to perceive
signals were more discriminated for classifying the negative emotional information and express corresponding emotions.
Pr
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 8 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
An EOG-based emotion recognition deep learning model
ed
iew
ev
Figure 5: The confusion matrices for the VCLSTMℎ , the VCLSTM𝑣 , and VCLSTM on the DEAP dataset and the OVPD-II dataset
among all 𝑘 values. For each subfigure, from left to right: the VCLSTMℎ , the VCLSTM𝑣 , and VCLSTM. (a The average confusion
rr
matrices for the traditional video pattern of the OVPD-II dataset; (b) The average confusion matrices for the olfactory-enhanced
video pattern of the OVPD-II dataset; (c) The average confusion matrices for the DEAP dataset.
confusion matrices results, we indicated that the model supported by the National Natural Science Foundation of
based on vEOG is generally superior to hEOG on both China (NSFC) (No.61972437), Excellent Youth Founda-
two datasets; and for OVPD-II, we tentatively concluded tion of Anhui Scientific Committee (No. 2208085J05),
the National Key Research and Development Program of
that olfaction could affect the ability for vision to per-
cepect emotional information, especially for negative in- China (No. 2021ZD0201502), Special Fund for Key Pro-
gram of Science and Technology of Anhui Province (No.
no
Writing - original draft. Zhao Lv: Conceptualization, Method- and negative emotion in the human brain: Eeg spectral analysis. Neu-
ology, Supervision, Writing - review & editing. Cunhang ropsychologia 23, 745–755.
Benssassi, E.M., Ye, J., 2021. Investigating multisensory integration in
Fan: Methodology, Software, Formal analysis, Writing -
emotion recognition through bio-inspired computational models. IEEE
review & editing. Shengbing Pei: Methodology, Software, Transactions on Affective Computing .
Formal analysis, Writing - review & editing. Xiangping Cai, H., Liu, X., Jiang, A., Ni, R., Zhou, X., Cangelosi, A., 2021. Com-
Pr
Gao: Conceptualization, Methodology - review & editing. bination of eog and eeg for emotion recognition over different window
Fan Li: Conceptualization, Methodology - review & editing. sizes, in: 2021 IEEE 2nd International Conference on Human-Machine
Systems (ICHMS), IEEE. pp. 1–6.
Wen Liang: Conceptualization, Methodology - review &
Cai, H., Liu, X., Ni, R., Song, S., Cangelosi, A., 2023. Emotion recognition
editing. through combining eeg and eog over relevant channels with optimal
windowing. IEEE Transactions on Human-Machine Systems .
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 9 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922
An EOG-based emotion recognition deep learning model
Cruz, A., Garcia, D., Pires, G., Nunes, U., 2015. Facial expression Salovey, P., Mayer, J.D., 1990. Emotional intelligence. Imagination,
recognition based on eog toward emotion detection for human-robot cognition and personality 9, 185–211.
interaction., in: Biosignals, pp. 31–37. Stein, B.E., Stanford, T.R., 2008. Multisensory integration: current issues
ed
Dragomiretskiy, K., Zosso, D., 2013. Variational mode decomposition. from the perspective of the single neuron. Nature reviews neuroscience
IEEE transactions on signal processing 62, 531–544. 9, 255–266.
Ekman, P., 1992. An argument for basic emotions. Cognition & emotion 6, Tamir, M., 2009. What do people want to feel and why? pleasure and utility
169–200. in emotion regulation. Current directions in psychological science 18,
Frijda, N.H., 1986. The emotions. Cambridge University Press. 101–105.
Hason Rudd, D., Huo, H., Xu, G., 2023. An extended variational mode Taran, S., Bajaj, V., 2019. Emotion recognition from single-channel eeg
iew
decomposition algorithm developed speech emotion recognition perfor- signals using a two-stage correlation and instantaneous frequency-based
mance, in: Pacific-Asia Conference on Knowledge Discovery and Data filtering method. Computer methods and programs in biomedicine 173,
Mining, Springer. pp. 219–231. 157–165.
Izard, C.E., 2007. Basic emotions, natural kinds, emotion schemas, and a Wang, Y., Lv, Z., Zheng, Y., 2018. Automatic emotion perception using eye
new paradigm. Perspectives on psychological science 2, 260–280. movement information for e-healthcare systems. Sensors 18, 2826.
Khare, S.K., Bajaj, V., 2020. An evolutionary optimized variational mode Welch, R.B., DutionHurt, L.D., Warren, D.H., 1986. Contributions of audi-
decomposition for emotion recognition. IEEE Sensors Journal 21, 2035– tion and vision to temporal rate perception. Perception & psychophysics
2042. 39, 294–300.
ev
Larsen, R.J., 2000. Toward a science of mood regulation. Psychological Wiem, M.B.H., Lachiri, Z., 2017. Emotion classification in arousal valence
inquiry 11, 129–141. model using mahnob-hci database. International Journal of Advanced
Li, J., Zhang, X., Li, F., Huang, L., 2023. Speech emotion recognition Computer Science and Applications 8.
based on optimized deep features of dual-channel complementary spec- Wu, M., Teng, W., Fan, C., Pei, S., Li, P., Lv, Z., 2023. An investigation
trogram. Information Sciences 649, 119649. of olfactory-enhanced video on eeg-based emotion recognition. IEEE
Li, P., Liu, H., Si, Y., Li, C., Li, F., Zhu, X., Huang, X., Zeng, Y., Transactions on Neural Systems and Rehabilitation Engineering 31,
rr
Yao, D., Zhang, Y., et al., 2019. Eeg based emotion recognition by 1602–1613.
combining functional connectivity network and local activations. IEEE Xue, J., Wang, J., Hu, S., Bi, N., Lv, Z., 2022. Ovpd: Odor-video
Transactions on Biomedical Engineering 66, 2869–2881. elicited physiological signal database for emotion recognition. IEEE
Liu, Z.T., Hu, S.J., She, J., Yang, Z., Xu, X., 2023. Electroencephalogram Transactions on Instrumentation and Measurement 71, 1–12.
emotion recognition using combined features in variational mode de- Zhang, C., Yeh, C.H., Shi, W., 2023. Variational phase-amplitude coupling
composition domain. IEEE Transactions on Cognitive and Develop- characterizes signatures of anterior cortex under emotional processing.
ee
mental Systems . IEEE Journal of Biomedical and Health Informatics 27, 1935–1945.
Lu, Y., Zheng, W.L., Li, B., Lu, B.L., 2015. Combining eye movements Zheng, W., 2016. Multichannel eeg-based emotion recognition via group
and eeg to enhance emotion recognition., in: IJCAI, Buenos Aires. pp. sparse canonical correlation analysis. IEEE Transactions on Cognitive
1170–1176. and Developmental Systems 9, 281–290.
Mishra, S.P., Warule, P., Deb, S., 2023. Variational mode decomposition Zheng, W.L., Lu, B.L., 2015. Investigating critical frequency bands and
based acoustic and entropy features for speech emotion recognition. channels for eeg-based emotion recognition with deep neural networks.
tp
Applied Acoustics 212, 109578. IEEE Transactions on autonomous mental development 7, 162–175.
Murray, N., Lee, B., Qiao, Y., Muntean, G.M., 2014. Multiple-scent en-
hanced multimedia synchronization. ACM Transactions on Multimedia
Computing, Communications, and Applications (TOMM) 11, 1–28.
Niu, Y., Todd, R.M., Kyan, M., Anderson, A.K., 2012. Visual and emotional
salience influence eye movements. ACM Transactions on Applied
no
M. Wu, P. Li, Z. Lv, C. Fan, S. Pei, X. Gao, F. Li, and W. Liang: Preprint submitted to Elsevier Page 10 of 10
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4719922