0% found this document useful (0 votes)
26 views

Biomedical Signal Processing and Control: Linlin Gong, Mingyang Li, Tao Zhang, Wanzhong Chen

This document summarizes a research paper that proposes a novel attention-based convolutional transformer neural network (ACTNN) for EEG-based emotion recognition. The ACTNN model effectively integrates spatial, spectral, and temporal information from EEG signals. It uses spatial and spectral attention masks to enhance distinguishable features at each time slice. A convolutional module then extracts local features, which are fed into a transformer-based temporal encoding layer using multi-head self-attention to capture global features. Evaluation on two public datasets shows the ACTNN achieves state-of-the-art recognition accuracy. Visualization of attention masks also provides insights into the neuroscience relevance of the model with human emotion.

Uploaded by

Kaushik Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Biomedical Signal Processing and Control: Linlin Gong, Mingyang Li, Tao Zhang, Wanzhong Chen

This document summarizes a research paper that proposes a novel attention-based convolutional transformer neural network (ACTNN) for EEG-based emotion recognition. The ACTNN model effectively integrates spatial, spectral, and temporal information from EEG signals. It uses spatial and spectral attention masks to enhance distinguishable features at each time slice. A convolutional module then extracts local features, which are fed into a transformer-based temporal encoding layer using multi-head self-attention to capture global features. Evaluation on two public datasets shows the ACTNN achieves state-of-the-art recognition accuracy. Visualization of attention masks also provides insights into the neuroscience relevance of the model with human emotion.

Uploaded by

Kaushik Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Biomedical Signal Processing and Control 84 (2023) 104835

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control


journal homepage: www.elsevier.com/locate/bspc

EEG emotion recognition using attention-based convolutional transformer


neural network
Linlin Gong, Mingyang Li, Tao Zhang, Wanzhong Chen ∗
College of Communication Engineering, Jilin University, Changchun, 130012, China

ARTICLE INFO ABSTRACT

Keywords: EEG-based emotion recognition has become an important task in affective computing and intelligent interac-
Electroencephalogram (EEG) tion. However, how to effectively combine the spatial, spectral, and temporal distinguishable information of
Emotion recognition EEG signals to achieve better emotion recognition performance is still a challenge. In this paper, we propose
Attention mechanism
a novel attention-based convolutional transformer neural network (ACTNN), which effectively integrates the
Convolutional neural network (CNN)
crucial spatial, spectral, and temporal information of EEG signals, and cascades convolutional neural network
Transformer
and transformer in a new way for emotion recognition task. We first organized EEG signals into spatial–
spectral–temporal representations. To enhance the distinguishability of features, spatial and spectral attention
masks are learned for the representation of each time slice. Then, a convolutional module is used to extract
local spatial and spectral features. Finally, we concatenate the features of all time slices, and feed them into the
transformer-based temporal encoding layer to use multi-head self-attention for global feature awareness. The
average recognition accuracy of the proposed ACTNN on two public datasets, namely SEED and SEED-IV, is
98.47% and 91.90% respectively, outperforming the state-of-the-art methods. Besides, to explore the underlying
reasoning process of the model and its neuroscience relevance with emotion, we further visualize spatial and
spectral attention masks. The attention weight distribution shows that the activities of prefrontal lobe and
lateral temporal lobe of the brain, and the gamma band of EEG signals might be more related to human
emotion. The proposed ACTNN can be employed as a promising framework for EEG emotion recognition.

1. Introduction body movements [13]. However, although external manifestations can


express emotions intuitively, it is easy to hide or disguise, and physio-
Affective computing [1], a new research on perception and mea- logical signals are helpful to capture the potential reaction of emotional
surement of human emotions, is regarded as an exciting prospect in changes more objectively [14]. This includes brain signals from the
the field of human–computer interaction (HCI) [2], and has received central nervous system, as well as skin electrical responses and heart
more and more attention in recent years. Its purpose is to let the rates and so on from the autonomic nervous system. Many neuroscience
machine understand and recognize human emotional state, and then researches have shown that emotion is closely related to amygdala, pre-
make appropriate feedback to respond, evaluate or regulate human frontal cortex and other regions of the brain [15,16]. Some studies have
emotions [3]. At present, some related research with great potential also verified the relationship between EEG signals and emotion [17].
and practical value have appeared, for example, the affective brain– In addition, EEG has the characteristics of non-invasive, inexpensive,
computer interface (aBCI) technology [4], emotion detection of patients
good time resolution. Therefore, in recent years, EEG-based emotion
with disorders of consciousness (DoC) [5], auxiliary diagnosis of af-
recognition has become an effective method.
fective disorders (such as autism [6], attention deficit hyperactivity
In the past decades, to achieve a more robust and accurate clas-
disorder [7], depression [8], etc.), emotion detection of drivers [9],
sification model, researchers have made great efforts in emotional
cognitive load assessment [10] and so on. However, because the neural
feature extraction and classification algorithms based on EEG signals.
mechanism of emotion is not clear, affective computing is still facing
many difficulties and challenges [11]. Many researchers adopt machine learning methods and extract features
Emotion recognition is still the most important research topic in mainly from time domain, frequency domain, or time-frequency do-
the field of affective computing. Some studies recognize emotions main. Time domain features generally include higher order crossings
based on external manifestations such as facial expressions [12] or (HOC) features [18–20], Hjorth features [21,22], etc. Frequency

∗ Corresponding author.
E-mail address: chenwz@jlu.edu.cn (W. Chen).

https://doi.org/10.1016/j.bspc.2023.104835
Received 28 November 2022; Received in revised form 17 February 2023; Accepted 5 March 2023
Available online 10 March 2023
1746-8094/© 2023 Elsevier Ltd. All rights reserved.
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

domain features generally include power spectrum (PSD) features [23]. framework. First, we use a non-overlapping window with a length of
For time-frequency domain features, wavelet transform is often used to T-second to intercept EEG signals after removing noise and artifacts.
extract, including discrete wavelet transform (DWT) [24,25], tunable Q Then, we divide it into T 1-second segments. For each segment, we
wavelet transform (TQWT) [26–28], Dual-Tree Complex Wavelet Trans- extract the DE features in the 𝛿, 𝜃, 𝛼, 𝛽, 𝛾 frequency bands, and then
forms (DT-CWT) [29,30], etc. In addition, some studies used features map the features in space according to the position of the electrodes.
based on entropy measure. In the field of EEG emotion recognition, Subsequently, to enhance the critical spatial and spectral information
differential entropy (DE) [31] feature is widely used and proved to and suppress invalid information, we introduce a new parallel spatial
be robust. Besides, it also includes sample entropy [32,33], energy and spectral attention mechanism. Next, we use the convolution mod-
entropy [34], approximate entropy [35,36], etc. For feature classi- ule to extract the local spatial and spectral features of each time slice. In
fication, the classifiers used generally include support vector ma- the temporal encoding part, we concatenate the features from the time
chine (SVM) [31,36], k-nearest neighbor (KNN) [19,21], random forest slices and apply multi-head self-attention for global awareness through
(RF) [34], decision tree (DT) [25,37], and so on, as well as ensemble three temporal encoding layers. Finally, the classifier is composed of a
model [24] of multiple classifiers. fully-connected layer and a softmax layer to predict emotion labels.
Recently, with the continuous improvement and superior perfor- We carry out a series of experiments through ACTNN. Firstly, statis-
mance of deep learning algorithm, EEG emotion recognition method tic analysis of DE features is carried out using one-way ANOVA. Sec-
based on deep learning framework has been effectively applied, and has ondly, the overall performance of ACTNN and the comparison of results
achieved the better performance. Zheng et al. [38] designed a classifica- under different attention conditions are reported. Thirdly, we com-
tion model based on deep belief network (DBN), and discussed the key pare the recognition performance when the input is the raw EEG
frequency bands and channels more suitable for emotion classification signals or DE features. Fourthly, the spatial and spectral attention mask
tasks. Maheshwari et al. [39] proposed a deep convolution neural is visualized to explore the model’s potential reasoning process and
network (Deep CNN) EEG emotion classification method. Considering interpretability. Finally, the ablation experiment is conducted to in-
the spatial information of adjacent and symmetric channels of EEG, vestigate the contribution of key components of ACTNN to recognition
Cui et al. [40] proposed an end-to-end regional asymmetric convo- performance.
lutional neural network (RACNN), wherein the temporal, regional, The main contributions of this paper are as follows:
and asymmetric feature extractor in the model are all composed of (1) We propose a novel attention-based Convolutional Transformer
convolution structures. Xing et al. [41] used the Stack AutoEncoder neural network, named ACTNN. It cascades convolutional neural net-
(SAE) to decompose the EEG source signal, and then used the long work and transformer in an innovative way to deal with EEG emo-
short-term memory recurrent neural network (LSTM-RNN) framework tion recognition tasks, which effectively utilizes the advantages of
for emotion classification. local awareness of CNN and global awareness of transformer, and the
In addition, some researchers also use the mixed depth model to combination of the two can form a powerful model.
carry out experiments. For example, Iyer et al. [42] proposed a hybrid (2) We introduce the new attention mechanism to effectively en-
model based on CNN and LSTM, and an integrated model combining hance the distinguishability of spatial, spectral, and temporal of EEG
CNN, LSTM and hybrid models. Li et al. [43] presented a hybrid model signals, and achieved satisfactory results. Moreover, we apply a more
based on CNN and RNN (CRNN), which constructs scalograms as the lightweight spatial and spectral attention layout, which overcomes the
input of the model after continuous wavelet transform of EEG signals. high computational complexity caused by common attention mech-
Zhang et al. [44] designed an end-to-end hybrid network based on CNN anism layout, saves computational consumption and ensures better
and LSTM (CNN-LSTM), which directly takes the original EEG signal recognition accuracy.
as the input. All hybrid frameworks showed better classification results (3) The average recognition accuracy of the proposed ACTNN model
than using a single model. on SEED and SEED-IV datasets is 98.47% and 91.90% respectively,
However, there are still some challenges and problems worthy of outperforms the state-of-the-art methods. Besides, to explore the un-
improvement in the area of EEG-based emotion recognition. derlying reasoning process of the model and its neuroscience relevance
Firstly, as mentioned above, most studies often extract features to emotion, we analyze the attention mask, and the weight distribution
from the time domain, frequency domain or time-frequency domain showed that the activities of the prefrontal and lateral temporal lobes
of EEG signals. In fact, EEG also includes the spatial information of of the brain and the gamma band of EEG signals might be more related
each channel. Because when in an emotional state, it involves large- to human emotion.
scale network interaction of the entire neural axis [45]. To effectively The rest of this paper is arranged as follows: Section 2 introduces
use the spatial information of EEG signals, Song et al. [46] proposed to two public databases and the proposed methods in detail. Section 3
use dynamical graph convolution neural network (DGCNN) to carry out reports and analyzes the experimental results. Section 4 discusses the
EEG emotion recognition, in their method, each EEG channel is taken noteworthy points and points to be improved in our work according to
as the vertex of the graph, and the adjacency matrix is dynamically the results. Section 5 elaborates the conclusion of our work.
updated during training. Subsequently, some models based on graph
neural network are proposed gradually [47,48]. Another method using 2. Materials and methods
spatial information has recently attracted attention, that is, EEG signal
is processed in a two-dimensional matrix. Yang et al. [49] was the 2.1. Dataset
first to propose this integration method. Later, CRNN [43], HCNN [50],
PCNN [51], etc., also used similar construction methods. We conduct extensive experiments on the SEED1 [31,38] and SEED-
Secondly, CNN has been extensively applied in EEG emotion recog- IV [53] dataset to evaluate our model. The main details of the two
nition task. However, there is a temporal context relation between datasets are summarized in Table 1.
frames. The convolution kernel in CNN can perceive locally, but it The SEED dataset [31,38] is a public EEG emotion dataset, which
may break these relations. We have learned that Transformer [52] has is mainly oriented to discrete emotion models. The experimental flow
strong global awareness due to the design of multi-head self-attention of SEED dataset is shown in Fig. 1, which is similar to that of SEED-
mechanism. Therefore, we hope to combine the local perception ability IV dataset. It including 15 subjects (7 males and 8 females, age range:
of CNN and the global perception ability of Transformer, and to design 23.27±2.37). Each subject did three experiments at intervals of about
a novel EEG-based emotion recognition model with better performance.
In this paper, we propose a novel multi-channel EEG emotion
1
recognition model (ACTNN), which cascades CNN and transformer https://bcmi.sjtu.edu.cn/~seed/index.html

2
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

Table 1 on the 𝛿(1–4 Hz), 𝜃(4–8 Hz), 𝛼(8–13 Hz), 𝛽(13–31 Hz), and 𝛾(31–50 Hz)
Details of SEED and SEED-IV datasets.
frequency bands, respectively. The calculation formula of DE feature
Item SEED SEED-IV is
Subjects 15 15
Trials/Film clips 15 24 𝐷𝐸(𝑋) = − 𝑓 (𝑥) log 𝑓 (𝑥) 𝑑𝑥 (1)
∫𝑋
Each clip duration 4-min 2-min
Sessions/experiments 3 3 Where 𝑋 represented the EEG sequence and 𝑓 (𝑥) represented the its
EEG electrodes 62 62
probability density function. Shi et al. [54] had proved that when band-
Sampling rate 200 Hz 200 Hz
Emotion category 3 class 4 class
pass filtering is carried out at a 2 Hz step from 2 Hz to 44 Hz, the EEG
signals of each subband approximately follow the Gaussian distribution,
namely, 𝑋 ∼ 𝑁(𝜇, 𝜎 2 ). Therefore, the formula (1) can be further written
as

1 (𝑥 − 𝜇)2 1 (𝑥 − 𝜇)2
𝐷𝐸(𝑋) = √ exp(− ) log √ exp(− ) 𝑑𝑥
∫−∞ 2𝜎 2 2𝜎 2
2𝜋𝜎 2 2𝜋𝜎 2
(2)
1
= log(2𝜋𝑒𝜎 2 )
2
Where 𝜋 and 𝑒 was constant, and 𝜎 2 represented the variance of the
EEG time series.

2.4. Spatial projection of EEG features

To preserve the relative relationship between the placement posi-


tions of EEG electrodes on the head, EEG features were spatial pro-
Fig. 1. The experiment flow of one subject in SEED dataset.
jected. Similar to previous work [49–51], the EEG features of each
band were first mapped into a 2D matrix, and then organized into a
3D structure according to the frequency bands. Specifically, as shown
one week. In each experiment, subjects watched 15 film clips about 4- in Fig. 3, the yellow circle in the left figure represented 62 acquisition
min long. EEG acquisition cap includes 62 electrodes. EEG signals are electrodes, which were mapped to the right figure according to their
down sampled to 200 Hz and band-pass filtered to 0–75 Hz. There are relative positions to form a two-dimensional matrix. The electrode
three emotional tasks in SEED dataset: positive, neutral, and negative. positions were filled with features and the rest positions were filled
The SEED-IV dataset [53] is also a public emotion dataset, which with 0. The shape of the 2D matrix was set as 𝐻 × 𝑊 , where 𝐻 =
recorded EEG signals and eye movement signals. It included 15 sub- 𝑊 = 9. Then combine all frequency bands (i.e., 𝛿, 𝜃, 𝛼, 𝛽, 𝛾 rhythms)
jects, each of whom watched 24 film clips about 2 min long in one to obtain 𝐸𝑖 ∈ 𝑅𝐵×(𝐻×𝑊 ) , where 𝐵 = 5. Finally, the 𝐸𝑖 of each second
experiment. Each subject did this experiment three times. EEG acqui- in T seconds was arranged in chronological order to construct the 4D
sition cap includes 62 electrodes. EEG signals are downsampled to structure 𝑋 = [𝐸1 , 𝐸2 , … , 𝐸𝑇 ] ∈ 𝑅𝑇 ×𝐵×(𝐻×𝑊 ) .
200 Hz and band-pass filtered to 1–75 Hz. There are four emotional
tasks in the SEED-IV dataset: neutral, sad, fear, and happy. 2.5. Spatial–spectral attention branch

2.2. The proposed model The three-dimensional structure 𝐸𝑖 ∈ 𝑅𝐵×(𝐻×𝑊 ) contained impor-
tant spatial and spectral information, where 𝑖 represented the 𝑖th time
The framework of the proposed ACTNN is shown in Fig. 2. It mainly slice, 𝑖 = 1, … , 𝑇 . We introduced spatial and spectral attention branch,
consists of the following parts: EEG signal acquisition, preprocessing which were to adaptively capture brain regions and frequency bands
and segmentation, feature extraction, spatial projection, spatial and that were more critical for the tasks. Inspired by the convolutional
spectral attention branch, spatial–spectral convolution part, temporal attention module, scSE [55], which was used in the field of medical
encoding part, and classifier. image segmentation initially, we designed attention branch that were
For the EEG signals induced by emotional stimulus of each subject suitable for our tasks, as shown in the lower-left corner of Fig. 2.
in the dataset, we first intercept the non-overlapping T-second EEG
signals and divide them into T time slices with a length of 1 s. Then, 2.5.1. Spatial attention branch
we extract DE features in five frequency bands (i.e., 𝛿, 𝜃, 𝛼, 𝛽, 𝛾 The spatial attention branch aimed to capture crucial brain regions
rhythms) from each slice and map them to the spatial matrix. In the and corresponding electrodes involved in emotional activities, which
attention stage, we introduce a parallel spatial and spectral attention used the method of spectral squeeze and spatial excitation. In detail,
branch to adaptively allocate attention weights to spatial and spectral the three-dimensional structure 𝐸𝑖 of each time slice was shown as
dimensions. Next, we use the spatial–spectral convolution module to 𝐸𝑖 = [𝑒1,1 , 𝑒1,2 , … , 𝑒𝐻,𝑊 ], where 𝑒𝑖,𝑗 ∈ 𝑅𝐵×(1×1) . Spectral squeeze was
extract local features from each time slice. After concatenating the mainly realized by 3D convolution, that used convolution kernel with
features of each time slice, the temporal encoding layer is used to the size of 𝐵 × 1 × 1 and the output channel of 1, which was represented
further extract the temporal features from the global. Finally, a fully- by
connected layer and a softmax layer is used to predict the emotional
state of the subjects. The following is a detailed introduction to the 𝐾𝑖 = 𝑊𝑘 ⊗ 𝐸𝑖 (3)
specific implementation process of each part. Where 𝑊𝑘 ∈ 𝑅1×𝐵×1×1 represented the learned matrix, and 𝐾𝑖 ∈
𝑅1×𝐻×𝑊 was the spatial scores tensor. Next, the sigmoid function (rep-
2.3. Preprocessing and feature extraction resented by 𝜎(⋅)) was applied to normalize each element
𝑘𝑚=1,2,…,𝐻,𝑛=1,2,…,𝑊 of 𝐾𝑖 to the range of [0,1], which was the spatial
For the preprocessed EEG signals in the SEED and SEED-IV dataset, attention scores. Finally, using spatial attention scores to recalibrate the
we used a non-overlapping window with a length of T-second to original three-dimensional structure 𝐸𝑖 , we can get
intercept EEG signals. Then, we divided it into T 1-second segments.
For each segment, we extracted the differential entropy (DE) features 𝐸𝑖,𝑠𝑝𝑎𝑡𝑖𝑎𝑙 = [𝜎(𝑘1,1 )𝑒1,1 , 𝜎(𝑘1,2 )𝑒1,2 , … , 𝜎(𝑘𝐻,𝑊 )𝑒𝐻,𝑊 ] (4)

3
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

Fig. 2. The framework diagram of the attention-based convolutional transformer neural network (ACTNN) for EEG emotion recognition.

applied to normalize each element 𝑧̂ 𝑘=1,2,…,𝐵 of 𝑍̂ to the range of [0,1],


which was the spectral attention scores. Finally, we applied all spectral
attention scores to 𝐸𝑖 . The specific formula was

𝐸𝑖,𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 = [𝜎(𝑧̂1 )𝑒1 , 𝜎(𝑧̂2 )𝑒2 , … , 𝜎(𝑧̂𝐵 )𝑒𝐵 ] (7)


Finally, We added 𝐸𝑖,𝑠𝑝𝑎𝑡𝑖𝑎𝑙 and 𝐸𝑖,𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 together to get the output
𝐸̂ 𝑖 with the spatial and spectral attention. The formula was as follows

𝐸̂ 𝑖 = 𝐸𝑖,𝑠𝑝𝑎𝑡𝑖𝑎𝑙 + 𝐸𝑖,𝑠𝑝𝑒𝑐𝑡𝑟𝑎𝑙 (8)

2.6. Spatial–spectral convolution module

For each 3D tensor 𝐸̂ 𝑖 with spatial and spectral attention, we used


Fig. 3. The spatial projection of EEG features.
spatial–spectral convolution module to further extract local features,
as shown in the middle part of Fig. 2. The spatial–spectral convolution
module was composed of three continuous convolution layers, Fig. 4
2.5.2. Spectral attention branch showed its specific structure. Each convolution layer of the spatial–
Spectral attention branch was to capture more valuable EEG bands, spectral convolution module adopted the method of three-dimensional
for the more important frequency band, increased the proportion of (3D) convolution. The convolution kernel size were 1 × 3 × 3, 5 × 1 × 1,
attention, otherwise suppressed. It mainly adopted the method of spa- and 1 × 3 × 3, respectively. The output channels were 16, 32, and 64,
tial squeeze and spectral excitation. Here, the 3D structure 𝐸𝑖 was respectively.
represented as 𝐸𝑖 = [𝑒1 , 𝑒2 , … , 𝑒𝐵 ], where 𝑒𝑘 ∈ 𝑅𝐻×𝑊 , and 𝑘 = The general formula of 3D convolution can be expressed as
1, 2, … , 𝐵. The spatial squeeze was obtained through global average 𝑘𝐵 𝑘𝐻 𝑘𝑊
∑ ∑∑
pooling, and the formula was as follows 𝐶𝑜𝑛𝑣(𝑋, 𝑘𝑐 )(1,𝑏,ℎ,𝑤) = 𝑎𝑐 + 𝑋𝑏+𝑟−1,ℎ+𝑝−1,𝑤+𝑞−1 𝑘𝑐𝑟,𝑝,𝑞 (9)
𝑟=1 𝑝=1 𝑞=1
1 ∑
𝐻 ∑
𝑊
𝑧𝑘 = 𝑒𝑘 (𝑖, 𝑗) (5) Where 𝑋 was the input of the convolution layer, and 𝑋 ∈ 𝑅1×𝐵×𝐻×𝑊 .
𝐻 ×𝑊 𝑖=1 𝑗=1 For a point (1,b,h,w) in X, where b∈ [1, 𝑠1 + 1, … , 𝑘𝐵 − 𝑠1 + 1], h∈
where 𝑘 = 1, 2, … , 𝐵, and combined all 𝑧𝑘 into 𝑍 = [𝑧1 , 𝑧2 , … , 𝑧𝐵 ]. [1, 𝑠2 +1, … , 𝑘𝐻 −𝑠2 +1], w∈ [1, 𝑠3 +1, … , 𝑘𝑊 −𝑠3 +1], where 𝑠1 , 𝑠2 , 𝑠3 rep-
Then, we performed two-layer 3D convolution operation on 𝑍, the resented the stride respectively. 𝑘𝑐 represented the convolution kernel,
convolution kernel size was 1 × 1 × 1, and the output channels were 5 𝑘𝑐 ∈ 𝑅𝑘𝐵 ×𝑘𝐻 ×𝑘𝑊 , where c was the specified output channels. 𝑎𝑐 denoted
and 1, respectively. The activation function was gaussian error linear the learnable bias. Therefore, we used this general formula to de-
unit (GELU). The specific formula was scribe the calculation of each layer of the spatial–spectral convolution
module.
𝑍̂ = 𝑊2 ⊗ (𝐺𝐸𝐿𝑈 (𝑊1 ⊗ 𝑍)) (6)
𝐵𝑖,1 = 𝑓 (𝐶𝑜𝑛𝑣(𝐸̂ 𝑖 , 𝑘𝑐1 )), 𝑘𝑐1 ∈ 𝑅1×3×3 (10)
Wherein, 𝑊1 , 𝑊2 ∈ 𝑅1×1×1
represented the learnable matrix of
convolution operation. The sigmoid function (represented by 𝜎(⋅)) was 𝐵𝑖,2 = 𝑓 (𝐶𝑜𝑛𝑣(𝐵𝑖,1 , 𝑘𝑐2 )), 𝑘𝑐2 ∈ 𝑅5×1×1 (11)

4
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

Fig. 4. The structure diagram of spatial–spectral convolution module.

Table 2
Sample size in SEED and SEED-IV datasets.
𝑌𝑖 = 𝑓 (𝐶𝑜𝑛𝑣(𝐵𝑖,2 , 𝑘𝑐3 )), 𝑘𝑐3 ∈ 𝑅1×3×3 (12) Dataset session1/session2/session3

Where 𝑓 (∙) denoted ReLU activation function, 𝑘𝑐1 , 𝑘𝑐2 , 𝑘𝑐3


represented the SEED 1692/1692/1692
SEED-IV 1711/1677/1655
convolution kernel of each convolution layer respectively. The input
of the spatial–spectral convolution module was the matrix 𝐸̂ 𝑖 , i = 1,2,
which represented the features matrix of the 1st-second and 2nd-
second respectively. The final output of the spatial–spectral convolution The FFN layer was consisted of two linear mapping layers, through
module was 𝑌𝑖 , i = 1,2. Afterward, the spatial–spectral features of residual connection layer and LN operation, we can get
all time slices were concatenated in chronological order, which was
𝑌 ′ = 𝐿𝑁(𝐿𝑖𝑛𝑒𝑎𝑟(𝐺𝐸𝐿𝑈 (𝐿𝑖𝑛𝑒𝑎𝑟(𝐹 ))) + 𝐹 ) (19)
expressed as
Afterward, we sent the obtained output 𝑌′
to the next temporal
𝑌 = [𝑌1 ‖𝑌2 ‖ ⋯ ||𝑌𝑇 ] (13) encoding layer.
Where || represented the concatenation of features of the time slice. The classifier consisted of a fully-connected layer and a softmax
Then, we flattened the output 𝑌 and through a linear layer. Thus, we layer, as shown in the lower right corner of Fig. 2, which was used to
can obtain the final spatial–spectral features 𝑌 , namely, predict the emotion label of the final output of the temporal encoding
layer.
𝑌 = 𝐿𝑖𝑛𝑒𝑎𝑟(𝑓 𝑙𝑎𝑡𝑡𝑒𝑛(𝑌 )) (14)
3. Experiments and results
Finally, sent 𝑌 to the transformer-based temporal encoding layer to
capture effective information in temporal dimension. We conducted extensive experiments on SEED and SEED-IV datasets
to evaluate the proposed method. As described in Section 2.4, the input
2.7. Transformer-based temporal encoding part of the model was 𝑋 = [𝐸1 , 𝐸2 , … , 𝐸𝑇 ] ∈ 𝑅𝑇 ×𝐵×𝐻×𝑊 . Exactly, we set T
to 2s. As listed in Table 2, in the SEED dataset, 1692 samples of this
The transformer-based temporal encoding layer was shown in the 4D structure 𝑋 can be obtained under session1, session2, and session3,
upper right corner of Fig. 2. There were three temporal encoding layers, respectively, and in the SEED-IV dataset, the number of 4D structure
and each of them was mainly composed of a multi-head self-attention samples 𝑋 in the case of session1, session2, and session3 was 1711,
1677, and 1655, respectively.
(MHSA) and feed-forward network (FFN) module, Fig. 5 showed the
specific structure. Meantime, residual connection layers and layer nor-
3.1. Experiment setup
malization (LN) were also used with MHSA and FFN. In MHSA, the
attention vector of each head was calculated according to the following
We trained and tested our model on NVIDIA RTX 1080Ti GPU,
formula,
and implemented it with the PyTorch framework. The optimizer was
𝑄1 ⋅ 𝐾 𝑇 Adam and the learning rate was set to 1e−5. The loss function was
𝐴ℎ (𝑄1 , 𝐾1 , 𝑉1 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( √ 1 ) ⋅ 𝑉1 (15) a cross-entropy function. We conducted experiments on each subject
𝑑𝐾1 and each session, and applied 10-fold cross-validation to evaluate the
where ℎ represented the attention head, and ℎ = 1, 2, … , 𝐻. In our performance of the model. The average performance of all subjects
model, 𝐻 = 6. And the 𝑄1 , 𝐾1 , and 𝑉1 represented the query vector, key and sessions was considered to be the final result of the model. The
batch size was 32 and the learning epoch was 30 and 50 in SEED and
vector, and value vector, respectively. The 𝑑𝐾1 denoted the dimension
SEED-IV datasets respectively. The dropout with a rate of 0.7 and 0.6
of the vector 𝐾1 . They were obtained from the linear mapping of the
in SEED and SEED-IV datasets respectively to prevent overfitting. We
output of the previous stage. For the first temporal encoding layer, the
summarized all the hyper-parameter settings in Table 3. The code was
output of the previous part was 𝑌 , from which we can get
released in Github at https://github.com/LGong666/EEG_ACTNN.git.
𝑄1 = 𝑙𝑖𝑛𝑒𝑎𝑟(𝑌 ), 𝐾1 = 𝑙𝑖𝑛𝑒𝑎𝑟(𝑌 ), 𝑉1 = 𝑙𝑖𝑛𝑒𝑎𝑟(𝑌 ) (16)
3.2. Model implementation
The output 𝐴ℎ of each head was concatenated to obtain the output
of MHSA, and then through a fully-connected layer, that was The spatial–spectral convolution module included three convolution
layers, wherein the kernel size was 1 × 3×3, 5×1 × 1, and 1 × 3 × 3
𝑀𝐻𝑆𝐴(𝑌 ) = 𝐹 𝐶([𝐴1 ‖𝐴2 ‖ ⋯ ||𝐴𝐻 ]) (17) respectively, and the output channel was 16, 32, and 64 respectively.
Then, according to the residual connecting layer and LN operation, In the temporal encoding layer, the head of MHSA was set to 6,
we can obtained wherein LayerNorm and GELU activate function was applied. The
simple classifier consisted of a fully-connected layer and a softmax
𝐹 = 𝐿𝑁(𝑀𝐻𝑆𝐴(𝑌 ) + 𝑌 ) (18) layer.

5
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

Fig. 5. The structure diagram of the temporal encoding layer.

Table 3 Table 5
Hyper-parameter setting. The F statistic and 𝑝-value obtained by one-way
Hyper-parameter Value or type ANOVA in SEED-IV.
Subject F 𝑝-value
Optimizer Adam
Learning rate 1e−5 1 227.51 1.28E−134
Loss function cross entropy 2 250.23 7.59E−147
Batch size 32 3 286.65 4.90E−166
Number of epochs 30(SEED)/50(SEED-IV) 4 496.03 9.90E−268
Dropout 0.7(SEED)/0.6(SEED-IV) 5 64.03 2.80E−40
6 37.03 1.52E−23
7 160.74 1.72E−97
Table 4 8 166.12 1.51E−100
The F statistic and 𝑝-value obtained by one-way 9 74.84 6.92E−47
ANOVA in SEED. 10 52.90 2.03E−33
Subject F 𝑝-value 11 168.25 9.35E−102
12 329.79 4.09E−185
1 436.99 1.40E−169
13 129.28 2.42E−79
2 118.4 1.97E−50
14 388.92 1.95E−217
3 101.98 9.76E−44
15 67.18 3.31E−42
4 238.4 1.34E−97
5 422.8 1.15E−164
6 1769.72 0
7 903.47 3.01e−315
8 1386.73 0
9 935.79 0
10 227.4 2.13E−93
11 960.23 0
12 65.01 1.96E−28
13 573.44 3.01E−215
14 1060.74 0
15 781.07 1.00E−279

3.3. Statistical analysis of DE features


Fig. 6. Boxplots of DE features under different emotions on SEED (including negative,
neutral, and positive emotions) and SEEDIV (including neutral, sad, fear, and happy
emotions). The left figure is from the subject #1 in SEED dataset, and the right figure
We used the one-way ANOVA to investigate the significance dif-
is from the subject #4 in SEEDIV dataset.
ference of DE features under different emotions at 99% confidence.
The average normalized value of the DE feature extracted from EEG
signals was the dependent variable, and the emotional state was the
3.4. Results and analysis
independent variable. Tables 4 and 5 reported the F statistic and 𝑝-
value obtained by one-way ANOVA in SEED and SEED-IV respectively. As spatial, spectral, and temporal attention was used in ACTNN,
It can be seen that the 𝑝-value of all subjects in SEED and SEED-IV we conducted experiments under different attention mechanisms to
was less than 0.01, which indicated that the extracted DE features have compare their effects on EEG emotion recognition tasks. The attention
significant differences under different emotions. We can be intuitively mechanism situations we set included: w/o any attention, with only
verified from the boxplot of Fig. 6, that is, the average DE feature values spatial attention, with only spectral attention, with spatial–spectral
of different emotions had different distributions. attention, with only temporal attention, and with all attention.

6
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

Table 6
The component of different attention situations.
Component Spatial attention Spectral attention Spatial–spectral convolution module Temporal encoding layer
Attention MHSA FFN
W/O any attention × × ✓ × ✓
With only spatial attention ✓ × ✓ × ✓
With only spectral attention × ✓ ✓ × ✓
With spatial–spectral attention ✓ ✓ ✓ × ✓
With only temporal attention × × ✓ ✓ ✓
With all attention ✓ ✓ ✓ ✓ ✓

Table 7
The average accuracy and standard deviation (acc/std(%)) of ACTNN model in different attention situations.
Attention SEED SEED-IV
session1 session2 session3 session1 session2 session3
w/o any attention 91.57/7.11 92.24/4.15 91.86/6.72 66.98/6.66 63.04/9.25 67.69/7.86
With only spatial attention 94.43/4.90 95.92/2.80 95.51/4.72 74.28/7.80 70.55/8.26 73.94/8.37
With only spectral attention 94.59/4.73 95.99/2.83 95.50/4.73 74.20/8.32 70.62/8.62 74.07/8.15
With spatial–spectral attention 96.31/3.11 97.21/2.17 97.06/3.49 77.24/7.62 73.59/7.47 76.27/8.83
With only temporal attention 97.21/2.66 97.48/2.86 97.35/2.58 89.71/4.72 84.13/8.24 87.37/9.63
With all attention 98.21/1.71 98.47/1.73 98.72/1.71 93.55/2.33 90.93/5.51 91.21/8.46

Fig. 7. The recognition accuracy of each subject under six attention situations in three sessions of SEED dataset. They are w/o any attention (dark blue square), with only spatial
attention (green circle), with only spectral attention (light blue triangle), with spatial–spectral attention (purple pentagon), with only temporal attention (orange diamond), and
with all attention (red star), respectively.

Table 6 listed the component of each attention situation. Table 7 For the overall performance of ACTNN, Figs. 9 and 10 reported
was the average accuracy and standard deviation obtained under dif- the accuracy of all subjects on SEED and SEED-IV datasets. As can
ferent attention mechanisms. Figs. 7 and 8 showed the results of all be seen from Fig. 9, the proposed ACTNN can achieve a satisfactory
sessions for all subjects in SEED and SEED-IV respectively. classification result for all subjects in SEED. The average recognition
When no attention was used (dark blue square in Figs. 7 and 8), accuracy of all subjects in the three sessions was 98.21%, 98.47%, and
all subjects in SEED dataset can still perform relatively well, but the 98.72% respectively, and the corresponding standard deviation was
subjects in SEED-IV dataset had been greatly affected. Compared with- 1.71%, 1.73%, and 1.71% respectively. This showed that ACTNN had
out any attention, the results obtained by only adding spatial attention better stability and superiority on SEED dataset.
(green circle) or spectral attention (light blue triangle) had a similar For most subjects in SEED-IV dataset (see Fig. 10), the proposed
increase, which may be due to the parallel structure of spatial and ACTNN can achieve good results, the average recognition accuracy of
spectral attention mechanisms. And when spectral and spatial attention all subjects in the three sessions was 93.55%, 90.93%, and 91.21%
mechanisms were combined (purple pentagon), the accuracy can be respectively, and the standard deviation was 2.33%, 5.51%, and 8.46%
respectively. Except for a few cases, such as subject #9 in session 3
further improved. As for adding only temporal attention (orange di-
(75.4%), subject #11 in session 2 (77.16%) and session 3 (67.31%),
amond), the best improvement can be got compared with the previous
which may be due to the difference between the emotional label
situations. This may be because it captured context from the global
marked for the EEG signals and the induced emotions of the
scope of the time slice, which makes input more discriminative.
subject.
We can make a quantitative comparison from Table 7. Compared
In addition, to illustrate the ability of ACTNN to distinguish various
with no attention, the maximum improvement had been achieved by
emotional states, Fig. 11 showed the confusion matrix obtained by
adding temporal attention, which can increase the accuracy by at ACTNN on SEED and SEED-IV datasets respectively. As shown in (a)
least 5.24% and 19.68% in SEED and SEED-IV, respectively. However, of Fig. 11, for the SEED dataset, ACTNN can achieve the best classi-
adding spatial attention increased the average accuracy by at least fication for positive emotions, followed by neutral emotions. For the
2.86% and 6.25% respectively, adding spectral attention increased by SEED-IV dataset (see (b) in Fig. 11), sad emotion was the most easily
at least 3.02% and 6.38% respectively, and adding spatial–spectral distinguished emotional state and fear emotions seemed to be the least
attention increased by at least 4.74% and 8.58% respectively. recognizable.
To sum up, in our model, temporal attention achieved the best
performance than spatial or spectral attention. Due to the structure 3.5. Comparative analysis between raw EEG signals and DE features
designed, spatial attention and spectral attention had similar improve-
ments. It played a better role when spatial and spectral attention were It proved that when the extracted DE features were used as the input
combined. of ACTNN, good emotion recognition performance can be obtained.

7
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

Fig. 8. The recognition accuracy of each subject under six attention situations in three sessions of SEED-IV dataset. They are w/o any attention (dark blue square), with only
spatial attention (green circle), with only spectral attention (light blue triangle), with spatial–spectral attention (purple pentagon), with only temporal attention (orange diamond),
and with all attention (red star), respectively.

Fig. 9. The overall performance of the proposed ACTNN on SEED dataset.

Fig. 10. The overall performance of the proposed ACTNN on SEED-IV dataset.

Fig. 11. The confusion matrix of the proposed ACTNN on SEED and SEED-IV dataset.

8
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

Table 8
Comparison of recognition performance (average accuracy and standard deviation) with
raw EEG signal and DE feature as input in SEED.
Session Raw EEG signals DE features
Acc (%) Std (%) Acc (%) Std (%)
1 94.70 5.77 98.21 1.71
2 96.78 2.28 98.47 1.73
3 95.86 2.81 98.72 1.71
Average 95.78 3.62 98.47 1.72

Table 9
Comparison of recognition performance (average accuracy and standard deviation) with
raw EEG signal and DE feature as input in SEED-IV.
Session Raw EEG signals DE features
Acc (%) Std (%) Acc (%) Std (%)
1 89.36 5.88 93.55 2.33 Fig. 12. Brain topographic map of spatial attention mask adaptively assigned by
2 87.65 4.45 90.93 5.51 ACTNN on the subject #4 in SEED dataset, where the first and second maps represent
3 87.34 5.05 91.21 8.46 the attention weights assigned for the 1st and 2nd second input data, respectively.
Average 88.12 5.13 91.90 5.43

In this section, we continued to explore the performance when the


raw EEG signals were input into ACTNN. To maintain the inherent
framework of the model, we still first projected the raw EEG signals of
each sampling point in space. Since the sampling rate of EEG signals in
the SEED and SEED-IV was 200 Hz, each second of EEG signal segments
can be organized as 200 × 9 × 9. In addition, because the input
dimension of the raw EEG signal was larger, we appropriately adjusted
the convolution kernel size of the spatial–spectral convolution module
to 8 × 3 × 3, 5 × 1 × 1, 5 × 3 × 3 respectively. The parameters used
for model training, including batch size, epochs, etc., were consistent
with the DE features. The raw EEG signals had been preprocessed and
smoothed to eliminate noise points.
Tables 8 and 9 reported the average accuracy and standard de- Fig. 13. Brain topographic map of spatial attention mask adaptively assigned by
viation of each session in the two datasets. In Table 8, the average ACTNN on the subject #3 in SEED-IV dataset, where the first and second maps represent
accuracy of the SEED with the raw input signal as the input was the attention weights assigned for the 1st and 2nd second input data, respectively.

95.78%, which was 2.69% less than the DE features. Similarly, Table 9
described that the average accuracy obtained by using the raw EEG
signals in the SEED-IV dataset was 88.12%, which was 3.78% less than bands. As shown in Figs. 12 and 13, since we set T to 2, the first
the DE features. Therefore, although extracted DE feature seemed to and second brain topographic maps of each emotion represented the
increase the complexity compared with the raw EEG signals, the final attention masks captured for the 1st and 2nd second, respectively. It
recognition results showed that DE features obtained better perfor- can be seen that the weight distribution changed in a small range with
mance within the acceptable range of complexity, as they contained time. Due to the limited space, we only listed the results of subject #4
more effective emotional information. in SEED and subject #3 in SEED-IV as an example. The spatial attention
brain topographic maps of other subjects were attached at the end of
3.6. Analysis of spatial and spectral attention mask the paper (see supplemental material).
For spectral attention masks, we computed the average spectral
To further understand the underlying reasoning process of our attention masks of all subjects after the training, which represented the
proposed method, we visualized the spatial and spectral attention mask common importance of the different frequency bands, and explained
in the model. These attention masks were a set of data-driven attention the contribution of each frequency band to emotion recognition. Then,
weights, that dynamically assigned to critical electrodes or frequency we plotted the average weights of spectral attention mask in Fig. 14.
bands after training. We can see that all the attention mask values were between 0 and 1.
To describe the weight distribution of spatial attention masks more In the SEED and SEED-IV datasets, the model allocated the maximum
intuitively, we captured the updated masks in the last iteration of the attention weight on the gamma band. Since the attention weight was
model and mapped them to the brain topographic map. Figs. 12 and data-driven, it indicated that the features of the gamma band may
13 showed the spatial attention mask captured for SEED and SEED- provide more valuable discrimination information for emotion recog-
IV datasets respectively. The redder the color, the higher the assigned nition tasks, and the EEG of the gamma band may be more related to
weight. It can be seen that the attention weights of all emotions were human emotion, which was consistent with the existing research [58].
mainly distributed in the prefrontal lobe and lateral temporal lobe, Thus, the features of gamma band were continuously enhanced after
which indicated that these brain regions may be more closely related recalibration, and to improve the overall recognition performance.
to emotional activation and information processing in the brain, and
this was consistent with the observation results of neurobiological 3.7. Method comparison
studies [56,57]. It should be noted that the spatial attention mask was
obtained by compressing the frequency bands, that is, we used the To verify the effectiveness of our model, we compared the proposed
convolutional kernels with the size of 5 × 1 × 1 on spectral dimension, model with the state-of-the-art methods, and a brief introduction of
so it contained the comprehensive information of the five frequency each method was listed as follows.

9
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

by tunable Q-factor wavelet transform, and the mean absolute value


and DE features on the subbands are extracted.
Table 10 reported the recognition performance comparison of all
methods above and the proposed ACTNN on SEED and SEED-IV
datasets. In general, the average recognition accuracy of the proposed
ACTNN on SEED and SEED-IV datasets was 98.47% and 91.90% respec-
tively, and the standard deviation was 1.72% and 5.43% respectively,
which was superior to the most advanced methods. As shown in
Table 10, the average recognition accuracy of the proposed ACTNN
on SEED and SEED-IV datasets was at least 1.17% and 5.13% higher
than that of the listing methods respectively. Although the standard
deviation of HCRNN [67] on SEED dataset was slightly lower than
Fig. 14. The average spectral attention mask of all subjects in five frequency bands
that of the proposed ACTNN by 0.33% difference, for more crucial
(namely, delta, theta, alpha, beta, and gamma bands) in SEED and SEED-IV datasets average recognition accuracy, our method was 3.14% higher than that
respectively. of HCRNN [67].
It is worth mentioning that, we introduced a new spatial and
spectral attention mechanism that was originally used for medical
(1) DBN (2015) [38]: It proposed a EEG-based emotion recognition image segmentation, and used a more light and effective attention
method using deep belief network. mechanism layout. Among the methods listed, spatial and spectral
(2) SVM (2018) [53]: It used support vector machine with linear attention mechanisms were also designed in SST-EmotionNet [61], 3D-
kernel for EEG emotion recognition. CNN&PST [62], and 4D-aNN [65]. Different from them, the proposed
(3) DGCNN (2018) [46]: It organized DE features into the form method only operate spatial and spectral attention at the input of
of graph structure, and dynamically learned the adjacency matrix to convolution module, rather than embedding the attention mechanism
explore the connection relationship between EEG electrodes to extract into multiple convolutional blocks that would increase the calculation
more discriminative emotional features. of the model. The proposed attention layout had a more lightweight de-
(4) BiHDM (2019) [59]: It considered the differences between the sign, and the recognition results showed that this attention mechanism
left and right hemispheres of the brain in emotional expression, and its layout was effective and ensured better performance.
framework was mainly composed of directed recurrent neural networks
and pairwise subnetwork. 4. Discussion
(5) GCB-net+BLS (2019) [48]: It released a graph convolutional
broad network including graph convolution and regular convolution, Decoding human emotions from EEG signals is a growing concern.
and also combined broad learning system to improve performance by In this paper, we carried out extensive experiments using ACTNN
expanding feature nodes. on two public datasets and achieved satisfactory performance. The
(6) RGNN (2020) [47]: It proposed a regularized graph neural following discussed the noteworthy points of the proposed ACTNN and
network, which considered the local anatomical connectivity and global the points to be improved.
channel relations to design the adjacency matrix, and released two On one hand, the better EEG emotion recognition performance
regularizers, NodeDAT and EmotionDL. of ACTNN may be largely due to the effective combination of CNN-
(7) 4D-CRNN (2020) [60]: It developed a 4D convolutional re- based and transformer-based modules. Because CNN explored from
current neural network, which combined CNN and recurrent neural local perception field, while transformer had a good global perception.
network with LSTM unit to effectively organize spectral, spatial and To further verify this viewpoint, we conducted ablation experiments.
temporal information of EEG signals. As shown in Fig. 15, By removing the key parts of ACTNN: spatial–
(8) SST-EmotionNet (2020) [61]: It reorganized the EEG signals spectral convolution part and temporal encoding part, we obtained
into 3D spatial–spectral and 3D spatial–temporal representation respec- ACTNN-T and ACTNN-C respectively. ACTNN-T was composed of the
tively, and designed a 3D dense network with spatial–spectral and temporal encoding part and the fully-connected layer. ACTNN-C mainly
spatial–temporal attention module. included the spatial–spectral convolution part, which was composed of
(9) 3DCNN&PST (2021) [62]: It introduced a 3DCNN-based the spatial–spectral attention branch, the spatial–spectral convolution
method, which included the parallel positional-spectral–temporal at- module, and the fully-connected layer. Under the same experimental
tention module, and can adaptively capture more discriminative and conditions, the corresponding results were shown in Fig. 16.
stable mask patterns. The average recognition accuracy of ACTNN-T on SEED and SEED-
(10) EeT (2021) [63]: It released a transformer-based EEG emotion IV was 96.31% and 82.62% respectively, which was better than 94.69%
recognition method, which divided the 4D EEG representation into mul- and 71.55% obtained by ACTNN-C respectively. It was noteworthy that
tiple non-overlapping regions, and used the self-attention mechanism to although the spatial–spectral convolution module in ACTNN-C had only
capture the spatial and temporal information. three shallow convolution layers, it can still obtain relatively good
(11) JDAT (2021) [64]: It proposed a joint-dimension-aware trans- recognition results. Also, based on using ACTNN-T, adding the spatial–
former model, which used continuous wavelet transform to process EEG spectral convolution part improved the average accuracy by 2.16% and
signals, and designed an integrated adaptive self-attention mechanism. 9.28% on SEED and SEED-IV respectively. The temporal encoding part
(12) 4D-aNN (2022) [65]: It proposed a 4D attention-based neural had a greater contribution to the improvement. Based on using ACTNN-
network, which mainly contained four convolutional blocks and an C, added temporal encoding part would bring 3.78% and 20.35% gain
attention-based Bi-LSTM. and each convolution block was embedded on the SEED and SEED-IV datasets respectively. In general, ACTNN
with concatenated spatial and spectral attention module. cascaded CNN-based and Transformer-based modules, combined their
(13) MDGCN-SRCNN (2022) [66]: It used the dynamic graph con- advantages and obtained the best performance, which also verified the
volutional network as a shallow layer to extract spatial features, and the rationality and effectiveness of the cascade.
style-based recalibration CNN module to learn deep abstract features. On other hand, the attention mechanism was beneficial to improve
(14) HCRNN (2022) [67]: It raised a hybrid convolutional recurrent the performance of the model. In the proposed ACTNN, we applied spa-
neural network, where EEG signals was divided into several subbands tial, spectral, and temporal attention mechanisms. The spatial attention

10
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

Table 10
Performance comparison between the baseline methods and the proposed ACTNN on the SEED and SEED-IV datasets.
Methods Year Evaluation methods SEED SEED-IV
Acc (%) Std (%) Acc (%) Std (%)
DBN [38] 2015 Trial (9:6) 86.08 8.34 – –
SVM [53] 2018 Trial (16:8) – – 70.58 17.01
DGCNN [46] 2018 Trial (9:6) 90.40 8.49 – –
BiHDM [59] 2019 Trial (9:6)/(16:8) 93.12 6.06 74.35 14.09
GCB-net+BLS [48] 2019 Trial (9:6) 94.24 6.70 – –
RGNN [47] 2020 Trial (9:6)/(16:8) 94.24 5.95 79.37 10.54
4D-CRNN [60] 2020 5-fold CV 94.74 2.32 – –
SST-EmotionNet [61] 2020 Shuffle (6:4) 96.02 2.17 84.92 6.66
3D-CNN&PST [62] 2021 Shuffle (9:6) 95.76 4.98 82.73 8.96
EeT [63] 2021 5-fold CV 96.28 4.39 83.27 8.37
JDAT [64] 2021 10-fold CV 97.30 1.74 – –
4D-aNN [65] 2022 5-fold CV 96.25 1.86 86.77 7.29
MDGCN-SRCNN [66] 2022 Trial (9:6)/(16:8) 95.08 6.12 85.52 11.58
HCRNN [67] 2022 5 times 10-fold CV 95.33 1.39 – –
ACTNN(this paper) 2022 10-fold CV 98.47 1.72 91.90 5.43

Where the best results indicated in bold fonts.

Moreover, through the analysis of the spatial and spectral attention


mask in Section 3.6, the weight distribution showed that the activities
of the prefrontal and lateral temporal lobes of the brain and the gamma
band of EEG signals might be more related to human emotion.
Despite its advantages, the generalization performance of our
method on cross-subject and cross-session deserves further exploration.
we had carried out extensive experiments on the EEG data of each
subject and session. However, different subjects have different emo-
tional activation states for the same stimulus material, even the same
subject at different times, which makes the distribution of EEG sig-
Fig. 15. The structure diagram of ACTNN-C and ACTNN-T. nals collected differently. Therefore, in future work, we will con-
duct subject-independent and cross-session experiments to improve the
generalization ability of the model.

5. Conclusion

As an important part of affective computing, EEG-based emotion


recognition has attracted extensive attention recently. However, how to
effectively combine the spatial, spectral, and temporal distinguishable
information of EEG signals, and to achieve better emotion recognition
performance is still a challenge. In this paper, we propose a novel
attention-based Convolution Transform neural network (ACTNN),
which effectively integrates the crucial spatial, spectral, and tem-
poral information of EEG signals, and cascades the CNN-based and
Transformer-based modules in a new way for emotion recognition
tasks.
Firstly, we organize DE feature frames into spatial–spectral–
temporal representations. Secondly, to enhance the distinguishability
of input features, we use a parallel spatial and spectral attention branch
to learn the spatial and spectral attention mask for each time slice,
which is used to recalibrate the original input. Thirdly, the spatial–
Fig. 16. The overall performance of three methods: ACTNN-T, ACTNN-C, and ACTNN, spectral convolution module is used to extract local spatial and spectral
on SEED and SEED-IV datasets respectively. features. Fourthly, to realize global feature perception using multi-head
self-attention, we connect the features of all time slices and feed them
into the transformer-based temporal encoding layer. Finally, a classifier
was obtained by compressing the frequency bands and spatial excita- is used to predict the discrete emotion classes of input features. The
tion. The spectral attention was obtained by the global average pooling, average recognition accuracy of the proposed ACTNN on SEED and
convolution, and excitation in spectral. The temporal attention was SEED-IV datasets is 98.47% and 91.90% respectively, which is superior
realized by the multi-head self-attention mechanism. According to the to the most advanced methods.
Additionally, according to the captured spatial and spectral atten-
line chart (Figs. 7 and 8) of different attention situations in Section 3.4,
tion masks, we learn that the higher weight of the spatial attention
we can see that, in the proposed ACTNN, temporal attention mechanism
mask is mainly distributed in the electrode positions of the prefrontal
played a more important role than spatial or spectral attention, and
lobe and the lateral temporal lobe of the brain, and the higher weight
the improvement effect of spatial and spectral attention was similar. of the spectral attention mask is mainly distributed in the gamma
When spectral and spatial attention were combined, the recognition band, which indicates that the activities of these brain regions and the
performance of the model can be further improved. When all attention EEG bands may be more related to human emotions. Through ablation
was used, the model achieved the best recognition performance. experiments, we can verify that the model have effectively cascaded

11
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

CNN-based and Transformer-based modules, and the results show that [10] J. Yedukondalu, L.D. Sharma, Cognitive load detection using circulant singular
the temporal encoding module has a relatively larger contribution to spectrum analysis and Binary Harris Hawks Optimization based feature selection,
Biomed. Signal Process. Control 79 (2023) 104006.
the improvement of recognition performance.
[11] R.W. Picard, Affective computing: challenges, Int. J. Hum.-Comput. Stud. 59
The proposed ACTNN provides a new insight into human emotion (1–2) (2003) 55–64.
decoding based on EEG signals, and can also be easily applied to [12] H.D. Nguyen, S.H. Kim, et al., Facial expression recognition using a temporal
other EEG classification tasks, such as sleep stage classification, motor ensemble of multi-level convolutional neural networks, IEEE Trans. Affect.
imagination, etc. In future work, we will explore the performance of Comput. 13 (1) (2022) 226–237.
[13] F. Noroozi, C.A. Corneanu, et al., Survey on emotional body gesture recognition,
ACTNN in subject-independent and cross-session tasks to improve the IEEE Trans. Affect. Comput. 12 (2) (2021) 505–523.
generalization ability of the model. [14] W. Li, Z. Zhang, A. Song, Physiological-signal-based emotion recognition: An
odyssey from methodology to philosophy, Measurement 172 (2021) 108747.
CRediT authorship contribution statement [15] C. Morawetz, S. Bode, et al., Effective amygdala-prefrontal connectivity pre-
dicts individual differences in successful emotion regulation, Soc. Cogn. Affect.
Neurosci. 12 (4) (2017) 569–585.
Linlin Gong: Conceptualization, Methodology, Software, Writing – [16] S. Berboth, C. Morawetz, Amygdala-prefrontal connectivity during emotion reg-
original draft. Mingyang Li: Writing – review & editing, Methodology. ulation: A meta-analysis of psychophysiological interactions, Neuropsychologia
Tao Zhang: Investigation, Validation. Wanzhong Chen: Supervision, 153 (2021) 107767.
[17] J.T. Cacioppo, D.J. Klein, et al., The psychophysiology of emotion, in: The
Formal analysis.
Handbook of Emotion, 2003.
[18] P.C. Petrantonakis, L.J. Hadjileontiadis, Emotion recognition from brain signals
Declaration of competing interest using hybrid adaptive filtering and higher order crossings analysis, IEEE Trans.
Affect. Comput. 1 (2) (2010) 81–97.
The authors declare that they have no known competing finan- [19] P.C. Petrantonakis, L.J. Hadjileontiadis, Emotion recognition from EEG using
higher order crossings, IEEE Trans. Inf. Technol. Biomed. 14 (2) (2010) 186–197.
cial interests or personal relationships that could have appeared to
[20] H. Bo, C. Xu, et al., Emotion recognition based on representation dissimilar-
influence the work reported in this paper. ity matrix, in: 2022 IEEE International Conference on Multimedia and Expo
Workshops, ICMEW, 2022, pp. 1–6.
Data availability [21] N. Jadhav, R. Manthalkar, Y. Joshi, Effect of meditation on emotional response:
An EEG-based study, Biomed. Signal Process. Control 34 (2017) 101–113.
[22] R.M. Mehmood, B. Muhammad, et al., EEG-based affective state recognition from
The authors do not have permission to share data. human brain signals by using hjorth-activity, Measurement 202 (2022) 111738.
[23] M. Alsolamy, A. Fattouh, Emotion estimation from EEG signals during listening
Acknowledgments to Quran using PSD features, in: 2016 7th International Conference on Computer
Science and Information Technology, CSIT, 2016, pp. 1–5.
[24] K.S. Kamble, J. Sengupta, Ensemble machine learning-based affective computing
We sincerely appreciate all the editors and reviewers for their
for emotion recognition using dual-decomposed EEG signals, IEEE Sens. J. 22
insightful comments and constructive suggestions. This work was sup- (3) (2022) 2496–2507.
ported by the Natural Science Foundation of Jilin Province, China [25] P. Wagh Kalyani, K. Vasanth, Performance evaluation of multi-channel electroen-
(Grant No. 20210101178JC), Scientific Research Project of Education cephalogram signal (EEG) based time frequency analysis for human emotion
recognition, Biomed. Signal Process. Control 78 (2022) 103966.
Department of Jilin Province, China (Grant No. JJKH20221009KJ), In-
[26] S. Li, X. Lyu, et al., Identification of emotion using electroencephalogram by
terdisciplinary Integration and Innovation Project of JLU, China (Grant tunable Q-factor wavelet transform and binary gray wolf optimization, Front.
No. JLUXKJC2021ZZ02), and National Natural Science Foundation of Comput. Neurosci. 15 (2021).
China (Grant No. 62203183). [27] A. Subasi, T. Tuncer, et al., EEG-based emotion recognition using tunable q
wavelet transform and rotation forest ensemble classifier, Biomed. Signal Process.
Control 68 (2021) 102648.
Appendix A. Supplementary data
[28] S.K. Khare, V. Bajaj, G.R. Sinha, Adaptive tunable q wavelet transform-based
emotion identification, IEEE Trans. Instrum. Meas. 69 (12) (2020) 9609–9617.
Supplementary material related to this article can be found online [29] C. Wei, L. Chen, et al., EEG-based emotion recognition using simple recurrent
at https://doi.org/10.1016/j.bspc.2023.104835. units network and ensemble learning, Biomed. Signal Process. Control 58 (2020)
101756.
[30] D.S. Naser, G. Saha, Recognition of emotions induced by music videos using DT-
References CWPT, in: 2013 Indian Conference on Medical Informatics and Telemedicine,
ICMIT, 2013, pp. 53–57.
[1] R.W. Picard, Affective Computing, MIT Press, 1997. [31] R. Duan, J. Zhu, B. Lu, Differential entropy feature for EEG-based emotion
[2] P.D. Bamidis, C. Papadelis, et al., Affective computing in the era of contempo- classification, in: 2013 6th International IEEE/EMBS Conference on Neural
rary neurophysiology and health informatics, Interact. Comput. 16 (4) (2004) Engineering, NER, 2013, pp. 81–84.
715–721. [32] J. Xiang, C. Rui, L. Li, Emotion recognition based on the sample entropy of EEG,
[3] R.W. Picard, E. Vyzas, J. Healey, Toward machine emotional intelligence: in: Proceedings of the 2nd International Conference on Biomedical Engineering
analysis of affective physiological state, IEEE Trans. Pattern Anal. Mach. Intell. and Biotechnology, 2014, pp. 1185–1192.
23 (10) (2001) 1175–1191. [33] Y. Shi, X. Zheng, T. Li, Unconscious emotion recognition based on multi-scale
[4] S. Ehrlich, C. Guan, G. Cheng, A closed-loop brain-computer music interface for sample entropy, in: 2018 IEEE International Conference on Bioinformatics and
continuous affective interaction, in: 2017 International Conference on Orange Biomedicine, BIBM, 2018, pp. 1221–1226.
Technologies, ICOT, 2017, pp. 176–179, http://dx.doi.org/10.1109/ICOT.2017. [34] E.S. Pane, A.D. Wibawa, M.H. Purnomo, Improving the accuracy of EEG emotion
8336116. recognition by combining valence lateralization and ensemble learning with
[5] J. Pan, Q. Xie, et al., Emotion-related consciousness detection in patients with tuning parameters, Cogn. Process 20 (2019) 405–417.
disorders of consciousness through an EEG-based BCI system, Front. Hum. [35] T. Chen, S. Ju, et al., Emotion recognition using empirical mode decomposition
Neurosci. 12 (2018). and approximation entropy, Comput. Electr. Eng. 72 (2018) 383–392.
[6] E.V.C. Friedrich, A. Sivanathan, et al., An effective neurofeedback intervention to [36] T. Chen, S. Ju, et al., EEG emotion recognition model based on the LIBSVM
improve social interactions in children with autism spectrum disorder, J. Autism classifier, Measurement 164 (2020) 108047.
Dev. Disord. 45 (2015) 4084–4100. [37] W. Jiang, G. Liu, et al., Cross-subject emotion recognition with a decision tree
[7] H. Dini, F. Ghassemi, M.S.E. Sendi, Investigation of brain functional networks classifier based on sequential backward selection, in: 2019 11th International
in children suffering from attention deficit hyperactivity disorder, Brain Topogr. Conference on Intelligent Human–Machine Systems and Cybernetics, IHMSC,
33 (2020) 733–750. 2019, pp. 309–313.
[8] H. Chang, Y. Zong, et al., Depression assessment method: An EEG emotion recog- [38] W. Zheng, B. Lu, Investigating critical frequency bands and channels for EEG-
nition framework based on spatiotemporal neural network, Front. Psychiatry 12 based emotion recognition with deep neural networks, IEEE Trans. Auton. Ment.
(2022). Dev. 7 (3) (2015) 162–175.
[9] H. Hu, Z. Zhu, et al., Analysis on biosignal characteristics to evaluate road rage [39] D. Maheshwari, S.K. Ghosh, et al., Automated accurate emotion recognition
of Younger drivers: A driving simulator study, in: 2018 IEEE Intelligent Vehicles system using rhythm-specific deep convolutional neural network technique with
Symposium, IV, 2018, pp. 156–161. multi-channel EEG signals, Comput. Biol. Med. 134 (2021) 104428.

12
L. Gong et al. Biomedical Signal Processing and Control 84 (2023) 104835

[40] H. Cui, A. Liu, et al., EEG-based emotion recognition using an end-to-end [54] L. Shi, Y. Jiao, B. Lu, Differential entropy feature for EEG-based vigilance
regional-asymmetric convolutional neural network, Knowl.-Based Syst. 205 estimation, in: 2013 35th Annual International Conference of the IEEE Engi-
(2020) 106243. neering in Medicine and Biology Society, EMBC, 2013, pp. 6627–6630, http:
[41] X. Xing, Z. Li, et al., SAE+LSTM: A new framework for emotion recognition from //dx.doi.org/10.1109/EMBC.2013.6611075.
multi-channel EEG, Front. Neurorobot. 13 (2019). [55] A.G. Roy, N. Navab, C. Wachinger, Concurrent spatial and channel ‘squeeze &
[42] A. Iyer, S.S. Das, et al., CNN and LSTM based ensemble learning for human excitation’ in fully convolutional networks, in: 2018 International Conference
emotion recognition using EEG recordings, Multimodal Interact. IoT Appl. (2022) on Medical Image Computing and Computer-Assisted Intervention, vol. 11070,
http://dx.doi.org/10.1007/s11042-022-12310-7. 2018, http://dx.doi.org/10.1007/978-3-030-00928-1_48.
[43] X. Li, D. Song, et al., Emotion recognition from multi-channel EEG data through [56] S.J. Reznik, J.J.B. Allen, Frontal asymmetry as a mediator and moderator of
Convolutional Recurrent Neural Network, in: 2016 IEEE International Conference emotion: An updated review, Psychophysiology 55 (1) (2018) e12965.
on Bioinformatics and Biomedicine, BIBM, 2016, pp. 352–359. [57] D. Seo, C.A. Olman, Neural correlates of preparatory and regulatory control
[44] Y. Zhang, J. Chen, et al., An investigation of deep learning models for EEG-based over positive and negative emotion, Soc. Cogn. Affect. Neurosci. 9 (4) (2014)
emotion recognition, Front. Neurosci. 14 (2020). 494–504.
[45] L. Pessoa, A network model of the emotional brain, Trends in Cognitive Sciences [58] K. Yang, L. Tong, High Gamma band EEG closely related to emotion: Evidence
21 (5) (2017) 357–371. from functional network, Front. Hum. Neurosci. 14 (2020).
[46] T. Song, W. Zheng, et al., EEG emotion recognition using dynamical graph [59] Y. Li, L. Wang, et al., A novel bi-hemispheric discrepancy model for EEG emotion
convolutional neural networks, IEEE Trans. Affect. Comput. 11 (3) (2020) recognition, IEEE Trans. Cogn. Dev. Syst. 13 (2) (2021) 354–367.
532–541. [60] F. Shen, G. Dai, et al., EEG-based emotion recognition using 4D convolutional
[47] P. Zhong, D. Wang, C. Miao, EEG-based emotion recognition using regularized recurrent neural network, Cogn. Neurodyn. 14 (2020) 815–828.
graph neural networks, IEEE Trans. Affect. Comput. 13 (3) (2022) 1290–1301. [61] Z. Jia, Y. Lin, et al., SST-EmotionNet: Spatial-spectral-temporal based attention
[48] T. Zhang, X. Wang, et al., GCB-net: Graph convolutional broad network and its 3D dense network for EEG emotion recognition, in: Proceedings of the 28th ACM
application in emotion recognition, IEEE Trans. Affect. Comput. 13 (1) (2022) International Conference on Multimedia, MM ’20, Association for Computing
379–388. Machinery, 2020, pp. 2909–2917, http://dx.doi.org/10.1145/3394171.3413724.
[49] Y. Yang, Q. Wu, et al., Continuous convolutional neural network with 3D [62] J. Liu, Y. Zhao, et al., Positional-spectral-temporal attention in 3D convolutional
input for EEG-based emotion recognition, in: 2018 International Conference neural networks for EEG emotion recognition, in: 2021 Asia-Pacific Signal and
on Neural Information Processing, 2018, http://dx.doi.org/10.1007/978-3-030- Information Processing Association Annual Summit and Conference, APSIPA ASC,
04239-4_39. 2021, pp. 305–312.
[50] J. Li, Z. Zhang, et al., Hierarchical convolutional neural networks for EEG-based [63] J. Liu, H. Wu, et al., Spatial–temporal transformers for EEG emotion recognition,
emotion recognition, Cogn. Comput. 10 (2018) 368–380. 2021, preprint, http://dx.doi.org/10.48550/arXiv.2110.06553.
[51] Y. Yang, Q. Wu, et al., Emotion recognition from multi-channel EEG through [64] Z. Wang, Z. Zhou, et al., JDAT: Joint-Dimension-Aware Transformer with Strong
parallel convolutional recurrent neural network, in: 2018 International Joint Flexibility for EEG Emotion Recognition, TechRxiv, 2021, preprint http://dx.doi.
Conference on Neural Networks, IJCNN, 2018. org/10.36227/techrxiv.17056961.v1.
[52] A. Vaswani, N. Shazeer, et al., Attention is all you need, in: Proceedings of [65] G. Xiao, M. Shi, et al., 4D attention-based neural network for EEG emotion
the 31st International Conference on Neural Information Processing Systems, recognition, Cogn. Neurodyn. 16 (2022) 805–818.
NIPS’17, 2017, pp. 6000–6010. [66] G. Bao, K. Yang, et al., Linking multi-layer dynamical GCN with style-based
[53] W. Zheng, W. Liu, et al., EmotionMeter: A multimodal framework for recognizing recalibration CNN for EEG-based emotion recognition, Front. Neurorobot. 16
human emotions, IEEE Trans. Cybern. 49 (3) (2019) 1110–1122. (2022).
[67] M. Zhong, Q. Yang, et al., EEG emotion recognition based on TQWT-features and
hybrid convolutional recurrent neural network, Biomed. Signal Process. Control
79 (2) (2023) 104211.

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy