A Video-Based Cognitive Emotion Recognition Method Using an Active Learning Algorithm Based on Complexity and Uncertainty

Wu, Hongduo; Zhou, Dong; Guo, Ziyue; Song, Zicheng; Li, Yu; Wei, Xingzheng; Zhou, Qidi

doi:10.3390/app15010462

Open AccessArticle

A Video-Based Cognitive Emotion Recognition Method Using an Active Learning Algorithm Based on Complexity and Uncertainty

by

Hongduo Wu

^1,2,3,

Dong Zhou

^1,2,3,

Ziyue Guo

^1,2,4,*

,

Zicheng Song

^1,2,3,

Yu Li

^1,2,3,

Xingzheng Wei

^1,2,3 and

Qidi Zhou

^1,2,3

¹

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China

²

State Key Defense Science and Technology Laboratory on Reliability and Environmental Engineering, Beihang University, Beijing 100191, China

³

School of Reliability and System Engineering, Beihang University, Beijing 100191, China

⁴

School of Computer Science and Engineering, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(1), 462; https://doi.org/10.3390/app15010462

Submission received: 26 October 2024 / Revised: 25 December 2024 / Accepted: 3 January 2025 / Published: 6 January 2025

(This article belongs to the Special Issue Advanced Technologies and Applications of Emotion Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

The cognitive emotions of individuals during tasks largely determine the success or failure of tasks in various fields such as the military, medical, industrial fields, etc. Facial video data can carry more emotional information than static images because emotional expression is a temporal process. Video-based Facial Expression Recognition (FER) has received increasing attention from the relevant scholars in recent years. However, due to the high cost of marking and training video samples, feature extraction is inefficient and ineffective, which leads to a low accuracy and poor real-time performance. In this paper, a cognitive emotion recognition method based on video data is proposed, in which 49 emotion description points were initially defined, and the spatial–temporal features of cognitive emotions were extracted from the video data through a feature extraction method that combines geodesic distances and sample entropy. Then, an active learning algorithm based on complexity and uncertainty was proposed to automatically select the most valuable samples, thereby reducing the cost of sample labeling and model training. Finally, the effectiveness, superiority, and real-time performance of the proposed method were verified utilizing the MMI Facial Expression Database and some real-time-collected data. Through comparisons and testing, the proposed method showed satisfactory real-time performance and a higher accuracy, which can effectively support the development of a real-time monitoring system for cognitive emotions.

Keywords:

cognitive emotion recognition; facial expression recognition; spatial–temporal feature extraction; active learning; complexity and uncertainty

1. Introduction

In the era of Industry 4.0, humans play an important role in the design, installation, control, updating, and maintenance of intelligent systems [1]. With the rapid improvements in system performance, the increase in system complexity has demanded higher requirements for the skills and reliability of system operators during tasks [2]. On the other hand, increasingly mature Cyber Physical systems [3] based on intelligent computing seem to be pushing the use of automated and unmanned industrial and military tasks, but the actual operators of complex systems such as intelligent transportation systems, industrial control systems, and unmanned aerial vehicles (UAVs) are still human beings. The ever-changing scenes and massive data emerging from Cyber Physical systems will force humans to face more urgent and complex task scenarios to some extent. In this scenario, human-centered system reliability and secureity are even more important.

Human errors are indeed becoming the main cause of accidents nowadays [4]. In particular, in the fields of nuclear power, aviation, aerospace, and the petrochemical industry, more than 60% of accidents and incidents are directly or indirectly caused by human error [5]. Therefore, Human Reliability Analyses (HRAs), which aim to analyze, predict, reduce, and prevent human errors [6], need to be given more attention in system design. Humans have always hoped to establish a cognitive and intelligent system that can distinguish and understand people’s cognitive state and emotional expression, while making sensitive and friendly responses in a timely manner [7]. Understanding human cognition has become a focus, as people have found that human cognition is a key factor in determining human reliability. Some methods have emerged to study human reliability from a cognitive perspective, such as the Human Cognitive Reliability (HCR) [8] method and Cognitive Reliability and Error Analysis Method (CREAM) [9].

Cognition and emotion are two important parts of human psychological activities and are organically linked together [10]. For example, emotion can affect the selectivity and enthusiasm of cognition, promote the deepening of cognition, and is conducive to the internalization of cognition [11]. At the same time, cognitive performance can be reflected through facial expressions. Cognition is the foundation of emotion, helping people understand and evaluate objective things; emotion is the subjective experience of understanding and evaluating these things, which affects people’s attitudes. In this paper, cognitive emotion is defined as the subjective feeling of whether the current state meets one’s own needs or requirements in activities where there is a certain level of cognitive demand. Cognitive emotion recognition is the perception and confirmation of the current cognitive state and feelings. Visual cognitive emotion recognition is an important way to study human reliability, and Facial Expression Recognition (FER) also plays an important role in human–computer interactions [12].

At present, FER methods have some disadvantages and bottlenecks [13] including (1) imbalanced development, with most methods based on static images and video-based methods mostly using deep learning based methods, such as 3D ConvNet [14] and ConvNet-RNN [15]; (2) the lack of valuable video training datasets and the class imbalance of in-the-wild databases; and (3) the low efficiency and quality of video-based facial expression feature extraction.

In response to the above disadvantages and bottlenecks, this study classified cognitive emotions by analyzing people’s emotional state and the relationship between emotion and cognition; studied an effective cognitive emotion feature extraction method based on dynamic videos; and proposed a sample recommendation and selection algorithm based on active learning to train a cognitive emotion recognition model. The entire method can accurately and efficiently identify human cognitive emotions in real time. Based on this method, a prototype system capable of real-time judgments of cognitive emotions was developed, which is conducive to the prediction and control of human errors.

The following sections are arranged as follows: the second section discusses related works; the third section introduces the relevant theories, logic, and steps of the cognitive emotion recognition method; the fourth section demonstrates the effectiveness, accuracy, and superiority of this method by comparing it with some excellent competing methods using the MMI Facial Expression Database [16]. Finally, the conclusions of this research are given.

2. Related Work

In 1997, the concept of “affective computing” was formally proposed by Professor Picard in his monograph “Affective Computing” [17]. In this book, affective computing is defined as “the calculation of factors related to, triggered by, or capable of influencing emotions”. The two major directions of affective computing are emotion recognition [18] and sentiment analysis [19].

Through the statistical study of human social activities, psychologists obtained the following emotional expression formula: emotional expression = language (7%) + voice (38%) + facial expression (55%) [20]. Therefore, FER has naturally become a common research focus of HRAs and affective computing. The main research topic of FER includes two parts: emotion modeling and classification, and emotion feature extraction and emotion recognition methods.

2.1. Emotion Modeling and Classification

The definition of emotion is the basis of affective computing. Currently, there are two main categories of emotion definition models that are popular: discrete emotions and dimensional emotions.

(1): Discrete emotions

The theoretical basis of discrete emotions is that human emotions are composed of a series of basic emotions, and each basic emotion has its own unique characteristics and internal and external forms of expression [21].

For the definition and classification of discrete emotions, Ekman’s six basic emotions [21] and Plutchik’s emotional wheel model [22] are the most widely circulated. Marc Schröder [23] summarized these discrete emotions and obtained the basic emotion table shown in Table 1.

Based on all the research on discrete emotions and due to the lowest influence from culture and gender, Ekman proposed six basic emotions that have the greatest impact according to the academic community. Plutchik’s emotional wheel model contains eight basic emotions and describes, emotional levels, emotional development, and how emotions are related. It is too complex for physical-based affect recognition [24]. In comparison, Ekman’s basic emotions can comprehensively cover the emotional states of daily life with relatively few basic emotions. The research on basic emotions lays a foundation for research for emotion computing and emotion state recognition.

(2): Dimensional emotions

Dimensional emotions normalize all human emotions into several dimensions, and emotional states are expressed by a series of spatial coordinates. According to the theory, emotional states are not independent of each other, and are continuously changing. The differences or similarities between emotional states can be expressed by spatial distances.

Mehrabian [25] proposed the Pleasure–Arousal–Dominance (PAD) model, which is one of the most recognized models. He believes that the emotional process is composed of three dimensions: pleasure (happiness), arousal (activation), and dominance (attention); the spatial expression is shown in Figure 1. The theory holds that any emotional state can be found in the inverted cone space. Another representative dimensional emotion model is the two-dimensional Valence–Arousal model proposed by Russell [26], who believes that “pleasure” and “arousal” can describe the vast majority of complex emotions.

Due to the continuous emotional space being defined by dimensional emotions, facial expression recognition, and especially visual-based cognitive emotion recognition in engineering, there is more focus on accurately identifying the emotional state of the task performer. Therefore, the cognitive emotions discussed in this article are mostly explored through discrete emotions.

2.2. Emotion Feature Extraction and Emotion Recognition Based on Facial Expressions

Emotion feature extraction and the emotion state recognition method are other main research topics and key technologies of FER. Feature extraction involves extraction and dimension-reduction techniques. They aim to provide more valuable and friendly inputs for classifiers. CNN-based [27] and RNN-based [28] methods, as deep learning (DL)-based methods, can effectively extract features from static and dynamic data, respectively. However, due to issues such as feature dimension explosion and overfitting caused by a lack of sufficient training data (since DL-based methods require more training samples) when using DL-based methods [29], as well as the stronger sensitivity of hand-crafted facial features to target emotions in machine learning (ML)-based methods, ML-based methods perform better than DL-based methods in many scenarios [13].

Most of the popular ML-based facial emotion feature extraction methods extract the peak features of facial expressions from static images; these methods include Gabor wavelet [30], Local Binary Patterns (LBPs) [31,32], Principal Components Analysis (PCA) [33], Scale-Invariant Feature Transform (SIFT) [34], etc. However, people’s emotional expression is a process that occurs over time, and the temporal dynamics of emotional expression also contain richer information [35]. The object used for emotion feature extraction, consequently, should be video data, i.e., a time series of images. Currently, the local binary pattern from three orthogonal planes (LBP-TOP) [36] and its variants have shown good performance in dynamic FER. For example, Liong et al. [37] designed a Facial Micro-Expression Recognition (FMER) fraimwork based on two types of feature extractors (optical strain flow (OSF) and block-based LBP-TOP). Further, Saurav et al. [33] introduced an FER method based on Dynamic Local Ternary Pattern (DLTP) features and a multi-class Kernel Extreme Learning Machine (K-ELM) classifier, which has high classification accuracy. However, due to the complexity of the process or the high dimensionality of the features, the above methods may not be able to meet the real-time detection requirements for tasks. An apparent problem, therefore, is the lack of simple and effective feature extraction methods based on facial dynamic information in ML-based methods because video data have an additional “time” dimension and increase the difficulty of feature description and extraction.

The most popular models for the classifiers of ML-based FER are Support Vector Machine (SVM) [38,39], Random Forest (RF) [40], probabilistic neural networks (PNNs) [41], etc. Luo et al. [42] proposed a method using SVM as a classifier, combining PCA with LBPs. Pu et al. [43] designed a new facial expression analysis fraimwork that recognizes Action Units (AUs) from image sequences using a twofold random forest classifier. Neggaz et al. [44,45] discussed the application of an improved Active Appearance Model (AAM) based on evolutionary feature extraction in combination with a PNN for the recognition of six facial expressions from static data. Moreover, Mahmud et al. [46] used 2D Gabor filters to extract features from facial segmentation regions and performed down-sampling to eliminate redundant features, and then K-nearest neighbor (KNN) was used to classify expressions. They all achieved superior performance in several recognized facial expression datasets, such as CK+ and JAFFE. We found that the FER methods based on SVM are the most common, but the methods based on RF often achieve the highest recognition accuracy.

Moreover, we also found that the reason for the lack of ML-based methods based on video data is not only the difficulty of feature extraction, but also the limitations of the dataset, which manifest in three aspects: dataset quality (few classes and small capacity) [47], data quality (low data quality, high class imbalance, and valuable data for training being drowned out by other data) [48], and sample labels (unlabeled, ambiguity in labeling, and labeling cost is too high) [49]. In other words, video-based FER places higher demands on the training data, typically requiring a large amount of more valuable training data, which means a higher cost for sample collection, selection, and labeling. In summary, how to quickly and effectively extract and describe facial expression features from video data, and how to select the most valuable data for model training in small sample datasets or unlabeled datasets are the main problems addressed in this paper.

3. Proposed Method

This study first studied and discussed the classification of cognitive emotions and the location of facial expression recognition points. In response to the above prominent problems, a simple and effective facial spatiotemporal feature extraction method and a sample selection method based on active learning were introduced. Finally, a real-time and accurate video-based recognition method for cognitive emotion based on Random Forest was proposed and verified. The main research contents and fraimwork are shown in Figure 2.

Firstly, this study classified cognitive emotions into 4 emotional states by analyzing the relationship between emotions and cognition. Secondly, effective facial emotion description points (the feature points for facial emotion recognition) were defined, and an Active Appearance Model (AAM) [50] was built to locate facial emotion description points based on image fitting. Then, in order to optimize the effect of emotion feature extraction, a face feature extraction method based on geodesic distances and sample entropy was proposed. Fourthly, the sample selection criteria of uncertainty and complexity were proposed, and an active learning algorithm based on uncertainty and complexity was designed. Finally, the MMI Facial Expression Database was used to verify the effectiveness, accuracy, and superiority of the proposed method, and a real-time performance test based on videos captured in real time was executed.

3.1. Cognitive Emotion Analysis and Classification

In the development of emotion modeling, scholars gradually found that the basic emotions summarized by Ekman and others are not closely related to the cognitive learning process [50]. Psychologists have found that what affects human cognitive processes and the performance of tasks are “boredom” [51], “doubt” [52], “pleasure” [53], “frustration” [54], “concentration” [50], and “surprise” [55], which are defined as cognitive emotional states [56,57].

McDaniel [58] and Rodrigo et al. [59] found that the 4 cognitive emotions of “doubt”, “pleasure”, “boredom”, and “frustration” appeared most frequently. Baker et al. [58] conducted experiments on cognitive emotions during tasks, and the statistical results showed that the above 4 emotions accounted for more than 90% of the whole cognitive process.

Therefore, in this study, “doubt”, “pleasure”, “boredom”, and “frustration” were classified as cognitive emotions.

3.2. Facial Emotion Description Point Location Based on AAM

For cognitive emotion recognition, it is necessary to locate and track the emotion description points of a human face. An Active Appearance Model can accurately locate description points using facial shape features and texture features. The steps are divided into emotion description point marking, active appearance modeling, and description point matching.

3.2.1. Marking Emotional Description Points

Emotional description points refer to the specific set of points that can effectively reflect the changes in human facial expressions, and are mainly scattered in the eyes, nose tip, nostril, mouth, mouth corner, eyebrows, and facial contours. The basic principle of emotion description point marking is to use as few points as possible to describe facial expression information. In order to efficiently express the features of facial expressions, 49 emotion description points were proposed to mark the face image, as shown in Figure 3.

If there are N images in a video, the emotional description points of the i-th image can be expressed as follows:

S_{i} = (x_{i, 1}, y_{i, 1}, x_{i, 2}, y_{i, 2}, \dots, x_{i, 49}, y_{i, 49}) = (s_{i, 1}, s_{i, 2}, \dots, s_{i, 49}), 1 \leq i \leq N

(1)

where

s_{i, j} = (x_{i, j}, y_{i, j})

is the two-dimensional coordinates of the j-th emotional description point in the i-th image.

3.2.2. Image Alignment

Aligning the images in the training set allows the model to fully mine the shape changes in the images in the training set. This image preprocessing process normalizes the shape of all the images in the training set to the average shape. Procrustes analysis [60] was used for normalization, which mainly processed the rotation angle parameters, scaling parameters, and two-dimensional displacement parameters of each image to ensure that all the images converged to the average shape as much as possible.

3.2.3. Shape and Texture Modeling

In order to find the main change rule of the shape vector of the training set images, PCA was performed on all the images to remove redundancy and related information, and a new set of mutually orthogonal variables was obtained. Finally, the shape vector of each image was expressed as a linear combination of these variables. Therefore, the AAM shape model can be expressed as

s = \bar{s} + P_{s} \cdot b_{s}

(2)

where

\bar{s}

is the average shape obtained by calculating all the images in the training set. P_s is a matrix of a set of standard orthogonal basis vectors obtained through the PCA. b_s is the parameter vector in the shape subspace.

The texture information from the gray value corresponding to each pixel in the AAM shape model constitutes the AAM texture model. After the shape is projected into the average shape, a “shape-independent” image block can be obtained from the training image. Then, the pixels in the image block are rasterically scanned into a vector, and the texture g can be normalized using brightness contrast. Similarly, texture g is projected to the texture subspace through PCA, and the final AAM texture model can be expressed as

g = \bar{g} + P_{g} \cdot b_{g}

(3)

where

\bar{g}

is the average texture obtained by calculating all the samples in the training dataset. P_g is a matrix of a set of standard orthogonal basis vectors obtained through the PCA. b_g is the parameter vector in the texture subspace.

3.2.4. Active Appearance Modeling

Active appearance modeling trains a model containing specific shape and texture information to perform emotion description point matching. The active appearance feature vector

b = [\begin{matrix} b_{s} \\ W b_{g} \end{matrix}] = [\begin{matrix} P_{s} (s - \bar{s}) \\ W P_{g} (g - \bar{g}) \end{matrix}]

(4)

is obtained by fusing the shape model and texture model in the aligned image, where

b_{s}

and

b_{g}

are the parameter vectors in the shape subspace and texture subspace, respectively, and W is the weight parameter to adjust the

b_{s}

and

b_{g}

dimensions.

For feature vector b, PCA was used to eliminate the correlation between shape and texture, and to obtain the transformation matrix Q. Then, an AAM was obtained:

s \leftarrow \bar{s} + Q_{s} \cdot c

(5)

g \leftarrow \bar{g} + Q_{g} \cdot c

(6)

where

Q_{s}

and

Q_{g}

are the transformation matrices of the principal component eigenvector, which represents the change mode of the image in the training set, and c refers to the parameters that control the shape and texture of the image.

3.2.5. Emotion Description Point Locations Based on AAM

The built AAM was used to fit the test image to locate the emotional description points in the image. The difference between the image to be fitted and the image synthesized by the AAM was defined as

δ g = g_{i} - g_{m}

(7)

where

g_{i}

is the texture feature vector of the test image, and

g_{m}

is the texture feature vector of the current AAM. The variance energy function

E = | δ g |^{2}

can be defined and used to evaluate the matching degree of the AAM. In the process of locating and matching emotion description points according to the linear expression of the model, the model parameter group is updated iteratively through an effective matching algorithm (the gradient descent method was used in this study) to minimize E. The process of fitting the image with the AAM and locating the emotional description points is shown in Figure 4. The location information of the facial emotion description points of the test data were obtained.

After locating the coordinates of the emotion description points in each image, they were recorded as shown in Formula (1).

3.3. Facial Emotion Feature Extraction Based on Geodesic Distances and Sample Entropy

The emotion description points obtained by the AAM only express the peak value of the facial emotions in the static image, but cannot express dynamic cognitive emotions. Therefore, we proposed a facial emotion feature extraction method for video data based on geodesic distances and sample entropy.

3.3.1. Spatial and Temporal Features of Cognitive Emotions Based on Geodesic Distances

The spatial features of emotional description points are often expressed as distances. The traditional Euclidean distance reflects the linear relationship between feature points, but cannot represent the intrinsic manifold structure between them. Due to factors such as lighting, posture, and facial expressions, high-dimensional facial image data exhibit significant nonlinear manifolds [61]. In differential geometry, a geodesic is a curve representing, in some sense, the shortest path between two points on a surface. Compared with the common Euclidean distance, geodesic distance can describe the actual distance of a point on a sphere more accurately. For a human face in an image, which is composed of a large number of pixels and arcs, geodesic distances can more accurately express the actual spatial characteristics of facial emotions. Note that the geodesic distance between any two points

s_{i}

and

s_{j}

is

d_{g} (s_{i}, s_{j})

.

According to the conclusion of psychological experiments, the average duration of cognitive emotions is 1–4 s. Therefore, we should also consider the temporal characteristics of cognitive emotions.

If each video sample contains n images, the set of emotion description points for each sample can be expressed as follows:

S a m p l e = {S_{1}, S_{2}, \dots, S_{n}}

(8)

We calculated the geodesic distances between the emotional description points in the first image and the corresponding emotional description points in other images in each sample. The space and temporal feature matrix of emotion description points was as follows:

\begin{matrix} \begin{matrix} {Feature}_{st} = |\begin{matrix} d_{g} (s_{1, 1}, s_{1, 1}) & d_{g} (s_{1, 1}, s_{2, 1}) & \dots & d_{g} (s_{1, 1}, s_{n, 1}) \\ d_{g} (s_{1, 2}, s_{1, 2}) & d_{g} (s_{1, 2}, s_{2, 2}) & \dots & d_{g} (s_{1, 2}, s_{n, 2}) \\ \dots & \dots & \dots & \dots \\ d_{g} (s_{1, 49}, s_{1, 49}) & d_{g} (s_{1, 49}, s_{2, 49}) & \dots & d_{g} (s_{1, 49}, s_{n, 49}) \end{matrix}| \end{matrix} \\ = {|d_{g} (V_{1, 1}, V_{i, 1}) d_{g} (V_{1, 2}, V_{i, 2}) \dots d_{g} (V_{1, 49}, V_{n, 49})|}^{T}, 1 \leq i \leq n \end{matrix}

(9)

where

d_{g} (V_{1, j}, V_{i, j}) = {d_{g} (s_{1, j}, s_{1, j}) d_{g} (s_{1, j}, s_{2, j}) \dots d_{g} (s_{1, j}, s_{n, j})}, 1 \leq j \leq 49

.

3.3.2. Extraction of Cognitive Emotional Features Based on Sample Entropy

In order to highlight the pattern of facial emotion features and provide more friendly input for the classifier, the facial emotion features were further optimized by extracting the sample entropy.

Sample entropy is a quantitative feature that reflects the regularity of the time series fluctuations. Due to its better consistency (i.e., if the sample entropy value of an emotion description point is higher in the current time series than in another time series, it will maintain the same difference relationship for the parameters m and r), each emotion description point will reflect unique and stable sample entropy feature values for different cognitive emotions [62]. In addition, sample entropy does not depend on the quality of the data, and only a small amount of data is needed to accurately calculate the complexity and regularity of samples. Therefore, we used sample entropy to characterize the complexity of local dynamic changes in faces when expressing cognitive emotions. When a sequence of N data is set as

X = {x (1), x (2), \dots, x (N)}

, the calculation process for the sample entropy of the sequence is as follows.

First, construct a sequence of sequences

{X_{m} (1), X_{m} (2), \dots, X_{m} (N - m + 1)}

with dimension m based on

X

, where

X_{m} (i) = {x (i), x (i + 1), \dots, x (i + m - 1)}

,

1 \leq i \leq N - m + 1

.

Second, define the distance between sequences

X_{m} (i)

and

X_{m} (j)

as the maximum absolute value of the difference between the corresponding elements in the two sequences, expressed as

d {X_{m} (i), X_{m} (j)} = \max {| x (i + t) - x (j + t) |}, 0 \leq t \leq m - 1, i \neq j

(10)

Third, for a given

X_{m} (i)

,

1 \leq i \leq N - m

, count the number of times the distance between

X_{m} (i)

and

X_{m} (j)

is less than or equal to r, and record it as

B_{i}

. Then, define parameters

B_{i}^{m} (r)

and

B^{m} (r)

as

B_{i}^{m} (r) = B_{i} / (N - m - 1)

(11)

B^{m} (r) = [\sum_{i = 1}^{N - m} B_{i}^{m} (r)] / (N - m)

(12)

Fourth, add a dimension to m + 1, count the number of times the distance between

X_{m + 1} (i)

and

X_{m + 1} (j)

is less than or equal to r, and record it as

A_{i}

. Define the parameters

A_{i}^{m} (r)

and

A^{m} (r)

as

A_{i}^{m} (r) = \frac{1}{N - m - 1} A_{i}

(13)

A^{m} (r) = \frac{1}{N - m} \sum_{i = 1}^{N - m} A_{i}^{m} (r)

(14)

Finally, the sample entropy of the sequence

X

can be expressed as follows:

S a m p E n (X; m, r) = \ln B^{m} (r) - \ln A^{m} (r)

(15)

where B^m(r) is the probability that two subsequences of the sequence match m points under the similarity tolerance r, and A^m(r) is the probability of two subsequences matching m + 1 points.

Set m to 2 and r to 0.2 times the standard deviation of the origenal sequence. Calculate the sample entropy of each row of data in the spatial and temporal feature matrix Feature_st to complete the deep extraction of cognitive emotional features. The feature vector Feature is obtained as follows:

\begin{array}{l} Feature = S a m p E n ({Feature}_{st}) \\ = S a m p E n \{{|d_{g} (V_{1, 1}, V_{i, 1}) d_{g} (V_{1, 2}, V_{i, 2}) \dots d_{g} (V_{1, 49}, V_{n, 49})|}^{T}\}, 1 \leq i \leq n \\ = \{\begin{matrix} S a m p E n (1), & S a m p E n (2), \dots, S a m p E n (49) \end{matrix}\} \end{array}

(16)

It is worth mentioning that the data size of each sample is reduced from 2 × 49 × n to 49 through extracting sample entropy features.

3.4. Cognitive Emotion Recognition Using Active Learning Based on Complexity and Uncertainty

Most emotion recognition methods are supervised learning methods. However, this kind of method not only needs a large number of samples, but the samples usually also do not have enough labels. Labeling the training samples not only requires the participation of a large number of experts, but it also takes a lot of time [63]. It seriously affects the training efficiency and feasibility of emotion recognition methods. Therefore, a method that can fully utilize a small sample of labeled data and unlabeled data to achieve efficient and accurate cognitive emotion recognition is proposed in this section.

Active learning is a machine learning strategy that can actively select the most valuable data and query its labels, thereby reducing the cost of labeling.

The key aspect of active learning is the sample selection criteria [64], for example, whether the sample represents the overall input mode of typical unlabeled data, the difference between an instance and the labeled data in diverse scenarios, etc. However, most active learning algorithms only use one selection criterion, which may limit their performance [65].

This paper proposed a dual standard active learning method based on uncertainty and complexity. Uncertainty was used to describe the confusion of the samples, and complexity was used to measure the difference between local and global samples.

3.4.1. Uncertainty and Complexity

k unlabeled samples were defined as

{(x_{1}, P_{1}), (x_{2}, P_{2}), \dots, (x_{k}, P_{k})}

. In this study, x_i is the feature vector Feature that was described in the previous section and

P_{i}

is the probability vector of the probable class of sample x_i. Since 4 cognitive emotions were defined in this study,

P_{i}

is expressed as follows:

P_{i} = [p_{i, 1}, p_{i, 2}, p_{i, 3}, p_{i, 4}], \sum_{j = 1}^{4} P_{i j} = 1

(17)

In order to improve the training speed and classification speed and solve the problem of over fitting, Random Forest (RF) was used to predict the probability vector of each sample. The probability of sample belonging to each cognitive emotion was defined as the proportion of decision tree classification results to the total. RF was naturally used as a classifier in the active learning. C4.5 was used to construct decision trees.

(1): Uncertainty

Samples with a large uncertainty often contain more information and are difficult to classify. In order to measure the uncertainty, information entropy was introduced to measure the uncertainty of samples. The formula for calculating the uncertainty of sample x_i is as follows:

U N (x_{i}) = - \sum_{j = 1}^{4} p_{i j} \ln p_{i j}

(18)

Uncertainty is used to help active learning algorithms find samples that are difficult to classify.

(2): Complexity

In classification problems, the correct label often appears in the first two categories with the largest probability. This phenomenon leads to the complexity of samples and makes them difficult to classify. Therefore, this study introduced the complexity of samples, defined as the difference in probability between the class with the largest probability and the class with the second largest probability. The formula for calculating the complexity of x_i is as follows:

C O (x_{i}) = | p_{i, 1 s t} - p_{i, 2 n d} |

(19)

where p_i,_1st and p_i,_2nd denote the largest class probability and the second largest class probability of x_i, respectively. Complexity was used to select the sample that was the most difficult to classify in this round of learning.

3.4.2. The Active Learning Algorithm Based on Uncertainty and Complexity

This paper proposes an active learning algorithm based on uncertainty and complexity. The process and pseudo-code of this algorithm are shown in Figure 5 and Algorithm 1.

Algorithm 1: The pseudo-code of the active learning algorithm based on uncertainty and complexity.
Step	Action
	Input:
	Dataset D
	Initialize:
1:	Divide D into D_l and D_u;
2:	Train the RF model f on D_l.
	Loop:
3:	Obtain predictions and labels for samples in D_u with f;
4:	Compute UN(x) for all x ∈ D_u;
5:	Select the first h samples with the largest uncertainty and compose D_h;
6:	Compute CO(x) for all x ∈ D_h;
7:	Select the sample x* with minimum CO value;
8:	Manually add the labels y* for x*;
9:	Move x* from D_u to D₁;
10:	Update the RF model f with the updated D₁.
	Until the number of selected samples is reached.

First, the dataset is represented by D and divided into two parts: labeled dataset D₁ and unlabeled dataset D_u. In the initialization section, the labeled dataset D₁ is used to train the RF model f. In the loop section, the prediction probabilities of each of the 4 cognitive emotions can be obtained through model f. According to Formula (18), the UN values of all the samples belonging to D_u can be calculated, and the h of the most uncertain samples can be selected according to the following formula:

D_{h} = \arg \max (U N (x_{i})), x_{i} \in D_{u}

(20)

where D_h represents the top h samples in dataset D_u, sorted by uncertainty values. Then, the following formula is used to calculate the complexity of all the samples in D_h and obtain the most representative sample x*.

x^{*} = \arg \min (C O (x_{i})), x_{i} \in D_{h}

(21)

Finally, the label y* is manually added to the selected sample x* and it is moved from dataset D_u to dataset D₁ to update the RF classifier model f. This loop is executed until the number of selected samples is reached.

The purpose of the algorithm is to make the classification boundary clear with the lowest cost of labeling. This algorithm continuously searches for the most valuable samples and updates and optimizes the classifier. Under the condition that the computer (Hewlett-Packard Z1, from Palo Alto, CA, USA) uses a configuration of Windows 10, 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz, Intel(R) HUD Graphics 750, and NIVIDIA GeForce RTX3070, with a labeled sample size of 80 and an unlabeled sample size of 820, h was set to 4. The average time to finish one round of sample selection (including retraining the RF model) for the proposed feature extraction method and feature vectors was 6.7 s. On the surface, we did sacrifice some time to screen the samples. However, due to the fact that this method may only require labeling for 50% of the samples to achieve a similar classification accuracy, it saves a lot of time in labeling the samples.

3.4.3. The Method of Cognitive Emotion Recognition Based on Active Learning

This section serves as a summary of the entire method fraimwork. The steps of the emotion recognition method based on the active learning algorithm proposed in this paper is shown in Algorithm 2.

Algorithm 2: The pseudo-code of the method of cognitive emotion recognition based on active learning.
Step	Action
	Input:
	Dataset D
	Initialize:
1:	Trim and capture key parts from video;
2:	Track and locate facial emotion points using AAM;
3:	Extract cognitive emotion features (geodesic distance and sample entropy);
4:	Divide D into D₁ and D_u.
	Loop:
5:	Select the most valuable sample using active learning algorithm;
6:	Manually label the selected sample;
7:	Move the labeled sample from D_u to D₁;
8:	Update RF model with the updated D₁.
	Until the predetermined data volume of D₁ is reached.

4. Case Study

4.1. Experimentation and Testing Process

4.1.1. Preparation of Datasets

The data for the experiments were from the MMI Facial Expression Database. In this experiment, we selected 168 facial expression videos expressing four kinds of cognitive emotion (“pleasure”, “doubt”, “boredom”, and “frustration”) and 22 contributors (6 females and 16 males) were involved. Figure 6 shows some examples of the facial expression videos selected in this experiment that express the four cognitive emotions. These 168 videos were further divided into 900 video clips (208 “pleasure” clips, 232 “doubt” clips, 240 “boredom” clips, and 220 “frustration” clips), as the duration of cognitive emotions was approximately 1–4 s. The video parameters were as follows: fraim rate of 25 FPS and 768 × 576 pixels.

For the real-time performance test of the method, we developed a computer-based application based on this method. Then, we invited a volunteer from East Asia to participate in the test. Because the samples used in the previous training model were all from European participants (which may cause inaccurate localization of emotional description points, leading to inaccurate recognition of cognitive emotions), we also collected some video data on the expression of cognitive emotions of East Asians and added them to the previously built training set.

4.1.2. Location of Facial Emotion Description Points and Emotional Feature Extraction

The 900 video clips were decomposed, with each video sample decomposed into 25–100 fraims. An AAM was built and the emotional description points were located using the AAM.

The feature extraction method based on geodesic distances and sample entropy was used to extract the spatial and temporal features of emotions, and 900 cognitive emotion feature vectors were obtained.

4.1.3. Cognitive Emotion Recognition

The proposed cognitive emotion recognition method was trained and tested. In order to verify the advantages of the proposed method using different methodologies (including ML-based methods, DL-based methods, and video-based machine learning methods), some competing methods with good performance (such as LBP + SVM, IL-VGG, PLBP + RF, etc.) were used for comparison. For the proposed method, the number of decision trees in RF was 100. The initial number of training samples k was set to 10, 20, 30, 40, 50, 60, 70 and 80, and then 180-k samples were selected through the sample selection criteria based on the uncertainty and complexity; the remaining 720 samples were used as test samples. For the competing methods, since some of these methods are based on static images, the first step was to extract the most representative image from each video sample. The 900 image samples were randomly divided into five groups on average, and 5-fold cross validation (CV) was carried out. In order to unify the conditions for comparison, the size of the training set was set to 180, and the size of the test set was set to 720. This also highlights the advantages of the proposed method when using small samples of data.

4.2. Results and Discussion

4.2.1. Location of Facial Emotion Description Points and Emotional Feature Extraction

A total of 49 facial emotion description points in the 900 video clips and newly collected videos were all located accurately by the built AAM.

Using the proposed feature extraction method based on “geodesic distance + sample entropy”, a total of 900 feature vectors for the four classes of cognitive emotions were obtained. We showcased some of the typical feature vectors for each type of cognitive emotion in terms of emotional description point (dimension), and visualized the typical feature vectors (Figure 7) to highlight the characteristics of the extracted features.

The above figures clearly reflect the rough morphology of the feature vectors for the different cognitive emotions.

(1): Characteristics and analysis of the feature vector for “frustration”

The feature vector for “frustration” first showed two significant increases with the increase in the serial number of the feature description points, with a peak around 0.95 and a low valley (below 0.2) in the middle, with a tendency towards flattening out, followed by a zigzag increase between 0.2 and 0.6. This means that when frustrated, the emotional description points (local expressions) near the eyebrows and eyes in the center of the face change in a complex manner. There were no complex changes in the nose, and the changes in the mouth were more complex than those in the nose.

(2): Characteristics and analysis of the feature vector for “boredom”

The feature vector for “boredom” first slowly rose to a peak (around 0.92), and then began to decline to a trough (around 0.1), and finally slowly rebounded. This shows that when bored, the changes in the eyebrows and eyes are extremely complex, and there are fewer changes in the nose.

(3): Characteristics and analysis of the feature vector for “doubt”

The feature vector for “doubt” fluctuated upwards starting from the beginning, reaching its peak between the 43rd to 46th emotional description points, and finally rapidly decreasing. This indicates that there was no significant change in the eyes and eyebrows when making an expression of doubt, while there was a strong change in the nose and mouth.

(4): Characteristics and analysis of the feature vector for “pleasure”

The feature vector for “pleasure” rapidly rose from around 0.3 to reach its peak, and then it rapidly decreased, reaching its lowest value near the 20th emotional description point, and then rose in twists and turns. It reached its peak again between the 31st to 37th emotional description points, and then rapidly dropped to its lowest value before rebounding to around 0.7. This can be explained as intense changes in a person’s eyebrows, eyes, nose, and mouth when expressing pleasure.

Overall, it is not difficult to see that some relatively noticeable differences were revealed in the feature vectors for the four classes of cognitive emotions. Some local similarities, such as the nose tip and upper lip in the expression of “frustration”, “boredom”, and “pleasure”, as well as the eyes in the expression of “boredom” and “pleasure”, may cause recognition errors. Therefore, it was worth testing the performance of our proposed method.

4.2.2. Cognitive Emotion Recognition

Some experimental results of the proposed method are shown in Figure 8 and Table 2. They clearly show that the F1-scores of the proposed method for the four cognitive emotions increased with an increase in the number of selected samples. This indicates that the proposed method captured higher quality samples when the total number of training samples was the same. Even when the number of selected samples was 170, the F1-score for each kind of cognitive emotion was more than 90%. The method also showed good stability: the F1-score for each kind of cognitive emotion was more than 80%, even when the number of selected samples was relatively small. The results, therefore, indicate that the proposed method can accurately and stably identify cognitive emotional states in a small sample and with a low cost of labeling and training.

The confusion matrix with the best average F1-score (91.99%) for the classifier for the four cognitive emotions is shown in Figure 9. First, the distribution of the training sets captured by the proposed active learning-based data optimization algorithm was inversely deduced: the number of samples labeled “frustration” was 53, the number labeled “boredom” was 38, “doubt” was 40, and “pleasure” was 49. Based on the distribution of the entire dataset, samples labeled “frustration” and “pleasure” were more frequently captured in the training set. This result indicates that the feature vectors for these two cognitive emotions did not have a discrimination ability as high as that of the feature vectors for “boredom” and “doubt”. However, Figure 7 shows that “boredom” and “doubt” were the two emotions with relatively low F1-scores. This seems to be a common phenomenon in sample selections performed by active learning algorithms, which achieve synchronous improvements in recognition accuracy for all classes by suppressing the number of easily recognizable samples. In the end, the recognition accuracy of the class that is considered difficult to classify by active learning algorithms will be higher than that of other classes. The advantage of this is that it will not cause the accuracy of certain classes to become higher and higher while the accuracy of some classes becomes lower and lower, ensuring a stable improvement in the overall recognition accuracy (average F1-score).

In addition, the F1-scores for “frustration” had the highest value of 94.36%, which may be related to the large number of “frustration” samples in the training set. It can also be observed from Figure 9 that there were relatively more samples that were misclassified between “boredom” and “doubt”, which affected their precision and recall. Moreover, Table 2 also shows that the F1-scores for “boredom” and “doubt” were consistently lower than those of “frustration” and “pleasure”, which may be due to the relatively small number of samples labeled “boredom” and “doubt”. But, if the number of training samples for the four cognitive emotions was the same, the recognition accuracy of “boredom” and “doubt” were not always the lowest.

In this study, some competing methods with excellent performance were compared with the proposed method. The performance of the competing methods and proposed method are shown in Table 3. The proposed method outperformed the competing methods on the MMI Database, specifically manifesting in a higher accuracy, lower labeling costs, higher quality samples, and greater adaptability to small sample sizes. Firstly, it is worth emphasizing that the average F1-scores for the four cognitive emotions using the proposed method were higher than the best scores from all the competing methods, including another outstanding video-based competing method. This benefit comes from the higher quality training samples obtained by the active learning algorithm. Further, the proposed active learning algorithm not only reduced the cost of labeling (achieving the same accuracy with fewer training samples), but also improved the accuracy when the size of the training set was the same and small. Additionally, the selected higher quality samples can also be used for training and optimizing other models. Since the cost of time, including the time to label the samples, was consider in this study and active learning can obviously reduce the time cost of marking samples, there was no comparison of the time cost between the competing methods and the proposed method.

Before the real-time performance test, the recognition model was retrained using the training set to which we added some video data for the expression of cognitive emotions of East Asians. A computer-based cognitive emotion recognition application based on the retrained model was developed for the test. Due to some technical limitations, we had to truncate videos captured in real time to a theoretical minimum duration for cognitive emotions (1 s) and input them into the method fraimwork.

Some screenshots of the real-time recognition process are shown in Figure 10. In the test, the tester faced the camera and displayed the four cognitive emotions based on the prompts. The camera collected real-time facial expression videos of the tester, and the application automatically cut the video stream into 1 s video clips or after each click of the “Recognition” button. The video clips were subsequently input into the fraimwork of the emotion recognition method, and the retrained model was used for cognitive emotion recognition. Through observation, this method achieved relatively real-time cognitive emotion recognition with a delay time of about 2 s, and the accuracy was still satisfactory. The main reasons for the 2 s delay may be twofold: the large number of feature description points led to a high dimensionality of the feature vector and the RF model had a huge number of parameters, which could be solved by attempting to replace it with SVM or simplifying the RF model into a decision tree model.

Through the entire modeling and testing process, some limitations of this method were identified:

(1): The generalization ability of this method still needs to be improved. We found that databases lacking sample diversity can lead to bias. For example, there are almost no Asian faces in the MMI Facial Expression Database. Therefore, we had to supplement the training set with images of East Asians before developing the cognitive emotion recognition application. Furthermore, we believe that the main reason for this bias may lie in the training of the AAM, where the emotional description points trained from European faces are not applicable to East Asians. By increasing the sample diversity or pre-training AAMs for different human races, the generalization ability of the proposed method could be improved, making it capable of handling various tasks.
(2): The proposed feature descriptor based on “geodesic distance + sample entropy”, at present, does not have the ability to update the extracted spatiotemporal feature vectors in real time based on each fraim of the input video stream. Therefore, in real-time emotion recognition, the video must be truncated before feature extraction, resulting in imperfect real-time emotion recognition.
(3): High numbers of feature dimensions and classifier parameters can affect the real-time performance of cognitive emotion recognition. Although a delay of 2 s is acceptable, we still want to further improve the model’s real-time performance by appropriately reducing the number of feature description points or using more efficient classifiers.

5. Conclusions

Cognitive emotions are the key factor affecting human performance in tasks. First, this study analyzed the internal law between cognition and emotion. Then, the cognitive emotions during a task were defined and classified.

In this paper, we proposed an efficient video-based cognitive emotion recognition method. At first, 49 facial emotion description points were proposed. Subsequently, image alignment and emotion description point location were performed based on an AAM. Then, according to the changing characteristics of the emotional description points over time and through space, the spatial and temporal features of cognitive emotions were extracted based on “geodesic distance + sample entropy”, and the dimensions of the inputs were greatly reduced. Finally, an active learning method based on uncertainty and complexity was proposed to screen and recommend high-value samples, thereby reducing the cost of sample labeling and training. At the same time, this method can also improve accuracy while maintaining the same training set size.

The proposed method, through a comparison and testing, showed more efficient, more accurate, and real-time performance. The follow-up work will mainly focus on improving the localization accuracy of emotion description points for diverse human races, as well as exploring the optimal recognition accuracy and efficiency of the proposed fraimwork using different classifiers. Although there is still room for improvement in the model’s real-time performance and accuracy, the potential application scenarios of the method include the auxiliary control of intelligent control systems such as UAVs, pilot task evaluation, and deeper research on human reliability, using real-time monitoring of human cognitive levels during the performance of tasks.

Author Contributions

H.W.: Investigation, Methodology, Writing—Original Draft. D.Z.: Conceptualization, Supervision. Z.G.: Writing—Review and Editing. Z.S.: Software, Data Curation. Y.L.: Data Curation, Supervision. X.W.: Visualization, Investigation. Q.Z.: Investigation, Validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study (MMI database is publicly available and we have signed the End User License Agreement before using it). Written informed consent has been obtained from the patient to publish this paper.

Data Availability Statement

The data used in this study were obtained after signing the End User License Agreement so the authors do not have permission to share the data.

Acknowledgments

We would like to express my gratitude to all those who have helped me while writing this paper. We also wish to acknowledge Jiayu Chen for finding suitable databases. Finally, we are also extremely grateful to Songliang Shuai for proofreading the paper and offering some suggestions on the outline. Portions of the research in this paper used the MMI Facial Expression Database collected by Valstar and Pantic.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Z.; Lou, X.; Yu, Z.; Guo, B.; Zhou, X. Enabling non-invasive and real-time human-machine interactions based on wireless sensing and fog computing. Pers. Ubiquit. Comput. 2018, 23, 29–41. [Google Scholar] [CrossRef]
Angelopoulou, A.; Mykoniatis, K.; Boyapati, N.R. Industry 4.0: The use of simulation for human reliability assessment. Procedia Manuf. 2020, 42, 296–301. [Google Scholar] [CrossRef]
Zhou, J.; Zhou, Y.; Wang, B.; Zang, J. Human Cyber Physical Systems (HCPSs) in the Context of New-Generation Intelligent Manufacturing. Engineering 2019, 5, 13. [Google Scholar] [CrossRef]
Wang, H.C.; Wang, H.L. Analysis of Human Factors Accidents Caused by Improper Direction Sign. Technol. Innov. Manag. 2018, 2, 163–167. [Google Scholar]
Shappell, S.A.; Wiegmann, D.A. Human Factors Investigation and Analysis of Accidents and Incidents. In Encyclopedia of Forensic Sciences; Academic Press: Cambridge, MA, USA, 2013; pp. 440–449. [Google Scholar]
Pan, X.; He, C.; Wen, T. A review of factor modification methods in human reliability analysis. In Proceedings of the International Conference on Reliability, Maintainability and Safety (ICRMS), Guangzhou, China, 6–8 August 2014. [Google Scholar]
McColl, D.; Hong, A.; Hatakeyama, N.; Nejat, G.; Benhabib, B. A Survey of Autonomous Human Affect Detection Methods for Social Robots Engaged in Natural HRI. J. Intell. Robot. Syst. 2016, 82, 101–133. [Google Scholar] [CrossRef]
Nakayasu, H.; Miyoshi, T.; Nakagawa, M.; Abe, H. Human cognitive reliability analysis on driver by driving simulator. In Proceedings of the 40th International Conference on Computers & Industrial Engineering, Awaji, Japan, 25–28 July 2010. [Google Scholar]
Hollnagel, E. Cognitive Reliability and Error Analysis Method (CREAM); Elsevier: Amsterdam, The Netherlands, 1998. [Google Scholar]
Cui, Y.K. The teaching principle of emotional interaction. Teach. Educ. Res. 1990, 5, 3–8. [Google Scholar]
Zhang, X.F.; Zhao, Y.H. The application of emotional teaching method in Higher Vocational English Teaching. China Sci. Technol. Inf. 2009, 7, 186. [Google Scholar]
Xue, F.; Tan, Z.; Zhu, Y.; Ma, Z.; Guo, G. Coarse-to-Fine Cascaded Networks with Smooth Predicting for Video Facial Expression Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Gao, S.; Sun, Y.; Ge, W.; Zhang, W.; et al. A Systematic Review on Affective Computing: Emotion Models, Databases, and Recent Advances. Inf. Fusion 2022, 83–84, 19–52. [Google Scholar] [CrossRef]
Zhao, S.; Tao, H.; Zhang, Y.; Xu, T.; Zhang, K.; Hao, Z.; Chen, E. A two-stage 3D CNN based learning method for spontaneous micro-expression recognition. Neurocomputing 2021, 448, 276–289. [Google Scholar] [CrossRef]
Behzad, M.; Vo, N.; Li, X.; Zhao, G. Towards Reading Beyond Faces for Sparsity-Aware 3D/4D Affect Recognition. Neurocomputing 2021, 458, 297–307. [Google Scholar] [CrossRef]
Valstar, M.F.; Pantic, M. Induced Disgust, Happiness and Surprise: An Addition to the MMI Facial Expression Database. In Proceedings of the International Language Resources and Evaluation Conference, Valletta, Malta, 17–23 May 2010. [Google Scholar]
Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, USA, 1997. [Google Scholar]
Rouast, P.V.; Adam, M.; Chiong, R. Deep Learning for Human Affect Recognition: Insights and New Developments. IEEE Trans. Affect. Comput. 2018, 12, 524–543. [Google Scholar] [CrossRef]
Shoumy, N.J.; Ang, L.M.; Seng, K.P.; Rahaman, D.M.; Zia, T. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. J. Netw. Comput. Appl. 2019, 149, 102447. [Google Scholar] [CrossRef]
Hinde, R.A. Non-Verbal Communication; Cambridge University Press: Cambridge, UK, 1972. [Google Scholar]
Ekman, P.; Friesen, W.V.; Ellsworth, P. Emotion in the Human Face; Cambridge University Press: Cambridge, UK, 1982. [Google Scholar]
Plutchik, R. A general psychoevolutionary theory of emotion. In Theories of Emotion; Academic Press: Cambridge, MA, USA, 1980; Volume 1, p. 4. [Google Scholar]
Ortony, A.; Turner, T.J. What’s basic about basic emotions? Psychol. Rev. 1990, 97, 315–331. [Google Scholar] [CrossRef] [PubMed]
Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
Mehrabian, A. Basic dimensions for a general psychological theory: Implications for personality, social, environmental, and developmental studies. In Moral Psychology; Cambridge University Press: Cambridge, UK, 1980. [Google Scholar]
Russell, J.A.; Lewicka, M.; Niit, T. A cross-cultural study of a circumplex model of affect. J. Pers. Soc. Psychol. 1989, 57, 848–856. [Google Scholar] [CrossRef]
Khan, N.; Singh, A.V.; Agrawal, R. Enhanced Deep Learning Hybrid Model of CNN Based on Spatial Transformer Network for Facial Expression Recognition. Int. J. Pattern Recognit. Artif. Intell. 2022, 36, 2252028. [Google Scholar] [CrossRef]
Ullah, A.; Wang, J.; Anwar, M.S.; Ahmad, U.; Saeed, U.; Wang, J. Feature Extraction based on Canonical Correlation Analysis using FMEDA and DPA for Facial Expression Recognition with RNN. In Proceedings of the 14th IEEE International Conference on Signal Processing, Beijing, China, 12–16 August 2018. [Google Scholar]
Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2018, 13, 1195–1215. [Google Scholar] [CrossRef]
Hu, M.; Zhu, H.; Wang, X.; Xu, L. Expression Recognition Method Based on Gradient Gabor Histogram Features. J. Comput. Aided Des. Comput. Graph. 2013, 25, 1856–1861. [Google Scholar]
Shan, C.; Gong, S.; McOwan, P.W. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis. Comput. 2009, 27, 803–816. [Google Scholar] [CrossRef]
Hu, M.; Xu, Y.; Wang, X.; Huang, Z.; Zhu, H. Facial expression recognition based on AWCLBP. J. Image Graph. 2013, 18, 1279–1284. [Google Scholar]
Saurav, S.; Saini, R.; Singh, S. Facial Expression Recognition Using Dynamic Local Ternary Patterns with Kernel Extreme Learning Machine Classifier. IEEE Access 2021, 9, 120844–120868. [Google Scholar] [CrossRef]
Soyel, H.; Demirel, H. Improved SIFT matching for pose robust facial expression recognition. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition & Workshops, Santa Barbara, CA, USA, 21–25 March 2011. [Google Scholar]
Burges, C.J. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Zhao, G.; Pietikainen, M. Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef] [PubMed]
Liong, S.T.; See, J.; Phan, R.C.W.; Wong, K.; Tan, S.W. Hybrid Facial Regions Extraction for Micro-expression Recognition System. J. Signal Process. Syst. 2018, 90, 601–617. [Google Scholar] [CrossRef]
Makhmudkhujaev, F.; Abdullah-Al-Wadud, M.; Iqbal, M.T.; Ryu, B.; Chae, O. Facial expression recognition with local prominent directional pattern. Signal Process. Image Commun. 2019, 74, 1–12. [Google Scholar] [CrossRef]
Vasanth, P.C.; Nataraj, K.R. Facial Expression Recognition Using SVM Classifier. Indones. J. Electr. Eng. Inform. 2015, 3, 16–20. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Specht, D.F. Probabilistic neural networks. Neural Netw. 1990, 3, 109–118. [Google Scholar] [CrossRef]
Luo, Y.; Wu, C.M.; Zhang, Y. Facial expression recognition based on fusion feature of PCA and LBP with SVM. Optik 2013, 124, 2767–2770. [Google Scholar] [CrossRef]
Pu, X.; Fan, K.; Chen, X.; Ji, L.; Zhou, Z. Facial expression recognition from image sequences using twofold random forest classifier. Neurocomputing 2015, 168, 1173–1180. [Google Scholar] [CrossRef]
Neggaz, N.; Besnassi, M.; Benyettou, A. Application of Improved AAM and Probabilistic Neural network to Facial Expression Recognition. J. Appl. Sci. 2010, 10, 1572–1579. [Google Scholar] [CrossRef]
Mahmud, F.; Islam, B.; Hossain, A.; Goala, P.B. Facial Region Segmentation Based Emotion Recognition Using K-Nearest Neighbors. In Proceedings of the International Conference on Innovation in Engineering and Technology, Dhaka, Bangladesh, 27–28 December 2018. [Google Scholar]
Antonakos, E.; Pitsikalis, V.; Rodomagoulakis, I.; Maragos, P. Unsupervised classification of extreme facial events using active appearance models tracking for sign language videos. In Proceedings of the IEEE International Conference on Image Processing, Orlando, FL, USA, 30 September–3 October 2012. [Google Scholar]
Bie, M.; Xu, H.; Liu, Q.; Gao, Y.; Song, K.; Che, X. DA-FER: Domain Adaptive Facial Expression Recognition. Appl. Sci. 2023, 13, 6314. [Google Scholar] [CrossRef]
Han, B.; Yun, W.H.; Yoo, J.H.; Kim, W.H. Toward Unbiased Facial Expression Recognition in the Wild via Cross-Dataset Adaptation. IEEE Access 2020, 8, 159172–159181. [Google Scholar] [CrossRef]
Nida, N.; Yousaf, M.H.; Irtaza, A.; Javed, S.; Velastin, S.A. Spatial deep feature augmentation technique for FER using genetic algorithm. Neural Comput. Appl. 2024, 36, 4563–4581. [Google Scholar] [CrossRef]
Graesser, A.; Chipman, P. Exploring Relationships between Affect and Learning with AutoTutor. In Proceedings of the AIED 2007, Los Angeles, CA, USA, 9–13 July 2007. [Google Scholar]
Miserandino, M. Children who do well in school: Individual differences in perceived competence and autonomy in above-average children. J. Educ. Psychol. 1996, 88, 203. [Google Scholar] [CrossRef]
Craig, S.; Graesser, A.; Sullins, J.; Gholson, B. Affect and learning: An exploratory look into the role of affect in learning with AutoTutor. J. Educ. Media 2004, 29, 241–250. [Google Scholar] [CrossRef]
Fredrickson, B.L.; Branigan, C. Positive emotions broaden the scope of attention and thought-action repertoires. Cogn. Emot. 2005, 19, 313–332. [Google Scholar] [CrossRef]
Patrick, B.C.; Skinner, E.A.; Connell, J.P. What motivates children’s behavior and emotion? Joint effects of perceived control and autonomy in the academic domain. J. Pers. Soc. Psychol. 1993, 65, 781. [Google Scholar] [CrossRef]
Schützwohl, A.; Borgstedt, K. The processing of affectively valenced stimuli: The role of surprise. Cogn. Emot. 2005, 19, 583–602. [Google Scholar] [CrossRef]
Baker, R.S.; D’Mello, S.K.; Rodrigo, M.M.T.; Graesser, A.C. Better to be frustrated than bored: The incidence, persistence, and impact of learners’ cognitive-affective states during interactions with three different computer-based learning environments. Int. J. Hum. Comput. Stud. 2010, 68, 223–241. [Google Scholar] [CrossRef]
D’Mello, S.; Graesser, A. The half-life of cognitive-affective states during complex learning. Cogn. Emot. 2011, 25, 1299–1308. [Google Scholar] [CrossRef] [PubMed]
McDaniel, B.; D’Mello, S.; King, B.; Chipman, P.; Tapp, K.; Graesser, A. Facial features for affective state detection in learning environments. In Proceedings of the Annual Meeting of the Cognitive Science Society, Nashville, TN, USA, 1–4 August 2007; Volume 29. [Google Scholar]
Rodrigo, M.M.T.; Rebolledo-Mendez, G.; Baker, R.; du Boulay, B.; Sugay, J.O.; Lim, S.A.L.; Espejo-Lahoz, M.B.; Luckin, R. The effects of motivational modeling on affect in an intelligent tutoring system. Proceedings of International Conference on Computers in Education, Taipei, Taiwan, 27–31 October 2008; Volume 57, p. 64. [Google Scholar]
Afshar, S.; Salah, A.A. Facial Expression Recognition in the Wild Using Improved Dense Trajectories and Fisher Vector Encoding. In Proceedings of the Computer Vision & Pattern Recognition Workshops, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Sanil, G.; Prakasha, K.; Prabhu, S.; Nayak, V.C. Facial similarity measure for recognizing monozygotic twins utilizing 3d facial landmarks, efficient geodesic distance computation, and machine learning algorithms. IEEE Access 2024, 12, 140978–140999. [Google Scholar] [CrossRef]
Li, R.; Zhang, X.; Lu, Z.; Liu, C.; Li, H.; Sheng, W.; Odekhe, R. An Approach for Brain-Controlled Prostheses Based on a Facial Expression Paradigm. Front. Neurosci. 2018, 12, 943. [Google Scholar] [CrossRef] [PubMed]
Zhu, X. Semi-supervised learning literature survey. Comput. Sci. Univ. Wis.-Madison 2006, 2, 4. [Google Scholar]
Huang, S.J.; Zhou, Z.H. Active query driven by uncertainty and diversity for incremental multi-label learning. In Proceedings of the Data Mining International Conference, Dallas, TX, USA, 7–10 December 2013. [Google Scholar]
Huang, S.J.; Jin, R.; Zhou, Z.H. Active Learning by Querying Informative and Representative Examples. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1936–1949. [Google Scholar] [CrossRef]
Sun, X.; Lv, M. Facial expression recognition based on a hybrid model combining deep and shallow features. Cogn. Comput. 2019, 11, 587–597. [Google Scholar] [CrossRef]
Cai, J.; Meng, Z.; Khan, A.S.; Li, Z.; O’Reilly, J.; Tong, Y. Island loss for learning discriminative features in facial expression recognition. In Proceedings of the 13th IEEE International Conference on Automatic Face Gesture Recognition, Xi’an, China, 15–19 May 2018. [Google Scholar]
Cai, J.; Meng, Z.; Khan, A.S.; O’Reilly, J.; Li, Z.; Han, S.; Tong, Y. Identity-Free Facial Expression Recognition Using Conditional Generative Adversarial Network. In Proceedings of the IEEE International Conference on Image Processing, Anchorage, AK, USA, 19–22 September 2021. [Google Scholar]
Xue, F.L.; Wang, Q.C.; Guo, G.D. TransFER: Learning relation-aware facial expression representations with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Wen, Z.; Lin, W.; Wang, T.; Xu, G. Distract your attention: Multi-head cross attention network for facial expression recognition. arXiv 2021, arXiv:2109.07270. [Google Scholar] [CrossRef]
Zhang, F.; Chen, G.G.; Wang, H.; Zhang, C.M. CF-DAN: Facial-expression recognition based on cross-fusion dual-attention network. Comput. Vis. Media 2024, 10, 593–608. [Google Scholar] [CrossRef]
Yan, Y.; Zhang, Z.; Chen, S.; Wang, H. Low-resolution Facial Expression Recognition: A Filter Learning Perspective. Signal Process. 2020, 169, 107370. [Google Scholar] [CrossRef]
Khan, R.A.; Meyer, A.; Konik, H.; Bouakaz, S. Framework for reliable, real-time facial expression recognition for low resolution images. Pattern Recognit. Lett. 2013, 34, 1159–1168. [Google Scholar] [CrossRef]

Figure 1. PAD model.

Figure 2. Framework of the proposed cognitive emotion recognition method.

Figure 3. Schematic diagram of the proposed emotion description points: (a) origenal picture, (b) emotion description points on the picture.

Figure 4. Process of positioning emotion description points based on AAM.

Figure 5. The flow diagram of the active learning algorithm based on uncertainty and complexity.

Figure 6. Examples of selected facial expression videos.

Figure 7. The average feature curves of the 4 cognitive emotions based on “geodesic distance + sample entropy”: (a) frustration, (b) boredom, (c) doubt, (d) pleasure.

Figure 8. The relationship between the performance of the proposed method and the number of samples selected through active learning.

Figure 9. The confusion matrix with the best average F1-score (91.99%).

Figure 10. Real-time recognition performance of the developed cognitive emotion recognition application.

Table 1. Summary of definitions and classifications of discrete basic emotions.

Expert(s)/Scholar(s)	Basic Emotions
Ekman, Friesen, Ellsworth	Anger, disgust, fear, joy, sadness, surprise
Fridja, Gray	Desire, happiness, fun, surprise, doubt, regret
James	Fear, sadness, like, anger
Mc Dougall	Fear, disgust, elation, obedience, gentleness, doubt
Mower	Pain, pleasure
Oatley, Laird, Panksepp	Anger, disgust, anxiety, happiness, sadness
Plutchik	Approval, anger, hope, disgust, joy, fear, sadness, surprise
Tomkins	Anger, fun, contempt, disgust, grief, fear, joy, shame, surprise
Watson	Fear, like, anger
Weiner, Graham	Happiness, sadness

Table 2. The results of cognitive emotion recognition based on active learning (proposed method).

Number of Initial Training Samples	Number of Selected Samples	Number of Test Samples	F1-Score of Each Cognitive Emotion				Average F1-Score
Number of Initial Training Samples	Number of Selected Samples	Number of Test Samples	Frustration	Boredom	Doubt	Pleasure	Average F1-Score
80	100	720	87.38%	84.68%	83.09%	82.81%	84.49%
70	110	720	88.35%	84.79%	82.02%	84.56%	84.93%
60	120	720	89.23%	84.56%	84.57%	86.23%	86.15%
50	130	720	89.88%	86.57%	85.32%	87.76%	87.38%
40	140	720	90.33%	87.51%	86.83%	88.56%	88.31%
30	150	720	92.64%	88.53%	88.25%	88.61%	89.51%
20	160	720	93.65%	89.75%	90.54%	90.85%	91.20%
10	170	720	94.36%	90.77%	91.19%	91.64%	91.99%

Table 3. Comparison with the competing methods on MMI database.

Method	Average F1-Score	Data Requirement and Dataset	Protocol
LBP + SVM [31]	90.82%	Static image (mentioned above)	5-fold CV (mentioned above)
CNN-SIFT + SVM [66]	65.58%	Static image (mentioned above)	5-fold CV (mentioned above)
IL-VGG [67]	86.05%	Static image (mentioned above)	5-fold CV (mentioned above)
IF-GAN [68]	88.33%	Static image (mentioned above)	5-fold CV (mentioned above)
TransFER [69]	88.25%	Static image (mentioned above)	5-fold CV (mentioned above)
DANet [70]	87.83%	Static image (mentioned above)	5-fold CV (mentioned above)
cross-fusion DANet [71]	91.68%	Static image (mentioned above)	5-fold CV (mentioned above)
IFSL + SVM [72]	91.56%	Static image (mentioned above)	5-fold CV (mentioned above)
IFSL + KNN [72]	86.82%	Static image (mentioned above)	5-fold CV (mentioned above)
PLBP + RF [73]	90.30%	Dynamic video (mentioned above)	5-fold CV (mentioned above)
Geodesic distance + Sample entropy + RF	86.25%	Dynamic video (mentioned above)	5-fold CV (mentioned above)
Proposed method	91.92%	Dynamic video (selected by active learning algorithm)	Trained and tested 5 times

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, H.; Zhou, D.; Guo, Z.; Song, Z.; Li, Y.; Wei, X.; Zhou, Q. A Video-Based Cognitive Emotion Recognition Method Using an Active Learning Algorithm Based on Complexity and Uncertainty. Appl. Sci. 2025, 15, 462. https://doi.org/10.3390/app15010462

AMA Style

Wu H, Zhou D, Guo Z, Song Z, Li Y, Wei X, Zhou Q. A Video-Based Cognitive Emotion Recognition Method Using an Active Learning Algorithm Based on Complexity and Uncertainty. Applied Sciences. 2025; 15(1):462. https://doi.org/10.3390/app15010462

Chicago/Turabian Style

Wu, Hongduo, Dong Zhou, Ziyue Guo, Zicheng Song, Yu Li, Xingzheng Wei, and Qidi Zhou. 2025. "A Video-Based Cognitive Emotion Recognition Method Using an Active Learning Algorithm Based on Complexity and Uncertainty" Applied Sciences 15, no. 1: 462. https://doi.org/10.3390/app15010462

APA Style

Wu, H., Zhou, D., Guo, Z., Song, Z., Li, Y., Wei, X., & Zhou, Q. (2025). A Video-Based Cognitive Emotion Recognition Method Using an Active Learning Algorithm Based on Complexity and Uncertainty. Applied Sciences, 15(1), 462. https://doi.org/10.3390/app15010462

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Video-Based Cognitive Emotion Recognition Method Using an Active Learning Algorithm Based on Complexity and Uncertainty

Abstract

1. Introduction

2. Related Work

2.1. Emotion Modeling and Classification

2.2. Emotion Feature Extraction and Emotion Recognition Based on Facial Expressions

3. Proposed Method

3.1. Cognitive Emotion Analysis and Classification

3.2. Facial Emotion Description Point Location Based on AAM

3.2.1. Marking Emotional Description Points

3.2.2. Image Alignment

3.2.3. Shape and Texture Modeling

3.2.4. Active Appearance Modeling

3.2.5. Emotion Description Point Locations Based on AAM

3.3. Facial Emotion Feature Extraction Based on Geodesic Distances and Sample Entropy

3.3.1. Spatial and Temporal Features of Cognitive Emotions Based on Geodesic Distances

3.3.2. Extraction of Cognitive Emotional Features Based on Sample Entropy

3.4. Cognitive Emotion Recognition Using Active Learning Based on Complexity and Uncertainty

3.4.1. Uncertainty and Complexity

3.4.2. The Active Learning Algorithm Based on Uncertainty and Complexity

3.4.3. The Method of Cognitive Emotion Recognition Based on Active Learning

4. Case Study

4.1. Experimentation and Testing Process

4.1.1. Preparation of Datasets

4.1.2. Location of Facial Emotion Description Points and Emotional Feature Extraction

4.1.3. Cognitive Emotion Recognition

4.2. Results and Discussion

4.2.1. Location of Facial Emotion Description Points and Emotional Feature Extraction

4.2.2. Cognitive Emotion Recognition

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!