0% found this document useful (0 votes)
9 views12 pages

Engagement Detection Through Facial Emotional Recognition Using A Shallow Residual Convolutional Neural Networks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Engagement Detection Through Facial Emotional Recognition Using A Shallow Residual Convolutional Neural Networks

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received: October 13, 2020. Revised: December 24, 2020.

236

Engagement Detection through Facial Emotional Recognition Using a Shallow


Residual Convolutional Neural Networks

Michael Moses Thiruthuvanathan1* Balachandran Krishnan1 Madhavi Rangaswamy2

1
Department of Computer Science and Engineering, School of Engineering and Technology,
CHRIST (Deemed to be University), Bangalore, India
2
Department of Psychology, School of Social Sciences, CHRIST (Deemed to be University), Bangalore, India
* Corresponding author’s Email: Michael.moses@christuniversity.in

Abstract: Online teaching and learning has recently turned out to be the order of the day, where majority of the learners
undergo courses and trainings over the new environment. Learning through these platforms have created a requirement
to understand if the learner is interested or not. Detecting engagement of the learners have sought increased attention
to create learner centric models that can enhance the teaching and learning experience. The learner will over a period
of time in the platform, tend to expose various emotions like engaged, bored, frustrated, confused, angry and other
cues that can be classified as engaged or disengaged. This paper proposes in creating a Convolutional Neural Network
(CNN) and enabling it with residual connections that can enhance the learning rate of the network and improve the
classification on three Indian datasets that predominantly work on classroom engagement models. The proposed
network performs well due to introduction of Residual learning that carries additional learning from the previous batch
of layers into the next batch, Optimized Hyper Parametric (OHP) setting, increased dimensions of images for higher
data abstraction and reduction of vanishing gradient problems resulting in managing overfitting issues. The Residual
network introduced, consists of a shallow depth of 50 layers which has significantly produced an accuracy of 91.3%
on ISED & iSAFE data while it achieves a 93.4% accuracy on the Daisee dataset. The average accuracy achieved by
the classification network is 0.825 according to Cohens Kappa measure.
Keywords: Student engagement detection, Residual networks, Convolutional neural network, Emotion detection,
Facial expression recognition.

writing, viewing video lessons, virtual exams and


1. Introduction online meetings. They display different degrees of
interaction during involvement in these educational
Recognition of user interaction becomes highly
events, such as fatigue, annoyance, excitement,
important in a digital environment that is filled with
indifference, uncertainty, and advantage in learning.
information of consumers. It is important for
To provide customized pedagogical support through
applications to be “aware” of the user's presence
online learner initiatives, it is critical that online
when delivering information. Affect is a
educators reliably and effectively detect the state of
psychological process used to define the feeling and
involvement of their online learners. In the sense of
its external presentation. Affective computing aims at
online learning this paper provides a study of the state
designing systems and tools capable of detecting,
of the art in interaction identification. While students
reading and simulating human effects across various
take part in these online classes, computing is
sources such as face, speech and biological signals.
required to estimate the valence and arousal to
Due to the availability of online resources to learn
analyse the engagement of the users by capturing
during these pandemic times, students are involved in
facial features. Enormous efforts have been made to
attending online classes. Virtual learners take part in
develop reliable automated Facial Expression
numerous instructional events including reading,
Recognition (FER) systems for use in machines and
International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 237

devices that are aware of the impact. These programs boredom, 2) engaged, 3) frustrated and 4) confused
can more easily grasp the individual feelings and Few authors have used combinational methods
communicate with the consumer. Current explicitly by combining two emotions that became
technologies, however, are yet to achieve the unable to interpret mixed feelings into a limited
maximum emotional and social capacities needed to collection of words sufficiently [2]. On the other hand,
create a rich and stable Human Machine Interaction the dimensional model of affect can discern between
(HMI). It is primarily due to the reality that HMI slightly different displays of affect and represent
devices need to communicate with humans in an minor changes in the intensity of each emotion on a
unregulated atmosphere (aka wild setting) where continuous scale, such as valence and anticipation.
scene illumination, camera orientation, picture size, Valence reflects how positive or negative an event is
landscape, user-head posture, gender and ethnicity and excitement reflects whether an event is exciting,
can differ considerably. restless or calm.
Furthermore, there are inadequate variations and In the continuous domain, dimensional
annotated samples in the data that drive the creation perception of affect encompasses the strength and
of affective computing systems and particularly FER specific types of emotion. However, comparatively
systems that can be used in developing these systems. fewer studies have been performed to establish
Studies in late psychology showed that people automatic algorithms for calculating affect using a
basically convey their feelings externally. Study of cumulative dimensional model (e.g. valence and
facial expression is an essential part of genuinely rich arousal). One of the key reasons for this is that
Man-Machine Frameworks on communication building a massive database to cover the entire
(MMI), as it uses nonverbal signs all over to measure continuous space of valence and anticipation is costly
the user's enthusiasm. This work emphasises on and there are very few annotated face databases in the
creation of Neural Networks to detect user emotions continuous domain. Facial Expression Recognition
and User engagement by utilising most commonly (FER) for different domains use supervised/semi
used datasets. In the recent literature and the work supervised learning methods for automated affective
carried by various authors for user engagement computing. They require labelled dataset for training
detection, Convolutional Neural Networks (CNN) and testing, these datasets are generally created by
and Residual Networks (ResNet) are considerably subjects based on posed actions and also expressions
used to improve the outcomes in emotional detections. extracted from videos enacted by various actors.
Residual Networks utilizes skip networks that can Recent studies of the education sectors have
connect between layers to enhance the learning initiated to impart knowledge through online portals.
pattern of the network. This rest of the paper is These methods have become challenging in
organised as follows, Section 2 describes the Related analysing the engagement levels of the students while
Work and Section 3 explains the Datasets. Section 4 teaching and learning is conducted through online
explains the Residual Network. The details of portals [2]. Facial expression and affect datasets in
experimentation are provided in section 5. The the wild have been receiving a lot of attention
detailed Results are in Section 6. Finally, Section 7 recently. These datasets are either collected from
concludes the paper. movies or the world wide web and well labelled [3-
5], and varied dimensions [6]. However, they cover
2. Related works just one model of affect, have a small range of
subjects, or include little instances of certain
In literature, there are several models to measure
emotions like disgust and sadness. A broad archive,
emotional behaviours: 1) definite models that select
with such a substantial quantity of object variations
the emotion or affect from a list of categories of
in wild condition covering numerous affect models is
affectivity, including six specific emotions identified
therefore, a requirement. Though there are several
by Ekman etc. 2) Dimensional model where meaning,
models for affect computing for emotional
such as a valence and arousal, is selected over a
recognition in videos or single images, object
sustained emotional scale 3) Facial Action coding
localization and continuous emotional analysis has
systems, in the case of Action Units (AUs), all
always been a challenging task due to face detection,
potential facial behaviours are identified. 4) Tagged
posture recognition, segmentation, human pose,
Emotions, these are emotions that are grouped
object association and for affective state
together based on Eckman’s categorical model to
classification using facial expressions in a cluttered
create emotional tags as combinational outcomes [1].
environment. For the better growth of Massive Open
The authors have grouped the emotions into four
Online Courses (MOOCs), there is a need to design
primary categories of learning environments: 1)
smart interfaces that can simulate the interactions
International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 238

between the instructor and pupil. The principal improve the scope of detection rate and the precision
disadvantage of existing e-learning systems is, that of detection percentage for individual emotions. The
they cannot have direct input in real time Students (or need for a system that can reduce the error rates while
instructors) during the delivery of the content, training is important as the connections established in
compared with traditional instruction in the each layer kneads to weight updating and
classroom., MOOCs have a 91-93% dropout rate [7]. approximation to improve detection in the data.
Understanding user engagement at different junctures This paper aims in analysing the Residual
of the e-learning experience can help design intuitive network’s performance with respect to parameterized
interfaces that support students' better absorption of study that can establish the significance of
knowledge and personalize learning. The user 's classification model on three established datasets.
understanding of affective state is an important Also, this work would compare the results with the
computer vision sub-area, centred for a long time on existing models in terms of accuracy.
datasets of the seven basic terms: neutral, happiness,
sadness, anger, disgust, surprise and contempt [8]. In 3. Datasets
recent years, the data collection has been extended to
Data for any work is pivotal and all the
cover successful states in terms of dimension
experiments are based on the data. There are critical
representations [9-11], but the vast subtleties in
datasets that emulate emotions and help in creating
affective states allow datasets for particular goals to
models for detecting emotions for various
be established. This strategy, which is backed by
applications. Classification problems to detect
recent developments, including, tends to promote
emotions have recently been a field of study in
measurable outcomes. It has been found that in e-
various prominent datasets that contribute to the
learning and classroom settings students often prefer
understanding of emotions. Though there are several
to communicate only a few effective conditions.
datasets that help in analysing the face for emotions,
These included 7 fundamental emotions and a
there are few datasets for Indian Origin faces for a
few emotions that are concentrated on learning.
classroom environment. The learning environments
Distinct works focus on hand gesture [12], facial
not only are limited to the basic emotions also can be
recognition, affective states, however there are a very
extended to various classes of classifications that can
few works on elaborating the available dataset on
influence the accurate measurement of engagement
assessing various cognitive levels of understanding
in a class. This work focusses on elaborating the
students emotional state for engagement and
emotions by using the available dataset for Indian
distraction. There are distinctive doubts on the
origin by combining basic classes, Engagement
curation and usage of facial data in facial recognition
recognition and Learning centered emotions. The
[13]. Few researchers have captured emotions under
following datasets are used for this study are DAISEE,
controlled environments while the subjects watch
iSAFE and ISED databases. Table 1 lists the details
videos of different emotions [14-18]. With such
of the available datasets for affectnet.
methods being able to collect a vast number of frames,
the variety of such repositories is restricted due to the
number of participants, head orientation, and
4. The residual network
environmental exposures [21, 22]. In this work the priority is to elaborate the feature
Some of the works carried out on ISED datasets extraction process by creating a space where a
have predominantly used CNN as the crux, while particular emotion exposed by a human is discrete.
some modifications on the network is incorporated to This is achieved by extending the emotions into 10
enhance the accuracy of the algorithm [25-26]. classes and establishing a model that can eliminate
Feature extraction methods involving Local the bias of learning and detection. network that was
prominent directional patterns and local directional used for emotional analysis. This network is grouped
structural pattern have been used. However, these as convolutional layers on the left of the figure and
methods lack in efficient classification accuracy the Skip network on the right. The middle layers
when compared to the CNN’s [25-27]. Many authors represent the connection between the residual
use modified CNN in order to achieve greater results connections and the convolution layers.
by adding multiple deep layers that enhance the SIU1CONV1 represents one single convolutional
performance of the system [29-31]. CNN’s are prone unit and S1U1BN1 represents the Batch
to issues of vanishing gradients that leads to accuracy normalization layer. Each of these layers are grouped
loss by curating the training into an expanding into groups, each consisting of two convolutional and
memory requirement. In all the major works carried batch normalization layers.
out mentioned in this section, there is a need to
International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 239

Table 1. Dataset details for facial emotional analysis


Sl Name of the Emotions
Name of the author Database details Affective states
no dataset enlisted
100 movie Fundamental
1 Setty et al. [11] IMFBD dataset Posed
videos emotions
Fundamental
2 Dhall et al. [14] AFEW database 957 videos Temporal data
emotions
428 video data
Collected from the Fundamental
3 Happy et al. [15] ISED database from 50
wild emotions
participants
Multimodal 560 images with Learning based
4 Sapinski et al. [16] Learning centered
database 16 subjects emotions
Spontaneous 30184 images Learning based
5 Bian et al. [17] Online learning
expression database from 82 students emotions
9068 videos Engagement Learning based
6 Daisee et al [18] DAISEE database
from 112 users recognition emotions
7 different
expressions Fundamental
7 Lyons et al. [19] JAFFE Acted expressions
consisting of 213 emotions
images
Collected from the Fundamental
8 Goodfellow I J et al. [20] FER-2013 35685 images
wild emotions
395 videos from Fundamental
9 Shivendra Singh et al. [21] iSAFE Acted expressions
44 volunteers emotions
Student 78 volunteers
Head pose and eye Behavioural
10 Kaur et al. [22] Engagement with 5 mins
Gaze Cues
Database video

Figure. 1 The residual network

The residual connections are routed after the layer that channels the input images into the network.
second batch into the third batch and similarly the The final layers are equipped with pooling layer,
second connection arises from the fourth batch into SoftMax layer and the classification layer, the
the fifth batch of layers. The ReLu layers are the Pooling pools all the weights from the various
interconnection layers providing the activation from distribution provided by various layers to help in
the previous group of layers into the next group of dimensionality reduction and the SoftMax layer
layers. Additionally, the input layer has a convolution
International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 240

normalisation layer followed by a ReLU activation


feature. This is further accompanied by a 3x3
convolution layer and a batch normalisation layer.
The skip link skips both of these layers and attaches
immediately before the ReLU activation feature.
These residual blocks are replicated in order to form
a residual network. Deep networks are prone to
degradation problems due to the Vanishing gradients
and fitting issues that leads to worse training errors.
Deep networks need not be "harder" to match
Figure. 2 A residual function intuitively If there are such layers N with maximum
data precision, then the identity-mapping layers M(x),
converts the weights into a normalized probability following N can only be learned by mapping the layer
distribution. Based on the distribution the to be learned; the network would have efficient N
classification layer provides the class in which the layer performance.
image is classified. However, it is not simple to drive weights in such
Measuring learners’ engagement by increasing a way that they exactly yield identity mapping. The
scalability and appearance accessibility. The facial idea of residual learning leads to a residual function
appearance training and extraction of the features are R(x)- x. This is interpreted as a stack of layers that
essential to enhance the understanding of the computes the mapping as y= R(x)+ x as shown in
learner’s engagement. Fig.2 To learn y=R(x), the learning can happen
This work uses a light weight ResNet for feature directly through F(x) where,
extraction from the faces. Feature extraction and
distinction are important tasks for analysing the F(x) =R(x)- x (1)
emotions from the face. This work is divided into
three phases i) Data Preparations ii) Emotion therefore, our underlying mapping is y=M(x), such
Understanding through residual Network iii) that the network learns the added weights from the
Validation of Results. The fundamental previous layer, where y is represented as F(x)+x,
understanding of emotions lies in the orientation
change in the face muscle. The facial muscles y=R(x)−x + x (2)
controlled by the facial nerve controls the emotions
of an individual. The human brain transmits the data Identity mapping is made convenient due to the
through the interconnections among the neurons that introduction of allocating all the weights to 0, so that
control the muscle movement to express. emotions. R(x) =0 and F(x)=-x such that y=x is trained. ResNet
CNN’s generally perform well for shallow is defined by its building block and it is denoted by,
networks, as the network grows deeper depth wise
vanishing gradient problem is quite obvious and also y= (xi, Wi) + x (3)
optimizing the data and parameters in the network are
quite a tedious task CNN’s, over the decade with deep where, F can be multiple layers. A shortcut operation
layers have contributed to prominent results in the and element-wise addition are done by the ‘+’
image processing arena. However, the issues like operation as seen in Fig. 2. New parameters are not
excessive training time, gradient vanishing due to added by the shortcut link in the network and hence
deep networks and enormous parameters obtained training the network in this way does not increase
while training are some persistent issues. Due to these training time due to the number of parameters that
reasons an upgraded model to especially keep the must be trained. But if dimensionalities are distinct,
vanishing gradients under check is required and this is not feasible.
hence the Residual network was introduced. By To handle dimensionality approximation, a
connecting previous layers trained weights into the projection matrix Ws that is associated with x on the
next layers using shortcut connections have achieved same space F(x). Hence, Eq. (3) can be modified as,
greater impact in improving the accuracy and
reducing time taken for training. In this work a y=F(x,{Wi })+Ws x (4)
lightweight network using ResNets is used. Fig. 1
explains the model that is used to learn meaningful Degradation of gradients can be controlled by adding
interpretations from images. Every residual block has the identity mapping coefficients and Ws is used for
a 3x3 convolution layer followed by a batch
International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 241

matching the dimensions of the previous layer with Table 1. Parameter setting detail
the next layer. Multiple convolutional layers are Parameter’s Name Value
represented with the function F(x,{Wi}). The Networks Layers 50
element-wise addition is performed on two feature InitialLearnRate(Lr) 1.00E-04
maps, channel by channel. Table. 2 illustrates the Regularization Function l2Regularization
details of various parameter setting used in the GradientThreshold 'l2norm'
network. Eq. (5), explains the CNN process, in MaxEpochs 40
MiniBatchSize(Bs) 32
which: n is the size of the image, p is the padding size,
Verbose Frequency 50
f is the size of the filter, nc is the number of channels,
Validation Frequency (Vf) 500
nf is the number of filters and s is the stride. Shuffle 'every-epoch'
convolutions are carried out on each image based on Padding Direction 'right' (1,1)
n, nc and f, preserving the relation between pixels and Filter Size (3 x 3)
creating a matrix of feature maps. s is used to shift the Stride (1,1)
filter over the image pixels p. Number of Filters(layer) 8,16,32,64
Optimiser SGDM
n+2p-f n+2p-f Learning Rate Scheduler Piecewise
[n,n,nc]. [f,f,nc]= [[
s
+1] [ s +1] ,nf] (5)
Image size 192 x 192 x 3
Learning Droprate 60 iterations
After two layers of convolution Eq. (4) is introduced
as a shortcut layer to carry information from the images are tested for True Positive, false positive,
previous layers into the next layer. The residual True negative or false. The images are resized and
connections help in associating the prediction values fed into the Residual network with the sizes of 128 x
that were estimated by the previous layers as an input 128 x 3 dimensions.
into the next layer. The residual function computes Images from each dataset for training are 508
and matches the actual value with the predicted value. images from ISED and iSAFE datasets that trains
If the value of x is equal to the actual value then the data for 7 classes. Similarly, 5295 images are used for
residual function is zero resulting in a higher training data from Daisee dataset. The network was
derivative. along with the residual connections batch created from the scratch and the network was used to
normalization is also carried out in the block to improve the efficiency of emotional understanding.
normalize the values to a threshold where the However, every time training is carried out on the
derivatives are not too small to be removed due to datasets, the learning rate of the network is set to
least significance. All the layers, parameters 1.00E-04 this helps the network to learn features from
mentioned are the outcome of OHP. This tuning helps the inception. The network uses a piecewise learning
in assembling the required number of layers to extract rate scheduler, this enhances the learning rate by
meaningful interpretations of mid-level and high- decreasing the learning rate often and optimizing the
level features after each iteration to create a pool of network for a higher degree of weight vector
weighted probabilities. These probabilities are used distribution. The data is shuffled after each epoch and
to classify the images during validation and testing. the mini batch size was fixed at 32. Fig. 2 elicits the
details of the network where two skip networks are
5. Experimentation introduced. The plain network had 40 layers, while
The Residual Network was used on the following the introduction of the residual layers has increased
datasets, ISED, iSAFE and Daisee Datasets. The the number of layers to 50.
network is trained with images of the faces of Indian The number of layers is still lesser than the
origin. Since these images are created by the authors prominent networks like ResNet 32, ResNet50 and
for an E-learning environment the same trained Resnet 101. The pooling layers help in reducing the
network was used to create an observation on the dimensionality of the features extracted, while the
Online classes that are being conducted during these ReLu layers were used in the network as activation
pandemic years. All the experiments were carried out functions. The layers were not chosen at random,
using Intel Xeon E3 based workstations, NVIDIA however the layers were precisely placed after
Gforce GTX graphics card on a 32 GB RAM and several iterations and changes to the entire network
Matlab2019b was used as the platform to train and based on the performances based on OHP. Deep
validate the network. Images from the testing data is Networks with higher number of layers and residual
drawn at random and fed into the network, these networks and hence achieving the desired results for
the data took a long time to reach an optimum

International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 242

η(t)= ηi if ti ≤ t ≤ ti+1 (8)

Eq. (9) is used to calculate the efficiency of the


classifier based on the confusion matrix generated
based on the classifier on the test data. Kappa
Statistics is a measure that calculates how closely the
instances are classified by the algorithm that matched
the data labelled as ground truth also controlling the
accuracy of a random classifier as measured by the
expected accuracy.
po - pe
Kappa (κ)= 1- pe
(9)
Figure. 3 Training & validation accuracy for daisee
dataset
Where, po is the observed agreement, and pe is the
solution and also the training accuracy was lesser expected agreement. It recommends the
than 70%. This was due to the increase in feature performances of the classifier over a classification
pooling requirement and gradient descent issue was a that is generated merely based on a guess at random
common problem. according to the frequency of each class. Accuracy,
The intent is to create this optimized network for Precision, Recall, Sensitivity, Specificity and F1
emotional understanding through multiple classes. score are used as metrics gauge the performance of
The data available is imbalanced and hence F1 score the classifier for each class on the three datasets. The
is used to determine the efficiency of the algorithm. iSAFE and ISED datasets are combined together for
The training data validation was conducted after 500 7 classes as a single database and the Daisee Dataset
iterations, that provide the information about the uses 4 classes. Cohens-Kappa is also used to measure
training progress, parallelly once the loss function the efficiency of the classifier as it is a multiclass and
was also measured. The details of the training are imbalanced data.
provided in Fig. 3. The data available is imbalanced
hence F1 score is used to determine the efficiency of 6. Results and discussions
the algorithm. The training data validation was
The training samples are shuffled and taken at
conducted after 500 iterations, which provides the
random after every epoch to reduce over-fitting or
information about the training progress, parallelly the
under-fitting issues. The Training Accuracy(T) and
loss function was also measured. The details of the
Validation Accuracy(V) is plotted and visualized for
training are provided in Fig. 3. The Stochastic
the efficiency index of the model after every iteration,
Gradient Descent for the Function at x is,
Fig. 3, illustrates the training and validation results
𝑛 acquired using the Residual Network model. The
1
f(x)= n ∑ fi (x) (6) Validation frequency (Vf) is an important parameter
i=1
as it tests the model’s training efficiency in regular
Stochastic gradient descent (SGD) reduces intervals. The value for Vf , is set for every 500
computational cost at each iteration for an unbiased iterations for this model. A weighted average of the
scores independently derived from individual layers
estimate of the gradientf(x). At each iteration the
using the posterior class probabilities is cumulated to
SGD is uniformly sampled to an index of i{1,…,n}
improve Learning rate and reduce Validation Loss.
at random, and gradient ∇fi(x) is computed to update These weights are then trained for face image on
x: a cross dataset, which helps in reducing the blindness
of the model to newer data. However, while
x←[x-ηfi (x)] (7) comparing with the other methods as mentioned in
Table 3, Emonet performs with an accuracy higher
Replacing η with a learning rate η(t) that is time than the majority of the work carried out earlier. In
dependent improves in controlling the convergence many of the dataset’s individual classifications, there
rate of an algorithm that should produce optimized have been instances where emotions have been
outcomes. The piecewise constant for a time-based wrongly classified, however individualistic and
SGD is defined as in Eq. (8). η(t) reduces the learning
rate whenever the progress towards optimization is
not improving.
International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 243

Table 2. Metric table


Data Class Acc Pre Sen Spe F1
Anger 95.55 0.91 0.83 0.99 0.87
Disgust 96.53 0.86 0.92 0.97 0.89
Fear 95.63 0.91 0.91 0.99 0.91
ISED Happy 97.67 0.93 0.93 0.99 0.93
Neutral 98.03 1.00 0.87 1.00 0.93
Sad 94.34 0.86 1.00 0.97 0.92
Surprise 96.67 0.93 0.93 0.99 0.93
Boredom 96.61 0.88 0.96 0.97 0.92
Confusion 94.21 0.93 0.85 0.98 0.89
Daisee
Engaged 96.61 0.92 0.97 0.96 0.94
Frustrated 99.13 1.00 0.97 1.00 0.98
Acc= Accuracy, Pre=Precision, Sen= Sensitivity,
Spe= Specificity, F1= F1-score

consistent and there is a close correlation between the


validation accuracy and training accuracy. Based on
this training the model achieved 84.97% validation
Figure. 4 Confusion matrix for ISED & iSAFE accuracy on the Daisee dataset and 88.76% on the
dataset ISED & iSafe dataset combined. Residual
connections establish a prominent learning
framework for the emotions in these datasets.
Fig. 4 and Fig. 5 provides the details of the
confusion matrix (CM) acquired over the two datasets
and are derived as an outcome of the classifier from
the images that were not used while training the
network. In addition to the CM, of each classes the
accuracy, precision, recall, sensitivity and F1 score
are calculated. The results prove that the residual
connections introduced to the CNN’s have created a
vital impact in improving the outcomes for detecting
and recognizing emotions. The matrix also displays
the percentage of False Positives and False Negatives
for each class that provides a deep understanding of
the performance for each class in the data.
Table. 3 displays the findings obtained using the
ISED, iSafe and Daisee database from the different
Figure. 5 Confusion matrix for daisee dataset models used for emotional identification. Compared
to recent works using CNN and conventional
holistic approach has enhanced the model’s methods of definition, the new model has been
performance. To shed light on this classification effective. Though the metric F1- measure is used to
model the following metrics Accuracy, Precision, measure the performance of the classifier, it is the
Recall, Specificity and F1-score is calculated for harmonic mean of precision and recall. These metrics
EmoNet on the dataset and the results are furnished apart from using the conventional methods of
in Table 3. The number of epochs used to train the analysing accuracy from the classification data, it
network was set to 40. The network has undergone also elicits the performance of the classifiers with
numerous trial and error methods to fix on the different data. This model is also compared with the
hyperparameters. Moreover, decisive conclusions in traditional CNN model with 40 layers and the
the number of iterations were based on the Loss accuracy on ISED data was recorded as 90.53% and
function. Cross validation was performed on the with the iSAFE database at 91.78%. Improvement in
dataset, where the data was split into 70% training the results were found while using residual functions
and 30% validation data. The progress of the network to the existing EMONET model.
to evaluate overfitting. At each Vf the network The proposed model has a good ranking score
validates the data for validation. The network is based on Table. 3. While all of the other models use

International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 244

Table 4. Performance comparison of various available emotional analysis especially in the field of student
models engagement and attention estimation. Based on Eq.
Sl Name of the Method Dataset Acc (9) the classifiers accuracy on both the dataset is
no (%) calculated based on the confusion matrix. From Table.
1 CNN [23] ISED 51.6 3, various parametric results show that the network
2 CNN [23] ISED 59.3 was able to achieve significant outcome using the
3 Inception V3 [23] ISED 47.9 residual network. The average results for the dataset
4 EmotionNet 2 [23] ISED 21.0 ISED and iSafe dataset are, Accuracy 91.3%, Error at
5 EmotionalDAN [23] ISED 62.0 8.7%, Sensitivity 91.34%, Specificity 98.56,
6 CNN [24] ISED 82.9 precision 91.35%, false Positive rate is recorded
7 Local prominent at1.44% and the Kappa’s Coefficient 0.65 which is
directional patterns [25] recorded to be a substantial classifier for the database.
i. LBP (Local 76.47 Similarly, in the Daisee dataset, overall results for
Binary Pattern) Accuracy: 93.44%, Error rate: 6.56%, Sensitivity:
ii. LDP (Local 74.61 93.63%, Specificity: 97.82%, Precision 93.37%, false
Directional
positive rate is recorded at 2.18% and the Kappa’s
Pattern)
iii. LDN (Local 75.85 Coefficient 0.825 which is recorded to be a perfect
Directional classifier for the database. The proposed method has
Number) attained promising results due to the enormous
iv. LPTP (Local 72.46 images used for training, optimised layers to enrich
Directional the shallow network to learn, understand and capture
Ternary Pattern) ISED features. The Features learned using the Network
v. PTP (Positional 76.16 makes it sophisticated for classification with higher
Ternary Pattern) accuracy. The Feature maps created after each layer
vi. HOG (Histogram that uses the activation functions enhances the
Of Gradients) 76.75
network by reducing the vanishing gradient and
vii. LPDP (Local
Prominent overfitting issues that are prominent in the traditional
Directional 77.80 CNN. The network is compared with the results that
Pattern) are deeper and most commonly used trained networks.
viii. LPDP𝑓 (Local Table. 4, enlists the methods used by various authors
Prominent 78.32 and the accuracy that it has achieved on the dataset,
Directional though there are different classes in each of the
Pattern 𝑓) datasets our model has attained a signific-ant
8 Landmark Detection [26] ISED 34 improvement in all parameters used to measure the
9 Local directional-structural ISED 77.78 tangibility in the introduction of residual networks on
pattern [27] conventional CNN.
10 LDP+KPCA [28] Daisee 90.89
There are three observations about the network’s
11 Hybrid CNN [29] Daisee 86 performance, firstly degradation issues have rapidly
decreased due to the lower training error rate which
12 Deep Engagement Daisee 57.9
Recognition Network [30]
was observed to be at 0.265. The reduced training
13 Very Deep Convolutional Daisee 92.33 error improves the efficiency of the learning due to
Network [31] the optimum depth of the network. Secondly, the
14 Proposed Model ISED & 91.3 identity connections as mentioned in Fig. 1, have
iSAFE helped in significantly decreased the time complexity
15 Proposed Model Daisee 93.44 for training and validation by 30%. Thirdly, the
network uses SGD solver and it is able to find good
solutions. Though the network is shallow, gradient
state-of-the-art deep learning networks and descent algorithm works on batches of smaller sizes
traditional methods such as Inception V3, CNN, and this enables the network to train on smaller batches
the well-known local directional patterns. The model and create multiple layers of features. These features
proposed exceeds the precision of all ISED algorithm are the crux of the classification unit to create
models that have used the database. Conventional probabilities on the Weighted layers by accurately
methods, CNN, Hybrid CNN’s and many more recent turning on the exact neurons to provide precise and
works that have been carried out in the recent years accurate results. Two-Fold Cross validations provide
are prominent in enhancing the efficiency of visibility on the network’s validation outcome during
International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 245

the training phase. This provides a closer detail of the Conflicts of Interest
performance of the network. Incorporating
Optimized Hyper Parametric Settings that are The authors declare no conflict of interest.
discussed in Section. 4, has cushioned the network to
be customized for better performance. Author Contributions
The contributions by the authors for this research
7. Conclusions article are as follows: “conceptualization,
methodology, Formal analysis and writing—original
The proposed work intends to evaluate the
performance of the inclusion of residual layers to an draft preparation Michael Moses Thiruthuvanathan;
existing CNN model. CNN models are prone to Result validation, resources, formal analysis
vanishing gradients and loss in accuracy as the writing—review and editing and supervision
networks grow deep. The residual connections on the Balachandran Krishnan; data curation, Result
shallow network is designed from the scratch Validation and ethical inference Madhavi
specifically for the purpose of detecting students’ Rangaswamy”
engagement on an E-Learning platform. Both
behavioral classes like boredom, engaged, frustrated Acknowledgments
and confusion classes along with emotional classes Authors wish to acknowledge the technical and
are considered in this work. This work utilizes infrastructure help rendered by the faculty members
students’ facial features to predict and classify from the department of Computer Science and
images in the wild to calculate the accuracy of the Engineering, CHRIST (Deemed to be University),
proposed approach. In this model, the network uses Bangalore, India.
residual networks to improve connections from
previous layers into the next layers to improve the References
learning and classification response of the system.
Two-fold cross validations are used to understand the [1] A. Mollahosseini, B. Hasani, and M. H. Mahoor,
capability of the model. This network is trained on “AffectNet: A Database for Facial Expression,
three Indian Datasets indigenously, to be able to Valence, and Arousal Computing in the Wild”,
detect the emotional and behavioral intent of the IEEE Transactions on Affective Computing, Vol.
students. The total layers used in the network is 50. 10, No. 1, pp. 18-31, 2019.
The shallow network helps in efficient learning [2] T. Ashwin and R. Guddeti, “Affective Database
model that is able to validate images at an average of for E-Learning and Classroom Environments
86.87% during training. The model is tested for using Indian Students’ Faces, Hand Gestures
detection efficiency with test data and compared with and Body Postures”, Future Generation
state-of-the-art models that are built with CNN as Computer Systems, Vol. 108, No. 1, pp. 334-348,
primary network. The usage of Residual Connections 2020.
and Optimized Hyper Parametric Settings has [3] A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T.
considerably enhanced the performance in creating Gedeon, “Emotion Recognition in the Wild
and using the network for a Indian face based Challenge 2013”, In: Proc. of 15th International
emotional classification model. Conf. on Multimodal Interaction, ACM, pp.
The Furnished results in Table. 3 and Table 4 509–516, 2013.
based on Fig. 4 and Fig. 5 show that the model [4] I. Goodfellow, D. Erhan, P. L. Carrier, A.
outperforms the other state of the art techniques. Also, Courville, M. Mirza, B. Hamner, W. Cukierski,
the classifiers performance is evaluated with the help Y. Tang, D. Thaler, and H. Lee, “Challenges in
of Kappa score and the network performs well and Representation Learning: A Report on Three
the classifier is diligently able to perform at 82.51%. Machine Learning Contests”, In: Proc. of
The proposed model is evaluated for performance International Conf. on Neural Information
measures using standard evaluation metrics that Processing, Vol. 64, No. 1, pp. 59–63, 2015.
results in improvement close to 2% in various [5] A. Mollahosseini, B. Hasani, M. J. Salvador, H.
parameters. The proposed model will be tested for Abdollahi, D. Chan, and M. H. Mahoor, “Facial
group engagement detections and evaluate the Expression Recognition from World Wild Web”,
Valence and Arousal of the group using the model as In: Proc. of IEEE Conf. on Computer Vision and
a future enhancement. Pattern Recognition (CVPR) Workshops, Vol. 1,
pp. 168-195, 2016.
[6] S. Zafeiriou, A. Papaioannou, I. Kotsia, M.
Nicolaou and G. Zhao, “Facial Affect “In-the-
International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 246

Wild”: A Survey and a New Database”, In: Proc. [17] C. Bian, Y. Zhang, F. Yang, W. Bi, and W. Lu,
of IEEE Conf. on Computer Vision and Pattern “Spontaneous Facial Expression Database for
Recognition Workshops (CVPRW), pp. 1487- Academic Emotion Inference in Online
1498, 2016. Learning”, IET Computer Vision. Vol. 13, No. 3,
[7] L. Rothkrantz, “Dropout Rates of Regular pp. 329–337, 2018.
Courses and MOOCs”, In: Proc. of International [18] A. Gupta, A. D’Cunha, K. Awasthi, and V.
Conf. on Computer Supported Education, Rome, Balasubramanian, “Daisee: Towards User
pp. 25-46, 2016. Engagement Recognition in the Wild”, arXiv
[8] M. Li, H. Xu, X. Huang, Z. Song, X. Liu and X. preprint arXiv: 1609.01885, 2016.
Li, “Facial Expression Recognition with Identity [19] M. J. Lyons, S. Akamatsu, M. Kamachi, J.
and Emotion Joint Learning”, IEEE Gyoba, J. Budynek, “The Japanese female facial
Transactions on Affective Computing, Vol. 4, expression (JAFFE) database”, In: Proc. of
No. 8, pp. 411-416, 2018. Third International Conf. on Automatic Face
[9] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. and Gesture Recognition, pp. 14–16, 1998.
Trinh, and J. F. Cohn, “Disfa: A Spontaneous [20] I. Goodfellow, D. Erhan, P. L. Carrier, A.
Facial Action Intensity Database”, IEEE Courville, M. Mirza, B. Hamner, W. Cukierski,
Transactions on Affective Computing, Vol. 4, Y. Tang, D. Thaler, D.-H. Lee, Y. Zhou, C.
No. 2, pp. 151–160, 2013. Ramaiah, F. Feng, R. Li, X. Wang, D.
[10] T. Ashwin and R. Guddeti, “Unobtrusive Athanasakis, J. Shawe-Taylor, M. Milakov, J.
Students’ Engagement Analysis in Computer Park, R. Ionescu, M. Popescu, C. Grozea, J.
Science Laboratory Using Deep Learning Bergstra, J. Xie, L. Romaszko, B. Xu, Z. Chuang,
Techniques”, In: Proc. of IEEE 18th and Y. Bengio. “Challenges in representation
International Conf. on Advanced Learning learning: A report on three machine learning
Technologies (ICALT), pp. 436–440, 2018. contests”, Neural Information Processing, Vol.
[11] S. Setty, M. Husain, P. Beham, J. Gudavalli, M. 8228, pp. 117-124, 2013.
Kandasamy, R. Vaddi, V.Hemadri, J. Karure, R. [21] S. Singh and S. Benedict, “Indian Semi-Acted
Raju, and B. Rajan, “Indian Movie Face Facial Expression (iSAFE) Dataset for Human
Database: A Benchmark for Face Recognition Emotions Recognition”, Advances in Signal
under Wide Variations”, In: Proc. of Fourth Processing and Intelligent Recognition Systems.
National Conf. on Computer Vision, Pattern SIRS Communications in Computer and
Recognition, Image Processing and Graphics Information Science, Vol 1209, No. 1, pp. 150-
(NCVPRIPG), IEEE, 2013, pp. 1–5, 2013. 162, 2019.
[12] S. Patwardhan and G. M. Knapp, “Affect [22] A. Kaur, A. Mustafa, L. Mehta and A. Dhall,
Intensity Estimation Using Multiple Modalities”, “Prediction and Localization of Student
In: Proc. of Florida Artificial Intelligence Engagement in the Wild”, Digital Image
Research Society Conf., pp. 130-133, 2014. Computing: Techniques and Applications
[13] R. Noorden, “The ethical questions that haunt (DICTA), Canberra, Australia, pp. 1-8, 2018.
facial-recognition research”, Nature. Vol. 587: [23] I. Tautkute, T. Trzcinski, and A. Bielski, “I
pp. 354-358, 2020. Know How You Feel: Emotion Recognition
[14] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, with Facial Landmarks”, In: Proc. of IEEE Conf.
“Collecting Large, Richly Annotated Facial- on Computer Vision and Pattern Recognition
Expression Databases from Movies”, IEEE Workshops (CVPRW), pp. 1959-1974, 2018.
MultiMedia, Vol. 19, No. 3, pp. 34-41, 2012. [24] S. Gonzalez-Lozoya, J. Calleja and L. Pellegrin,
[15] S. Happy, P. Patnaik, A. Routray, and R. Guha, H.Escalante, Ma. Medina and A. Benitez-Ruiz,
“The Indian Spontaneous Expression Database “Recognition of Facial Expressions based on
for Emotion Recognition”, IEEE Transactions CNN Features”, Multimedia Tools &
on Affective Computing, Vol. 8, No. 1, pp. 131- Application, Vol. 79, pp. 13987–14007, 2020.
142, 2017. [25] F. Makhmudkhujaev, M. Abdullah-Al-Wadud,
[16] T.Sapinski, D. Kaminska, A. Pelikant, C. M. Iqbal, B. Ryu, and O. Chae, “Facial
Ozcinar, E.Avots, and G. Anbarjafari. Expression Recognition with Local Prominent
“Multimodal Database of Emotional Speech, Directional Pattern”, Signal Processing: Image
Video and Gestures”, In: Proc. of International Communication, Vol. 74, No. 1, pp, 1-12, 2019.
Conf. on Pattern Recognition Information [26] S. Engoor, S. SendhilKumar, C. Hepsibah
Forensics, pp. 153–163, 2018. Sharon, and G. S. Mahalakshmi, “Occlusion-
aware Dynamic Human Emotion Recognition
International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21
Received: October 13, 2020. Revised: December 24, 2020. 247

Using Landmark Detection”, In: Proc. of 6th


International Conf. on Advanced Computing
and Communication Systems (ICACCS), pp.
795-799, 2020.
[27] A. Rivera, J. Rojas Castillo, and O. Oksam Chae,
“Local Directional Number Pattern for Face
Analysis: Face and Expression
Recognition”, IEEE Transactions on Image
Processing, Vol. 22, No. 5, pp. 1740-1752, 2013.
[28] M. Dewan, F. Lin, D. Wen, M. Murshed and Z.
Uddin, “A Deep Learning Approach to
Detecting Engagement of Online Learners”, In:
Proc. of IEEE Smart World, Ubiquitous
Intelligence & Computing, Advanced & Trusted
Computing, Scalable Computing &
Communications, Cloud & Big Data Computing,
Internet of People and Smart City Innovation, pp.
1895-1902, 2018.
[29] T. Ashwin and R. Guddeti, “Automatic
detection of students’ affective states in
classroom environment using hybrid
convolutional neural networks”, Education
Information Technologies, Vol. 25, No. 1, pp.
1387–1415, 2020.
[30] O. Nezami, M. Dras, L. Hamey, D. Richards, S.
Wan, and C. Paris, “Automatic Recognition of
Student Engagement Using Deep Learning and
Facial Expression”, In: Proc. of Machine
Learning and Knowledge Discovery in
Databases, Lecture Notes in Artificial
Intelligence, Springer, Part III, Vol. 11908, pp.
273-289, 2020.
[31] T. Huang, Y. Mei, H. Zhang, S. Liu and H. Yang,
“Fine-grained Engagement Recognition in
Online Learning Environment”, In: Proc. of
IEEE 9th International Conf. on Electronics
Information and Emergency Communication
(ICEIEC), Beijing, China, pp. 338-341, 2019.

International Journal of Intelligent Engineering and Systems, Vol.14, No.2, 2021 DOI: 10.22266/ijies2021.0430.21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy