0% found this document useful (0 votes)
10 views17 pages

Automatic Recognition of Student Engagement Using Deep Learning and Facial Expression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

Automatic Recognition of Student Engagement Using Deep Learning and Facial Expression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Automatic Recognition of Student Engagement

using Deep Learning and Facial Expression

Omid Mohamad Nezami1,2 ( ), Mark Dras1 , Len Hamey1 , Deborah Richards1


Stephen Wan2 , and Cécile Paris2
1
Macquarie University, Sydney, NSW, Australia
omid.mohamad-nezami@hdr.mq.edu.au
{mark.dras,len.hamey,deborah.richards}@mq.edu.au
2
CSIRO’s Data61, Sydney, NSW, Australia
{stephen.wan,cecile.paris}@data61.csiro.au

Abstract. Engagement is a key indicator of the quality of learning ex-


perience, and one that plays a major role in developing intelligent educa-
tional interfaces. Any such interface requires the ability to recognise the
level of engagement in order to respond appropriately; however, there is
very little existing data to learn from, and new data is expensive and dif-
ficult to acquire. This paper presents a deep learning model to improve
engagement recognition from images that overcomes the data sparsity
challenge by pre-training on readily available basic facial expression data,
before training on specialised engagement data. In the first of two steps,
a facial expression recognition model is trained to provide a rich face rep-
resentation using deep learning. In the second step, we use the model’s
weights to initialize our deep learning based model to recognize engage-
ment; we term this the engagement model. We train the model on our
new engagement recognition dataset with 4627 engaged and disengaged
samples. We find that the engagement model outperforms effective deep
learning architectures that we apply for the first time to engagement
recognition, as well as approaches using histogram of oriented gradients
and support vector machines.

Keywords: Engagement · Deep Learning · Facial Expression.

1 Introduction
Engagement is a significant aspect of human-technology interactions and is de-
fined differently for a variety of applications such as search engines, online gaming
platforms, and mobile health applications [28]. According to Monkaresi et al. [25],
most definitions describe engagement as attentional and emotional involvement
in a task.
This paper deals with engagement during learning via technology. Investigat-
ing engagement is vital for designing intelligent educational interfaces in different
learning settings including educational games [14], massively open online courses
(MOOCs) [18], and intelligent tutoring systems (ITSs) [1]. For instance, if stu-
dents feel frustrated and become disengaged (see disengaged samples in Fig. 1),
2 O. M. Nezami et al.

Fig. 1. Engaged (left) and disengaged (right) samples collected in our studies. We
blurred the children’s eyes for ethical issues, even though we have their parents consent
at the time.

the system should intervene in order to bring them back to the learning process.
However, if students are engaged and enjoying their tasks (see engaged samples
in Fig. 1), they should not be interrupted even if they are making some mistakes
[19]. In order for the learning system to adapt the learning setting and provide
proper responses to students, we first need to automatically measure engage-
ment. This can be done by, for example, using context performance [1], facial
expression [35] and heart rate [25] data. Recently, engagement recognition us-
ing facial expression data has attracted special attention because of widespread
availability of cameras [25].
This paper aims at quantifying and characterizing engagement using facial
expressions extracted from images. In this domain, engagement detection models
usually use typical facial features which are designed for general purposes, such
as Gabor features [35], histogram of oriented gradients [18] and facial action
units [4]. To the best of the authors’ knowledge, there is no work in the litera-
ture investigating the design of specific and high-level features for engagement.
Therefore, providing a rich engagement representation model to distinguish en-
gaged and disengaged samples remains an open problem (Challenge 1). Training
such a rich model requires a large amount of data which means extensive effort,
time, and expense would be required for collecting and annotating data due to
the complexities [3] and ambiguities [28] of the engagement concept (Challenge
2).
To address the aforementioned challenges, we design a deep learning model
which includes two essential steps: basic facial expression recognition, and en-
Automatic Recognition of Student Engagement using Deep Learning 3

gagement recognition. In the first step, a convolutional neural network (CNN)


is trained on the dataset of the Facial Expression Recognition Challenge 2013
(FER-2013) to provide a rich facial representation model, achieving state-of-
the-art performance. In the next step, the model is applied to initialize our
engagement recognition model, designed using a separate CNN, learned on our
newly collected dataset in the engagement recognition domain. As a solution to
Challenge 1, we train a deep learning-based model that provides our representa-
tion model specifically for engagement recognition. As a solution to Challenge 2,
we use the FER-2013 dataset, which is around eight times larger than our col-
lected dataset, as external data to pre-train our engagement recognition model
and compensate for the shortage of engagement data 3 . The contributions of this
work are threefold:

– To the authors’ knowledge, the work in this paper is the first time a rich face
representation model has been used to capture basic facial expressions and
initialize an engagement recognition model, resulting in positive outcomes.
This shows the effectiveness of applying basic facial expression data in order
to recognize engagement.
– We have collected a new dataset we call the Engagement Recognition (ER)
dataset to facilitate research on engagement recognition from images. To
handle the complexity and ambiguity of engagement concept, our data is
annotated in two steps, separating the behavioral and emotional dimensions
of engagement. The final engagement label in the ER dataset is the combi-
nation of the two dimensions.
– To the authors’ knowledge, this is the first study which models engagement
using deep learning techniques. The proposed model outperforms a compre-
hensive range of baseline approaches on the ER dataset.

2 Related Work

2.1 Facial Expression Recognition

As a form of non-verbal communication, facial expressions convey attitudes,


affects, and intentions of people. They are the result of movements of muscles and
facial features [11]. Study of facial expressions was started more than a century
ago by Charles Darwin [10], leading to a large body of work in recognizing basic
facial expressions [11,31]. Much of the work uses a framework of six ‘universal’
emotions [9]: sadness, happiness, fear, anger, surprise and disgust, with a further
neutral category.
Deep learning models have been successful in automatically recognizing facial
expressions in images [15,23,24,30,37,38,39]. They learn hierarchical structures
from low- to high-level feature representations thanks to the complex, multi-
layered architectures of neural networks. Kahou et al. [17] applied convolutional
3
Our code and trained models are publicly available from https://github.com/
omidmnezami/Engagement-Recognition
4 O. M. Nezami et al.

neural networks (CNNs) to recognize facial expressions and won the 2013 Emo-
tion Recognition in the Wild (EmotiW) Challenge. Another CNN model, fol-
lowed by a linear support vector machine, was trained to recognize facial expres-
sions by Tang et al. [34]; this won the 2013 Facial Expression Recognition (FER)
challenge [12]. Kahou et al. [16] applied CNNs for extracting visual features ac-
companied by audio features in a multi-modal data representation. Nezami et
al. [27] used a CNN model to recognize facial expressions, where the learned
representation is used in an image captioning model; the model embedded the
recognized facial expressions to generate more human-like captions for images in-
cluding human faces. Yu et al. [37] employed a CNN model that was pre-trained
on the FER-2013 dataset [12] and fine-tuned on the Static Facial Expression in
the Wild (SFEW) dataset [8]. They applied a face detection method to detect
faces and remove noise in their target data samples. Mollahosseini et al. [24]
trained CNN models across different well-known FER datasets to enhance the
generalizablity of recognizing facial expressions. They applied face registration
processes, extracting and aligning faces, to achieve better performances. Kim et
al. [20] measured the impact of combining registered and unregistered faces in
this domain. They used the unregistered faces when the facial landmarks of the
faces were not detectable. Zhang et al. [38] applied CNNs to capture spatial infor-
mation from video frames. The spatial information was combined with temporal
information to recognize facial expressions. Pramerdorfer et al. [29] employed a
combination of modern deep architectures such as VGGnet [32] on the FER-2013
dataset. They also achieved the state-of-the-art result on the FER-2013 dataset.

2.2 Engagement Recognition


Engagement has been detected in three different time scales: the entire video
of a learning session, 10-second video clips and images. In the first category,
Grafsgarrd et al. [13] studied the relation between facial action units (AUs) and
engagement in learning contexts. They collected videos of web-based sessions
between students and tutors. After finishing the sessions, they requested each
student to fill out an engagement survey used to annotate the student’s engage-
ment level. Then, they used linear regression methods to find the relationship
between different levels of engagement and different AUs. However, their ap-
proach does not characterize engagement in fine-grained time intervals which
are required for making an adaptive educational interface.
As an attempt to solve this issue, Whitehill et al. [35] applied linear support
vector machines (SVMs) and Gabor features, as the best approach in this work,
to classify four engagement levels: not engaged at all, nominally engaged, en-
gaged in task, and very engaged. In this work, the dataset includes 10-second
videos annotated into the four levels of engagement by observers, who are analyz-
ing the videos. Monkaresi et al. [25] used heart rate features in addition to facial
features to detect engagement. They used a face tracking engine to extract facial
features and WEKA (a classification toolbox) to classify the features into en-
gaged or not engaged classes. They annotated their dataset, including 10-second
videos, using self-reported data collected from students during and after their
Automatic Recognition of Student Engagement using Deep Learning 5

tasks. Bosch et al. [4] detected engagement using AUs and Bayesian classifiers.
The generalizability of the model was also investigated across different times,
days, ethnicities and genders [5]. Furthermore, in interacting with intelligent
tutoring systems (ITSs), engagement was investigated based on a personalized
model including appearance and context features [1]. Engagement was consid-
ered in learning with massively open online courses (MOOCs) as an e-learning
environment [7]. In such settings, data are usually annotated by observing video
clips or filling self-reports. However, the engagement levels of students can change
during 10-second video clips, so assigning a single label to each clip is difficult
and sometimes inaccurate.
In the third category, HOG features and SVMs have been applied to classify
images using three levels of engagement: not engaged, nominally engaged and
very engaged [18]. This work is based on the experimental results of whitehill et
al. [35] in preparing engagement samples. whitehill et al. [35] showed that en-
gagement patterns are mostly recorded in images. Bosch et al. [4] also confirmed
that video clips could not provide extra information by reporting similar perfor-
mances using different lengths of video clips in detecting engagement. However,
competitive performances are not reported in this category.
We focus on the third category to recognize engagement from images. To do
so, we collected a new dataset annotated by Psychology students, who can po-
tentially better recognize the psychological phenomena of engagement, because
of the complexity of analyzing student engagement. To assist them with recog-
nition, brief training was provided prior to commencing the task and delivered
in a consistent manner via online examples and descriptions. We did not use
crowdsourced labels, resulting in less effective outcomes, similar to the work of
Kamath et al. [18]. Furthermore, we captured more effective labels by following
an annotation process to simplify the engagement concept into the behavioral
and the emotional dimensions. We requested annotators to label the dimensions
for each image and make the overall annotation label by combining these. Our
aim is for this dataset to be useful to other researchers interested in detecting
engagement from images. Given this dataset, we introduce a novel model to
recognize engagement using deep learning. Our model includes two important
phases. First, we train a deep model to recognize basic facial expressions. Sec-
ond, the model is applied to initialize the weights of our engagement recognition
model trained using our newly collected dataset.

3 Facial Expression Recognition from Images

3.1 Facial Expression Recognition Dataset

In this section, we use the facial expression recognition 2013 (FER-2013) dataset [12].
The dataset includes images, labeled happiness, anger, sadness, surprise, fear,
disgust, and neutral. It contains 35,887 samples (28,709 for the training set, 3589
for the public test set and 3589 for the private test set), collected by the Google
search API. The samples are in grayscale at the size of 48-by-48 pixels (Fig. 2).
6 O. M. Nezami et al.

Fig. 2. Examples from the FER-2013 dataset [12] including seven basic facial expres-
sions.

We split the training set into two parts after removing 11 completely black
samples: 3589 for validating and 25,109 for training our facial expression recog-
nition model. To compare with related work [20,29,37], we do not use the public
test set for training or validation, but use the private test set for performance
evaluation of our facial expression recognition model.

3.2 Facial Expression Recognition using Deep Learning


We train the VGG-B model [32], using the FER-2013 dataset, with one less
Convolutional (Conv.) block as shown in Fig. 3. This results in eight Conv. and
three fully connected layers. We also have a max pooling layer after each Conv.
block with stride 2. We normalize each FER-2013 image so that the image has a
mean 0.0 and a norm 100.0 [34]. Moreover, for each pixel position, the pixel value
is normalized to mean 0.0 and standard-deviation 1.0 using our training part.
Our model has a similar performance with the work of Pramerdorfer et al. [29]
generating the state-of-the-art on FER-2013 dataset. The model’s output layer
has a softmax function generating the categorical distribution probabilities over
seven facial expression classes in FER-2013. We aim to use this model as a part
of our engagement recognition model.

4 Engagement Recognition from Images


4.1 Engagement Recognition Dataset
Data Collection To recognize engagement from face images, we construct a new
dataset that we call the Engagement Recognition (ER) dataset. The data sam-
ples are extracted from videos of students, who are learning scientific knowledge
and research skills using a virtual world named Omosa [14]. Samples are taken at
a fixed rate instead of random selections, making our dataset samples represen-
tative, spread across both subjects and time. In the interaction with Omosa, the
Automatic Recognition of Student Engagement using Deep Learning 7

Fig. 3. The architecture of our facial expression recognition model adapted from VGG-
B framework [32]. Each rectangle is a Conv. block including two Conv. layers. The max
pooling layers are not shown for simplicity.

goal of students is to determine why a certain animal kind is dying out by talking
to characters, observing the animals and collecting relevant information, Fig. 4
(top). After collecting notes and evidence, students are required to complete a
workbook, Fig. 4 (bottom).
The videos of students were captured from our studies in two public sec-
ondary schools involving twenty students (11 girls and 9 boys) from Years 9
and 10 (aged 14–16), whose parents agreed to their participation in our ethics-
approved studies. We collected the videos from twenty individual sessions of
students recorded at 20 frames per second (fps), resulting in twenty videos and
totalling around 20 hours. After extracting video samples, we applied a con-
volutional neural network (CNN) based face detection algorithm [21] to select
samples including detectable faces. The face detection algorithm cannot detect
faces in a small number of samples (less than 1%) due to their high face occlusion
(Fig. 5). We removed the occluded samples from the ER dataset.

Data Annotation We designed custom annotation software to request annota-


tors to independently label 100 samples each. The samples are randomly selected
from our collected data and are displayed in different orders for different anno-
8 O. M. Nezami et al.

Fig. 4. The interactions of a student with Omosa [14], captured in our studies.

Fig. 5. Examples without detectable faces because of high face occlusions.

tators. Each sample is annotated by at least six annotators.4 Following ethics


approval, we recruited undergraduate Psychology students to undertake the an-
notation task, who received course credit for their participation. Before starting
the annotation process, annotators were provided with definitions of behavioral
and emotional dimensions of engagement, which are defined in the following
paragraphs, inspired by the work of Aslan et al. [2].
Behavioral dimension:
– On-Task : The student is looking towards the screen or looking down to the
keyboard below the screen.
– Off-Task : The student is looking everywhere else or eyes completely closed,
or head turned away.
– Can’t Decide: If you cannot decide on the behavioral state.
Emotional dimension:
– Satisfied : If the student is not having any emotional problems during the
learning task. This can include all positive states of the student from being
neutral to being excited during the learning task.
– Confused : If the student is getting confused during the learning task. In some
cases, this state might include some other negative states such as frustration.
4
The Fleiss’ kappa of the six annotators is 0.59, indicating a high inter-coder agree-
ment.
Automatic Recognition of Student Engagement using Deep Learning 9

Fig. 6. An example of our annotation software where the annotator is requested to


specify the behavioral and emotional dimensions of the displayed sample.

Table 1. The adapted relationship between the behavioral and emotional dimensions
from Woolf et al. [36] and Aslan et al. [2].

Behavioral Emotional Engagement


On-task Satisfied Engaged
On-task Confused Engaged
On-task Bored Disengaged
Off-task Satisfied Disengaged
Off-task Confused Disengaged
Off-task Bored Disengaged

– Bored : If the student is feeling bored during the learning task.


– Can’t Decide: If you cannot decide on the emotional state.

During the annotation process, we show each data sample followed by two
questions indicating the engagement’s dimensions. The behavioral dimension can
be chosen among on-task, off-task, and can’t decide options and the emotional
dimension can be chosen among satisfied, confused, bored, and can’t decide op-
tions. In each annotation phase, annotators have access to the definitions to
label each dimension. A sample of the annotation software is shown in Fig. 6. In
the next step, each sample is categorized as engaged or disengaged by combin-
ing the dimensions’ labels using Table 1. For example, if a particular annotator
labels an image as on-task and satisfied, the category for this image from this
annotator is engaged. Then, for each image we use the majority of the engaged
and disengaged labels to specify the final overall annotation. If a sample receives
10 O. M. Nezami et al.

Fig. 7. Randomly selected images of ER dataset including engaged and disengaged.

the label of can’t decide more than twice (either for the emotional or behavioral
dimensions) from different annotators, it is removed from ER dataset. Labeling
this kind of samples is a difficult task for annotators, notwithstanding the good
level of agreement that was achieved, and finding solutions to reduce the diffi-
culty remains as a future direction of our work. Using this approach, we have
created ER dataset consisting of 4627 annotated images including 2290 engaged
and 2337 disengaged.

Dataset Preparation We apply the CNN based face detection algorithm to detect
the face of each ER sample. If there is more than one face in a sample, we
choose the face with the biggest size. Then, the face is transformed to grayscale
and resized into 48-by-48 pixels, which is an effective resolution for engagement
detection [35]. Fig. 7 shows some examples of the ER dataset. We split the ER
dataset into training (3224), validation (715), and testing (688) sets, which are
subject-independent (the samples in these three sets are from different subjects).
Table 2 demonstrates the statistics of these three sets.

Table 2. The statistics of ER dataset and its partitions.

State Total Train Valid Test


Engaged 2290 1589 392 309
Disengaged 2337 1635 323 379
Total 4627 3224 715 688
Automatic Recognition of Student Engagement using Deep Learning 11

4.2 Engagement Recognition using Deep Learning


We define two Convolutional Neural Network (CNN) architectures as baselines,
one designed architecture and one that is similar in structure to VGGnet [32].
The key model of interest in this paper is a version of the latter baseline that
incorporates facial expression recognition. For completeness, we also include an-
other baseline that is not based on deep learning, but rather uses support vector
machines (SVMs) with histogram of oriented gradients (HOG) features. For all
the models, every sample of the ER dataset is normalized so that it has a zero
mean and a norm equal to 100.0. Furthermore, for each pixel location, the pixel
values are normalized to mean zero and standard deviation one using all ER
training data.

HOG+SVM We trained a method using the histogram of oriented gradients


(HOG) features extracted from ER samples and a linear support vector machine
(SVM), which we call the HOG+SVM model. The model is similar to that
of Kamath et al. [18] for recognizing engagement from images and is used as
a baseline model in this work. HOG [6] applies gradient directions or edge ori-
entations to express objects in local regions of images. For example, in facial
expression recognition tasks, HOG features can represent the forehead’s wrin-
kling by horizontal edges. A linear SVM is usually used to classify HOG features.
In our work, C, determining the misclassification rate of training samples against
the objective function of SVM, is fine-tuned, using the validation set of the ER
dataset, to the value of 0.1.

Convolutional Neural Network We use the training and validation sets of the
ER dataset to train a Convolutional Neural Networks (CNNs) for this task from
scratch (the CNN model); this constitutes another of the baseline models in
this paper. The model’s architecture is shown in Fig. 8. The model contains two
convolutional (Conv.) layers, followed by two max pooling (Max.) layers with
stride 2, and two fully connected (FC) layers, respectively. A rectified linear unit
(ReLU) activation function [26] is applied after all Conv. and FC layers. The
last step of the CNN model includes a softmax layer, followed by a cross-entropy
loss, which consists of two neurons indicating engaged and disengaged classes.
To overcome model over-fitting, we apply a dropout layer [33] after every Conv.
and hidden FC layer. Local response normalization [22] is used after the first
Conv. layer. As the optimizer algorithm, stochastic gradient descent with mini-
batching and a momentum of 0.9 is used. Using Equation 1, the learning rate at
step t (at ) is decayed by the rate (r) of 0.8 in the decay step (s) of 500. The total
number of iterations from the beginning of the training phase is global step (g).
g
at = at−1 × r s (1)

Very Deep Convolutional Neural Network Using the ER dataset, we train a deep
model which has eight Conv. and three FC layers similar to VGG-B architecture
[32], but with two fewer Conv. layers. The model is trained using two different
12 O. M. Nezami et al.

Fig. 8. The architecture of the CNN Model. We denote convolutional, max-pooling,


and fully-connected layers with “Conv”, “Max”, and “FC”, respectively.

scenarios. Under the first scenario, the model is trained from scratch initial-
ized with random weights; we call this the VGGnet model (Fig. 9), and this
constitutes the second of our deep learning baseline models. Under the second
scenario, which uses the same architecture, the model’s layers, except the soft-
max layer, are initialized by the trained model of Section 3.2, the goal of which
is to recognize basic facial expressions; we call this the engagement model
(Fig. 10), and this is the key model of interest in our paper. In this model, all
layers’ weights are updated and fine-tuned to recognize engaged and disengaged
classes in the ER dataset. For both VGGnet and engagement models, after
each Conv. block, we have a max pooling layer with stride 2. In the models,
the softmax layer has two output units (engaged and disengaged), followed by
a cross-entropy loss. Similar to the CNN model, we apply a rectified linear
unit (ReLU) activation function [26] and a dropout layer [33] after all Conv. and
hidden FC layers. Furthermore, we apply local response normalization after the
first Conv. block. We use the same approaches to optimization and learning rate
decay as in the CNN model.

5 Experiments
5.1 Evaluation Metrics
In this paper, the performance of all models are reported on the both validation
and test splits of the ER dataset. We use three performance metrics including
classification accuracy, F1 measure and the area under the ROC (receiver oper-
ating characteristics) curve (AUC). In this work, classification accuracy specifies
the number of positive (engaged) and negative (disengaged) samples which are
correctly classified and are divided by all testing samples (Equation 2).
TP + TN
Accuracy = (2)
TP + FP + TN + FN
where T P , T N , F P , and F N are true positive, true negative, false positive, and
false negative, respectively. F1 measure is calculated using Equation 3.
p×r
F1 = 2 × (3)
p+r
Automatic Recognition of Student Engagement using Deep Learning 13

Fig. 9. The architecture of the VGGnet Fig. 10. Our facial expression recogni-
model on ER dataset. “Conv” and “FC” tion model on FER-2013 dataset (left).
are convolutional and fully connected The engagement model on ER dataset
layers. (right).

where p is precision defined as T PT+F


P TP
P and r is recall defined as T P +F N . AUC
is a popular metric in engagement recognition task [4,25,35]; it is an unbiased
assessment of the area under the ROC curve. An AUC score of 0.5 corresponds
to chance performance by the classifier, and AUC 1.0 represents the best possible
result.

5.2 Implementation Details


In the training phase, for data augmentation, input images are randomly flipped
along their width and cropped to 48-by-48 pixels (after applying zero-padding
because the samples were already in this size). Furthermore, they are randomly
rotated by a specific max angle. We set learning rate for the VGGnet model
to 0.001 and for other models to 0.002. The batch size is set to 32 for the
engagement model and 28 for other models. The best model on the validation
set is used to estimate the performance on the test partition of the ER dataset
for all models in this work.

5.3 Results
Overall Metrics We summarize the experimental results on the validation set of
the ER dataset in Table 3 and on the test set of the ER dataset in Table 4. On the
14 O. M. Nezami et al.

Table 3. The results of our models (%) Table 4. The results of our models (%)
on the validation set of ER dataset. on the test set of ER dataset.

Method Accuracy F1 AUC Method Accuracy F1 AUC


HOG+SVM 67.69 75.40 65.50 HOG+SVM 59.88 67.38 62.87
CNN 72.03 74.94 71.56 CNN 65.70 71.01 68.27
VGGnet 68.11 70.69 67.85 VGGnet 66.28 70.41 68.41
engagement 77.76 81.18 76.77 engagement 72.38 73.90 73.74

validation and test sets, the engagement model substantially outperforms all
baseline models using all evaluation metrics, showing the effectiveness of using a
trained model on basic facial expression data to initialize an engagement recog-
nition model. All deep models including CNN, VGGnet, and engagement
models perform better than the HOG+SVM method, showing the benefit of
applying deep learning to recognize engagement. On the test set, the engage-
ment model achieves 72.38% classification accuracy, which outperforms VG-
Gnet by 5%, and the CNN model by more than 6%; it is also 12.5% better than
the HOG+SVM method. The engagement model achieved 73.90% F1 mea-
sure which is around 3% improvement compared to the deep baseline models and
6% better performance than the HOG+SVM model. Using the AUC metric,
as the most popular metric in engagement recognition tasks, the engagement
model achieves 73.74% which improves the CNN and VGGnet models by
more than 5% and is around 10% better than the HOG+SVM method. There
are similar improvements on the validation set.

Confusion Matrices We show the confusion matrices of the HOG+SVM, CNN,


VGGnet, and engagement models on the ER test set in Table 5, Table 6,
Table 7, and Table 8, respectively. The tables show the proportions of predicted
classes with respect to the actual classes, allowing an examination of precision per
class. It is interesting that the effectiveness of deep models comes through their
ability to recognize disengaged samples compared to the HOG+SVM model.
Disengaged samples have a wider variety of body postures and facial ex-
pressions than engaged sample (e.g. Fig. 1). Due to complex structures, deep
learning models are more powerful in capturing these wider variations. The VG-
Gnet model, which has a more complex architecture compared to the CNN
model, can also detect disengaged samples with a higher probability. Since we
pre-trained the engagement model on basic facial expression data including
considerable variations of samples, this model is the most effective approach to
recognize disengaged samples achieving 60.42% precision which is around 27%
improvement in comparison with the HOG+SVM model

6 Conclusion
Reliable models that can recognize engagement during a learning session, partic-
ularly in contexts where there is no instructor present, play a key role in allowing
Automatic Recognition of Student Engagement using Deep Learning 15

Table 5. Confusion matrix of the Table 6. Confusion matrix of the CNN


HOG+SVM model (%). model (%).

predicted predicted
Engaged Disengaged Engaged Disengaged
Engaged 92.23 7.77 Engaged 93.53 6.47
actual actual
Disengaged 66.49 33.51 Disengaged 56.99 43.01

Table 7. Confusion matrix of the VG- Table 8. Confusion matrix of the en-
Gnet model (%). gagement model (%).

predicted predicted
Engaged Disengaged Engaged Disengaged
Engaged 89.32 10.68 Engaged 87.06 12.94
actual actual
Disengaged 52.51 47.49 Disengaged 39.58 60.42

learning systems to intelligently adapt to facilitate the learner. There is a short-


age of data for training systems to do this; the first contribution of the paper is
a new dataset, labelled by annotators with expertise in psychology, that we hope
will facilitate research on engagement recognition from visual data. In this paper,
we have used this dataset to train models for the task of automatic engagement
recognition, including for the first time deep learning models. The next contri-
bution has been the development of a model, called engagement model, that
can address the shortage of engagement data to train a reliable deep learning
model. engagement model has two key steps. First, we pre-train the model
using basic facial expression data, of which is relatively abundant. Second, we
train the model to produce a rich deep learning based representation for en-
gagement, instead of commonly used features and classification methods in this
domain. We have evaluated this model with respect to a comprehensive range
of baseline models to demonstrate its effectiveness, and have shown that it leads
to a considerable improvement against the baseline models using all standard
evaluation metrics.

References
1. Alyuz, N., Okur, E., Oktay, E., Genc, U., Aslan, S., Mete, S.E., Arnrich, B., Esme,
A.A.: Semi-supervised model personalization for improved detection of learner’s
emotional engagement. In: ICMI. pp. 100–107. ACM (2016)
2. Aslan, S., Mete, S.E., Okur, E., Oktay, E., Alyuz, N., Genc, U.E., Stanhill, D.,
Esme, A.A.: Human expert labeling process (help): Towards a reliable higher-order
user state labeling process and tool to assess student engagement. Educational
Technology pp. 53–59 (2017)
3. Bosch, N.: Detecting student engagement: Human versus machine. In: UMAP. pp.
317–320. ACM (2016)
16 O. M. Nezami et al.

4. Bosch, N., D’Mello, S., Baker, R., Ocumpaugh, J., Shute, V., Ventura, M., Wang,
L., Zhao, W.: Automatic detection of learning-centered affective states in the wild.
In: IUI. pp. 379–388. ACM (2015)
5. Bosch, N., D’mello, S.K., Ocumpaugh, J., Baker, R.S., Shute, V.: Using video to
automatically detect learner affect in computer-enabled classrooms. ACM Trans-
actions on Interactive Intelligent Systems 6(2), 17 (2016)
6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR. vol. 1, pp. 886–893. IEEE (2005)
7. D’Cunha, A., Gupta, A., Awasthi, K., Balasubramanian, V.: Daisee: Towards user
engagement recognition in the wild. arXiv preprint arXiv:1609.01885 (2016)
8. Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Static facial expression analysis in
tough conditions: Data, evaluation protocol and benchmark. In: ICCV. pp. 2106–
2112 (2011)
9. Ekman, P.: Basic emotions. In: Dalgleish, T., Power, T. (eds.) The Handbook of
Cognition and Emotion, pp. 45–60. John Wiley & Sons, Sussex, UK (1999)
10. Ekman, P.: Darwin and facial expression: A century of research in review. Ishk
(2006)
11. Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. Pattern recog-
nition 36(1), 259–275 (2003)
12. Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B.,
Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representation
learning: A report on three machine learning contests. In: ICONIP. pp. 117–124.
Springer (2013)
13. Grafsgaard, J., Wiggins, J.B., Boyer, K.E., Wiebe, E.N., Lester, J.: Automatically
recognizing facial expression: Predicting engagement and frustration. In: Educa-
tional Data Mining 2013 (2013)
14. Jacobson, M.J., Taylor, C.E., Richards, D.: Computational scientific inquiry with
virtual worlds and agent-based models: new ways of doing science to learn science.
Interactive Learning Environments 24(8), 2080–2108 (2016)
15. Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural
networks for facial expression recognition. In: ICCV. pp. 2983–2991 (2015)
16. Kahou, S.E., Bouthillier, X., Lamblin, P., Gulcehre, C., Michalski, V., Konda, K.,
Jean, S., Froumenty, P., Dauphin, Y., Boulanger-Lewandowski, N., et al.: Emonets:
Multimodal deep learning approaches for emotion recognition in video. Journal on
Multimodal User Interfaces 10(2), 99–111 (2016)
17. Kahou, S.E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, Ç., Memisevic, R.,
Vincent, P., Courville, A., Bengio, Y., Ferrari, R.C., et al.: Combining modality
specific deep neural networks for emotion recognition in video. In: ICMI. pp. 543–
550. ACM (2013)
18. Kamath, A., Biswas, A., Balasubramanian, V.: A crowdsourced approach to stu-
dent engagement recognition in e-learning environments. In: WACV. pp. 1–9. IEEE
(2016)
19. Kapoor, A., Mota, S., Picard, R.W., et al.: Towards a learning companion that
recognizes affect. In: AAAI Fall symposium. pp. 2–4 (2001)
20. Kim, B.K., Dong, S.Y., Roh, J., Kim, G., Lee, S.Y.: Fusing aligned and non-aligned
face information for automatic affect recognition in the wild: A deep learning ap-
proach. In: CVPR Workshops. pp. 48–57. IEEE (2016)
21. King, D.E.: Dlib-ml: A machine learning toolkit. Journal of Machine Learning
Research 10(Jul), 1755–1758 (2009)
22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: NIPS. pp. 1097–1105 (2012)
Automatic Recognition of Student Engagement using Deep Learning 17

23. Liu, P., Han, S., Meng, Z., Tong, Y.: Facial expression recognition via a boosted
deep belief network. In: CVPR. pp. 1805–1812 (2014)
24. Mollahosseini, A., Chan, D., Mahoor, M.H.: Going deeper in facial expression
recognition using deep neural networks. In: WACV. pp. 1–10. IEEE (2016)
25. Monkaresi, H., Bosch, N., Calvo, R.A., D’Mello, S.K.: Automated detection of
engagement using video-based estimation of facial expressions and heart rate. IEEE
Transactions on Affective Computing 8(1), 15–28 (2017)
26. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-
chines. In: ICML. pp. 807–814 (2010)
27. Nezami, O.M., Dras, M., Anderson, P., Hamey, L.: Face-cap: Image captioning us-
ing facial expression analysis. In: Joint European Conference on Machine Learning
and Knowledge Discovery in Databases. pp. 226–240. Springer (2018)
28. O’Brien, H.: Theoretical perspectives on user engagement. In: Why Engagement
Matters, pp. 1–26. Springer (2016)
29. Pramerdorfer, C., Kampel, M.: Facial expression recognition using convolutional
neural networks: State of the art. arXiv preprint arXiv:1612.02903 (2016)
30. Rodriguez, P., Cucurull, G., Gonzalez, J., Gonfaus, J.M., Nasrollahi, K., Moeslund,
T.B., Roca, F.X.: Deep pain: Exploiting long short-term memory networks for facial
expression classification. IEEE Transactions on Cybernetics (99), 1–11 (2017)
31. Sariyanidi, E., Gunes, H., Cavallaro, A.: Automatic analysis of facial affect: A sur-
vey of registration, representation, and recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence 37(6), 1113–1133 (2015)
32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
33. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: A simple way to prevent neural networks from overfitting. The Jour-
nal of Machine Learning Research 15(1), 1929–1958 (2014)
34. Tang, Y.: Deep learning using linear support vector machines. arXiv preprint
arXiv:1306.0239 (2013)
35. Whitehill, J., Serpell, Z., Lin, Y.C., Foster, A., Movellan, J.R.: The faces of en-
gagement: Automatic recognition of student engagement from facial expressions.
IEEE Transactions on Affective Computing 5(1), 86–98 (2014)
36. Woolf, B., Burleson, W., Arroyo, I., Dragon, T., Cooper, D., Picard, R.: Affect-
aware tutors: recognising and responding to student affect. International Journal
of Learning Technology 4(3-4), 129–164 (2009)
37. Yu, Z., Zhang, C.: Image based static facial expression recognition with multiple
deep network learning. In: ICMI. pp. 435–442. ACM (2015)
38. Zhang, K., Huang, Y., Du, Y., Wang, L.: Facial expression recognition based on
deep evolutional spatial-temporal networks. IEEE Transactions on Image Process-
ing 26(9), 4193–4203 (2017)
39. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning social relation traits from face
images. In: ICCV. pp. 3631–3639 (2015)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy