Automatic Recognition of Student Engagement Using Deep Learning and Facial Expression
Automatic Recognition of Student Engagement Using Deep Learning and Facial Expression
1 Introduction
Engagement is a significant aspect of human-technology interactions and is de-
fined differently for a variety of applications such as search engines, online gaming
platforms, and mobile health applications [28]. According to Monkaresi et al. [25],
most definitions describe engagement as attentional and emotional involvement
in a task.
This paper deals with engagement during learning via technology. Investigat-
ing engagement is vital for designing intelligent educational interfaces in different
learning settings including educational games [14], massively open online courses
(MOOCs) [18], and intelligent tutoring systems (ITSs) [1]. For instance, if stu-
dents feel frustrated and become disengaged (see disengaged samples in Fig. 1),
2 O. M. Nezami et al.
Fig. 1. Engaged (left) and disengaged (right) samples collected in our studies. We
blurred the children’s eyes for ethical issues, even though we have their parents consent
at the time.
the system should intervene in order to bring them back to the learning process.
However, if students are engaged and enjoying their tasks (see engaged samples
in Fig. 1), they should not be interrupted even if they are making some mistakes
[19]. In order for the learning system to adapt the learning setting and provide
proper responses to students, we first need to automatically measure engage-
ment. This can be done by, for example, using context performance [1], facial
expression [35] and heart rate [25] data. Recently, engagement recognition us-
ing facial expression data has attracted special attention because of widespread
availability of cameras [25].
This paper aims at quantifying and characterizing engagement using facial
expressions extracted from images. In this domain, engagement detection models
usually use typical facial features which are designed for general purposes, such
as Gabor features [35], histogram of oriented gradients [18] and facial action
units [4]. To the best of the authors’ knowledge, there is no work in the litera-
ture investigating the design of specific and high-level features for engagement.
Therefore, providing a rich engagement representation model to distinguish en-
gaged and disengaged samples remains an open problem (Challenge 1). Training
such a rich model requires a large amount of data which means extensive effort,
time, and expense would be required for collecting and annotating data due to
the complexities [3] and ambiguities [28] of the engagement concept (Challenge
2).
To address the aforementioned challenges, we design a deep learning model
which includes two essential steps: basic facial expression recognition, and en-
Automatic Recognition of Student Engagement using Deep Learning 3
– To the authors’ knowledge, the work in this paper is the first time a rich face
representation model has been used to capture basic facial expressions and
initialize an engagement recognition model, resulting in positive outcomes.
This shows the effectiveness of applying basic facial expression data in order
to recognize engagement.
– We have collected a new dataset we call the Engagement Recognition (ER)
dataset to facilitate research on engagement recognition from images. To
handle the complexity and ambiguity of engagement concept, our data is
annotated in two steps, separating the behavioral and emotional dimensions
of engagement. The final engagement label in the ER dataset is the combi-
nation of the two dimensions.
– To the authors’ knowledge, this is the first study which models engagement
using deep learning techniques. The proposed model outperforms a compre-
hensive range of baseline approaches on the ER dataset.
2 Related Work
neural networks (CNNs) to recognize facial expressions and won the 2013 Emo-
tion Recognition in the Wild (EmotiW) Challenge. Another CNN model, fol-
lowed by a linear support vector machine, was trained to recognize facial expres-
sions by Tang et al. [34]; this won the 2013 Facial Expression Recognition (FER)
challenge [12]. Kahou et al. [16] applied CNNs for extracting visual features ac-
companied by audio features in a multi-modal data representation. Nezami et
al. [27] used a CNN model to recognize facial expressions, where the learned
representation is used in an image captioning model; the model embedded the
recognized facial expressions to generate more human-like captions for images in-
cluding human faces. Yu et al. [37] employed a CNN model that was pre-trained
on the FER-2013 dataset [12] and fine-tuned on the Static Facial Expression in
the Wild (SFEW) dataset [8]. They applied a face detection method to detect
faces and remove noise in their target data samples. Mollahosseini et al. [24]
trained CNN models across different well-known FER datasets to enhance the
generalizablity of recognizing facial expressions. They applied face registration
processes, extracting and aligning faces, to achieve better performances. Kim et
al. [20] measured the impact of combining registered and unregistered faces in
this domain. They used the unregistered faces when the facial landmarks of the
faces were not detectable. Zhang et al. [38] applied CNNs to capture spatial infor-
mation from video frames. The spatial information was combined with temporal
information to recognize facial expressions. Pramerdorfer et al. [29] employed a
combination of modern deep architectures such as VGGnet [32] on the FER-2013
dataset. They also achieved the state-of-the-art result on the FER-2013 dataset.
tasks. Bosch et al. [4] detected engagement using AUs and Bayesian classifiers.
The generalizability of the model was also investigated across different times,
days, ethnicities and genders [5]. Furthermore, in interacting with intelligent
tutoring systems (ITSs), engagement was investigated based on a personalized
model including appearance and context features [1]. Engagement was consid-
ered in learning with massively open online courses (MOOCs) as an e-learning
environment [7]. In such settings, data are usually annotated by observing video
clips or filling self-reports. However, the engagement levels of students can change
during 10-second video clips, so assigning a single label to each clip is difficult
and sometimes inaccurate.
In the third category, HOG features and SVMs have been applied to classify
images using three levels of engagement: not engaged, nominally engaged and
very engaged [18]. This work is based on the experimental results of whitehill et
al. [35] in preparing engagement samples. whitehill et al. [35] showed that en-
gagement patterns are mostly recorded in images. Bosch et al. [4] also confirmed
that video clips could not provide extra information by reporting similar perfor-
mances using different lengths of video clips in detecting engagement. However,
competitive performances are not reported in this category.
We focus on the third category to recognize engagement from images. To do
so, we collected a new dataset annotated by Psychology students, who can po-
tentially better recognize the psychological phenomena of engagement, because
of the complexity of analyzing student engagement. To assist them with recog-
nition, brief training was provided prior to commencing the task and delivered
in a consistent manner via online examples and descriptions. We did not use
crowdsourced labels, resulting in less effective outcomes, similar to the work of
Kamath et al. [18]. Furthermore, we captured more effective labels by following
an annotation process to simplify the engagement concept into the behavioral
and the emotional dimensions. We requested annotators to label the dimensions
for each image and make the overall annotation label by combining these. Our
aim is for this dataset to be useful to other researchers interested in detecting
engagement from images. Given this dataset, we introduce a novel model to
recognize engagement using deep learning. Our model includes two important
phases. First, we train a deep model to recognize basic facial expressions. Sec-
ond, the model is applied to initialize the weights of our engagement recognition
model trained using our newly collected dataset.
In this section, we use the facial expression recognition 2013 (FER-2013) dataset [12].
The dataset includes images, labeled happiness, anger, sadness, surprise, fear,
disgust, and neutral. It contains 35,887 samples (28,709 for the training set, 3589
for the public test set and 3589 for the private test set), collected by the Google
search API. The samples are in grayscale at the size of 48-by-48 pixels (Fig. 2).
6 O. M. Nezami et al.
Fig. 2. Examples from the FER-2013 dataset [12] including seven basic facial expres-
sions.
We split the training set into two parts after removing 11 completely black
samples: 3589 for validating and 25,109 for training our facial expression recog-
nition model. To compare with related work [20,29,37], we do not use the public
test set for training or validation, but use the private test set for performance
evaluation of our facial expression recognition model.
Fig. 3. The architecture of our facial expression recognition model adapted from VGG-
B framework [32]. Each rectangle is a Conv. block including two Conv. layers. The max
pooling layers are not shown for simplicity.
goal of students is to determine why a certain animal kind is dying out by talking
to characters, observing the animals and collecting relevant information, Fig. 4
(top). After collecting notes and evidence, students are required to complete a
workbook, Fig. 4 (bottom).
The videos of students were captured from our studies in two public sec-
ondary schools involving twenty students (11 girls and 9 boys) from Years 9
and 10 (aged 14–16), whose parents agreed to their participation in our ethics-
approved studies. We collected the videos from twenty individual sessions of
students recorded at 20 frames per second (fps), resulting in twenty videos and
totalling around 20 hours. After extracting video samples, we applied a con-
volutional neural network (CNN) based face detection algorithm [21] to select
samples including detectable faces. The face detection algorithm cannot detect
faces in a small number of samples (less than 1%) due to their high face occlusion
(Fig. 5). We removed the occluded samples from the ER dataset.
Fig. 4. The interactions of a student with Omosa [14], captured in our studies.
Table 1. The adapted relationship between the behavioral and emotional dimensions
from Woolf et al. [36] and Aslan et al. [2].
During the annotation process, we show each data sample followed by two
questions indicating the engagement’s dimensions. The behavioral dimension can
be chosen among on-task, off-task, and can’t decide options and the emotional
dimension can be chosen among satisfied, confused, bored, and can’t decide op-
tions. In each annotation phase, annotators have access to the definitions to
label each dimension. A sample of the annotation software is shown in Fig. 6. In
the next step, each sample is categorized as engaged or disengaged by combin-
ing the dimensions’ labels using Table 1. For example, if a particular annotator
labels an image as on-task and satisfied, the category for this image from this
annotator is engaged. Then, for each image we use the majority of the engaged
and disengaged labels to specify the final overall annotation. If a sample receives
10 O. M. Nezami et al.
the label of can’t decide more than twice (either for the emotional or behavioral
dimensions) from different annotators, it is removed from ER dataset. Labeling
this kind of samples is a difficult task for annotators, notwithstanding the good
level of agreement that was achieved, and finding solutions to reduce the diffi-
culty remains as a future direction of our work. Using this approach, we have
created ER dataset consisting of 4627 annotated images including 2290 engaged
and 2337 disengaged.
Dataset Preparation We apply the CNN based face detection algorithm to detect
the face of each ER sample. If there is more than one face in a sample, we
choose the face with the biggest size. Then, the face is transformed to grayscale
and resized into 48-by-48 pixels, which is an effective resolution for engagement
detection [35]. Fig. 7 shows some examples of the ER dataset. We split the ER
dataset into training (3224), validation (715), and testing (688) sets, which are
subject-independent (the samples in these three sets are from different subjects).
Table 2 demonstrates the statistics of these three sets.
Convolutional Neural Network We use the training and validation sets of the
ER dataset to train a Convolutional Neural Networks (CNNs) for this task from
scratch (the CNN model); this constitutes another of the baseline models in
this paper. The model’s architecture is shown in Fig. 8. The model contains two
convolutional (Conv.) layers, followed by two max pooling (Max.) layers with
stride 2, and two fully connected (FC) layers, respectively. A rectified linear unit
(ReLU) activation function [26] is applied after all Conv. and FC layers. The
last step of the CNN model includes a softmax layer, followed by a cross-entropy
loss, which consists of two neurons indicating engaged and disengaged classes.
To overcome model over-fitting, we apply a dropout layer [33] after every Conv.
and hidden FC layer. Local response normalization [22] is used after the first
Conv. layer. As the optimizer algorithm, stochastic gradient descent with mini-
batching and a momentum of 0.9 is used. Using Equation 1, the learning rate at
step t (at ) is decayed by the rate (r) of 0.8 in the decay step (s) of 500. The total
number of iterations from the beginning of the training phase is global step (g).
g
at = at−1 × r s (1)
Very Deep Convolutional Neural Network Using the ER dataset, we train a deep
model which has eight Conv. and three FC layers similar to VGG-B architecture
[32], but with two fewer Conv. layers. The model is trained using two different
12 O. M. Nezami et al.
scenarios. Under the first scenario, the model is trained from scratch initial-
ized with random weights; we call this the VGGnet model (Fig. 9), and this
constitutes the second of our deep learning baseline models. Under the second
scenario, which uses the same architecture, the model’s layers, except the soft-
max layer, are initialized by the trained model of Section 3.2, the goal of which
is to recognize basic facial expressions; we call this the engagement model
(Fig. 10), and this is the key model of interest in our paper. In this model, all
layers’ weights are updated and fine-tuned to recognize engaged and disengaged
classes in the ER dataset. For both VGGnet and engagement models, after
each Conv. block, we have a max pooling layer with stride 2. In the models,
the softmax layer has two output units (engaged and disengaged), followed by
a cross-entropy loss. Similar to the CNN model, we apply a rectified linear
unit (ReLU) activation function [26] and a dropout layer [33] after all Conv. and
hidden FC layers. Furthermore, we apply local response normalization after the
first Conv. block. We use the same approaches to optimization and learning rate
decay as in the CNN model.
5 Experiments
5.1 Evaluation Metrics
In this paper, the performance of all models are reported on the both validation
and test splits of the ER dataset. We use three performance metrics including
classification accuracy, F1 measure and the area under the ROC (receiver oper-
ating characteristics) curve (AUC). In this work, classification accuracy specifies
the number of positive (engaged) and negative (disengaged) samples which are
correctly classified and are divided by all testing samples (Equation 2).
TP + TN
Accuracy = (2)
TP + FP + TN + FN
where T P , T N , F P , and F N are true positive, true negative, false positive, and
false negative, respectively. F1 measure is calculated using Equation 3.
p×r
F1 = 2 × (3)
p+r
Automatic Recognition of Student Engagement using Deep Learning 13
Fig. 9. The architecture of the VGGnet Fig. 10. Our facial expression recogni-
model on ER dataset. “Conv” and “FC” tion model on FER-2013 dataset (left).
are convolutional and fully connected The engagement model on ER dataset
layers. (right).
5.3 Results
Overall Metrics We summarize the experimental results on the validation set of
the ER dataset in Table 3 and on the test set of the ER dataset in Table 4. On the
14 O. M. Nezami et al.
Table 3. The results of our models (%) Table 4. The results of our models (%)
on the validation set of ER dataset. on the test set of ER dataset.
validation and test sets, the engagement model substantially outperforms all
baseline models using all evaluation metrics, showing the effectiveness of using a
trained model on basic facial expression data to initialize an engagement recog-
nition model. All deep models including CNN, VGGnet, and engagement
models perform better than the HOG+SVM method, showing the benefit of
applying deep learning to recognize engagement. On the test set, the engage-
ment model achieves 72.38% classification accuracy, which outperforms VG-
Gnet by 5%, and the CNN model by more than 6%; it is also 12.5% better than
the HOG+SVM method. The engagement model achieved 73.90% F1 mea-
sure which is around 3% improvement compared to the deep baseline models and
6% better performance than the HOG+SVM model. Using the AUC metric,
as the most popular metric in engagement recognition tasks, the engagement
model achieves 73.74% which improves the CNN and VGGnet models by
more than 5% and is around 10% better than the HOG+SVM method. There
are similar improvements on the validation set.
6 Conclusion
Reliable models that can recognize engagement during a learning session, partic-
ularly in contexts where there is no instructor present, play a key role in allowing
Automatic Recognition of Student Engagement using Deep Learning 15
predicted predicted
Engaged Disengaged Engaged Disengaged
Engaged 92.23 7.77 Engaged 93.53 6.47
actual actual
Disengaged 66.49 33.51 Disengaged 56.99 43.01
Table 7. Confusion matrix of the VG- Table 8. Confusion matrix of the en-
Gnet model (%). gagement model (%).
predicted predicted
Engaged Disengaged Engaged Disengaged
Engaged 89.32 10.68 Engaged 87.06 12.94
actual actual
Disengaged 52.51 47.49 Disengaged 39.58 60.42
References
1. Alyuz, N., Okur, E., Oktay, E., Genc, U., Aslan, S., Mete, S.E., Arnrich, B., Esme,
A.A.: Semi-supervised model personalization for improved detection of learner’s
emotional engagement. In: ICMI. pp. 100–107. ACM (2016)
2. Aslan, S., Mete, S.E., Okur, E., Oktay, E., Alyuz, N., Genc, U.E., Stanhill, D.,
Esme, A.A.: Human expert labeling process (help): Towards a reliable higher-order
user state labeling process and tool to assess student engagement. Educational
Technology pp. 53–59 (2017)
3. Bosch, N.: Detecting student engagement: Human versus machine. In: UMAP. pp.
317–320. ACM (2016)
16 O. M. Nezami et al.
4. Bosch, N., D’Mello, S., Baker, R., Ocumpaugh, J., Shute, V., Ventura, M., Wang,
L., Zhao, W.: Automatic detection of learning-centered affective states in the wild.
In: IUI. pp. 379–388. ACM (2015)
5. Bosch, N., D’mello, S.K., Ocumpaugh, J., Baker, R.S., Shute, V.: Using video to
automatically detect learner affect in computer-enabled classrooms. ACM Trans-
actions on Interactive Intelligent Systems 6(2), 17 (2016)
6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR. vol. 1, pp. 886–893. IEEE (2005)
7. D’Cunha, A., Gupta, A., Awasthi, K., Balasubramanian, V.: Daisee: Towards user
engagement recognition in the wild. arXiv preprint arXiv:1609.01885 (2016)
8. Dhall, A., Goecke, R., Lucey, S., Gedeon, T.: Static facial expression analysis in
tough conditions: Data, evaluation protocol and benchmark. In: ICCV. pp. 2106–
2112 (2011)
9. Ekman, P.: Basic emotions. In: Dalgleish, T., Power, T. (eds.) The Handbook of
Cognition and Emotion, pp. 45–60. John Wiley & Sons, Sussex, UK (1999)
10. Ekman, P.: Darwin and facial expression: A century of research in review. Ishk
(2006)
11. Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. Pattern recog-
nition 36(1), 259–275 (2003)
12. Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B.,
Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representation
learning: A report on three machine learning contests. In: ICONIP. pp. 117–124.
Springer (2013)
13. Grafsgaard, J., Wiggins, J.B., Boyer, K.E., Wiebe, E.N., Lester, J.: Automatically
recognizing facial expression: Predicting engagement and frustration. In: Educa-
tional Data Mining 2013 (2013)
14. Jacobson, M.J., Taylor, C.E., Richards, D.: Computational scientific inquiry with
virtual worlds and agent-based models: new ways of doing science to learn science.
Interactive Learning Environments 24(8), 2080–2108 (2016)
15. Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural
networks for facial expression recognition. In: ICCV. pp. 2983–2991 (2015)
16. Kahou, S.E., Bouthillier, X., Lamblin, P., Gulcehre, C., Michalski, V., Konda, K.,
Jean, S., Froumenty, P., Dauphin, Y., Boulanger-Lewandowski, N., et al.: Emonets:
Multimodal deep learning approaches for emotion recognition in video. Journal on
Multimodal User Interfaces 10(2), 99–111 (2016)
17. Kahou, S.E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, Ç., Memisevic, R.,
Vincent, P., Courville, A., Bengio, Y., Ferrari, R.C., et al.: Combining modality
specific deep neural networks for emotion recognition in video. In: ICMI. pp. 543–
550. ACM (2013)
18. Kamath, A., Biswas, A., Balasubramanian, V.: A crowdsourced approach to stu-
dent engagement recognition in e-learning environments. In: WACV. pp. 1–9. IEEE
(2016)
19. Kapoor, A., Mota, S., Picard, R.W., et al.: Towards a learning companion that
recognizes affect. In: AAAI Fall symposium. pp. 2–4 (2001)
20. Kim, B.K., Dong, S.Y., Roh, J., Kim, G., Lee, S.Y.: Fusing aligned and non-aligned
face information for automatic affect recognition in the wild: A deep learning ap-
proach. In: CVPR Workshops. pp. 48–57. IEEE (2016)
21. King, D.E.: Dlib-ml: A machine learning toolkit. Journal of Machine Learning
Research 10(Jul), 1755–1758 (2009)
22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: NIPS. pp. 1097–1105 (2012)
Automatic Recognition of Student Engagement using Deep Learning 17
23. Liu, P., Han, S., Meng, Z., Tong, Y.: Facial expression recognition via a boosted
deep belief network. In: CVPR. pp. 1805–1812 (2014)
24. Mollahosseini, A., Chan, D., Mahoor, M.H.: Going deeper in facial expression
recognition using deep neural networks. In: WACV. pp. 1–10. IEEE (2016)
25. Monkaresi, H., Bosch, N., Calvo, R.A., D’Mello, S.K.: Automated detection of
engagement using video-based estimation of facial expressions and heart rate. IEEE
Transactions on Affective Computing 8(1), 15–28 (2017)
26. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-
chines. In: ICML. pp. 807–814 (2010)
27. Nezami, O.M., Dras, M., Anderson, P., Hamey, L.: Face-cap: Image captioning us-
ing facial expression analysis. In: Joint European Conference on Machine Learning
and Knowledge Discovery in Databases. pp. 226–240. Springer (2018)
28. O’Brien, H.: Theoretical perspectives on user engagement. In: Why Engagement
Matters, pp. 1–26. Springer (2016)
29. Pramerdorfer, C., Kampel, M.: Facial expression recognition using convolutional
neural networks: State of the art. arXiv preprint arXiv:1612.02903 (2016)
30. Rodriguez, P., Cucurull, G., Gonzalez, J., Gonfaus, J.M., Nasrollahi, K., Moeslund,
T.B., Roca, F.X.: Deep pain: Exploiting long short-term memory networks for facial
expression classification. IEEE Transactions on Cybernetics (99), 1–11 (2017)
31. Sariyanidi, E., Gunes, H., Cavallaro, A.: Automatic analysis of facial affect: A sur-
vey of registration, representation, and recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence 37(6), 1113–1133 (2015)
32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
33. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: A simple way to prevent neural networks from overfitting. The Jour-
nal of Machine Learning Research 15(1), 1929–1958 (2014)
34. Tang, Y.: Deep learning using linear support vector machines. arXiv preprint
arXiv:1306.0239 (2013)
35. Whitehill, J., Serpell, Z., Lin, Y.C., Foster, A., Movellan, J.R.: The faces of en-
gagement: Automatic recognition of student engagement from facial expressions.
IEEE Transactions on Affective Computing 5(1), 86–98 (2014)
36. Woolf, B., Burleson, W., Arroyo, I., Dragon, T., Cooper, D., Picard, R.: Affect-
aware tutors: recognising and responding to student affect. International Journal
of Learning Technology 4(3-4), 129–164 (2009)
37. Yu, Z., Zhang, C.: Image based static facial expression recognition with multiple
deep network learning. In: ICMI. pp. 435–442. ACM (2015)
38. Zhang, K., Huang, Y., Du, Y., Wang, L.: Facial expression recognition based on
deep evolutional spatial-temporal networks. IEEE Transactions on Image Process-
ing 26(9), 4193–4203 (2017)
39. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning social relation traits from face
images. In: ICCV. pp. 3631–3639 (2015)