Peerj Cs 10 1887
Peerj Cs 10 1887
Peerj Cs 10 1887
ABSTRACT
Emotion detection (ED) involves the identification and understanding of an
individual’s emotional state through various cues such as facial expressions, voice
tones, physiological changes, and behavioral patterns. In this context, behavioral
analysis is employed to observe actions and behaviors for emotional interpretation.
This work specifically employs behavioral metrics like drawing and handwriting to
determine a person’s emotional state, recognizing these actions as physical functions
integrating motor and cognitive processes. The study proposes an attention-based
transformer model as an innovative approach to identify emotions from handwriting
and drawing samples, thereby advancing the capabilities of ED into the domains of
fine motor skills and artistic expression. The initial data obtained provides a set of
points that correspond to the handwriting or drawing strokes. Each stroke point is
subsequently delivered to the attention-based transformer model, which embeds it
into a high-dimensional vector space. The model builds a prediction about the
emotional state of the person who generated the sample by integrating the most
Submitted 10 November 2023
Accepted 29 January 2024
important components and patterns in the input sequence using self-attentional
Published 29 March 2024 processes. The proposed approach possesses a distinct advantage in its enhanced
Corresponding author capacity to capture long-range correlations compared to conventional recurrent
Yuanqing Xia, neural networks (RNN). This characteristic makes it particularly well-suited for the
xia_yuanqing@bit.edu.cn precise identification of emotions from samples of handwriting and drawings,
Academic editor signifying a notable advancement in the field of emotion detection. The proposed
Yu-Dong Zhang method produced cutting-edge outcomes of 92.64% on the benchmark dataset
Additional Information and known as EMOTHAW (Emotion Recognition via Handwriting and Drawing).
Declarations can be found on
page 19
DOI 10.7717/peerj-cs.1887 Subjects Brain-Computer Interface, Computer Vision, Data Mining and Machine Learning,
Natural Language and Speech, Sentiment Analysis
Copyright
Keywords Emotional state recognition, Handwriting/Drawing analysis, Behavioral biometrics,
2024 Khan et al.
Emotion detection, Human-computer Interaction, Emotional intelligence, Transformer model
Distributed under
Creative Commons CC-BY 4.0
How to cite this article Khan ZA, Xia Y, Aurangzeb K, Khaliq F, Alam M, Khan JA, Anwar MS. 2024. Emotion detection from handwriting
and drawing samples using an attention-based transformer model. PeerJ Comput. Sci. 10:e1887 DOI 10.7717/peerj-cs.1887
INTRODUCTION
Emotion detection (ED) is the process of recognizing and evaluating the emotional states
and feelings of individuals using a variety of techniques. Accurately understanding and
interpreting human emotions is the ultimate objective of ED, which has a variety of
applications in areas including mental health, user experience, education, marketing, and
security (Acheampong, Wenyu & Nunoo-Mensah, 2020). Emotions are one’s reactions, and
they can differ widely among individuals. Defining universal patterns or guidelines for
detection might be challenging since individuals may show the same emotion in various
ways. Although emotion science has made considerable strides, there is still much to learn
about the subtleties and complexity of human emotions. It is still difficult to create
comprehensive representations that adequately depict the entire spectrum of emotions
(Zad et al., 2021). The development of intelligent systems to assist physicians at the point
of treatment uses machine learning (ML) techniques. They can support conventional
clinical examinations for the assessment of Parkinson’s disease (PD) by detecting its early
symptoms and signs. In patients with PD, previously taught motor abilities, including
handwriting, are frequently impaired. This makes handwriting a potent identifier for the
creation of automated diagnostic systems (Impedovo, Pirlo & Vessio, 2018). DL models
have shown promising outcomes in ED, notably those built on RNNs and convolutional
neural networks (CNNs). These models can recognize temporal connections and learn
complicated patterns from the emotional data (Pranav et al., 2020).
The study conducted by Kedar et al. (2015) analyzed handwriting features such as
baseline, slant, pen pressure, dimensions, margin, and boundary to estimate an individual’s
emotional levels. The study concludes that it will aid in identifying those individuals who
are emotionally disturbed or sad and require psychiatric assistance to deal with such
unpleasant emotions. Additionally, Gupta et al. (2019) examines electroencephalogram
(EEG) signals from the user’s brain to determine their emotional state. The study uses a
correlation-finding approach for text improvement to change the words that match the
observed emotion. The verification of the sentence’s accuracy was then carried out utilizing
a language modeling framework built on long short-term memory (LSTM) networks. In a
dataset with 25 subjects, an accuracy of 74.95% was found when classifying five emotional
states employing EEG signals. Based on handwriting kinetics and quantified EEG analysis,
a computerized non-invasive, and speedy detection technique for mild cognitive
impairment (MCI) was proposed in Chai et al. (2023). They employed a classification
model built on a dual-feature fusion created for medical decision-making. They used SVM
with RBF kernel as the basic classifier and achieved a high classification rate of 96.3% for
the aggregated features.
Existing ED research has mostly concentrated on a small number of fundamental
emotions, such as happiness, sadness, and anger. The complexity and variety of emotional
expressions make it difficult to adequately depict the entire emotional range. The
performance of ED systems is improved by identifying significant characteristics across
several modalities and creating suitable representations (Zakraoui et al., 2023). The
capacity to recognize emotions via routine activities like writing and drawing could
PROPOSED METHODOLOGY
The proposed study introduces an attention-based transformer model designed to generate
a more comprehensive feature map from handwriting and drawing samples. This model
aims to accurately identify both handwritten information and emotional content. The
model supposes that writing and drawing are impacted by one’s emotional state and are
connected to behavior. The proposed method involves gathering an individual’s writing
through an electronic device and analyzing it to determine her emotional state. The
suggested model is founded on the transformer architecture, which makes use of attention
processes to provide a more detailed feature map of the data. The goal is to enhance the
model’s capacity to recognize and understand the information and feelings represented in
handwriting and drawing. The suggested study offers a thorough assessment of numerous
Emotion models
The core of ED systems is emotional models used to represent individual feelings. It is
crucial to establish the model of emotion to be used before beginning any ED-related
activity.
Experimental design
The implementation of the proposed work is done using Jupyter Notebook with a five-fold
cross-validation training strategy. Both the EMOTHAW and SemEval datasets were used
during training. The learning rate is set to 0.0001, and the model is trained over 25 epochs.
A weight decay of 0.05 is applied to control overfitting, and the Adam optimizer is
employed for efficient parameter updates. The loss function utilized for training is cross
Datasets
EMOTHAW dataset
The EMOTHAW database (Likforman-Sulem et al., 2017) contains samples from 129
individuals (aged between 21–32) whose emotional states, including anxiety, depression,
and stress, were measured using the Depression Anxiety Stress Scales (DASS) assessment.
Due to the dearth of publicly accessible labeled data in this field, this database itself is a
helpful resource. A total of 58 men and 71 women participated in the dataset. The age
range has been constrained to decrease the experiment’s inter-subject variation. Seven
activities are recorded using a digitizing tablet: drawing pentagons and houses,
handwriting words, drawing circles and clocks, and copying a phrase in cursive writing.
The writing and drawing activities used to get the measures are well-researched exercises
that are also used to assess a person’s handwriting and drawing abilities. Records include
pen azimuth, altitude, pressure, time stamp, and positions (both on article and in the air).
The generated files have the svc file extension, generated through the Wacom device.
Figure 1 shows a system overview of the proposed method.
SemEval dataset
The Semantic Evaluations (SemEval) dataset (Rosenthal, Farra & Nakov, 2019) includes
news headlines in Arabic and English that were taken from reputable news sources
including the BBC, CNN, Google News, and other top newspapers. There are 1,250 total
data points in the dataset. The database contains a wealth of emotional information that
may be used to extract emotions, and the data is labeled according to the six emotional
categories proposed by Ekman (1999) (happiness, sadness, fear, surprise, anger, and
disgust).
Feature extraction
Drawing and handwriting signals are examples of time series data, which is displayed as a
collection of data points gathered over time. In this study, we extract characteristics from
the handwritten and drawing signals in the time domain, frequency domain, and statistical
domain. The signal’s changing amplitude over time is used to extract time-domain
characteristics, which contain the signal’s mean and standard deviation. The frequency
content of the signal is extracted to get frequency-domain characteristics, in which spectral
entropy, spectral density, and spectral centroid are included. The signal’s statistical
characteristics are used to extract statistical features. Examples consist of mutual
information, cross-correlation, and auto-correlation.
Mean
The average value of the signal over a particular time interval is calculated to get an idea
about the overall level of the signal. In this work, the mean value of a handwriting signal
x½n over a time interval of N samples is calculated as:
l ¼ ð1=N Þ x½n (1)
where l is the mean value of the signal, n ranges from 0 to N 1, and denotes the sum
of all values.
Standard deviation
The standard deviation of the signal over a particular time interval is calculated to measure
the amount of variability in the signal. In this work, the standard deviation of a
handwriting signal x½n over a time interval of N samples are calculated as:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X ffi
r ¼ ð1=NÞ ðx½n lÞ2 (2)
Spectral density
Spectral density is used to measure the power distribution of a handwriting signal in the
frequency domain. In this work, the spectral density of a handwriting signal is calculated
as:
Sð f Þ ¼ jF ð f Þj2 (3)
where Sð f Þ represents the spectral density, and F ð f Þ denotes the Fourier transform of the
signal.
where SE represents the spectral entropy, Pi shows the power at the i th bin of the power
spectrum, and N denotes the number of frequency bins.
Autocorrelation
Autocorrelation measures the degree of similarity between the handwriting signal and a
delayed version of itself. It can be used to identify patterns or repeating features in the
signal. In this work, the autocorrelation of a handwriting signal x½n over a time interval of
N samples are calculated as:
R½k ¼ ð1=N Þ x½n x½n k (5)
where R½k is the autocorrelation of the signal at lag k, n ranges from 0 to N 1.
Classification
An attention-based transformer model is used to classify the features of handwriting and
drawing signals. The pre-processed data is delivered into the attention-based transformer
model during training to decrease the discrepancy between the model’s forecasts and the
actual labels assigned to the data. In this study, the model is trained by employing strategies
like gradient descent and backpropagation. The capacity to pay attention to various aspects
of the incoming data is one of the main characteristics of an attention-based transformer
model. This is accomplished through the use of an attention mechanism, which enables the
model to concentrate on the input’s key characteristics at each stage of the classification
process. The model can acquire the ability to recognize patterns and traits that are crucial
for differentiating between various classes of handwriting and drawing by paying attention
to different components of the input. Additionally, this architecture often has several
processing levels. Every layer of the model is made to learn more intricate representations
of the input data, enabling the model to identify subtler characteristics and patterns.
Typically, a softmax function is used to generate a probability distribution across all
feasible classes of handwriting and drawing using the output of the last layer.
EXPERIMENTS
The initial stage in the experiments is to process the data from each dataset to identify the
appropriate features for emotion identification from handwriting and drawing examples.
An attention-based transformer model for emotion recognition from handwriting and
drawing samples is trained and evaluated through a series of experiments. For this
experiment, we use the EMOTHAW and SemEval benchmark datasets. These datasets
include several types of handwritten and drawn samples as well as the associated emotion
X ¼ fx1 ; x2 ; . . . ; xn g ! E ¼ fe1 ; e2 ; . . . ; en g;
!
X (6)
where Ei ¼ f Wi xi þ b :
i
The Softmax attention mechanism entails matrix multiplication, where the dot product
is computed between each feature vector in q and the transpose of k. This dot product is
pffiffiffiffiffi
then divided by the scaling factor dk before being subjected to a softmax function. The
Softmax attention mechanism relies on the dot product operation, which considers both
the angle and magnitude of vectors when computing similarity.
Multi-head attention: To capture multiple aspects of the input writing signal in this
work, the self-attention layer is extended to include multiple heads.
Qi KiT
Zi ¼ self attentioni ðEÞ ¼ softmax pffiffiffiffiffi Vi (9)
dk
where i represents the head index, and Qi , Ki , and Vi denote the query, key, and value
matrices for the i-th head.
Feed-forward networks: In this work, the output of the self-attention layer is passed
through a feed-forward network to further refine the representation.
Y ¼ f ðfeedforwardðZ ÞÞ (10)
where f represent a non-linear activation function. Here the feed-forward network consists
of two linear transformations with a non-linear activation function.
Output prediction: In this work, the final output of the transformer model is obtained
by passing the refined representation through a linear transformation and a Softmax
function.
P ¼ softmaxðWi Y þ bÞ (11)
where P represents the predicted probability distribution over the emotion categories.
Evaluations metrics
The assessment criteria employed in emotion identification from handwriting and drawing
samples employing an attention-based transformer model largely rely on the particular
task and dataset being used. The following is an explanation of the assessment metrics
utilized in this study.
Accuracy
This measures the proportion of correctly classified emotions to the total number of
emotions used in the dataset.
Accuracy ¼ Tp þ TN = Tp þ TN þ Fp þ FN (12)
where Tp represents the number of true positive samples (samples that were correctly
classified as the target emotion), TN denotes the number of true negative samples (samples
F1 score
The F1 score, which offers a fair evaluation of the model’s achievement, is a harmonic
mean of accuracy and recall. In contrast to recall, which assesses the proportion of genuine
positive predictions made from all positive samples, precision assesses the proportion of
true positive predictions made from all positive predictions.
Tp
Precision ¼ (13)
Tp þ Fp
Tp
Recall ¼ (14)
Tp þ FN
Precision Recall
F1 ¼ 2 : (15)
Precision þ Recall
F1 represents a weighted average of precision and recall, with a maximum value of 1 and
a minimum value of 0.
In the second experiment, incorporating the diverse SemEval dataset enriched with
emotional annotations from tweets, blog posts, and news articles, the study examined text-
based emotion detection. The categorization of features into three distinct emotional
states, angry, happy, and sad, unfolded insightful revelations. Notably, the proposed model
showcased state-of-the-art performance on the test dataset, particularly excelling in the
identification of sadness with a remarkable F1 score of 87.06%, as clarified in Table 2.
Moreover, the proposed study achieved notable success in detecting happy and angry
states, reaffirming the versatility of the model across various emotional dimensions.
In the SemEval dataset, a variety of text samples with emotional annotations are
included, such as tweets, blog posts, and news articles. In this work, the features are
categorized into three different emotional states, angry, happy, and sad. On the test dataset,
the model produced state-of-the-art results for sad state identification, with an F1 score of
87:06%, as shown in Table 2. For happy state detection, we obtained the highest F1 score of
79:73% and for angry state detection we obtained the highest F1 score of 83:12%. To sum
up, the SemEval dataset, coupled with the proposed model, opens new horizons in text-
based emotion detection. The robust performance across different emotional states,
especially the outstanding identification of sadness, highlights the adaptability and efficacy
of the proposed work. These outcomes not only contribute to the academic discourse but
also pave the way for practical applications in sentiment analysis across diverse textual
genres.
Figure 2 showcases the robust performance of the proposed model. This high level of
accuracy indicates the model’s proficiency in learning from the training data and
effectively generalizing it to new, unseen data during testing. The sustained elevation of
both lines above 90% emphasizes the reliability and effectiveness of the proposed model in
accurately predicting emotional states based on the provided features. To specifically
address color differentiation issues, we have manually introduced circles to represent
training accuracy and rectangles for testing accuracy. This visual distinction aims to
enhance clarity and inclusivity for a diverse audience, mitigating potential challenges
associated with color perception.
Results comparison
The section conducts a comprehensive analysis of the results, drawing comparisons with
existing datasets and state-of-the-art approaches. Table 3 provides a detailed examination
of the model’s performance using the EMOTHAW dataset, highlighting its accuracy in
emotion state recognition. Moving forward, Table 4 extends the comparison to the
SemEval dataset, offering insights into how the proposed model fares in diverse text
samples like tweets, blog posts, and news articles. Additionally, Table 5 positions the
obtained results in the broader context of state-of-the-art methodologies, underlining the
competitiveness and advancements achieved by the proposed study.
Table 3 shows the comparison of the results using the EMOTHAW dataset. The study
conducted by Likforman-Sulem et al. (2017) used a random forest (RF) technique to
analyze and categorize the collection of characteristics collected from the EMOTHAW
dataset, which is a machine learning algorithm that uses a group of decision trees and a
feature ranking mechanism. They achieved a higher accuracy of 72:8% for depression
detection using drawing features. Similarly, they achieved a higher accuracy of 60.50% for
anxiety detection and 60:20% for stress detection. The work conducted by Rahman &
Halim (2023) used a combination of temporal, spectral, and Mel Frequency Cepstral
Coefficient (MFCC) approaches to extract characteristics from each signal and discover a
link between the signal and the emotional states of stress, anxiety, and sadness. They
classified the vectors of the generated characteristics using a Bidirectional Long-Short
Term Memory (BiLSTM) network. They obtained a higher accuracy of 89:21% for
depression detection using writing features. For anxiety detection, they achieved a higher
accuracy of 80:03% using both drawing and writing features. For stress detection, they
achieved a higher accuracy of 75:39% using drawing features. The study conducted in
Nolazco-Flores et al. (2021) used the fast correlation-based filtering approach to choose the
optimal characteristics. The retrieved features were then supplemented by introducing a
Threats to analysis
The proposed approach for emotion detection from handwriting and drawing samples
exhibits promising results, although certain inherent limitations merit consideration.
Firstly, the model’s performance is contingent on the quality and representativeness of the
training dataset, with potential biases affecting generalizability. Additionally, the limited
diversity in handwriting and drawing styles within the training data may impact the
model’s adaptability to extreme variations in individual expression. Cultural nuances in
emotional expression pose another challenge, as the model’s performance may vary across
diverse cultural contexts. Dependency on machine translation tools for languages beyond
the training set introduces potential errors, and the predefined set of emotions in focus
might not capture the full spectrum of human emotional expression. Ethical
considerations regarding privacy and consent in deploying emotion detection technologies
add another layer of complexity. While these limitations are acknowledged, the proposed
Discussion
Text-based ED is focused on the feelings that lead people to write down particular words at
specific moments. According to the results, multimodal ED, such as voice, body language,
facial expressions, and other areas, receive more attention than their text-based
counterparts. The dearth has mostly been caused by the fact that, unlike multimodal
approaches, texts may not exhibit distinctive indications of emotions, making the
identification of emotions from texts significantly more challenging compared to other
methods. Because there are no facial expressions or vocal modulations in handwritten text,
this makes emotion detection a difficult challenge. The purpose of this work was to
ascertain the level of interest in the field of handwritten text emotion recognition. To
perform classification and analysis tasks, handwriting and drawing signals are processed
using the feature extraction steps to isolate significant and informative attributes. In this
study, we found that individual variances in handwriting characteristics were caused by
their emotional moods. Incorporating more characteristics like pressure and speed into the
input data, we saw that the attention-based transformer model obtained great accuracy. We
observed that adding additional features can enhance the model’s performance even more.
CONCLUSION
The drawing and handwriting signals are instances of time series data, which is shown as a
collection of data points acquired over time. In this work, we extract time-domain,
frequency-domain, and statistical-domain features from the handwriting and drawing
signals. The proposed model has the benefit of being able to capture long-range
relationships in the input data, which is especially beneficial for handwriting and drawing
samples that contain sequential and spatial information. The model’s attention mechanism
also enables it to concentrate on relevant components and structures in the input data,
which may enhance its capacity to recognize minor emotional signals. The
hyperparameters that are adjusted during the testing of the model include the number of
layers, attention heads, the dimensionality of embeddings, learning rate, and batch size.
Concerning accuracy and F1 scores, the attention-based transformer model used in this
study excelled on two benchmark datasets.
In the future, transfer learning techniques could be used to pre-train the attention-based
transformer model on large datasets and fine-tune it for specific emotion detection tasks,
which could potentially improve the model’s performance on smaller datasets.
Funding
This research is funded by the Researchers Supporting Project Number (RSPD2024R947),
King Saud University, Riyadh, Saudi Arabia. The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests
Khursheed Aurangzeb is an Academic Editor for PeerJ.
Author Contributions
Zohaib Ahmad Khan conceived and designed the experiments, performed the
experiments, performed the computation work, prepared figures and/or tables, and
approved the final draft.
Yuanqing Xia analyzed the data, performed the computation work, authored or reviewed
drafts of the article, and approved the final draft.
Khursheed Aurangzeb analyzed the data, performed the computation work, authored or
reviewed drafts of the article, and approved the final draft.
Fiza Khaliq conceived and designed the experiments, performed the computation work,
authored or reviewed drafts of the article, and approved the final draft.
Mahmood Alam performed the experiments, prepared figures and/or tables, and
approved the final draft.
Javed Ali Khan conceived and designed the experiments, authored or reviewed drafts of
the article, and approved the final draft.
Muhammad Shahid Anwar performed the experiments, prepared figures and/or tables,
and approved the final draft.
Data Availability
The following information was supplied regarding data availability:
The code is available in the Supplemental File.
The datasets used in this study are available at:
- The “EMOTHAW” dataset: https://www.psicologia.unicampania.it/the-lab/our-
activities (Laurence Likforman-Sulem, Anna Esposito, Marcos Faundez-Zanuy, Stephan
Clemencon, Franceand Gennaro Cordasco).
- The “SemEval” dataset: https://alt.qcri.org/semeval2016/task4/index.php?id=data-
and-tools (Sara Rosenthal, QatarNoura Farra, Preslav Nakov).
Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/
peerj-cs.1887#supplemental-information.
REFERENCES
Aarts C, Jiang F, Chen L. 2020. A practical application for sentiment analysis on social media
textual data. In: Proceedings of the 24th Symposium on International Database Engineering &
Applications. 1–6.