26Vol101No3
26Vol101No3
ABSTRACT
The volume of Arabic posts on many social networks has increased significantly, providing a rich source
for analysis. As a result, Arabic Natural Language Processing intervenes to exploit this source and extract
invisible but valuable insights. This paper presents a review of recent studies on techniques used in the
Arabic Natural Language Processing field to come up with the faced challenges and the new trends. The
articles selected for the review are primarily studies on Arabic Natural Language Processing techniques, as
we collected and analysed a set of journal papers published in this field between 2018 and 2022. Based on
the analysis, we extracted the various ANLP steps and investigated the techniques used in each step. The
article also outlines the current trends in the several phases and steps of the Arabic Natural Language
Processing process. As a result, it gives an insight into the current state of research.
Keywords: Arabic Natural Language Processing, Systematic Literature Review, Data Collection,
Tokenisation, Embedding.
1333
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
and resource poverty. Thus, ANLP has attracted the In the case of the Indian language, and according
attention of researchers, especially after the to [1] and [2], it is noted that it is a morphologically
significant advances of NLP for other languages, rich and free order language, unlike English, which
mainly the English language, and has been the adds complexity to the processing of user-generated
subject of many research papers. This work aims to content, not to mention the variety of dialects that
investigate the challenges of natural language exist such as Bengali, Gujarati, Tamil, Malayalam,
processing for Arabic (ANLP) and the latest Telugu, and Bengali. Taking for example the Urdu
techniques being used to overcome them, in order dialect, [3] and [4] highlighted its specificity
to keep up with the most recent technologies in this compared to other languages due to its
field. morphological structure as it starts from right to
left.
The organisation of the remainder of this paper is
as follows. Section 2 presents the general context Moving on to the Chinese language. Compared
and deals with the application of NLP techniques to to English, NLP in Chinese is more difficult, as its
non-Latin character languages such as Arabic, the vocabulary and semantics are more complex. In
challenges faced while performing ANLP, and the addition, the semantics of Chinese texts are more
followed process. Section 3 presents the context-dependent [5]. Compared to the normal
methodology followed during the research. Section delimiters of English, Chinese has neither formal
4 presents the results related to each step of the delimiters nor a strict division method, and even
ANLP process. Finally, Section 5 discusses the sometimes the division method depends on the
results and some insights into future research context.
directions.
As for the Thai language, [6] and [7] noted that
2. GENERAL CONTEXT AND PROBLEM the words are written without a word or sentence
STATEMENT delimiters, and words are placed continuously
without spaces in sentences. Another problem
2.1 NLP and non-Latin alphabet languages faced by the Thai language is that of word variants,
When we mention NLP, we invoke the the use of different words to express the same idea,
various existing languages, whether they are written polysemy, and the ability of words to express
using the Latin alphabet such as English, French, several ideas depending on the context, in addition
Spanish, etc. or non-Latin alphabets such as Arabic, to the lack of explicit word boundaries.
Chinese, Urdu, Thai, etc. Various works have been
Arabic is also one of the non-Latin alphabet
carried out in NLP in these languages taking into
languages that require additional efforts due to the
consideration the characteristics of each.
challenges associated with it, namely morphological
richness, orthographic ambiguity, dialectal
variations, and orthographic noise. Another
challenge that is relevant to the mainstream of non-
Latin languages, including Arabic, is the lack of
resources for training and evaluating Machine
Learning (ML) models. In this article, the focus will
be on the Arabic language to detect the challenges
related to ANLP and the studies done so far to
resolve them.
Figure 1: Alphabets Of Various Non-Latin Alphabets 2.2 Arabic language
Languages The Arabic language has a unique history in
NLP, as well as all the other techniques in AI, that it has remained unchanged for more than
have been highly interested in the English language sixteen centuries (before The Holy Quran in
and therefore, we find various works and research 609 CE). Arabic is a Semitic language with an
in this direction. This does not exclude the fact that inextricable link to Islam and Arabic culture,
NLP is also oriented towards other languages, serving as the Quranic language for all
including languages with non-Latin characters. Muslims (more than 1.62 billion people).
However, this orientation is not entirely free of
challenges.
1334
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
Moreover, it is the native language of more than As Arabic is one of the non-Latin languages
422 million speakers. In addition to the uniqueness mentioned earlier, the following challenges are
of the historical and cultural background of the
Arabic language, its nature and structure are
different from other languages such as English. For
example, this language is written from right to left.
It comprises 12 million words, and 28 letters,
including three vowels and diacritics (short vowel
symbols inscribed atop regular letters). Diacritics
may affect the semantics and syntax of a word [8]
There is no letter capitalization in Arabic, however, faced while processing Arabic texts:
the shapes of Arabic letters change according to Morphological richness: Arabic has many
their positions in a word. Arabic is a structured forms that result from rich inflexions. These include
language with an abundance of vocabulary, wherein gender, number, person, aspect, mood, case, and
morphology plays an important role. Furthermore, some connectable clitic features. As a result, it is
words are often constructed in a complex manner. not uncommon to find a single Arabic word that
translates into a five-word English phrase:
hou+una-lum-ya+sa+wa ‘ وسيلومونهand they will
blame him. This task creates a challenge for ML
models by increasing the number of unique
vocabulary types compared to English [10]. This
challenge is generally faced during the step of
stemming and tokenization.
Orthographic ambiguity: Arabic letters use
Figure 2: Arabic Alphabets
optional diacritics to represent short vowels and
Moreover, there are three main Arabic varieties. other phonological information that are important in
The Classical Arabic (CA) or the language usually distinguishing words from each other. Other than
used in religious and literature contexts, is fully religious texts and children's literature, these
structured and vowelized. The Modern Standard symbols are rarely used and provide a high degree
Arabic (MSA), which is the official language used of ambiguity. Educated Arabs usually have no
in education and formal communications across the problem reading Arabic without diacritics, but it is
different Arabic speaking countries, is based on the a challenge for both learners of Arabic and
CA’s syntax and morphology, but it tends to have a computers. If we give the example of ‘'ذهبت,
more modern vocabulary. Finally, the Arabic without diacritics, it could be read as
colloquial dialects (AD), or the language used in ‘ ُ’ذَ َهبْتdahabtou which means I went or as ‘ت
ْ ’ذَ َه َب
daily informal conversations, has no orthographic dahabat which means she went. Referring to the
standards so one word can be written in different generic process followed in NLP, we can say that
forms. Additionally, the AD variety can vary from this challenge is shown while performing the steps
one region to another across the Arabic countries. of stemming and embedding.
AD is mostly divided into six main groups: (1)
Dialect variations: As we mentioned earlier,
Egyptian, (2) Levantine, (3) Gulf, (4) Iraqi, (5)
there are various dialects in the Arabic language,
Maghrebi, and (6) Others, which contains the
which enhances its corpus but creates an obstacle
remaining dialect [9].
when processing Arabic texts. Each dialect having
2.3 Problem statement its grammar and lexicon that differ from the others
Because of the complexity of the Arabic and from Standard Arabic, it results in a variation
language, several challenges are faced when and a complexity that are faced during the ANLP
dealing with it. Referring to the Figure 3, we process, especially during tokenisation, stemming,
present the ANLP process, which is the same as the and embedding.
generic NLP process.
Figure 3: ANLP Process
Resource poverty: In NLP, data is the most
important resource; this is true for rule-based
In this section, we’ll go through each step of the approaches, which require carefully constructed
process and discuss some of the used techniques. lexicons and rules, as well as ML approaches,
which require corpora and annotated corpora.
Although there are many corpora of unannotated
1335
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
Arabic texts, morphological analysers, Arabic "Arabic natural language processing and word
lexicons, and annotated corpora are not available. In embedding", "Arabic natural language processing
addition, annotations other than news and dialects and word embedding", "Arabic Sentiment
are limited. The importance of data availability is Analysis", and "Arabic Sentiment Analysis and
evident from the process mentioned above, as the word embedding".
data collection is the baseline for the work to After we retrieved the articles from the online
follow. databases, the titles and abstracts of the articles
were screened using predefined selection criteria.
Based on these challenges, our aim is to
An article was considered suitable for inclusion in
investigate their impact during each step of the
this research if it met all the following inclusion
described process of Figure 3. How could the
criteria:
mentioned challenges influence the data collection,
normalisation, stemming, tokenisation and It deals with Arabic NLP,
embedding step? Which step is more influenced
that the other? It focuses on at least one of NLP process,
1336
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
1337
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
extracted different challenges faced in each step. In annotations. Moreover, in [15] the MATTER
this section, we’ll discuss each step of the general (Model, Annotate, Test, Train, Evaluate, Revise)
process. approach, which is a general methodology for
creating annotation and ML tasks of all different
4.1. Data Collection
types, was used to annotate the data. It is worth
As mentioned before, data is the main base for a
mentioning that the annotation depends on the field
high-performance NLP model. Unfortunately, the
of study as in the case of [24] where they requested
availability of good and reliable Arabic resources
the assistance of psychologists to complete this
was one of the major issues, as reported by [11].
task.
[12], [13], and [14], brought up the issue of
unbalanced data. For [13], the collected corpus 4.3. Word normalisation
consisted of 5K airline service-related tweets in
Normalisation helps to reduce the amount of
Arabic. In [15], the authors mentioned the lack of
different information that the computer must deal
available corpora on the web for Arabic sentiment
with, and therefore improves efficiency. This is also
analysis for standard and dialect variations and
one of the mandatory steps in NLP and ANLP.
created the SANA corpus, which is a collection of
Authors in [11] used three tools, namely
comments from three Algerian newspapers. In [16],
MADAMIRA, Farasa, and the Stanford toolkit in
the authors raised the lack of work and resources in
order to handle Arabic morphological and text
the Algerian dialect and collected the “Ar_corpus1”
processing. MADAMIRA tool is “a fast,
from Facebook. It is worth mentioning that all the
comprehensive tool for morphological analysis and
extracted data were from social media platforms
disambiguation of Arabic” [25]. As for Farasa, it is
such as Facebook and Twitter, and then they are
“a fast and accurate text processing toolkit for
generally users' short reviews and comments with a
Arabic text” [26]. Finally, the Stanford toolkit is
limited number of words. From the reviewed
“an extensible pipeline that provides core natural
articles, we extracted some of the open-access
language analysis” [27]. Farasa was also used by
datasets that we can use in future. We mention the
[28] and [29] to segment the input data. Moreover,
SANAD dataset, which is a large collection of
it was used with XLNet model in [30] and achieved
Arabic news articles that can be used in different
good results.
ANLP tasks such as Text Classification and Word
Embedding. We also note the Arabic Sentiment 4.4. Data tokenization
Tweets Dataset (ASTD ) which contains over 10k
Word tokenization is the process of splitting a
Arabic sentiment tweets classified into four classes
subjective positive, subjective negative, subjective large sample of text into words. This is a
mixed, and objective. The Arabic Influencer Twitter requirement in NLP tasks where each word needs to
be captured and subjected to further analysis. [8]
Dataset (AITD ) is also available in open access in
has been interested in this concept and proposed a
addition to the Arabic Social Media News Dataset
tokenisation algorithm to fragment Arabic text into
(ASND ). There is also the Open-Source Arabic
words based on the spaces between words and
Corpora (OSAC ) is a large standard dataset for text
categorization. Finally, we note the ArSarcasm , a punctuation. [29] used the BERT tokenizer to
new Arabic sarcasm detection dataset containing perform the tokenization of the Arabic corpus.
Therefore, since the Arabic language doesn’t have
10,547 tweets, in multiple Arabic dialects, 1,682
capitalisation further techniques need to be used to
(16%) of which are sarcastic. Moreover, [17] faced
break down the sentence into tokens. In [31]
the lack of resources to detect Arabic fake news and
constructed their own Arabic dataset based on news WordPiece tokenizer of BERT was used to perform
sentences from an Arabic Twitter dataset and by the tokenisation. [30] investigated also WordPiece
tokenizer and SentencePiece of XLNet on Arabic
performing web scraping.
corpus and achieved an accuracy of 94.78% when
4.2. Data annotation used with XLNet. We also detected that [17] had
used its own tokenizer model based on BERT
In NLP, data should be annotated according to
models to deal with the sentence’s meaning.
the targeted task, such as sentiment analysis for
example. That is what was raised in [18], [19], [20] 4.5. Data stemming
and [21] where the collected data was annotated
Data stemming consists of producing
manually. Both [22] and [23] have manually
morphological variants of a root/base word. For this
annotated data related to violence in order to
reason, various techniques are used as [32] noted.
classify it. [24] have accomplished the annotation
by turning to crowdsource to perform multiple
1338
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
• Root-based approach: The main goal of Word2Vec for Text Categorization, which resulted
these stemmers is to extract the root of words, in an F1 score of 97.96% when combined with Att-
GRU as a classification model. [11] have
• Stem-based approach: These stemmers
implemented the Glove technique to extract the
identify the stem of words. As an example of
embeddings and got an accuracy of 94,80% when
Arabic stemmer, we cite Farasa,
combined with CNN. In [32], the authors
• Light-stem-based approach: These investigated the AraVec embeddings that achieved
stemmers are used to eliminate suffixes and an accuracy of 87,51% with NuSVC. FastText was
prefixes from stems. Light10 is the most used one used with [16] and got 80% as an accuracy with
as a light stemmer, CNN. We also find the AraBERT model used with
[34] on multi-dialect dataset for the embedding step
• Arabic morphological analyser: It can
and got an accuracy of 89.6%. QARIB model was
identify the different forms of words (root, stem,
implemented as an embedding model and a
and light stem).
classifier in [35] and achieved an accuracy of 70%
In [8], the authors proposed a stemmer module applied on a multi-dialect dataset. Moreover,
providing a solution to the challenges resulting mBERT was also used for the embedding task in
from the complexity of social media Arabic words [34] combined with itself to achieve an accuracy of
and aims to fulfil two objectives: (1) to help 93,8%.
understand the meaning of the word by providing
4.7. Task performed
its root, and (2) to determine whether the word is a
noun, stop word, or a non-standard Arabic word (in Once all the previous steps have been applied, we
case of not finding its root), a dialect, an error, or a move on to the ultimate objective we want to
non-Arabic word, yet written using the Arabic achieve. The objectives vary according to the
script. [12] also investigated the impact of orientation and vision of the article and the
stemming combined with the word embedding for researcher. The aim of the article could be related to
Text Categorization and obtained an F1 score of the classification or the prediction of humans’
97.96% by combining the light stemmer, sentiments, emotions, appraisals, attitudes, or
Word2Vec as embedding technique, and Att-GRU opinions toward products, issues, events, or
as a classification model. Moreover, [33] came with services. In our case, we extracted articles that aim
a broken plural rule (BPR) algorithm introducing to analyse sentiments as we've seen in [15], [16],
new solutions to solve the problem. [18], [19], [20], [26], [36], [29], [32] , [37], [49],
[38], [39], [40] and [41]. We also found articles
4.6. Word Embedding interested in emotion detection as stated in [42],
One of the critical steps in all models or tasks is [41] and [43]. Irony and sarcasm detection was the
word embedding. This step consists of generating topic of [15], [37] and [44]. In [21] and [45] the
distributed word vector representations and objective was text categorization. Detection of hate
representing the words or sentences of a text by speech was the subject of [46] and [47], detection of
vectors of real numbers. In other words, this step Arabic health information was treated in [48],
consists of converting the textual data into document classification was the aim in [16] and
numerical data machine-interpretable. Different dialect identification was the goal of [34].
techniques were used in the Arabic context. This
Several articles were interested by one or more
step is one of the mandatory steps to succeed in an
step of ANLP, as detailed in the Table I.
NLP project whether in an Arabic context or other
languages. From the reviewed articles, we’ve found
[12] that investigated the word embedding with
1339
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
Static Word Embeddings for Arabic Sarcasm Detection and Sentiment embedding
Analysis
[15] SANA: Sentiment analysis on newspapers comments in Algeria 2019 Data collection,
[16] A Semi-supervised Approach for Sentiment Analysis of Arab(ic+izi) 2021 Data collection,
Messages: Application to the Algerian Dialect embedding
[18] A Multilingual System for Cyberbullying Detection: Arabic Content 2018 Data annotation
Detection using Machine Learning
[19] Sentiment Analysis of Users on Social Networks: Overcoming the 2018 Data annotation
challenge of the Loose Usages of the Algerian Dialect
[20] Deep learning approaches for Arabic sentiment analysis 2019 Data annotation
[21] ArCovidVac: Analyzing Arabic Tweets About COVID-19 2022 Data annotation
Vaccination
[22] Sentiment Analysis of Arabic Tweets about Violence Against Women 2021 Data annotation
using Machine Learning
[23] Violence Detection over Online Social Networks: An Arabic 2021 Data annotation
Sentiment Analysis Approach
[24] ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment 2019 Data annotation
Analysis in Arabic Levantine Tweets
[11] Leveraging Arabic sentiment classification using an enhanced CNN- 2021 Data normalisation
LSTM approach and effective Arabic text preparation
[29] Arabic Sentiment Analysis Using BERT Model 2021 Tokenisation
[8] Preprocessing Arabic text on social media 2021 Tokenisation
[32] Comparative study of Arabic stemming algorithms for topic 2019 Stemming, embedding
identification
[34] Deep Multi-Task Model for Sarcasm Detection and Sentiment 2021 Tokenisation, embedding
Analysis in Arabic Language
[35] Benchmarking Transformer-based Language Models for Arabic 2021 Tokenisation, embedding
Sentiment and Sarcasm Detection
[30] AraXLNet: pre-trained language model for sentiment analysis of 2022 Tokenisation
Arabic
[31] TCE at Qur'an QA 2022: Arabic Language Question Answering Over 2022 Tokenisation
Holy Qur'an Using a Post-Processed Ensemble of BERT-based
Models
[17] Arabic fake news detection based on deep contextualized embedding 2022 Data collection,
models tokenization, embedding
[33] BPR algorithm: New broken plural rules for an Arabic stemmer 2022 Stemming
5. RESULTS DISCUSSION
The main objective of the study differs
from one article to another, depending on the
motivation of the researcher and the need
expressed. In our review, we noted a multitude of
objectives targeted by the articles. The objective or
theme that has been recurrent is that of Arabic
sentiment analysis (SA) with a percentage of 63%,
followed by emotion detection with 11.1%, and
Sarcasm and irony detection also with 11.1% as
illustrated in the Figure 5.
1340
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
articles’ goals, we extracted the three main and highlighted the newest and efficient techniques.
performed tasks, which are binary classification, Based on this research, we found that the field of
ternary classification and multi-class classification. ANLP still requires more interest to become
mature. Moreover, we’ve clearly showed the
The first point for any implementation and study
importance of the pre-processing steps for a
in NLP is that of the data, its nature and type. In the
performant model. There are several tasks in natural
case of the sample studied we find that Standard
language processing for Arabic (ANLP) that have
Arabic (MSA) was the most used in implementation
not been thoroughly researched, including dialect
with a percentage of 47.1% followed by the
identification, detecting hate speech, and detecting
Egyptian dialect and Algerian with a percentage of
sarcasm and irony. Additionally, certain tasks have
11.8% and 9.8% respectively. However the
been applied more to some Arabic dialects than
Moroccan dialect has less percentage in number of
others, such as sentiment analysis in the Moroccan
studies with only 1,8% as illustrated in the Figure 6.
dialect. This literature review and classification of
recent research in ANLP has allowed us to identify
areas that still require more research. Our focus in
the future is to investigate the application of these
tasks to the Moroccan dialect and to help other
researchers learn about current trends in ANLP and
begin working in this field.
REFERENCES:
1341
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
Sci., vol. 2, no. 6, pp. 1–17, 2021, doi: [18] B. Haidar, M. Chamoun, and A. Serhrouchni, ‘A
10.1007/s42979-021-00775-6. Multilingual System for Cyberbullying Detection :
[8] M. O. Hegazi, Y. Al-Dossari, A. Al-Yahy, A. Al- Arabic Content Detection using Machine Learning
Sumari, and A. Hilal, ‘Preprocessing Arabic text ISSN : A Multilingual System for Cyberbullying
on social media’, Heliyon, vol. 7, no. 2, p. e06191, Detection : Arabic Content Detection using
2021, doi: 10.1016/j.heliyon.2021.e06191. Machine Learning’, no. December, 2017, doi:
[9] I. Guellil, H. Saâdane, F. Azouaou, B. Gueni, and 10.25046/aj020634.
D. Nouvel, ‘Arabic natural language processing: [19] A. Soumeur, M. Mokdadi, A. Guessoum, and A.
An overview’, J. King Saud Univ. - Comput. Inf. Daoud, ‘Sentiment Analysis of Users on Social
Sci., vol. 33, no. 5, pp. 497–507, 2021, doi: Networks: Overcoming the challenge of the Loose
10.1016/j.jksuci.2019.02.006. Usages of the Algerian Dialect’, Procedia Comput.
[10] K. Darwish et al., ‘A panoramic survey of natural Sci., vol. 142, pp. 26–37, 2018, doi:
language processing in the Arab world’, Commun. 10.1016/j.procs.2018.10.458.
ACM, vol. 64, no. 4, pp. 72–81, 2021, doi: [20] A. Mohammed and R. Kora, ‘Deep learning
10.1145/3447735. approaches for Arabic sentiment analysis’, Soc.
[11] A. M. Alayba and V. Palade, ‘Leveraging Arabic Netw. Anal. Min., vol. 9, no. 1, pp. 1–12, 2019,
sentiment classification using an enhanced CNN- doi: 10.1007/s13278-019-0596-4.
LSTM approach and effective Arabic text [21] H. Mubarak, S. Hassan, S. A. Chowdhury, and F.
preparation’, J. King Saud Univ. - Comput. Inf. Alam, ‘ArCovidVac: Analyzing Arabic Tweets
Sci., no. xxxx, 2021, doi: About COVID-19 Vaccination’, 2022.
10.1016/j.jksuci.2021.12.004. [22] M. Zyout and N. Hassan, ‘Sentiment Analysis of
[12] H. A. Almuzaini and A. M. Azmi, ‘Impact of Arabic Tweets about Violence Against Women
Stemming and Word Embedding on Deep using Machine Learning’, 2021, no. May.
Learning-Based Arabic Text Categorization’, [23] M. Khalafat, J. S. Alqatawna, R. Al-Sayyed, M.
IEEE Access, vol. 8, pp. 127913–127928, 2020, Eshtay, and T. Kobbaey, ‘Violence Detection over
doi: 10.1109/ACCESS.2020.3009217. Online Social Networks: An Arabic Sentiment
[13] M. A. Mohammed, S. Muazzam Ahmed, and N. Analysis Approach’, Int. J. Interact. Mob.
Farrukh, ‘Pre-trained Word Embeddings for Technol., vol. 15, no. 14, pp. 90–110, 2021, doi:
Arabic Aspect-Based Sentiment Analysis of 10.3991/ijim.v15i14.23029.
Airline Tweets’, in Advances in Intelligent [24] R. Baly, A. Khaddaj, H. Hajj, W. El-Hajj, and K.
Systems and Computing 1058 Proceedings of the B. Shaban, ‘ArSentD-LEV: A Multi-Topic Corpus
International Conference on Advanced Intelligent for Target-based Sentiment Analysis in Arabic
Systems and Informatics, 2020, no. September, pp. Levantine Tweets’, no. September, 2019, [Online].
7–9. [Online]. Available: Available: http://arxiv.org/abs/1906.01830.
http://link.springer.com/10.1007/978-3-030- [25] A. C. Stubbs, ‘A Methodology for Using
31129-2. Professional Knowledge in Corpus Annotation’,
[14] A. I. Alharbi and M. Lee, ‘Multi-task Learning 2013
Using a Combination of Contextualised and Static [26] A. Pasha et al., ‘MADAMIRA : A Fast ,
Word Embeddings for {A}rabic Sarcasm Comprehensive Tool for Morphological Analysis
Detection and Sentiment Analysis’, Proc. Sixth and Disambiguation of Arabic’, in The
Arab. Nat. Lang. Process. Work., pp. 318–322, International Conference on Language Resources
2021, [Online]. Available: and Evaluation, 2014, pp. 1094–1101.
https://aclanthology.org/2021.wanlp-1.39. [27] A. Abdelali, K. Darwish, N. Durrani, and H.
[15] H. Rahab, A. Zitouni, and M. Djoudi, ‘SANA: Mubarak, ‘Farasa: A fast and furious segmenter
Sentiment analysis on newspapers comments in for arabic’, NAACL-HLT 2016 - 2016 Conf.
Algeria’, J. King Saud Univ. - Comput. Inf. Sci., North Am. Chapter Assoc. Comput. Linguist.
vol. 33, no. 7, pp. 899–907, 2021, doi: Hum. Lang. Technol. Proc. Demonstr. Sess., vol.
10.1016/j.jksuci.2019.04.012. 2016, pp. 11–16, 2016, doi: 10.18653/v1/n16-
[16] I. Guellil et al., ‘A Semi-supervised Approach for 3003.
Sentiment Analysis of Arab(ic+izi) Messages: [28] . Abuzayed and H. Al-Khalifa, ‘Sarcasm and
Application to the Algerian Dialect’, SN Comput. Sentiment Detection In Arabic Tweets Using
Sci., vol. 2, no. 2, pp. 1–18, 2021, doi: BERT-based Models and Data Augmentation’,
10.1007/s42979-021-00510-1. Proc. Sixth Arab. Nat. Lang. Process. Work., pp.
[17] A. B. Nassif, A. Elnagar, O. Elgendy, and Y. 312–317, 2021, [Online]. Available:
Afadar, ‘Arabic fake news detection based on deep https://www.aclweb.org/anthology/2021.wanlp-
contextualized embedding models’, Neural 1.38.
Comput. Appl., vol. 4, 2022, doi: 10.1007/s00521-
022-07206-4.
1342
Journal of Theoretical and Applied Information Technology
15th February 2023. Vol.101. No 3
© 2023 Little Lion Scientific
[29] H. Chouikhi, H. Chniter, and F. Jarray, ‘Arabic [40] K. Ibrahim, N. El Habib, and Hassan Satori,
Sentiment Analysis Using BERT Model’, ‘Sentiment Analysis Approach Based on
Commun. Comput. Inf. Sci., vol. 1463, no. Combination of Word Embedding Techniques’,
September, pp. 621–632, 2021, doi: 10.1007/978- 2019.
3-030-88113-9_50. [41] L. Moudjari, F. Benamara, and K. Akli-Astouati,
[30] A. Alduailej and A. Alothaim, ‘AraXLNet: pre- ‘Multi-level embeddings for processing Arabic
trained language model for sentiment analysis of social media contents’, Comput. Speech Lang.,
Arabic’, J. Big Data, vol. 9, no. 1, 2022, doi: vol. 70, p. 101240, 2021, doi:
10.1186/s40537-022-00625-z. 10.1016/j.csl.2021.101240.
[31] M. ElKomy and A. M. Sarhan, ‘TCE at Qur’an [42] B. Naaima, E. Soumia, F. Rdouan, and O. H. T.
QA 2022: Arabic Language Question Answering Rachid, ‘Exploring the Use of Word Embedding
Over Holy Qur’an Using a Post-Processed and Deep Learning in Arabic Sentiment Analysis’,
Ensemble of BERT-based Models’, 2022, in Advances in Intelligent Systems and
[Online]. Available: Computing, 2020, vol. 1105 AISC, pp. 149–156.
http://arxiv.org/abs/2206.01550. doi: 10.1007/978-3-030-36674-2_16.
[32] M. Naili, A. H. Chaibi, and H. H. Ben Ghezala, [43] R. A. Salama, A. Youssef, and A. Fahmy,
‘Comparative study of Arabic stemming ‘Morphological Word Embedding for Arabic’,
algorithms for topic identification’, Procedia Procedia Comput. Sci., vol. 142, pp. 83–93, 2018,
Comput. Sci., vol. 159, pp. 794–802, 2019, doi: doi: 10.1016/j.procs.2018.10.463.
10.1016/j.procs.2019.09.238. [44] H. Chouikhi, H. Chniter, and F. Jarray, ‘Stacking
[33] H. Alshalabi, S. Tiun, N. Omar, E. abdulwahab BERT based Models for Arabic Sentiment
Anaam, and Y. Saif, ‘BPR algorithm: New broken Analysis’, no. January, pp. 144–150, 2021, doi:
plural rules for an Arabic stemmer’, Egypt. 10.5220/0010648400003064.
Informatics J., no. xxxx, 2022, doi: [45] M. Baali and N. Ghneim, ‘Emotion analysis of
10.1016/j.eij.2022.02.006. Arabic tweets using deep learning approach’, J.
[34] A. El Mahdaouy, A. El Mekki, K. Essefar, N. El Big Data, vol. 6, no. 1, 2019, doi: 10.1186/s40537-
Mamoun, I. Berrada, and A. Khoumsi, ‘Deep 019-0252-x.
Multi-Task Model for Sarcasm Detection and [46] A. Al-Hassan and H. Al-Dossari, ‘Detection of
Sentiment Analysis in Arabic Language’, 2021, hate speech in Arabic tweets using deep learning’,
[Online]. Available: Multimed. Syst., no. 0123456789, 2021, doi:
http://arxiv.org/abs/2106.12488. 10.1007/s00530-020-00742-w.
[35] I. A. Farha and W. Magdy, ‘Benchmarking [47] I. Aljarah et al., ‘Intelligent detection of hate
Transformer-based Language Models for Arabic speech in Arabic social network: A machine
Sentiment and Sarcasm Detection’, Arab. Nat. learning approach’, J. Inf. Sci., vol. 47, no. 4, pp.
Lang. Process. Work., pp. 21–31, 2021. 483–501, 2021, doi: 10.1177/0165551520917651.
[36] H. El Moubtahij, H. Abdelali, and E. B. Tazi, [48] H. Elfaik and E. H. Nfaoui, ‘Combining Context-
‘AraBERT transformer model for Arabic Aware Embeddings and an Attentional Deep
comments and reviews analysis’, IAES Int. J. Learning Model for Arabic Affect Analysis on
Artif. Intell., vol. 11, no. 1, pp. 379–387, 2022, Twitter’, IEEE Access, vol. 9, pp. 111214–
doi: 10.11591/ijai.v11.i1.pp379-387. 111230, 2021, doi:
[37] I. Kaibi and E. H. Nfaoui, ‘A Comparative 10.1109/ACCESS.2021.3102087.
Evaluation of Word Embeddings Techniques for
Twitter Sentiment Analysis’, 2019 Int. Conf.
Wirel. Technol. Embed. Intell. Syst., pp. 1–4,
2019.
[38] A. H. Ombabi, W. Ouarda, and A. M. Alimi,
‘Deep learning CNN – LSTM framework for
Arabic sentiment analysis using textual
information shared in social networks’, Soc. Netw.
Anal. Min., pp. 1–13, 2020, doi: 10.1007/s13278-
020-00668-1.
[39] A. Alwehaibi, M. Bikdash, M. Albogmi, and K.
Roy, ‘A study of the performance of embedding
methods for Arabic short-text sentiment analysis
using deep learning approaches’, J. King Saud
Univ. - Comput. Inf. Sci., no. xxxx, 2021, doi:
10.1016/j.jksuci.2021.07.011.
1343