Sentiment Analyis

Applied Computer Systems
ISSN 2255-8691 (online)

ISSN 2255-8683 (print)
June 2022, vol. 27, no. 1, pp. 30–42
https://doi.org/10.2478/acss-2022-0004
https://content.sciendo.com
Urdu Sentiment Analysis

Iffraah Rehman1*, Tariq Rahim Soomro2
1,2
CCSIS, Institute of Business Management (IoBM), Karachi, Pakistan
Abstract – The world is heading towards more modernized and from the textual data available at hand. It is affecting a wide
digitalized data and therefore a significant growth is observed in range of domains and applications such as enhancing sales and
the active number of social media users with each passing day. improving the marketing of a brand/product, identification of
Each post and comment can give an insight into valuable
information about a certain topic or issue, a product or a brand, political strategies and analysing the trends, and forecasting the
etc. Similarly, the process to uncover the underlying information movement of the stock market by world news or financial
from the opinion that a person keeps about any entity is called a reports available online [3]. Over time, the concept of sentiment
sentiment analysis. The analysis can be carried out through two analysis has grown beyond the detection of polarity; different
main approaches, i.e., either lexicon-based or machine learning techniques are now available, which can detect the emotions of
algorithms. A significant amount of work in the different domains the humans such as anger, happiness, and sadness. There are
has been done in numerous languages for sentiment analysis, but
minimal research has been conducted on the national language of
two techniques available through which the sentiment analysis
Pakistan, which is Urdu. Twitter users who are familiar with Urdu can be performed. The first technique is machine learning-
update the tweets in two different textual formats either in Urdu based, which uses the well-labelled training dataset, and then
Script (Nastaleeq) or in Roman Urdu. Thus, the paper is an an automatic classifier is applied to it. The performance of the
attempt to perform the sentiment analysis on the Urdu language classifier is evaluated by how well it predicts the polarity of a
by extracting the tweets (Nastaleeq and Roman Urdu both) from sentiment(s) that is supplied to it through a test dataset. Machine
Twitter using Tweepy API. A machine learning-based approach
learning can further be divided into supervised and
has been adopted for this study and the tool opted for the purpose
is WEKA. The best algorithm was identified based on evaluation unsupervised learning [4]. The second approach that can be
metrics, which comprise the number of correctly and incorrectly adopted for the sentiment analysis is lexicon-based, which can
classified instances, accuracy, precision, and recall. SMO was assign the polarity to the sentence of a document based on the
found to be the most suitable machine learning algorithm for words available in its dictionary. In this case, a dictionary is a
performing the sentiment analysis on Urdu (Nastaleeq) tweets, collection of positive and negative words that are either
while the Roman Urdu Random Forest algorithm was identified
collected manually or can be extracted from already shared
as the best one.
resources, such as SentiWordNet [5].
Keywords – Machine learning algorithms, sentiment analysis, Apart from the English language, the sentiment analysis has
Tweepy, WEKA. also been performed for the Arabic language. The Arabic
language is written from right to left and involves diacritical
I. INTRODUCTION marks. It has 28 letters, 25 consonants, and 3 vowels. Several
Sentiment analysis can be defined to identify humans’ views works have been done on the Arabic dataset, in which machine
and emotional behaviour in textual form, towards any product, learning algorithms, i.e., SVM and Naïve Bayes have been
organisation, service, or individual. It is also a kind of data applied to the dataset of Arabic tweets. An attempt has also been
mining, which involves the extraction of human opinions made to perform an aspect-based sentiment analysis, opinion
through natural language processing (NLP) and text analysis. holder extraction, and opinion spam detection [6]. One of the
Sentiment analysis is often termed “Opinion Mining”, for approaches of sentimental analysis lies in sentimental lexicon
which the source of data extraction is usually the Web [1]. In that can be either a word or a phrase, which conveys either a
this era, there are opinionated data of humans stored in digital strong positive or strong negative emotion towards something
form, which are received from different discussion forums, be it any product or an event. Similar study has been performed
reviews, blogs, and available social media sites. Individuals in the Arabic language, in which a tweet regarding one brand
seek the opinions of others, while making any decision; this is has been extracted. Arabic sentiment ontology has been made
not only restricted to a person, but also organisations plan their consisting of 24 words that would express sentiments. Hence,
marketing strategies based on the reviews of their customers. positive and negative emotions have been evaluated based on
Thus, these opinions of others/customers are not limited to ASO in that dataset of Arabic tweets [7].
friends and family, but there are many public forums available Social media is an Internet-based platform, which is central
on the Internet [2]. Sentiment analysis involves the to user-generated content and provides public membership. It is
categorization of positive, negative, and at times neutral facts the most convenient way nowadays to propagate information to
*
Corresponding author’s e-mail: std_23371@iobm.edu.pk
©2022 Iffraah Rehman, Tariq Rahim Soomro.

This is an open access article licensed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0). 30
_________________________________________________________________________________________________2022/27
others in the blink of an eye. There are several social network its users who can speak, understand, and write in Urdu update
sites available, which are widely used by people around the their status or writing comments in both Urdu scripts, i.e.,
globe among which the most used one are Facebook, Twitter, Nastaleeq, and in Roman Urdu. Hence, this study is an attempt
Instagram, YouTube, LinkedIn, etc. These sites can be to perform the sentiment analysis on both Nastaleeq and Roman
categorized into dimensions such as conversations, information Urdu by extracting the data from Twitter using its API called
openness, connectivity, community formation, and Tweepy. This study will have its focus on experimenting with
participation [8]. Social media sites generate unstructured data the sentiment analysis for the Urdu language by using one of its
in bulk amounts, on which the sentiment analysis can be applied approaches, i.e., machine learning. Twitter will also be used to
to gain valuable insights. The data of all public posts posted on gather the data for Urdu and Roman Urdu tweets. A lot of work
social media sites can be extracted through their respective has been done on the sentiment analysis in the English language
APIs (Application Program Interface), such as Twitter has its and fewer studies have been performed on the sentiment
user timeline API, streaming API, and REST API [9]. Similarly, analysis in the Arabic language. However, detailed research has
Facebook is another largest platform for people for expressing not been conducted to date for performing the sentiment
their thoughts and ideas about areas such as politics, business, analysis on Urdu (Nastaleeq) and Roman Urdu. There is certain
government, health, etc. These data in textual format can be limitation that is faced while performing the sentiment analysis
extracted by using its Graph API [10]. Twitter was launched on on the Urdu language, such as a lack of lexicon, stop words, etc.
15 July 2006. It was first given the name Twttr; however, soon Considering the social media and its users in Pakistan, they
after six months of its launch, its name was changed to Twitter. update their status on social media apps or write comments in
The adoption of Twitter among the users of social media started Roman Urdu (using the English language to write in Urdu), as
to increase in March 2007 as it was a micro-blogging site, which well as in actual Urdu (Nastaleeq), i.e., writing from right to left
caught the attention of the users [11]. Similarly, since the first in the Urdu language fonts. There is no denying the fact that
quarter of the year 2017, rapid growth in the number of Twitter handling the Urdu script is an uphill task using the existing
active users has been observed [12], as shown in Fig. 1. tools. The present paper is organised as follows. Section II
covers a literature review to see what work has been done for
the Urdu language; Section III discusses the material and
methods used in this study; Section IV highlights the results and
findings and Section V discusses the study along with future
recommendations.
II. LITERATURE REVIEW

Content over social media has transformed over the years; it
has become a rich source for getting people’s opinions and
views regarding the specific topic, brands, products, social
issues, etc. Similarly, new research and techniques are
emerging with the passing days of performing text mining and
semantic analysis. This gives a path to gain valuable insights,
which exist within the data obtained from the data being
Fig. 1. Twitter statistics for active number of users from Q1 2017 to Q2 2020. generated from social networking sites in all aspects of life
whether it is healthcare, politics, etc. In [17], an attempt has
The time spent by users on Twitter is 3.39 minutes per been made by the author to analyse and predict the stock
session on average, on a daily basis. Moreover, the number of movement of Microsoft by extracting the data from Twitter and
tweets sent each day equates to 5787 tweets per second [13]. applying sentiment analysis to it. The data consisting of tweets
Initially, Twitter could not tweet with a maximum of 140 comprising of keywords #MSFT, #Microsoft, #Windows, and
characters only which allowed its users to be brief and creative other Microsoft brands have been collected for a certain date
with a post. It is termed a micro-blogging site; it allows the range of the year 2016. Also, the stock prices (opening and
users to post and share the posts, which are termed tweets [14]. closing) of Microsoft have been obtained from yahoo. The data
Currently, Twitter supports 34 languages, the English language are first pre-processed through three steps such as tokenization,
being the topmost used language [15]. Urdu is the national stop words removal, and regex. Extracted tweets are further
language of Pakistan. It is an Indo-Aryan language having its categorized as positive, negative, and neutral. Two methods
roots in the Persian and Arabic languages. It is bidirectional such as n-grams and word2vec have been used for textual
language, i.e., the text is written from right to left, while the representation. This process further passes through model
numbers are written from left to right. Urdu set consists of 38 training and is analysed for correlation of price and sentiment.
alphabets; it has 7 diacritics marks which are optional, and 7 Thus, it has been found that strong relationships exist between
punctuation marks [16]. people’s reviews about a company, expressed through tweets
The main aim of this study is to perform the sentiment on Twitter, and the rise or fall of a company’s stock price.
analysis specifically on the Urdu language (both Urdu and The social media platform “Twitter” is also widely used by
Roman Urdu). As far as social media platforms are concerned, sports fans; hence, the winner can already be predicted through
31
_________________________________________________________________________________________________2022/27
user thoughts and reviews posted on Twitter. Similarly, in [18], Arabia. Of 500 000 crawled tweets collected, only 1103 tweets
the winner of a premier league of football has been predicted. have been left after data pre-processing and filtration. These
The real-time based data from Twitter has been extracted from data are passed through a lexicon-based classifier, which gives
Twitter API, which consists of all the tweets having hashtags an output of labelled data, which is further passed through SVM
related to the league and is stored in the Central Sport system, to build a classifier model using the WEKA tool. These labelled
which is a tool for the Twitter collector and sentiment analysis. classes are based on the negative and positive polarity of the
Match results are obtained from ESPN.com. Every tweet is tweets. The results are based on the accuracy and the confusion
analysed on two axes, such as tone and opinion finder. Separate matrix. This study has proven that neither of the individual
models are built based on objectively classified tweets, all approaches such as lexicon-based classifier or SVM can
positive and all negative tweets. Social media being an open achieve the highest accuracy of 84.01 % as this hybrid
platform for all of its users around the globe give people approach.
freedom of speech. Whether it is the people protesting against An attempt has been made in [22] to identify which machine
some issue or supporting any good cause; all of the thoughts of learning algorithm best classifies the tweets either positive or
people can be viewed and analysed on social media. Back in negative regarding healthcare. The data from Twitter was
2015, there were refugee crises in Europe and different views extracted through Twitter API, using the keywords related to
were coming from different people. The sentiment analysis on healthcare in Arabic. Initially, 126 959 tweets were extracted
this crisis has also been performed in [19]. For this purpose, the based on the keywords, such as ‫ﻣﺴﺘﺸﻔﻰ_ﺗﻐﻠﻖ_اﻟﺼﺤﺔ‬# (Closing
tweets in two different languages, i.e., English and German, Hospital), ‫اﻟﺼﺤﺔ_ﯾﻌﺎﻟﺞ_ﻣﻦ‬# (Solving Health),
have been gathered. These data after being pre-processed and ‫اﻟﺼﺤﺔ_ﺗﺤﺴﯿﻦ_ﻧﺘﻨﺘﻈﺮ‬# (Improving Health), ‫اﻟﺼﺤﯿﺔ_ﺑﺎﻟﺨﺪﻣﺎت_رأﯾﻚ‬#
filtrated undergo a natural language processing tool called (Opinions about Health). Then, the data were pre-processed and
Linguistic Inquiry and Word Count (LIWC), which further the noise was removed; the total number of tweets left was
filters the tweets into emotions like positive, negative, anger, 2026. Further, the deep learning method was compared with the
and anxiety. The sentiment score has been calculated for each machine learning classification algorithms such as naïve Bayes,
category and in each language. The dataset obtained from logistic regression, and SVM, based on calculated accuracy.
Twitter is mostly used for viewpoint analysis of users of Thus, SVM was found to be the best among all methods with
different things. The views can be positive, neutral, or negative. accuracy between 85 % and 91 %. Apart from the application
One of the many ways to perform this analysis is to use a of linear models for the sentiment analysis on the Arabic data
cluster-based method, such as in [20]. A new method CSK has of tweets, now research has also been started for the application
been proposed by combining k-mean and cuckoo search. This of deep learning models. Two models of deep learning such as
method inputs the Twitter data and clusters them into three Convolution Neural Network (CNN) and Long Short-Term
phases: (1) pre-processing of data, which removes all the noise Memory (LSTM) models have been combined to make an
from the data such as URLs, and hash (#) characters; (2) feature ensemble model [23]. This model then predicts the sentiment of
extraction, i.e., alphabets are removed according to the Arabic tweets. All the tweets have a fixed size vector. CNN and
dictionary of stopwatch and acronym, and (3) hybrid clustering LSTM were trained with different hyper-parameters. These
using k-means. Then, the tweets are converted into the feature data were divided into training (70 %), testing (10 %), and
vectors by segregating them into positive, negative, and neutral validation (20 %). Thus, the ensemble model gave an accuracy
emoji, positive and negative exclamation and negation, and of 65.05 % and an F1 score of 64.46 %. Most of the work for
positive, negative, and neutral words. The results obtained from the sentiment analysis has been done in the English language,
the proposed method are then compared with cuckoo search, while other languages hold minimal importance in the world of
improved cuckoo search, SVM-tri, and Naïve Bayes-tri based social media, such as Urdu, which is the national language of
on statistical comparison of accuracy, computational time, and Pakistan. Twitter allows tweeting short blogs in Urdu as well.
fitness functional value. Thus, the proposed method has been However, Urdu on Twitter can be found in two different ways,
found most effective for the viewpoint analysis of Twitter data. i.e., either in actual Urdu (Nastaliq) written from right to left
Twitter not having any language barrier, comprises millions (using the Urdu language fonts) or it can be written in Roman
of tweets every single day. These tweets are not only restricted Urdu using the English language to write in Urdu. Very little
to the English language, but users update their Twitter accounts research and practical work have been done on the sentiment
in their native languages as well. Similarly, thousands of users analysis on Twitter in the Urdu language.
are found that tweet in the Arabic language. Several attempts In [24] data based on 12 different hashtags were extracted.
have already been made by different authors/researchers to #CrimeMinisterNawaz, #JiyeBhutto, #MaryamMeriAwaz, and
perform the sentiment analysis in Arabic. However, Arabic is a #PakStandsWithSC are to name a few. These tweets were first
very challenging language for the sentiment analysis, as it is extracted in the Urdu language by Twitter API and then
written from right to left and may or may not have special translated into Urdu through translator API. The actual dataset
characters, which can also be termed as diacritical marks. A had 100 000 tweets, after pre-processing 6250 tweets were left.
hybrid approach has been proposed in [21], and manually These tweets were first analysed through three different
labelling data for the machine learning classifier SVM has been analysers, namely TextBlob, SentiWord, and W-WSD.
tried. Twitter data have been collected using the Tweet TextBlob had the highest number of positive tweets, i.e., 3380
Archivist tool, based on the hashtag of the social issue in Saudi with 54.08 %, while SentiWord gave 3054 – the highest
32
_________________________________________________________________________________________________2022/27
negative number of tweets with 48.46 %. To verify the polarity kernel. To make SVM classify multiple classes, two approaches
obtained from these sentiment analysers, the data were further such as one-vs-one (OVO) and one-vs-all (OVA) were adopted.
passed through two machine learning algorithms, i.e., Naïve The algorithms were applied directly to complete data, and it
Bayes and SVM. W-WSD analyser was found to give the was found that the cubic kernel was marginally better than the
highest number of correct instances along with 79.00 % for the linear one as both of them showed the accuracy of sentiment N,
given data using Naïve Bayes, while TextBlob gave an i.e., Neutral as high, while sentiment analysis for E, i.e.,
accuracy of 62.67 % using SVM. Another attempt was made in negative remains a challenge. The technique of deep learning
[25], in which sentiment analysis was performed on a bilingual was also applied for the sentiment analysis in Roman Urdu. The
dataset, i.e., data consisting of tweets in English as well as in author of the paper [29] used a deep neural network long short-
Roman Urdu. The data were extracted based on the hashtags term memory (LSTM) model on the Roman Urdu dataset. The
related to the General Election of Pakistan. 89 000 tweets were training model was built by using four layers: the input
collected out of which 82 224 were further classified as political (embedding) layer, the hidden layer, the LSTM layer, and the
and 6847 as non-political. A bi-lingual sentiment lexicon (BSL) output layer. The cross-validation approach was applied with
was developed using WordNet, SentiStrength, and a bilingual 10-folds in which 90 % was used as training while 10 % for the
glossary. The sole purpose of BSL was to calculate the validation testing. The obtained results were compared with the
frequency of terms used in tweets in the language to give them machine learning-based classifiers that were Naïve Bayes,
strength also known as sentiment scores. Thus, English tweets random forest, and SVM. Thus, it was found that the proposed
had the strength of 2600 words and Roman Urdu had 3200 deep learning model was able to give better results with the
words. Results were analysed by evaluating the measure of highest accuracy of 0.95180 among the other three machine
recall, precision, accuracy, and F-measure. learning classifiers. Similarly, in [30] it was found that around
Many tweets are updated in Urdu, which is related to 64 million people spoke the Urdu language and knew about its
different political situations prevailing in Pakistan; similarly, Roman script. In the, a deep learning model was proposed
the polarity of these views can be measured by the sentiment which was able to mine the emotions and feelings of people in
analysis. An attempt was made in [26], in which data of 2703 Roman Urdu. Two types of classifications were carried out
tweets in Roman Urdu from Twitter were extracted based on known as binary classifications and tertiary classification. It
the hashtags of cellular companies, such as Ufone, Zong, was concluded that the RCNN model performed exceptionally
Mobilink, Warid, and Telenor by using Twitter API. The better than the baseline models with an accuracy of 0.652 for
statistical calculations were done using the R language. The binary classification and 0.572 for tertiary classification.
polarity of the tweet was analysed by giving a weighted score
through SENTIWORDNET. Six machine learning algorithms III. MATERIAL AND METHODS
such as bagging, SVM, Boosting, Forest Tree, Maxen, and This section of the research illustrates the methodology that
Naïve Bayes were compared based on precision, recall, and F has been adopted to perform the sentiment analysis of the Urdu
score. Thus, bagging was found to predict the best results of language. There are two ways through which the sentiment
sentiments among all. The results of sentiment analysis may analysis can be performed: one is lexicon-based, and the second
differ in different languages; in [27], Twitter data were is machine learning-based. Since the Urdu language is very rich
extracted based on six different keywords in Urdu and English. in terms of morphology, creating lexicons in Urdu is a
The keywords were Pakistan, Imran Khan, Nawaz Sharif, challenging task. Therefore, a machine learning approach has
Bhutto, justice, and Dam. First, the sentiment analysis was been selected for this study. First, the medium that is used for
performed on English tweets of keywords and then on Urdu the data collection is Twitter, a well-known social media site.
tweets. As for in English the keyword “Pakistan” had the The tweets are extracted in the Urdu language, i.e., in Roman
highest strength of 42 %, while the keyword “Justice” had the Urdu and the tweets are written in Urdu scripts (Nastaleeq).
lowest strength of 14 %. The keyword “Imran Khan” had the Access to Twitter data has been gained using one of twitter’s
polarity of 9:7, i.e., 9 positive and 7 negative. However as of APIs called TWEEPY API. Second, the extracted data are pre-
Urdu the keyword “‫ ”ڈ ﯾﻢ‬had the highest strength of 20 % and processed. Third, the pre-processed data are translated into the
“‫ ”ﻧﻮاز ﺷﺮﯾﻒ‬had 0 % strength. Also, the keyword “Imran Khan” English language. Next, polarity is assigned to each tweet by
had a sentiment of 11:3 i.e., 11 positive and 3 negatives, unlike using NLTK and TextBlob library of python. In the next step,
the result obtained in English for the same keyword which was the obtained data are prepared for training and testing and are
9:2. It was observed that Roman Urdu differs in many ways, in further passed through several machine learning algorithms
terms of how people write it. Everyone has a different style of available in the WEKA tool.
spelling words. Thus, it does not have any standard for writing
it. An attempt was made by the authors in [28] to perform the A. Data Collection and Tools
sentiment analysis on Roman Urdu. The data extracted was Tweets that are updated in Urdu language (Nastaleeq and
based on people’s reviews of different products on an e- Roman) on Twitter have been extracted using Tweepy API.
commerce site known as Daraz.pk. The data of 80 285 reviews Python is used as a programming language for API and data
were collected and stored as vector space models, which were extraction. Lastly, sublime text editor has been used as an editor
then labelled as negative, positive, and neutral. Later support for writing down the code.
vector machine algorithm was applied with the linear and cubic
33
_________________________________________________________________________________________________2022/27
B. Data Analysis Model The next step after the tweet extraction was to get the polarity
The extraction of tweets from Twitter begins by selecting the of each tweet to get the labelled data. The labelled data were
hashtags. The selected hashtags serve as the baseline for used to implement the machine learning approach of sentiment
collecting the tweets related to one specific domain. In this analysis. Python provided the packages, which could
study, several hashtags have been used that are related to the automatically detect the sentiment of a sentence whether the
prevailing corona virus and its situation in Pakistan only. sentence was positive, negative, or neutral. These packages are
NLTK (Natural Language Toolkit) and TextBlob. NLTK takes
#CoronaVirusPakistan #stayhomestaysafe #LockdownExtended
the string as an input, tokenizes it into words, removes the stop
#CoronaInPakistan #COVIDisnotFlu #StayStrongImranKhan
#sealeddown #PakistanFightsCorona #PMIKFightsCorona words, and then performs stemming. It also tags each word of a
#COVID-19 #CoronaFreePakistan #WeStandWithPMImranKhan string into parts of speech. These words are then matched with
#COVID-19 #StaySafeStayHome #TigerForce
the words available in the bag of words to get the sentiment
#COVID-19 #StayAtHomeChallenge #LockdownEnd
#LiftingRestrictions #lockdownpakistan #CoronaVirus scores. In NLTK, the bag of words that is used is called
#PMIKCoronaFund #PakistanUnitedAgainstCorona #PunjabFightsCorona SentiWordNet. This is how NLTK works and assigns the
#COVID19outbreak
polarity of each sentence in a dataset. TextBlob, in turn, is built
The extraction took place at different time intervals, upon NLTK, as it uses the NLTK corpora only. TextBlob
depending upon the fact that either of the mentioned hashtags identifies the polarity by calculating the number of positive and
is trending on Twitter or not. Once a topic is trending on negative words in a sentence [31]. The mentioned two libraries
Twitter, the chances of getting a maximum number of tweets is of python do not provide support for the Urdu language. Also,
getting higher. The Tweepy API supports the Urdu language; there is not any python library, which has tokens for Urdu
similarly, a python code was used along with the API to extract words, Urdu POS, complete and authentic list of positive and
the tweets only in the Urdu language given the hashtags. A negative words. Due to this reason, the polarity of extracted
condition was added in the code to discard all the tweets that Urdu tweets cannot be detected. To overcome this issue, the
were in another language except for Urdu. During this Urdu tweets were translated into English using the Google
extraction, the pre-processing of data was also done to save the translator API provided by python. After the translation, the
clean tweets in a CSV file. polarity was assigned using TextBlob and NLTK. A CSV file
The total number of tweets gathered was 2300. In this having Urdu tweets was passed through the python code with
dataset, there were some tweets, which only included unwanted Google translator API. Once the tweets were translated into
text followed by a link such as “ ‫وﯾﮉﯾﻮ دﯾﮑﮭﻨﮯ ﮐﮯ ﻟﯿﮯ ﻧﯿﭽﮯ ﻟﻨﮏ ﭘﺮ‬ English, they were again saved into a CSV file. In the next step,
‫”ﮐﻠﮏ ﮐﺮﯾﮟ‬. Thus, such kinds of tweets were removed and the the translated tweets were given the polarity by using NLTK
final dataset that was obtained had 1914 tweets, which were and TextBlob. The tweets underwent some basic pre-processing
further used to perform the sentiment analysis. The pre- again that was mentioned earlier to avoid unwanted noise that
processing steps were performed. might have occurred during the translation. If the polarity of the
The process to gather the tweets in Roman Urdu is similar to tweet was greater than zero (0), then it was declared as positive;
the extraction process of tweets in the Urdu language if it was less than zero (0), then it was declared as negative.
(Nastaleeq). Except for the fact that Roman Urdu is similar to Moreover, if the polarity of the tweet was exactly zero (0), then
the English language in terms of the alphabet and letters used it was declared neutral. Table I depicts the polarity assignment
for it. Similarly, the identification of tweets written in Roman to Roman Urdu Tweet as an example.
Urdu that differs from actual English was a challenging task in Roman Urdu is written using the English letters and
the research. To overcome this challenge, the hashtags that were alphabets, but it differs from the English language. Similarly,
selected for this purpose were solely related to ongoing political the python libraries NLTK and TextBlob cannot be used for
issues in Pakistan. Also, a condition was added in the python gaining the polarity of Roman Urdu tweets. Therefore, in this
code to exclude all the tweets that were found either in English study, the collected Roman Urdu tweets from Twitter were
or Urdu, leaving behind the tweets that were in Roman Urdu converted into the English language by using an online website
only based on the selected hashtags. https://www.ijunoon.com/. After the tweet conversion from
The hashtags used are provided below. 1127 is the total count Roman Urdu to English, the tweets were passed through NLTK
of the tweets in Roman Urdu that were gathered after the and TextBlob to assign the polarity to each tweet. Table II
extraction and basic pre-processing. The pre-processing steps depicts the polarity assignment to Urdu Tweet as an example.
were performed. Stop words can be defined as those words in a natural
language that carry little or no information at all. These are the
#CorruptionDushmanARY #Patwari #LanatARY
#calloffuniversityexams #SindhDushmanPPP #BilawalBhuttoTheLeader words that occur frequently in the documents, but they are less
#SindhDushmanARY #supportARY #ARYNewsUrdu informative. Removal of such words from the text helps in
#ARYNews
#PMIK
#chorPPP
#BoycottARY
#ARYLanat
#WeWillExposeARY
significant information retrieval. The removal of stop words
#We_love_ARY #iqrarulhassan #PMImranKhan usually takes place in the pre-processing step of data. Thus, the
#PPP #GoNaiziGo #ImranKhan elimination of such uninformative words from a sentence or
#extendlockdown #TeamSareAam #imrankhanPTI
document plays an important role in gaining better efficiency
#SaeedGhani #BilawalBhuttoJawabDo #ThankYouPMIK
#SindhGovt #We_love_iqrar #kaptan
and accuracy of textual data [32]. Several works and studies
#ARYNiazi #bhuttozardari #PEMRAShouldBanIqrarARY were performed in the past for the development of stop word
34
_________________________________________________________________________________________________2022/27
list in the Urdu language for both Natalee and Roman Urdu. In as classification, clustering, regression, attribute selection,
[33], an algorithm was proposed, which removed the stop words visualization, and most importantly data pre-processing [37].
from Urdu documents. The methodology of this algorithm was The testing and training dataset has been divided in the ratio
defined by using deterministic finite automaton (DFA). of 70:30 (refer to Table III). The training set comprises 70 % of
However, as a result, there was no stable list of stop words the tweets, while the remaining 30 % is used for testing. The
obtained by using the proposed algorithm. The authors of the total number of tweets in the Urdu language is 1914, out of
paper [34] attempted to perform text classification in the Urdu which 1414 are reserved for the training purpose and the
language for which stop word removal was an essential step in remaining 500 – for testing. As for Roman Urdu, the dataset
pre-processing. Hence, a list of stop words was built manually consists of 1127 total tweets. Out of the total count, 789 tweets
based upon the data used. The list also included Arabic words. are used for training the machine learning algorithms, while 338
In this study, while considering the stop words for Urdu tweets are used for testing purposes.
language (Nastaleeq), a list has been compiled for stop words The dataset is prepared in .arff file format that will be loaded
from three different sources. The sources along with the number into WEKA. The attributes of the dataset are the following:
of words obtained from each source are mentioned below: @relation Urdu_SA;
• Ranks NL: This is an online website that provides stop @attribute Tweet_Urdu string;
words in 40+ languages. A list consisting of 548 Urdu @attribute 'Class' {'Positive','Negative','Neutral'}.
stop words has been obtained from this site. This source Pre-processing is the most important step that has to be
can be accessed at performed before the application of any machine learning
https://www.ranks.nl/stopwords/urdu. However, in the algorithm on the dataset. Similarly, in the research, the training
provided list some words have been found, which do not is first loaded into WEKA using the explorer tab for pre-
belong to the Urdu language, and some have been processing. Feature extraction is the first step in pre-processing.
duplicated. Hence, after the elimination process, 286 Feature extraction can be described as an initial reduction in a
stop words have been obtained from this source. dataset without the loss of important and relevant information.
• The second source, which has been used to get the Urdu It extracts the important features or variables from the dataset
stop words, is the paper [35]. An algorithm was designed through different techniques, thereby reducing the machine
that identified the most frequently used words in each learning algorithm effort in building a classifier. It also helps in
document. This algorithmic approach was implemented improving the accuracy of the selected classifier for the dataset.
on the publicly available dataset of BBC Urdu news There are different methods available for feature extraction
articles. The authors were able to get 93 different stop such as a bag of words, word embeddings, TF-IDF and count
words for the Urdu language. vectors. However, in this study TF-IDF has been used [38].
• The third source which has been used in this study is the StringToWordVector filter, which is available in WEKA, was
paper [36]. The authors of the paper have attempted to applied.
present the list of 783 Urdu stop words. These words Along with feature extraction, some other task of pre-
have been obtained by translating the available list of the processing has also been carried through the
English language stop words into Urdu. StringToWordVector filter, which is listed below:
Similarly, after the stop word data collection from the above- • TF-IDF stands for Term Frequency-Inverse document
mentioned sources, the duplication was removed to get the frequency. It helps in identifying how important a
unique set of words. The final list that has been obtained certain word is to the document or a sentence. If a word
consists of 1003 stop words in the Urdu language (Nastaleeq has a high occurrence in a document, then its importance
script). also increases, while this importance is offset by its
Urdu written in the Roman script is called Roman Urdu. frequency in a corpus [39].
Currently, there is no proper standard defined for writing For example, the word “‫”ﮐﯽ‬, “‫”ﮨﮯ‬, “‫ ”ﮨﯿﮟ‬etc. may appear
Roman Urdu. It also has many irregularities in spelling; very frequently in a document as well as in a complete
therefore, every other user on social media is updating their corpus, while the word “‫ ”ﺧﻮﺑﺼﻮرت‬may occur frequently
posts, tweets, or writing comments in their style in Roman in a single document, but not in the whole corpus; thus,
Urdu. As far as this study is concerned, a list of stop words for it would be considered relevant and important
Roman Urdu has been prepared manually. The stop words are information for the analysis.
extracted directly from the data used for the sentiment analysis, • LowerCaseToken parameter was used during the pre-
i.e., the tweets extracted in Roman Urdu from Twitter. The final processing of Roman Urdu only. All the words were
list that has been obtained consists of 815 stop words of Roman transformed into lower case alphabets so that all the
Urdu. words were in the same format for further analysis.
• StopwordsHandler. In this parameter a file was given
C. WEKA
input which consisted of compiled stop words for Urdu
WEKA stands for Waikato Environment for Knowledge (Nastaleeq) and Roman Urdu. The stop words were used
Analysis. This tool is easily available online and has a public in the respective pre-processing of Urdu and Roman
license. WEKA supports several machine learning tasks, such Urdu.
35
_________________________________________________________________________________________________2022/27
• Tokenizer splits the words present in a string into a In the next step, another filter “NumericToBinary” was
single token. These tokens are identified by WEKA applied. This filter is used to normalize the attributes by
using different delimiters. The delimiters used by the converting each attribute into binary form, thereby assigning
tool are \r\n\t,,::\\"()?!" the value either 0 or 1. In the next step, another filter “Attribute
• WordsToKeep. This parameter allows the user to set the Selection” was applied. The reason for applying this filter was
possible limit of words per class. It helps in keeping a to find the importance of attributes. Only applicable attributes
maximum number of words even with the smaller are involved in the data mining process, instead of processing
frequency, which may be useful when building a all the attributes. This way performance of the mining task is
classifier for the given dataset. In this study, the range increased, and it will reduce the processing time as well. Thus,
was set as 10 000. attribute selection algorithms are applied before implementing
Other available parameters for StringToWordVector were other data mining tasks. Attribute selection can be performed in
not utilized in this study; hence, they were left with the default two steps. The first step is subset generation, and the second
settings. The dataset was already pre-processed as per the step is ranking [40]. In this case, InfoGainAttributeEval was
requirement with the above-mentioned parameters. The last used followed by Ranker search. The training and testing
attribute of the loaded dataset in WEKA is considered the Class. dataset is passed through the previously mentioned pre-
However, in this case, after the StringToWordVector filter was processing steps. Table IV represents the number of attributes
applied to the dataset of Urdu and Roman Urdu tweets, the class obtained after the successful execution of pre-processing phase
was found to be the very first token. Similarly, the dataset was on the testing and training dataset.
edited in WEKA and the column of the class was declared as The next step is to train these datasets with different machine
“Attribute as Class” and was moved to the last attribute. learning algorithms in WEKA and then test the trained models.
TABLE I
EXAMPLE OF POLARITY ASSIGNMENT TO EXTRACTED URDU TWEETS
Obtained Polarity
Urdu Tweet English Translated Tweet Pre-Processed English Tweet
Polarity Label
‫ دﻧﯿﺎ ﮐﮯ ﻣﺨﺘﻠﻒ ﺣﺼﻮں ﻣﯿﮟ ﮐﺮوﻧﺎ ﮐﯽ ﺻﻮرﺗﺤﺎل ﺑﮩﺘﺮ‬Corona's situation began to improve in Corona's situation began to
0.6 Positive
‫ﮨﻮﻧﮯ ﻟﮕﯽ‬ different parts of the world improve in different parts world
‫ ﮐﺮوﻧﺎ ﻧﮯ دﻧﯿﺎ ﮐﯽ ﺳﺎڑھﮯ ﺗﯿﻦ ﺳﻮ ﺳﺎل ﮐﯽ ﺗﺮﻗﯽ ﮐﻮ ﻟﭙﯿﭧ‬Corona wrapped up the world's three- Corona wrapped world three half-
−0.166666667 Negative
‫ﮐﺮ رﮐﮭ دﯾﺎ‬ and-a-half-year development year development
Oh Allah eradicate this epidemic from Allah eradicate epidemic Pakistan
‫ﯾﺎ ﷲ اس ﮐﺮوﻧﺎ ﮐﯽ وﺑﺎ ﮐﻮ ﭘﺎﮐﺴﺘﺎن ﺳﮯ ﺧﺘﻢ ﮐﺮ دے آﻣﯿﻦ‬ 0 Neutral
Pakistan. Amen Amen
TABLE II
EXAMPLE OF POLARITY ASSIGNMENT TO EXTRACTED ROMAN URDU TWEETS
Obtained Polarity
Roman Urdu Tweet English Translated Tweet Pre-Processed English Tweet
Polarity Label
ARY corruption ke khelaf awaz otane ka ARY, thank you for speaking out ARY thank you for speaking
0.7 Positive
shukeria against corruption against corruption
Lanat ho aise jhoti media p Damn such false media Damn such false media −0.4 Negative
Corrupt logon sy sawal tu hoga Corrupt people will be questioned Corrupt people questioned 0 Neutral
TABLE III
TESTING AND TRAINING DATASET OF URDU (NASTALEEQ) AND ROMAN URDU
Positive Negative Neutral Training Testing
Total
Tweet Tweet Tweet Dataset Dataset
Urdu (Nastaleeq) 586 379 949 1914 1414 500
Roman Urdu 378 300 449 1127 789 388
TABLE IV
NUMBER OF ATTRIBUTES OBTAINED AFTER PRE-PROCESSING IN WEKA TESTING AND TRAINING DATASETS
Training Dataset Testing Dataset
Number of Number of Number of Number of
Instances Attributes Instances Attributes
Urdu (Nastaleeq) 1414 3184 500 2158
Roman Urdu 789 3872 338 2193
After the training and testing datasets were prepared, the next • Supervised Machine Learning: Supervised learning can
step was the selection of a machine learning algorithm to be train the machine with the help of a well-labelled
used for the sentiment analysis. Currently, the following are the training dataset. The training dataset already has the
two approaches that exist for machine learning: instances tagged with the correct answers. Once the
model is built, a testing set is supplied with unlabelled
36
_________________________________________________________________________________________________2022/27
instances. The trained model then produces the correct TP

outcome for the testing dataset. Supervised learning can Precision = .
further be divided into classification in which an output
TP + FP
label is a class such as rainy day, sunny day, patient, non- • The recall is defined as the ratio of correctly identified
patient etch and other is regression in which the output positive values to the total number of actual values. Like
label is a real-time numeric value such as price, weight, precision, it also has a range between 0 and 1, where 1
etc. [41]. is the highest range and 0 is the lowest. The more recall
• Unsupervised Machine Learning is based on the logic of a value gets closer to the range of 1 the more a model is
building the model with training data that does not have considered to yield the best results [43].
any labelled class. It works by making groups of similar TP
Recall = .
data, analysing the patterns, making some rules, etc. TP + FN
Unsupervised learning is further classified into • F-Measure is a combined measure for precision and
clustering and association [41]. recall as it maintains the balance between both. It also
Thus, in this study supervised machine learning approach takes into account both FN and FP as they should have
was adopted. Similarly, for this purpose classification minimum cost in the model to gain high accuracy [43].
algorithms were needed. WEKA provides a number of the 2 ( Precision − Recall )
classification algorithms. All of these algorithms were first F − Measure = .
trained by using the pre-processed training dataset. After that, ( Precision + Recall )
the trained models were re-evaluated by supplying the test
dataset. IV. RESULTS
Evaluation of a machine learning model is as important as
This section of the research gives a detailed view of the
building it on the desired dataset. Evaluation metrics identify
experiments carried out for the sentiment analysis of the Urdu
the algorithm or model that is most suitable and robust for
language (Nastaleeq) and Roman Urdu. The results obtained
providing the solution to a given problem. Similarly, in this
from different classifiers on the dataset extracted from Twitter
study, four parameters have been selected to find near to the
have been discussed.
best algorithm for sentiment analysis of Urdu (Nastaleeq) and
Roman Urdu. A. Urdu (Nastaleeq)
• Accuracy defines how effective a model is on the All the existing classifiers in WEKA mentioned previously
provided dataset. The ratio of correct predictions made also were applied to the training dataset of Urdu (Nastaleeq)
to the total number of predictions is termed accuracy. tweets to perform the sentiment analysis. The total number of
The higher the accuracy of the classifier, the better the the classifier is 31 and each of them is applied to the dataset.
model is [42]. Execution of each classifier on the extracted dataset from
Number of correct predictions Twitter played a significant role in identifying the most
A ccuracy = . appropriate machine algorithm for the sentiment analysis,
Total Number of predictions made specifically in the Urdu language. During the experiment, it
In other words, it can also be said that the ratio of true was observed that the results were obtained for 17 classifiers
positive and true negative to the total number of assessments is out of 31, while 14 classifiers were unable to give any results
called accuracy. It is the proportion of the true results in a using WEKA.
sample. The total number of instances for the training dataset was
TP + TN 1414, while for the testing it was 500. Once the classification
Accuracy = , model was built on the training data, the test data were provided
TP + TN + FP + FN to further predict the sentiment of the Urdu tweet. The most
where suitable algorithm that was found for this experiment was SMO
TP = True Positive (Actual class is positive and so is the (Sequential Minimal Optimization). This algorithm was
predicted class); anticipated in 1998 by John Platt. SMO is used for the training
TN = True Negative (Actual class is negative and so is the purpose of Support Vector Machine (SVM). The core purpose
predicted class); of SVM is to create boundaries between the classes involved in
FP= False Positive (Actual class is negative, but predicted training and testing datasets, thereby considering the error risk
class is positive); involved in it. To achieve this goal, SVM requires the solution
FN = False Negative (Actual class is positive, but predicted for large quadratic programming (QP) problems, which in turn
class is negative). requires plenty amount of time [44]. Similarly, SMO provides
• Precision is measured by dividing the true positive class support to SVM by dividing the large QP problems into series
in a model that was correct in the actual result also to the of smaller QP chunks and then solving them analytically. This
total number of predicted true positive and false positive. ability of SMO also reduces the time needed for training the
Precision has a range between 0 and 1, where 0 indicates dataset for SVM. Also, SMO can handle very large datasets
low precision and 1 indicates high precision [43]. easily as the amount of memory it requires is linear in the
provided training data [45]. SMO had the highest accuracy of
37
_________________________________________________________________________________________________2022/27
83.6 % among other classifiers. 418 instances out of 500 were metrics as it had the precision value of 0.837; recall was
correctly classified by SMO, while only 82 instances were obtained as 0.836 and F-measure as 0.833. The results obtained
classified incorrectly. Based on the classification identified by by all other classifiers are shown in Table V.
SMO, it also satisfied the passing criteria of the evaluation
TABLE V
RESULTS OF MACHINE LEARNING ALGORITHM ON A DATASET OF URDU (NASTALEEQ) TWEETS FOR SENTIMENT ANALYSIS
Correctly Incorrectly Total
Total Test
Classifiers in WEKA Classified Classified Accuracy Precision Recall F-Measure
Instance
Instances Instances %
Random Forest 500 413 87 82.6 0.849 0.826 0.820
Random Tree 500 386 114 77.2 0.770 0.772 0.769
REP Tree 500 283 217 56.6 0.555 0.566 0.521
LMT 500 N/A N/A N/A N/A N/A N/A
J48 500 359 141 71.8 0.725 0.718 0.707
Hoeffding Tree 500 389 111 77.8 0.779 0.778 0.774
Decision Stump 500 N/A N/A N/A N/A N/A N/A
BayesNet 500 392 108 78.4 0.798 0.784 0.775
NaiveBayes 500 N/A N/A N/A N/A N/A N/A
NaiveBayesMultinomialText 500 N/A N/A N/A N/A N/A N/A
NaiveBayesUpdateable 500 N/A N/A N/A N/A N/A N/A
IBK 500 393 107 78.6 0.793 0.786 0.780
K-Star 500 N/A N/A N/A N/A N/A N/A
LWL 500 N/A N/A N/A N/A N/A N/A
Logistic 500 N/A N/A N/A N/A N/A N/A
MultilayerPerceptron 500 N/A N/A N/A N/A N/A N/A
SimpleLogisticRegression 500 408 92 81.6 0.819 0.816 0.814
SMO 500 418 82 83.6 0.837 0.836 0.833
AdaBoostM1 500 263 237 52.6 ? 0.526 ?
Bagging 500 341 159 68.2 0.694 0.682 0.664
AttrributeSelectedClassifier 500 N/A N/A N/A N/A N/A N/A
ClassificationViaRegression 500 259 241 51.8 ? 0.518 ?
CVParameterSelection 500 259 241 51.8 ? 0.518 ?
FiletredClassifier 500 359 141 71.8 0.725 0.718 0.707
IterativeClassifierOptimizer 500 285 215 57 0.562 0.570 0.527
LogitBoost 500 285 215 57 0.562 0.570 0.527
MultiScheme 500 259 241 51.8 ? 0.518 41
MultiClassClassifierUpdateable 500 N/A N/A N/A N/A N/A N/A
RandomCommitte 500 N/A N/A N/A N/A N/A N/A
RandomizableFilteredClassifier 500 N/A N/A N/A N/A N/A N/A
RandomSubSpace 500 N/A N/A N/A N/A N/A N/A
It is evident from the graphical representation of the accuracy and Bagging, classified the least number of instances, which
in Fig. 2 that the SMO algorithm had the highest accuracy was 259. These three algorithms had the highest number of
compared to other classifiers, which were used in this study for incorrectly classified instances (241).
the Urdu tweet dataset. SMO also satisfied the evaluation parameters. The results
Also, it was able to classify a maximum number of instances obtained for precision, recall, and F-Measure were near to the
correctly out of 500, while three algorithms, i.e., IBK, Logistic, best for the supplied dataset.
38
_________________________________________________________________________________________________2022/27
dataset and hence no training model was obtained. During the

experiment, it was also observed that precision and F-measure
were not calculated by WEKA for the classifiers, which had an
accuracy of less than 40 %. Such classifiers are DecisionStump,
NaiveBayesMultinomialText, AdaBoostM1,
ClassificationViaRegression, CVParameterSelection and
MultiScheme. Among all the applied classifiers, Random
Forest was found to be the best for Roman Urdu. Random forest
(RF) was proposed by Leo Breiman and Adèle Cutler in 2006.
It is based on the technique of ensemble learning. Before
random forest, there were only two classifiers available that
Fig. 2. Accuracy obtained for each classifier for Urdu tweets.
worked on the logic of ensemble learning, i.e., boosting and
bagging [46]. During the training phase, RF builds several
B. Roman Urdu decision trees also known as the forest of decision trees. Thus,
the greater the forest, the highest the accuracy gained by the
Considering the results of the sentiment analysis for Roman
classifier. When it comes to predicting the data in the test
Urdu, training models were built with 31 different classifiers
dataset it works on the mechanism of voting, i.e., each of the
available in WEKA. The training dataset comprised 789 tweets
created decision trees gives the class prediction vote for the test
with labelled sentiment class, i.e., Positive, Negative and
instance. Random forest assigns the class or category that has
Neutral, while the testing dataset had 388 tweets. However, out
most of the votes [47]. It has achieved an accuracy of
of these 31 classifiers, training models were successfully
89.0533 %. Also, it was able to classify 301 instances correctly
executed for 27 classifiers only. The remaining five
out of 388 in the test dataset. The precision was 0.895, the recall
classification algorithms were K-Star, LWL,
was 0.891 and F-measure was 0.891. Table VI is the tabular
MultilayerPerceptron, and AttrributeSelectedClassifier. These
representation of the results obtained for Roman Urdu
algorithms in WEKA took a much longer time such as
sentiment analysis by WEKA.
60 minutes, but they failed to successfully execute on the given
TABLE VI
RESULTS OF MACHINE LEARNING ALGORITHM ON A DATASET OF URDU (NASTALEEQ) TWEETS FOR SENTIMENT ANALYSIS
Total Test
Classifiers In WEKA Classified Classified Accuracy Precision Recall F-Measure
Instance
Random Forest 388 301 37 89.0533 0.895 0.891 0.891
Random Tree 388 290 48 85.7988 0.860 0.858 0.859
REP Tree 388 176 162 52.071 0.531 0.521 0.515
LMT 388 292 46 86.3905 0.864 0.864 0.864
J48 388 229 109 67.7515 0.692 0.678 0.675
Hoeffding Tree 388 261 77 77.2189 0.808 0.772 0.771
Decision Stump 388 126 212 37.2781 ? 0.373 ?
BayesNet 388 269 69 79.5858 0.823 0.796 0.796
NaiveBayes 388 261 77 77.2189 0.808 0.772 0.771
NaiveBayesMultinomialText 388 118 220 34.9112 ? 0.349 ?
NaiveBayesUpdateable 388 261 77 77.2189 0.808 0.772 0.771
IBK 388 295 43 87.2781 0.882 0.873 0.874
K-Star 388 N/A N/A N/A N/A N/A N/A
LWL 388 N/A N/A N/A N/A N/A N/A
Logistic 388 290 48 85.7988 0.861 0.858 0.858
MultilayerPerceptron 388 N/A N/A N/A N/A N/A N/A
SimpleLogisticRegression 388 273 65 80.7692 0.811 0.808 0.808
SMO 388 297 41 87.8698 0.879 0.879 0.879
AdaBoostM1 388 126 212 37.2781 ? 0.373 ?
Bagging 388 238 100 70.4142 0.715 0.704 0.702
AttrributeSelectedClassifier 388 N/A N/A N/A N/A N/A N/A
ClassificationViaRegression 388 118 220 34.9112 ? 0.349 ?
CVParameterSelection 388 118 220 34.9112 ? 0.349 ?
39
_________________________________________________________________________________________________2022/27

Total Test
Classifiers In WEKA Classified Classified Accuracy Precision Recall F-Measure
Instance
FiletredClassifier 388 229 109 67.7515 0.692 0.678 0.675
IterativeClassifierOptimizer 388 138 200 40.8284 0.536 0.408 0.321
LogitBoost 388 171 167 50.5917 0.542 0.506 0.487
MultiScheme 388 118 220 34.9112 ? 0.349 ?
MultiClassClassifierUpdateable 388 298 40 88.1657 0.885 0.882 0.823
RandomCommitte 388 290 48 85.7988 0.861 0.858 0.858
RandomizableFilteredClassifier 388 288 50 85.2701 0.854 0.852 0.852
RandomSubSpace 388 215 123 63.6095 0.695 0.636 0.626
Two algorithms, i.e., Random Forest and in morphology; also, its grammar differs from English and other
MultiClassClassifierUpdateable were able to gain the highest languages. Also, stop word removal is a very important task in
accuracy, which was very close to each other, but Random the approach of sentiment analysis, i.e., lexicon-based or
Forest ranked first among all classifiers as it had an accuracy of machine learning. However, there is no authentic resource
89.0533, a bit higher than MultiClassClassifierUpdateable, available for the compiled stop words.
which had the accuracy of 88.1657. As it can be seen in Fig. 3,
four algorithms out of 31 produced the worst results in terms of VI. CONCLUSION
accuracy, which was 34.9112. These classifiers are This study aimed to perform the sentiment analysis of the
NaiveBayesMultinomialText, ClassificationViaRegression, national language of Pakistan (Urdu). Urdu is found on all
CVParameterSelection, MultiScheme. platforms of social media and a wide range of its users use their
national language (Urdu Script and Roman Urdu) to express
their thoughts on different topics/issues going around the globe.
Therefore, in this study, Twitter opted as the medium for data
extraction in the form of Urdu (Nastaleeq and Roman Urdu)
tweets. The data were extracted in different time intervals using
Tweepy API. Also, Urdu is a resource-deprived language when
it comes to the sentiment analysis compared to other rich
languages, on which a wide variety of research has already been
done for sentiment extraction using the lexicon-based approach.
Especially, Roman Urdu has many irregularities in its spelling
and allows for free-style writing. Thus, the Urdu language
(Nastaleeq and Roman Urdu) lacks the lexicons and its
associated components such as stemming, tokenization, and
Fig. 3. Accuracy obtained for each classifier for Roman Urdu tweets. lemmatization. Another step in the research of sentiment
analysis specifically for the Urdu language can be the creation
Random forest yields the best results by classifying the of lexicons, which might yield better results than other available
highest number of tweets correctly in the testing dataset approaches. Due to this reason, a machine learning approach
compared to other classifies. was adopted to perform the sentiment analysis in this study.
Once the data were successfully extracted and pre-processed,
V. DISCUSSION each tweet was assigned the polarity and 31 different classifiers
Sentiment analysis has been performed in many languages to were applied to the obtained dataset. The best one was chosen
extract valuable information about products, brands, politics, for Urdu and Roman Urdu based on the evaluation metrics.
etc. Such languages are very rich in sentiment analysis SMO was found best for the sentiment analysis on Urdu
resources; the most important resource, in this case, is the (Nastaleeq) and Random Forest ranked number one for Roman
availability of the lexicons, while other resources include stop Urdu. Also, the stop word list has been compiled for Urdu script
words, techniques for tokenization, and stemming. Thus, unlike from three different sources, while stop words were extracted
other languages, the Urdu language has poor resources for manually from the dataset for Roman Urdu. The training and
performing the sentiment analysis. The Urdu language lacks the then testing of the classifiers took place in the tool WEKA.
availability of lexicons, similarly, the stemming of Urdu words These lists of stop words can play a significant role in future
gets more complicated. The most used python packages for the research on the Urdu language. Moreover, in this study, some
sentiment analysis (TextBlob and NLTK) work for 21 different of the classifiers out of 31 were unable to give any results in the
languages, but they do not provide support for the Urdu WEKA tool on the provided dataset. Thus, in the future, a
language. This is because Urdu (Nastaleeq script) is very rich different approach and techniques of preprocessing can be
40
_________________________________________________________________________________________________2022/27
adopted on the dataset so that some valuable results can be [18] R. P. Schumaker, A. T. Jarmoszko, and J. L. S. Chester, “Predicting wins
and spread in the Premier League using a sentiment analysis of twitter,”
obtained from the leftover machine learning algorithms as well. Decision Support Systems, vol. 88, pp. 76–84, Aug. 2016.
https://doi.org/10.1016/j.dss.2016.05.010
REFERENCES [19] D. Pope and J. Griffith, “An analysis of online Twitter sentiment
surrounding the European,” in 8th International Conference on
[1] J. Serrano-Guerrero, J. A. Olivas, F. P. Romero, and E. Herrera-Viedma, Knowledge Discovery and Information Retrieval, Porto, Portugal, 2016,
“Sentiment analysis: A review and comparative analysis of web,” pp. 299–306. https://doi.org/10.5220/0006051902990306
Information Sciences, vol. 311, pp. 18–38, Aug. 2015. [20] A. C. Pandey, D. S. Rajpoot, and M. Saraswat, “Twitter sentiment
https://doi.org/10.1016/j.ins.2015.03.040 analysis using hybrid cuckoo search method,” Information Processing &
[2] L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis: A Management, vol. 53, no. 4, pp. 764–779, July 2017.
survey,” WIRES data mining and knowledge discovery, vol. 8, no. 4, July https://doi.org/10.1016/j.ipm.2017.02.004
2018. https://doi.org/10.1002/widm.1253 [21] H. K. Aldayel and A. M. Azmi, “Arabic tweets sentiment analysis – a
[3] M. Giatsogloua, M. G. Vozalis, K. Diamantaras, A. Vakali, G. hybrid scheme,” Journal of Information Science, vol. 42, no. 6, pp. 782–
Sarigiannidis, and K. C. Chatzisavvas, “Sentiment analysis leveraging 797, Oct. 2016. https://doi.org/10.1177/0165551515610513
emotions and word embeddings,” Expert Systems with Applications, vol. [22] A. M. Alayba, V. Palade, M. England, and R. Iqbal, “Arabic language
69, pp. 214–224, Mar. 2017. https://doi.org/10.1016/j.eswa.2016.10.043 sentiment analysis on health services,” in 2017 1st International
[4] K. K. Mohbey, B. Bakariya, and V. Kalal, “A study and comparison of Workshop on Arabic Script Analysis and Recognition (ASAR), Nancy,
sentiment analysis techniques using demonetization: Case study," in France, Apr. 2017, pp. 114–118.
Sentiment Analysis and Knowledge Discovery in Contemporary https://doi.org/10.1109/ASAR.2017.8067771
Business, 2018, pp. 1–14. https://doi.org/10.4018/978-1-5225-4999- [23] M. Heikal, M. Torki, and N. El-Makky, “Sentiment analysis of Arabic
4.ch001 Tweets using deep learning,” Procedia Computer Science, vol. 142,
[5] C. S. Khoo and S. B. Johnkhan, “Lexicon-based sentiment analysis: pp. 114–122, 2018. https://doi.org/10.1016/j.procs.2018.10.466
Comparative Evaluation of Six Sentiment Lexicons," Journal of [24] A. Hassan, S. Moin, A. Karim, and S. Shamshirband, “Machine learning-
Information Science, vol. 44, no. 4, pp. 491–511, 19 Apr. 2017. based sentiment analysis for Twitter accounts," Mathematical and
https://doi.org/10.1177/0165551517703514 Computational Applications, vol. 23, no. 1, Feb. 2018.
[6] N. Boudad, R. Faizi, R. O. Haj Thami, and R. Chiheb, “Sentiment https://doi.org/10.3390/mca23010011
analysis in Arabic: A review of the literature,” Ain Shams Engineering [25] I. Javed, H. Afzal, A. Majeed, and B. Khan, “Towards creation of
Journal, vol. 9, no. 4, pp. 2479–2490, Dec. 2018. linguistic resources for bilingual sentiment analysis of Twitter data,” in
https://doi.org/10.1016/j.asej.2017.04.007 International Conference on Applications of Natural Language to Data
[7] S. Tartir and I. A. Nabi, “Semantic sentiment analysis in Arabic social Bases/Information Systems, Jun. 2018. https://doi.org/10.1007/978-3-
media,” Journal of King Saud University – Computer and Information 319-07983-7_32
Sciences, vol. 29, no. 2, pp. 229–223, Apr. 2017. [26] S. Ahmed, S. Hina, and R. Asif, “Detection of sentiment polarity of
https://doi.org/10.1016/j.jksuci.2016.11.011 unstructured multi-language text from social media,” International
[8] A. K. Rathore, V. Ilavarasan, and Y. K. Dwivedi, “Social media content Journal of Advanced Computer Science and Applications, vol. 9, no. 7,
and product co-creation: An emerging paradigm,” Journal of Enterprise pp. 199–203, 2019. https://doi.org/10.14569/IJACSA.2018.090728
Information Management, vol. 29, no. 1, pp. 7–18, Feb. 2016. [27] T. R. Soomro and S. M. Ghulam, “Current status of urdu on Twitter,”
https://doi.org/10.1108/JEIM-06-2015-0047 Sukkur IBA Journal of Computing and Mathematical Sciences, vol. 3,
[9] J. L. Sheela, “A review of sentiment analysis in Twitter data using no. 1, pp. 28–33, 2019. https://doi.org/10.30537/sjcms.v3i1.397
Hadoop,” International Journal of Database Theory and Application, [28] F. Noor, M. Bakhtyar, and J. Baber, “Sentiment analysis in E-commerce
vol. 9, no. 1, pp. 77–86, 2016. https://doi.org/10.14257/ijdta.2016.9.1.07 using SVM on Roman Urdu text," in International Conference for
[10] S. A. Salloum, M. Al-Emran, A. A. Monem, and K. Shaalan, “A survey Emerging Technologies in Computing, Jul. 2019.
of text mining in social media: Facebook and Twitter perspectives,” https://doi.org/10.1007/978-3-030-23943-5_16
Advances in Science, Technology and Engineering Systems, vol. 2, no. 1, [29] H. Ghulam, F. Zeng, W. Li, and Y. Xiao, “Deep learning-based
pp. 127–133, 2017. https://doi.org/10.25046/aj020115 sentiment analysis for Roman Urdu text," in 2018 International
[11] “Twitter launches,” A&E Television Networks, 14 July 2020. [Online]. Conference on Identification, Information and Knowledge in the Internet
Available: https://www.history.com/this-day-in-history/twitter- of Things, IIKI 2018, vol. 147, 2018, pp. 131–135.
launches. Accessed on: Aug. 2020. https://www.sciencedirect.com/journal/procedia-computer-
[12] “Number of monetizable daily active Twitter users (mDAU) worldwide science/vol/147/suppl/C
from 1st quarter 2017 to 2nd quarter 2020,” 23 July 2020. [Online]. [30] Z. Mehmood et al., “Deep sentiments in Roman Urdu text using
Available: https://www.statista.com/statistics/970920/monetizable- recurrent convolutional neural network model,” Information Processing
daily-active-twitter-users-worldwide/. Accessed on: Aug. 2020. and Management, vol. 57, no. 4, Feb. 2020, Art no. 102233.
[13] Y. Lin, “10 Twitter statistics every marketer should know in 2022 https://doi.org/10.1016/j.ipm.2020.102233
[infographic],” 30 July 2019. [Online]. Available: [31] V. Bonta, N. Kumaresh, and J. N, “A comprehensive study on lexicon
https://www.oberlo.com/blog/twitter-statistics. Accessed on: Oct. 2019. based approaches for sentiment analysis,” Asian Journal of Computer
[14] D. Hattem and L. Lomicka, “What the Tweets say: A critical analysis of Science and Technology, vol. 8, no. S2, pp. 1–6, Mar. 2019.
Twitter research in language learning from 2009 to 2016,” E-Learning https://doi.org/10.51983/ajcst-2019.8.S2.2037
and Digital Media, vol. 13, pp. 5–23, Oct. 2019. [32] S. Sarica and J. Luo, “Stopwords in technical language processing”,
https://doi.org/10.1177/2042753016672350 PLoS ONE, vol. 16, no. 8, Aug. 2021, Art no. e0254937.
[15] Twitter Inc., “Twitter for websites-supported languages,” 2019. [Online]. https://doi.org/10.1371/journal.pone.0254937
Available: https://developer.twitter.com/en/docs/twitter-for- [33] K. S. Dar, A. B. Shafat, and H. U. Muhammad, “An efficient stop word
websites/twitter-for-websites-supported-languages/overview. Accessed elimination algorithm for Urdu language,” in 2017 14th International
on: 2019. Conference on Electrical Engineering/Electronics, Computer,
[16] H. B. Zaya, A. A. Raza, and A. Ather, “Urdu word segmentation using Telecommunications and Information Technology (ECTI-CON), Phuket,
conditional random fields (CRFs),” in Proceedings of the 27th Thailand, Jun. 2017.
International Conference on Computational Linguistics, Santa Fe, New https://doi.org/10.1109/ECTICon.2017.8096386
Mexico: Association for Computational Linguistics, 2018, pp. 2562– [34] M. Usman, S. Ayub, Z. Shafique, and K. Malik, “Urdu text classification
2569. using majority voting,” International Journal of Advanced Computer
[17] V. S. Pagolu, K. N. R. Challa, and G. Panda, “Sentiment analysis of Science and Applications, vol. 7, no. 8, pp. 265–273, 2016.
Twitter data for predicting stock market movements,” in International https://doi.org/10.14569/IJACSA.2016.070836
conference on Signal Processing, Communication, Power and [35] K. Riaz and D. Becker, “Stopword identification in an Urdu corpus”.
Embedded System, Paralakhemundi, India, Oct. 2016, pp. 1345–1350. [36] A. Burney, B. Sami, N. Mahmood, Z. Abbas, and K. Rizwan, “Urdu text
https://doi.org/10.1109/SCOPES.2016.7955659 summarizer using sentence weight algorithm for word processors,”
International Journal of Computer Applications, vol. 46, no. 19, pp. 38–
43, May 2012.
41
_________________________________________________________________________________________________2022/27
[37] E. D. P. Kaur and E. P. Singh, “A comparative research of rule based Iffraah Rehman received B.Sc. and M.Sc. degrees in Computer Science from
classification on dataset using WEKA TOOL,” International Research the Institute of Business Management (IoBM), Karachi, Sindh, Pakistan. She
Journal of Engineering and Technology (IRJET), vol. 6, no. 9, Sep. 2019. is currently serving as a Software Developer, ISD at State Bank of Pakistan.
chrome- She is an experienced Research Database Officer with a demonstrated history
extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.irjet.net/ar of working in the facilities services industry. Skilled in PHP, JavaScript,
chives/V6/i9/IRJET-V6I9345.pdf MYSQL & SQL SERVER, Web Application, C# / ASP.NET, Oracle 11g
[38] R. Ahujaa, A. Chuga, S. Kohlia, S. Guptaa, and P. Ahuja, “The impact Forms / Reports, SQL/PLSQL, WordPress and Microsoft Office. She is a strong
of features extraction on the sentiment analysis,” in International information technology professional in Web and Database.
Conference on Pervasive Computing Advances and Applications, vol. E-mail: std_23371@iobm.edu.pk
152, 2019, pp. 341–348.
https://www.sciencedirect.com/journal/procedia-computer- Tariq Rahim Soomro (Senior Member, IEEE) received the B.Sc. (Hons.) and
science/vol/152/suppl/C M.Sc. degrees in Computer Science from the University of Sindh, Jamshoro,
[39] B. Stecanella, “What is TF-IDF?” May 2019. [Online]. Available: Sindh, Pakistan, and the PhD degree in Computer Applications from Zhejiang
https://monkeylearn.com/blog/what-is-tf-idf/. Accessed on: July 2020. University, Hangzhou, Zhejiang, China. He is currently a Professor of
[40] S. Gnanambal, M. Thangaraj, V. T. Meenatchi, and V. Gayathri, Computer Science, the Dean of the College of Computer Science and
“Classification algorithms with attribute selection: an evaluation study Information Systems, and an Acting Rector with the Institute of Business
using WEKA,” International Journal of Advanced Networking and Management (IoBM), Karachi, Sindh, Pakistan. He has more than 28 years of
Applications, vol. 9, no. 6, pp. 3640–3644, May 2018. extensive and diverse experience as an administrator, computer programmer,
[41] M. Desai and M. A. Mehta, “Techniques for sentiment analysis of researcher, and teacher. As an administrator, he served as a coordinator, the
Twitter data: A comprehensive survey,” in International Conference on head of the department, the head of the faculty, the dean of the faculty, the head
Computing, Communication and Automation, Greater Noida, India, Apr. of the academic affairs, an acting rector, and has wide experience in
2016, pp. 149–154. https://doi.org/10.1109/CCAA.2016.7813707 accreditation related matters, including ABET, USA, NCEAC, HEC Pakistan,
[42] S. Yıldırım, “How to best evaluate a classification model,” 17 March and the Ministry of Higher Education and Scientific Research, United Arab
2020. [Online]. Available: https://towardsdatascience.com/how-to-best- Emirates (UAE).
evaluate-a-classification-model-2edb12bcc587. He is also an IEEE Computer Society Distinguished Visitor (2021–2023). He
[43] P. Subedi, “Machine learning – The different ways to evaluate your is a member of the Task Force on Arabic Script IDNs by the Middle East
classification models and choose the best one!" 18 August 2020. Strategy Working Group (MESWG) of ICANN. He has been a member of the
[Online]. Available: https://medium.com/kharpann/machine-learning- Project Management Institute (PMI) since 2007, and ACM since 2019. He has
the-different-ways-to-evaluate-your-classification-models-and-choose- been a Senior Member of the IEEE Computer Society since 2005, and the
the-best-1281542432c. Accessed on: July 2020. International Association of the Computer Science and Information Technology
[44] M. Ghosh and G. Sanyal, “An ensemble approach to stabilize the features (IACSIT) since 2012. He has been a Life Member of the Computer Society of
for multi-domain sentiment analysis using supervised machine learning,” Pakistan (CSP) since 1999, and a Global Member of the Internet Society (ISOC),
Journal of Big Data, vol. 5, Nov. 2018, Art no. 44. USA since 2006. He has been an Active Member of the IEEE Karachi Section
https://doi.org/10.1186/s40537-018-0152-5 (Region 10). He is also serving as the member of the Executive Committee
[45] V. Chaurasia and S. Pal, “A novel approach for breast cancer detection (ExCom) (2017–2022), the IEEE R10 Southern Area Coordinator Computer
using data mining techniques,” International Journal of Innovative Society (2020–2022), and the IEEE R10 Education Activity Committee (EAC)
Research in Computer and Communication Engineering (An ISO 3297: (2021). He received the ISOC Fellowship to the IETF for the 68th Internet
2007 Certified Organization), vol. 2, no. 1, pp. 2456–2465, Jul. 2017. Engineering Task Force (IETF) Meeting. He served as the Secretary for the
https://www.researchgate.net/publication/259979477_A_Novel_Approa Karachi Section (2018–2019), the Chair for the GOLD Affinity Group (2002),
ch_for_Breast_Cancer_Detection_using_Data_Mining_Techniques a member for the Executive Committee, and a branch councillor (2002 & 2016).
[46] Y. A. Amrani, M. Lazaar, and K. E. E. Kadiri, “Random forest and He is currently serving as the Vice-Chair Karachi Section (2020–2023).
support vector machine based hybrid approach to sentiment analysis,” in E-mail: tariq.soomro@iobm.edu.pk
The First International Conference on Intelligent Computing in Data ORCID iD: https://orcid.org/0000-0002-7119-0644
Sciences, vol. 127, 2018, pp. 511–520.
https://www.sciencedirect.com/journal/procedia-computer-
science/vol/127/suppl/C
[47] M. A. Fauzi, “Random forest approach for sentiment analysis in
Indonesian language,” Indonesian Journal of Electrical Engineering and
Computer Science, vol. 12, no. 1, pp. 46–50, Oct. 2018.
https://doi.org/10.11591/ijeecs.v12.i1.pp46-50
42

Sentiment Analyis

Uploaded by

Copyright:

Available Formats

Sentiment Analyis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sentiment Analyis

Uploaded by

Copyright:

Available Formats

Applied Computer Systems

ISSN 2255-8691 (online)

Urdu Sentiment Analysis

©2022 Iffraah Rehman, Tariq Rahim Soomro.

II. LITERATURE REVIEW

instances. The trained model then produces the correct TP

dataset and hence no training model was obtained. During the

Correctly Incorrectly Total

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.