BiSAL – A bilingual sentiment analysis lexicon to analyze Dark Web forums for
cyber security
Majed Alrubaian
King Saud University
BiSAL- A Bilingual Sentiment Analysis Lexicon to Analyze Dark Web Forums for Cyber
Khalid Al-Rowailya , Muhammad Abulaish, SMIEEEb,∗, Nur Al-Hasan Haldarc , Majed Al-Rubaiana
a College of Computer and Information Sciences, King Saud University, Riyadh, KSA
b Department of Computer Science, Jamia Millia Islamia (A Central University), New Delhi, India
c Center of Excellence in Information Assurance, King Saud University, Riyadh, KSA
In this paper, we present the development of a Bilingual Sentiment Analysis Lexicon (BiSAL) for cyber security domain, which
consists of a Sentiment Lexicon for ENglish (SentiLEN) and a Sentiment Lexicon for ARabic (SentiLAR) that can be used to
develop opinion mining and sentiment analysis systems for bilingual textual data from dark web forums. For SentiLEN, a list
of 279 sentiment bearing English words related to cyber threats, radicalism, and conflicts are identified and a unifying process is
devised to unify their sentiment scores obtained from four different sentiment data sets. Whereas, for SentiLAR, sentiment bearing
Arabic words are identified from a collection of 2000 message posts from Alokab Web forum, which contains radical contents.
The SentiLAR provides a list of 1019 sentiment bearing Arabic words related to cyber threats, radicalism, and conflicts along with
their morphological variants and sentiment polarity. For polarity determination, a semi-automated analysis process by three Arabic
language experts is performed and their ratings are aggregated using some aggregate functions. A Web interface is developed
to access both the lexicons (SentiLEN and SentiLAR) of BiSAL data set online, and a beta version of the same is available at
Keywords: Sentiment analysis lexicon, Sentiment lexicon for English, Sentiment lexicon for Arabic, Cyber security, Dark Web
Table 4: A partial list of cyber crime related Arabic seed words and their fre- Table 5: A sample list of root words and their morphological variants
quency count identified from the message posts of Alokab forum Root word Morphological variants Affix
Words Frequency Acts -s
ﺑﻘﻮة 1113 Action -ion
واﻟﺘﻄﺮف 171 Actions -ions
اﻟﻮﺣﺸﻲ 435 Acting -ing
إرھﺎﺑﯿﺔ 445 Bombs -s
وﻋﺬاب 208 Bombers -ers
Bombing -ing
اﻟﺤﻘﺪ 445
Bombings -ings
اﻟﻔﺎﺣﺶ 376
Terrorism -ism
اﻟﻔﺎﺟﺮ 264
Terror Terrorist -ist
اﻏﺘﺼﺎب 423
Terrorists -ists
اﻹﺟﺮاﻣﯿﺔ 172
Rich Enrich en-
respectively. These parts cannot be further divided into any On the other hand, inflectional morphology generally distin-
other meaningful units. Here, the morpheme attack is called guishes a change in the root form of a word, keeping its syntac-
free morpheme as it can stand alone as a word, whereas other tic class unchanged. Rather, it indicates grammatical properties
morphemes are called bound morpheme as they can only occur in terms of comparison degree, tense, and quantities. For exam-
in combination as parts of a word and consequently they must ple, the root verb prove has inflectional variations like proves,
be attached as parts of a word. proved, proving, and proven that also belong to verb category.
In addition to free or bound categories of morphemes, they A sample list of morphological variants identified while pro-
can also be classified as inflectional or derivational morphemes. cessing BiSAL is shown in Table 5. For morphological vari-
Derivational morphology involves prefixing as well as suffixing ants of English words, various databases are searched and all
to transform a new but morphologically related word, which relevant morphemes for each word are compiled and stored in
is often a different class. For example, the suffix “ation” SentiLEN, whereas the morphological variants of Arabic words
converts the verb normalize into a new noun form normaliza- are identified by three different Arabic language experts inde-
tion, whereas the suffix “ize” converts the noun crystal into pendently and stored in SentiLAR.
crystalize which denotes its verb form. Similarly, subgroup,
inactivate, deactivate are the examples of some other cate- 2.3. Sentiment Polarity Determination
gory of derivational morphological variants that are generated In this section, we present the proposed technique to de-
through prefixing the root words. Some derivational prefixes termine the polarity of sentiment bearing words identified for
like “non”, “un” converts a word into negation like nonsci- BiSAL. For English words, we have used four publicly avail-
entific, unable, etc. It should be noted that all prefixes used able sentiment corpora AFFIN [13], SentiWordNet [10], Gen-
in English are derivational prefixes, whereas suffixes can be eral Inquirer [15], and SentiStrength [17] that assign polarity
derivational as well as inflectional. score to sentiment bearing words in different range. Since there
is no such sentiment corpus for Arabic words, we have applied
Table 6: Discretization of General Inquirer sentiment annotations
semi-automated analysis by Arabic language experts to assign General Inquirer annotation
polarity scores in a pure scientific manner. Further details about Positive Negative Strong Hostile
polarity determination for English and Arabic words are pro- - Yes Yes Yes -4
- - Yes Yes -4
vided in the following sub-sub-sections - - - Yes -3
- Yes - Yes -3
2.3.1. Sentiment Polarity Determination of English Words - Yes Yes - -2
As discussed earlier, SentiLEN (Sentiment Lexicon for En- - Yes - - -1
Yes - - - +1
glish) consists of a list of 279 unique words related to the vari- Yes - Yes - +2
ous categories of cyber crime. For polarity determination, these
words are searched in four different sentiment lexicons - AF-
FIN [13], SentiWordNet [10], General Inquirer [15], and Sen- Table 7: Exemplar SentiLEN words and their polarity scores from different
tiStrength [17], and their scores are unified using an aggregate sentiment data sets
AFFIN Score SentiWordNet Score GI Score SentiStrength Score
function. AFFIN data set [13] consists of total 2477 words and Word
[-4,+2] [-0.88, +0.63] [-4,+2] [-4,+3]
each word is assigned a polarity score between -5 to +5 based Attack -1 -0.25 -4 -3
Harm -2 -0.42 -4 -3
on its sentiment intensity. In case a word has no match in AF- Bomb -1 -0.19 -4 -2
FIN data set, polarity score is assigned as 0. Secret 0 +0.13 -1 -1
Second sentiment data set used in our experiment is the Sen-
tiWordNet [10], which is a lexical resource for sentiment anal-
ysis based on WordNet [12]. It assigns each synset of WordNet other two forms (arm#2 and arm#3) refer some violence sense,
with three sentiment scores positive, negative, and objective. but the latter one (arm#3) describes the violence intensity more
Each word is assigned a numeric number between 0 and 1 with than arm#2. Therefore, the score of arm#3, which is defined as
its polarity (positive or negative). Like AFFIN, English words negative, strong, and hostile is considered as the final score for
are searched in SentiWordNet dataset and the corresponding arm.
polarity score is assigned to them when there is a hit, otherwise Fourth sentiment data set used in our experiment is Sen-
0 is considered as the polarity score. On analysis, we found tiStrength [17] in which words are available in stemmed forms
that some words have multiple meanings based on the context with their sentiment scores in the range of [-5, +5]. Therefore,
and each corresponding meaning has different sentiment scores. before searching a word in this data set, the word is stemmed
In such cases, the words describing conflict activities are gath- using Porter stemmer. In case of a hit, the word is assigned a
ered together and their average sentiment score is assigned to sentiment score in the range of [-5, +5], whereas 0 is assigned
the word. The web interface of SentiWordNet [10] is used to in case of a miss.
compare multiple meanings (if exist) of a particular word. In this way, each word of SentiLEN is assigned four indepen-
Third sentiment data set used in our experiment is General dent polarity scores based on their match in the sentiment rep-
Inquirer [15], which does not have any numeric score for En- resenting data sets mentioned above. Table 7 shows the polarity
glish words. However, based on the words’ sentiment intensity, scores of some exemplar words. After searching the sentiment
General Inquirer data set categorizes them as positive, negative, data sets for all 279 words of SentiLEN, it is found that the
strong, hostile, etc. It is observed that no word is annotated as scores obtained from AFFIN, SentiWordNet, General Inquirer,
positive if it is hostile in nature, whereas a hostile word is noted and SentiStrength data sets lie in the range of [-4, +2], [-0.875,
as bearing maximum negative sentiment. If a word has a match +0.625], [-4, +2], and [-4, +3], respectively.
in General Inquirer data set, its respective annotations are dis- Therefore, the next task is to map the four different scores of
cretized in the range of [-4, +2] using the rules given in Table each word to a unified scale and normalize them in the range
6. A word that does not satisfy any combination listed in Table [-1, +1]. To this end, min-max normalization, given in equa-
6 is assigned polarity score as 0. Like SentiWordNet, a word in tion 1, is applied to map each score in the range of [-1, +1].
General Inquirer data set may appear in multiple forms having In Equation 1, Pn (wi , d j ) represents the new polarity value of
different meanings. In this case, only those word-forms hav- word wi in data set d j , Po (wi , d j ) represents the original polar-
ing some crime or conflict sense are considered. For example, ity value of wi in d j , min(d j ) and max(d j ) are the lowest and
the word shoot has three different entries in General Inquirer highest scores of a word in data set d j , respectively. After nor-
data set, out of which two entries are related to violence and malization, the mean score of each word wi , δ(wi ), is calculated
terror and the third one doesn’t have any violence sense as it using equation 2, where m is the number of data sets. The nor-
is used as shoot up, which means “to grow or rise rapidly”. malized and mean scores of the exemplar words of Table 7 are
Moreover, a particular word may have various meanings and shown in Table 8.
all meanings have some violence sense. In this case, the maxi-
mum numeric score calculated using Table 6 is considered. For
example, the word arm is available in General Inquirer data set P0 (wi , d j ) − min(d j )
Pn (wi , d j ) = × (newMax − newMin)
which has three different forms (arm#1, arm#2, and arm#3). max(d j ) − min(d j )
The first form (arm#1) is basically used to describe body part +newMean
which has no sense related to terror or violence. However, the (1)
Table 8: Normalized and mean scores of the exemplar words of Table 7
AFFIN(d1 ) S entiWordNet(d2 ) GeneralInquirer(d3 ) S entiS trength(d4 )
δ(wi )
Word(wi ) P0 (wi , d1 ) Pn (wi , d1 ) P0 (wi , d2 ) Pn (wi , d2 ) P0 (wi , d3 ) Pn (wi , d3 ) P0 (wi , d4 ) Pn (wi , d4 )
[-4,+2] [-1,+1] [-0.88,+0.63] [-1,+1] [-4,+2] [-1,+1] [-4,+3] [-1,+1] [-1,+1]
Attack -1 0 -0.25 -0.17 -4 -1.0 -3 -0.71 -0.47
Harm -2 -0.33 -0.42 -0.39 -4 -1.0 -3 -0.71 -0.61
Bomb -1 0 -0.19 -0.08 -4 -1.0 -2 -0.43 -0.38
Secret 0 +0.33 +0.13 +0.33 -1 0 -1 -0.14 +0.13
Pn (wi , d j )
δ(wi ) = (2)
Δ(wi ) = (3)
ρ(wi ) − min[ρ(w)]
η(wi ) = × (newMax − newMin)
max[ρ(w)] − min[ρ(w)] (5)
+newMean to know the context of a word in which it is used by the users
of Alokab forum. Figure 1 presents a list of exemplar Arabic
In order to take into consideration the majority voting for de- words extracted from the message posts, and Figure 2 shows the
termining the positive or negative sentiment of a word wi , an list of sentences from the message posts retrieved in response
additive factor Δ(wi ) is introduced and defined using Equation of clicking the first word in Figure 1.
3, where p and n represent the number of positive and negative The annotation application was given to three Arabic lan-
scores of the word wi , respectively. The additive factor Δ(wi ) guage experts to annotate them as positive, negative, strong,
is added to the mean score δ(wi ) to get an intermediate polarity and/or hostile. Scores in the range of [0, 1] for positivity and
score ρ(wi ), which is finally normalized using min-max normal- negativity are assigned to each word by the experts. In case a
ization (Equation 5) to get the final polarity score η(wi ) of the word is always used as positive its positive polarity is consid-
word wi . Table 9 shows the final sentiment scores of the exem- ered as 1 and the negative polarity as 0. Similarly, if a word
plar words considered in Tables 7 and 8. is always used in negative sense its negative polarity is set to
1 and positive polarity as 0. For other words that are used in
2.3.2. Sentiment Polarity Determination of Arabic Words both positive as well as negative sense, positive and negative
As stated earlier, unlike English language, there is no ex- polarity scores are assigned in the range of 0 and 1 in such a
isting sentiment lexicon for Arabic language. Therefore, we way that their sum is 1. Depending on the degree of positivity
have adopted a semi-automated analysis to create Arabic sen- of a word, it is also marked as strong, and similarly depend-
timent lexicon based on three domain experts’ views taken in- ing on the degree of negativity of a word it is marked as strong
dependently. To start with, a collection of 2000 message posts and/or hostile. A sample list of Arabic words and their polarity
from Alokab dark web forum [1] is processed using various scores assigned by three different domain experts is shown in
NLP techniques. To facilitate domain experts for annotation, a Tables 10, 11, and 12. Those words having total score as 0 by
GUI-based annotation application is developed in Java to parse every expert are filter out from the list considering them as non-
documents and count the frequency of each word, which is not sentiment bearing words. As a result, a total number of 1019
a member of the stop-words list identified for Arabic language. words are retained as sentiment representing words.
A total number of 7061 distinct words are compiled from the For experts scores aggregation, average of individual polarity
2000 message posts. The GUI presents the words in decreasing categories is calculated, as shown in Table 13. For a given word
order of their frequency count and provides a list of all matching wi the larger of the average positive score and average negative
sentences while clicking on a word. This facilitates the experts score is considered as initial polarity score δ(wi ) and the corre-
Table 12: Polarity scores assigned to Arabic words by Expert-3
Word Positive Negative Strong Hostile
ﺑﻘﻮة 0.0 0.0 0 0
راﺷﺪة 1.0 0.0 1 0
وإﻧﻘﺎذ 0.0 0.0 0 0
واﻟﺘﻄﺮف 0.0 1.0 0 1
اﺳﺘﻄﺎﻋﺖ 0.0 0.0 0 0
اﺑﻌﺪ 0.4 0.6 0 0
َاﻟﻈﱠﺎﻟِﻤُﻮن 0.0 1.0 0 1
ﻟﻠﻈﺎﻟﻤﯿﻦ 0.0 1.0 0 1
وﺿﯿﻖ 0.0 0.0 0 0
ﻣﺘﻔﻘﻮن 0.0 0.0 0 0
Table 15: A partial list containing 20 entries from SentiLEN data set
Words Morphological variations Sentiment polarity
ambush ambush, ambushed, ambushes, ambushing, ambuscade -0.86
arm arm, armed, arms, arming -0.54
assassin assassination, assassinate, assassinating, assassinated, assassin, assassinator, assassinators -0.95
assault assault, assaulting, assaulted, assaulter, assaultive, assaulters -0.84
attack attack, attacked, attacking, attacks, attacker, attackers -0.90
belief belief, beliefs, believe, believer, believing, believed, believes, believable, believingly, believing +0.68
blast blast, blasts, blasting, blaster, blasted -0.61
blow blow, blows, blew, blown, blower, blowing, blowy -0.75
blood blood, bloody, blooded, blooding, bloods +0.08
body body, bodies, bodily, bodied, bodying +0.08
bomb bomb, bombs, bombing, bombings, bomber, bombers, bombed -0.85
burn burn, burning, burns, burned, burner, burnable -0.69
business business, businesses, businessman +0.73
bust bust, busts, buster, busty, busting, busted -0.53
camp camp, camps, camper, camping, campy, camply, camped -0.52
capture capture, captured, captures, capturing, capturer, capturers -0.35
casualty casualty, casualties -0.85
change change, changes, changing, changed +0.68
checkpoint checkpoint, checkpoints +0.08
chief chief, chiefs +0.73
support support, supports, supporter, supporters, supporting, supported, supportive +0.95
Table 16: A partial list containing 20 entries from SentiLAR data set
Words Morphological variants Sentiment
اﺗﻔﻖ اﺗﻔﻖ, ﯾﺘﻔﻖ, اﺗﻔﻖ, ﻣﺘﻔﻖ, ﻣﺘﻔﻘﺔ, ﺗﺘﻔﻖ اﻻﺗﻔﺎﻗﯿﺔ +0.67
اﺣﺘﺮام اﻻﺣﺘﺮام, ﻣﺤﺘﺮﻣﺔ, ﯾﺤﺘﺮم,ﻣﺤﺘﺮم,اﺣﺘﺮاﻣﻲ +0.65
اﺣﯿﻰ اﻹﺣﯿﺎء, اﺣﯿﺎء, ﺗﺤﯿﻲ, ﻣﺤﯿﯿﺔ, ﻣﺤﯿﻲ, اﺣﯿﻲ,ﯾﺤﯿﻲ +0.67
ادى ﺳﯿﺆدي, ﻣﺆدى, ﺗﺆدي, ﻣﺆدﯾﺔ, ﻣﺆد, أد,ﺳﯿﺆدي +0.67
اﺳﺘﻐﻞ ﻻﺳﺘﻐﻼل, اﺳﺘﻐﻼل, ﺗﺴﺘﻐﻞ, ﻣﺴﺘﻐﻠﺔ, ﻣﺴﺘﻐﻞ, اﺳﺘﻐﻞ,ﯾﺴﺘﻐﻞ +0.67
اﺳﺘﻘﺖ ﻣﺴﺘﻘﻞ, اﺳﺘﻘﻼل, ﺗﺴﺘﻘﻞ, ﻣﺴﺘﻘﻠﺔ, ﻣﺴﺘﻘﻞ, اﺳﺘﻘﻞ,ﯾﺴﺘﻘﻞ -0.08
اﺳﺘﻤﺮ ﻣﺴﺘﻤﺮ, اﺳﺘﻤﺮار, ﺗﺴﺘﻤﺮ, ﻣﺴﺘﻤﺮة, ﻣﺴﺘﻤﺮ, اﺳﺘﻤﺮ,ﯾﺴﺘﻤﺮ -0.08
اﺳﺘﻮطﻦ اﻻﺳﺘﯿﻄﺎﻧﻲ, اﺳﺘﯿﻄﺎن, ﺗﺴﺘﻮطﻦ, ﻣﺴﺘﻮطﻨﺔ, ﻣﺴﺘﻮطﻦ, اﺳﺘﻮطﻦ,ﯾﺴﺘﻮطﻦ +0.67
اﻻﺋﻢ , أﻷم, اﻟﻠﺆم, ﻟﺆﻣﺎن, ﻣﻼﻣﺎن, ﻟﺆﻣﺎء, ﻟﻮﻣﺎ, ﯾﻠﺆم, ﻣﻼﻣﺔ, ﻻﻣﺔ,ﺑﻼم, ﻻم, ﻟﺆﻣﺖ, ﻻﻣﺎ,ﻻﺋﻢ -0.83
ﺑﻸﻣﮭﻤﺎ, اﻟﻤﻸم, ﻣﻶم, اﺳﺘﻸم,اﻵﻣﺎ
اﻹﺳﺘﺮاﺗﯿﺠﯿﺔ اﻹﺳﺘﺮاﺗﯿﺠﯿﺔ +0.67
اﻷﻣﻞ َ ﺗﺄﻣﯿﻼ,َ أﻣﻼ, ﯾﺄﻣﻠﮫ, آﻣﻠﮫ, أﻣﻠﺘﮫ, آﻣﺎل,اﻵﻣﺎل +0.67
اﻟﺘﺼﺪﯾﺔ , ﺗﺼﺪه,َ ﺻﺪﯾﺪا, ﯾﺘﺼﺪد, ﺻﺪاه, ﺗﺼﺪﯾﺖ, ﯾﺼﺪون, اﻟﺼّﺪى, ﯾﺼﺪد, ﯾﺼﺪي, ﺻﺪي,اﻟﺼﺪ -0.67
اﻟﺠﺒﯿﻦ ﺟﺒﺎن, ﺟﺒﻦ, أﺟﺒﻨﮫ, أﺟﺒﻦ, اﻟﺠﺒﯿﻨﺎن,اﻟﺠﺒﻨﺎء -0.67
اﻟﺠﺮم , أﺟﺮﻣﺖ, اﻟﻤﺠﺮﻣﯿﻦ, ﺟﺎرم, اﺟﺘﺮﻣﮫ, ﻣﺠﺮم, ﺟﺮﯾﻢ, ﺟﺮوم, أﺟﺮام,َ ﺟﺮاﻣﺎ, ﺟﺮﻣﮫ,ﺟﺮﯾﻤﺔ -1.00
اﺟﺮم,َ ﺟﺮﻣﺎ, ﯾﺠﺮﻣﻨﻚ,ﺟﺎرم
اﻟﺠﻤﺎل إﺟﻤﺎل, أﺟﻤﻠﺖ, ﺟُﻤﻠﺔ, ﺟﻤﺎﻻت, أﺟﻤﻠﺖ,َ ﺟﻤﺎﻻ, اﻟﻤﺠﻤﻞ, ﺟﻤﺎل, ﺟﻤﯿﻞ,أﺟﻤﻞ +0.83
اﻟﺮﺣﻤﺔ , رﺣﻤﻦ, اﻟﺮﺣﻢ, اﻷرﺣﺎم, اﻟﺮﺣﻤﻦ, رﺣﻤﺘﻨﺎ, رﺣﻤﺖ,َ رﺣﻤﺎ, رﺣﻤﺔ, اﻟﻤﺮﺣﻤﺔ,اﻟﺮّﺣﯿﻢ +0.67
اﻟﺴﻄﻮة ﯾﺴﻄﻮ, ﯾﺴﻄﻮان,َ ﺳﻄﻮا,َ ﺳﻄﺎ, اﻟﺴﻄﻮات,ﺳﻄﻮة -0.67
اﻟﺸﺮﯾﺮ ﺷﺮﯾﺮﯾﻦ, أﺷﺮار, اﻟﺸﺮار, أﺷﺮاء, أﺷﺮ, ﺷﺮارة,َ ﺷﺮا, ﺷﺮور, ﯾﺸﺮ,اﻟﺸﺮ -1.00
اﻟﺼﺮاط اﻟﺼﺮاط +0.67
اﻟﻐﺎﻟﯿﺔ اﻟﻐﺎﻟﯿﺔ +0.67