Comparitive Fraud App
Comparitive Fraud App
Comparitive Fraud App
COMPUTER ENGINEERING
Keywords— Natural Language Processing, Sentiment Analysis, Sentiment Lexicon, Sentiment Score.
ISSN: 0975 – 6760| NOV 12 TO OCT 13 | VOLUME – 02, ISSUE – 02 Page 313
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
them with the data to determine polarity. They Negation is also an important feature to take into
assigns sentiment scores to the opinion words account since it has the potential of reversing a
describing how Positive, Negative and Objective the sentiment [11].
words contained in the dictionary are. • Opinion words and phrases:
The objective of this paper is to discover the Opinion words and phrases are words and phrases
concept of Sentiment Analysis in the field of Natural that express positive or negative sentiments. The
Language Processing and presents a comparative main approaches to identify the semantic orientation
analysis of its techniques in this field. The paper is of an opinion word are statistical-based or lexicon-
organized as follows: Section 2 provides the based. Hu and Liu et al. [4] use WordNet to
overview of the most commonly used techniques in determine whether the extracted adjective has a
Sentiment analysis. Section 3 discusses the analysis positive or negative polarity.
and comparison of sentiment analysis techniques. Pang et al. [1] compared the performance of three
Section 4 concludes the manuscript. classifiers Naïve Bayes, Maximum Entropy and
Support Vector Machines in Sentiment classification
2. SENTIMENT ANALYSIS TECHNIQUES at document level on different features like
There are two main techniques for sentiment considering only unigrams, bigrams, combination of
analysis: machine learning based and lexicon based. both, combining unigrams and parts of speech, taking
Few research studies have also combined this two only adjectives and combining unigrams and position
methods and gain relatively better performance. information. The result has shown that feature
presence is more important than feature frequency
1) Machine learning based techniques and when the feature set is small, Naïve Bayes
The machine learning approach applicable to performs better than SVM. But SVM’s perform
sentiment analysis mostly belongs to supervised better when feature space is increased. When feature
classification. In a machine learning based techniques, space is increased, Maximum Entropy may perform
two sets of documents are needed: training and a test better than Naïve Bayes but it may also suffer from
set. A training set is used by an automatic classifier to over fitting. Abbasi et al. [12] proposed sentiment
learn the differentiating characteristics of documents, analysis techniques for classification of
and a test set is used to check how well the classifier hate/extremist web forum postings in multiple
performs. A number of machine learning techniques languages (English and Arabic) by utilizing of
have been adopted to classify the reviews. Machine stylistic and syntactic features. They introduced new
learning techniques like Naive Bayes (NB), algorithm entropy weighted genetic algorithm
maximum entropy (ME), and support vector (EWGA) which is hybrid genetic algorithm that uses
machines (SVM) have achieved great success in the information gain heuristic to improve feature
sentiment analysis. Machine learning starts with selection. They use 14 categories of English and
collecting training dataset. The next step is to train a Arabic Feature Sets as initial set. All features with an
classifier on the training data. Once a supervised information gain greater than 0.0025 are selected.
classification technique is selected, an important They used Support Vector Machine (SVM) with 10-
decision to make is feature selection. They can tell us fold cross-validation and bootstrapping to classify
how documents are represented. The most commonly sentiments in all experiments. When using both
used features in sentiment classification are syntactic and stylistic features they achieved 95.55%
introduced below. accuracy in 10 crosses validation.
• Term presence and their frequency:
These features include uni-grams or n-grams and 2) Lexicon based techniques
their frequency or presence. These features have been In unsupervised technique, classification is done by
widely and successfully used in sentiment comparing the features of a given text against
classification. Pang et al. [1] claim that uni-grams sentiment lexicons whose sentiment values are
gives better results than bi-grams in movie review determined prior to their use. Sentiment lexicon
sentiment analysis, but Dave et al. [6] report that bi- contains lists of words and expressions used to
grams and tri-grams give better product-review express people’s subjective feelings and opinions.
polarity classification. For example, start with positive and negative word
• Part of speech information: lexicons, analyze the document for which sentiment
POS is used to disambiguate sense which in turn is need to find. Then if the document has more positive
used to guide feature selection [11]. In POS tagging word lexicons, it is positive, otherwise it is
each term in sentences will be assigned a label, which negative.The lexicon based techniques to Sentiment
represents its position/role in the grammatical context. analysis is unsupervised learning because it does not
For example, with POS tags, we can identify require prior training in order to classify the data.
adjectives and adverbs which are usually used as The basic steps of the lexicon based techniques are
sentiment indicators [2]. outlined below [9]:
• Negations: 1. Preprocess each text (i.e. remove HTML tags,
noisy characters).
ISSN: 0975 – 6760| NOV 12 TO OCT 13 | VOLUME – 02, ISSUE – 02 Page 314
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
2. Initialize the total text sentiment score: s ← 0. into positive group and negative group based on the
3. Tokenize text. For each token, check if it is present k-means clustering algorithm. After applying TF-IDF
in a sentiment dictionary. (term frequency – inverse document frequency)
(a) If token is present in dictionary, technique on the raw data, a voting mechanism is
i. If token is positive, then s ← s + w. used to extract a more stable clustering result. This
ii. If token is negative, then s ← s − w. result is obtained by multiple implementations of the
4. Look at total text sentiment score s, clustering process then the term score is used to
(a) If s > threshold, then classify the text as further improve the clustering result. A. Khan et al.
positive. [5] proposed rule based domain independent method
(b) If s < threshold, then classify the text as of sentiment classification at the sentence level. They
negative. first classify sentences into objective and subjective
There are three methods to construct a sentiment and check their semantic scores using the
lexicon: manual construction, corpus-based methods SentiWordNet. The final weight of each individual
and dictionary-based methods. The manual sentence is calculated after considering the whole
construction of sentiment lexicon is a difficult and sentence structure, contextual information and word
time-consuming task. sense disambiguation. Their method achieves an
In dictionary based techniques the idea is to first accuracy of 86.6% at the sentence level. Zhang et al.
collect a small set of opinion words manually with [7] proposed weakness finder system which can help
known orientations, and then to grow this set by manufacturers’ find their product weakness from
searching in the WordNet dictionary for their Chinese reviews by using aspects based sentiment
synonyms and antonyms. The newly found words are analysis. In their method they first identify the
added to the seed list. The next iteration starts. The implicit and explicit features for each aspect, and
iterative process stops when no more new words are then they determine the sentiments about the aspects.
found [8]. Opinion words share the same orientation They found the product weaknesses by comparing the
as their synonyms and opposite orientations as their result of each aspect of a specific product and the
antonyms. Hu and Liu [4] use this technique to find aspects of different products. They found aspects of
semantic orientation for adjectives. They used 30 each review by using the explicit and implicit
adjectives as seed list. The dictionary based approach features grouping method, and the sentiments of them
have a limitation is that it can’t find opinion words can be found via sentence based sentiment analysis.
with domain specific orientations [8]. They use PMI method to find implicit feature words,
Corpus based techniques rely on syntactic patterns for explicit features sharing morpheme method and
in large corpora. Corpus-based methods can produce the similarity measure of Hownet are used. They
opinion words with relatively high accuracy. Most of achieved general precision of 82.62%, and the recall
these corpus based methods need very large labeled is 85.26%, F1-measure is about 83.92%.
training data. This approach has a major advantage
that the dictionary-based approach does not have. It 3) Hybrid Techniques
can help find domain specific opinion words and Few research techniques have indicated that the
their orientations. combination of both the machine learning and the
The most prominent work done using unsupervised lexicon based approaches improve sentiment
methods for opinion mining and sentiment detection classification performance. Mudinas et al. [15]
is by Turney [2]. He uses “poor” and “excellent” seed presents concept-level sentiment analysis system,
words as they are appear more in web for calculating pSenti, which is developed by combining lexicon-
the semantic orientation of phrases, where orientation based and learning-based approaches. The main
is measured by pointwise mutual information. advantage of their hybrid approach using a
lexicon/learning symbiosis is to attain the best of
SO(phrase) = PMI (phrase, "excellent”) - PMI both worlds-stability as well as readability from a
(phrase, "poor”) carefully designed lexicon, and the high accuracy
from a powerful supervised learning algorithm. Their
The sentiment of a document is calculated as the system uses a sentiment lexicon constructed using
average semantic orientation of all such phrases. He public resources for initial sentiment detection.
was able to achieve 66% accuracy for the movie Currently the sentiment lexicon consists of 7048
review domain. Ting-Chun Peng and Chia-Chun Shih sentiment words including words with wildcards and
[13] uses part-of-speech (POS) patterns for extracting sentiment values are marked in the range from − 3 to
the sentiment phrases of each review, they used +3. They used sentiment words as features in
unknown sentiment phrase as a query term and get machine learning method. The weight of such a
top-N relevant phrases from a search engine. Next, feature is the sum of the sentiment value in the given
sentiments of unknown sentiment phrases are review. For those adjectives which are not in
computed based on the sentiments of nearby known sentiment lexicon, their occurring frequencies are
relevant phrase using lexicons. Gang Li & Fei Liu used as their initial values. Their hybrid approach
[10] developed an approach for clustering documents pSenti achieved 82.30% accuracy,
ISSN: 0975 – 6760| NOV 12 TO OCT 13 | VOLUME – 02, ISSUE – 02 Page 315
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
Fang et al. [16] incorporate not only a general context sensitivities of sentiment expressions.
purpose sentiment lexicon but also Domain Specific Without a comprehensive lexicon, the sentiment
Sentiment Lexicons into SVM learning, and use this analysis results will suffer. The lexicon-based
method for identifying both product aspects and their approach can result in low recall for sentiment
associated polarities. Experiment results show that analysis.
while a general purpose sentiment lexicon provides The main advantage of hybrid approach using a
only minor accuracy improvement, incorporating lexicon/learning combination is to attain the best of
domain specific dictionaries leads to more significant both worlds, high accuracy from a powerful
improvement. Their system performed a two step supervised learning algorithm and stability from
classification. In step 1, a classifier is trained to lexicon based approach. Table 1 presents summary of
predict the camera aspect being discussed. In step 2, a Precision of sentiment analysis using different
classifier is trained to predict the sentiment associated techniques according to the data reported by authors.
with that camera aspect. Finally, the two step
prediction results are aggregated together to produce Technique
Paper Dataset
the final prediction. In both steps, the lexicon (precision, %)
knowledge is incorporated into conventional SVM NB (81.5), ME
learning. They achieved 66.8% polarity accuracy. Pang et
IMDB (81.0), SVM
Zhang et al. [14] employ an augmented lexicon-based al. [1]
(82.9)
method for entity level sentiment analysis. First Turney
extract some additional opinionated indicators (e.g. Epinions PMI(66)
[2]
words and tokens) through the Chi-square test on the SVM (85.8-
results of the lexicon-based method. With the help of Dave et
Amazon, CNET 87.2), NB (81.9-
the new opinionated indicators, additional al.[6]
87.0)
opinionated tweets can be identified. Afterwards, a Hu and
sentiment classifier is trained to assign sentiment Amazon, CNET Lexicon (84.0)
Liu [4]
polarities for entities in the newly identified tweets. U.S. & Middle
The training data for the classifier is the result of the Abbasi et
Eastern web SVM(95.55)
lexicon-based method. Thus, the whole process has al. [12]
forum postings
no manual labeling. They achieved accuracy of A. Khan IMDB,Skytrax,
85.4%. Lexicon(86.6)
et al. [5] Tripadvisor
Zhang et
3. ANALYSIS AND COMPARISON Luce, Yoka Lexicon(82.62)
al. [7]
Supervised machine learning techniques have shown
relatively better performance than the unsupervised Multi-Domain
Fang et ML + Lexicon
lexicon based methods. However, the unsupervised Sentiment Dataset
al. [16] (66.8)
methods is important too because supervised methods
demand large amounts of labeled training data that Zhang et ML + Lexicon
Twitter
are very expensive whereas acquisition of unlabelled al. [14] (85.4)
data is easy. Most domains except movie reviews Mudinas ML + Lexicon
CNET, IMDB
lack labeled training data in this case unsupervised et al. [15] (82.30)
methods are very useful for developing applications.
Most of the researchers reported that Support Table 1: Precision of sentiment analysis using
Vector Machines (SVM) has high accuracy than different techniques according to the data reported by
other algorithms. The main limitation of supervised authors.
learning is that it generally requires large expert-
annotated training corpora to be created from scratch, 4. CONCLUSION
specifically for the application at hand, and may fail Applying Sentiment analysis to mine the huge
when training data are insufficient. amount of unstructured data has become an important
The opinion words that are included in the research problem. Now business organizations and
dictionary are very important for the lexicon based academics are putting forward their efforts to find the
approach. If the dictionary contains less words or best system for sentiment analysis. Although, some
thorough, one risks the chance of over or under of the algorithms have been used in sentiment
analyzing the results, leading to a decrease in analysis gives good results, but still no technique can
performance. Another significant challenge to this resolve all the challenges. Most of the researchers
approach is that the polarity of many words is domain reported that Support Vector Machines (SVM) has
and context dependent. For example, ‘funny movie’ high accuracy than other algorithms, but it also has
is positive in movie domain and ‘funny taste’ is limitations. More future work is needed on further
negative in food domain. Such words are associated improving the performance of the sentiment
with sentiment in a particular domain. Current classification. There is a huge need in the industry for
sentiment lexicons do not capture such domain and such applications because every company wants to
ISSN: 0975 – 6760| NOV 12 TO OCT 13 | VOLUME – 02, ISSUE – 02 Page 316
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
COMPUTER ENGINEERING
know how consumers feel about their products and [13] T. Peng, C. Shih, “An Unsupervised Snippet-
services and those of their competitors. Different Based Sentiment Classification Method for Chinese
types of techniques should be combined in order to Unknown Phrases without Using Reference Word
overcome their individual drawbacks and benefit Pairs.” Proceedings of the International Conference
from each other’s merits, and enhance the sentiment on Web Intelligence and Intelligent Agent
classification performance. Technology, 2010, pp.243-248.
[14] L. Zhang, R. Ghosh, M. Dekhil, M. Hsu, and B.
5. REFERENCES Liu, “Combining Lexicon-based and Learning-based
[1] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs Methods for Twitter Sentiment Analysis”, Technical
up?: sentiment classification using machine learning report, HP Laboratories, 2011.
techniques,” Proceedings of the ACL-02 conference [15] A. Mudinas, D. Zhang, M. Levene, “Combining
on Empirical methods in natural language processing, lexicon and learning based approaches for concept-
vol.10, 2002, pp. 79-86. level sentiment analysis”, Proceedings of the First
[2] P. Turney, “Thumbs up or thumbs down? International Workshop on Issues of Sentiment
Semantic orientation applied to unsupervised Discovery and Opinion Mining, ACM, New York,
classification of reviews”, Proceedings of the NY, USA, Article 5, pp. 1-8, 2012.
Association for Computational Linguistics (ACL), [16] Ji Fang and Bi Chen, “Incorporating Lexicon
2002, pp. 417–424. Knowledge into SVM Learning to Improve
[3] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and Sentiment Classication”, In Proceedings of the
M. Stede, "Lexicon-based methods for sentiment Workshop on Sentiment Analysis where AI meets
analysis,” Computational Linguistics, vol. 37, 2011, Psychology (SAAIP), pages 94–100, 2011.
pp. 267-307.
[4] M. Hu and B. Liu, "Mining and summarizing
customer reviews," Proceedings of the tenth ACM
international conference on Knowledge discovery
and data mining, Seattle, 2004, pp. 168-177.
[5] A. Khan, B. Baharudin, K. Khan; “Sentiment
Classification from Online Customer Reviews Using
Lexical Contextual Sentence Structure” ICSECS
2011: 2nd International Conference on Software
Engineering and Computer Systems, Springer, pp.
317-331, 2011.
[6] K. Dave, S. Lawrence, and D. M. Pennock,
“Mining the peanut gallery: Opinion extraction and
semantic classification of product reviews,”
Proceedings of WWW, 2003, pp. 519–528.
[7] W. Zhang, H. Xu, W. Wan, “Weakness Finder:
Find product weakness from Chinese reviews by
using aspects based sentiment analysis,” Expert
Systems with Applications, Elsevier, vol. 39, 2012,
pp. 10283-10291.
[8] B. Liu, Web Data Mining: Exploring Hyperlinks,
Contents, and Usage Data. Springer, 2006.
[9] M. Annett, G. Kondrak, “A comparison of
sentiment analysis techniques: Polarizing movie
Blogs”, In Canadian Conference on AI, pp. 25–35,
2008.
[10] G. Li, F. Liu, "A Clustering-Based Approach on
Sentiment Analysis," IEEE International Conference
on Intelligent System and Knowledge Engineering,
Hangzhou, China, vol. 2010 / 1, pp. 331-337.
[11] B. Pang and L. Lee, “Opinion mining and
sentiment analysis,” Foundations and Trends in
Information Retrieval 2(1-2), 2008, pp. 1–135.
[12] A. Abbasi, H. Chen, and A. Salem, “Sentiment
analysis in multiple languages: Feature selection for
opinion classification in web forums,” In ACM
Transactions on Information Systems, vol. 26 Issue 3,
pp. 1-34, 2008.
ISSN: 0975 – 6760| NOV 12 TO OCT 13 | VOLUME – 02, ISSUE – 02 Page 317