Survey On Aspect-Level Sentiment Analysis: Kim Schouten and Flavius Frasincar

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO.
3, MARCH 2016 813
Survey on Aspect-Level Sentiment Analysis

Kim Schouten and Flavius Frasincar
Abstract—The field of sentiment analysis, in which sentiment is gathered, analyzed, and aggregated from text, has seen a lot of
attention in the last few years. The corresponding growth of the field has resulted in the emergence of various subareas, each
addressing a different level of analysis or research question. This survey focuses on aspect-level sentiment analysis, where the goal
is to find and aggregate sentiment on entities mentioned within documents or aspects of them. An in-depth overview of the current
state-of-the-art is given, showing the tremendous progress that has already been made in finding both the target, which can be an entity
as such, or some aspect of it, and the corresponding sentiment. Aspect-level sentiment analysis yields very fine-grained sentiment
information which can be useful for applications in various domains. Current solutions are categorized based on whether they provide
a method for aspect detection, sentiment analysis, or both. Furthermore, a breakdown based on the type of algorithm used is provided.
For each discussed study, the reported performance is included. To facilitate the quantitative evaluation of the various proposed
methods, a call is made for the standardization of the evaluation methodology that includes the use of shared data sets. Semantically-
rich concept-centric aspect-level sentiment analysis is discussed and identified as one of the most promising future research direction.
Index Terms—Text mining, linguistic processing, machine learning, text analysis, sentiment analysis, aspects
1 INTRODUCTION
T HE digital age, also referred to as the information soci-

ety, is characterized by ever growing volumes of infor-
mation. Driven by the current generation of web
product reviews, which have been shown to influence buy-
ing behavior [1]. Moreover, information provided by indi-
viduals on the web is regarded as more trustworthy than
applications, the nearly limitless connectivity, and an insa- information provided by the vendor [1]. From a producers
tiable desire for sharing information, in particular among point of view, every person is a potential customer. Hence,
younger generations, the volume of user-generated social knowing their likes and dislikes can be of great help in
media content is growing rapidly and likely to increase developing new products [2], as well as managing and
even more in the near future. People using the Web are con- improving existing ones [3]. Furthermore, understanding
stantly invited to share their opinions and preferences with how the information in, for example, product reviews inter-
the rest of the world, which has led to an explosion of opin- acts with the information provided by companies enables
ionated blogs, reviews of products and services, and com- the latter to take advantage of these reviews and improve
ments on virtually anything. This type of web-based sales [4]. In fact, opinions on the Web have become a
content is more and more recognized as a source of data resource to be harnessed by companies, just like the
that has added value for multiple application domains. traditional word-of-mouth [5]. In addition to this traditional
producer/consumer model, sentiment analysis is also
1.1 Applications important for other economic areas, like for example finan-
For ages, governments and mercantile organizations alike cial markets [6].
have been struggling to determine the opinions of their tar-
get communities and audiences. Now, for the first time, 1.2 Definitions
people voluntarily publish their opinions on the World This survey will start with a quick summary of the defini-
Wide Web, for anyone to see. This social web allows for tions for aspect-level sentiment analysis set forth by Pang
almost immediate feedback on products, stocks, policies, and Lee [3]. The field of sentiment analysis operates at the
etc., and many of the desired data, which was hard to come intersection of information retrieval, natural language proc-
by in the past, is now readily available. This is in stark con- essing, and artificial intelligence. This has led to the use of
trast with the traditional surveys and questionnaires that different terms for similar concepts. A term often used is
often reluctant participants had to fill without any personal ‘opinion mining’, a denotation coming from the data mining
motivation to do so, resulting in sub-optimal information. and information retrieval community. The main goal of
Many individuals are influenced by the opinionated opinion mining is to determine the opinions of a group of
materials they find on the web. This is especially true for people regarding some topic. The term ‘sentiment analysis’
is also used quite often. It comes from the natural language
The authors are with the Erasmus School of Econommics, Erasmus
processing domain, and the focus lies on determining the
University Rotterdam, The Netherlands. sentiment expressed in text. The term subjectivity analysis
E-mail: {schouten, frasincar}@ese.eur.nl. is sometimes regarded as encompassing opinion mining
Manuscript received 16 July 2014; revised 1 Sept. 2015; accepted 20 Sept. and sentiment analysis, as well as related tasks [7], but also
2015. Date of publication 1 Oct. 2015; date of current version 2 Feb. 2016. as a sub-task of opinion mining and sentiment analysis [8].
Recommended for acceptance by E. Lim. Nevertheless, all these terms, even though possibly used for
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. slightly different tasks or different angles, represent the
Digital Object Identifier no. 10.1109/TKDE.2015.2485209 same area of research. This field of research, labeled as
Authorized licensed use limited to: St. Xavier's Catholic College of Engineering. Downloaded on May 31,2024 at 11:03:06 UTC from IEEE Xplore. Restrictions apply.
1041-4347 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
814 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 3, MARCH 2016
opinion mining, sentiment analysis, or subjectivity analysis, other hand, “I could not sleep because of the noise” is an
studies the phenomena of opinion, sentiment, evaluation, example that illustrates an implicit sentiment with an
appraisal, attitude, and emotion [8]. For ease of reference implicit target: one expects to be able to sleep well, but
these terms are often simply referred to as opinion or senti- according to the sentence, this expectation was not met,
ment, even though they are technically not the same. which is why this sentence can be seen as illustrating a
An opinion can be defined as a “judgment or belief not negative sentiment.
founded on certainty or proof” [9]. In this sense, it is the Last, since the set of human emotions is very large [11],
opposite of a fact. Hence, statements expressing an opinion sentiment polarity is often used instead. Polarity describes
are subjective, while factual statements are objective. Senti- the direction of the sentiment and it is either positive, nega-
ment is orthogonal to this [10], as it is closely related to atti- tive, or neutral [3]. Some algorithms only perform a binary
tude and emotion, used to convey an evaluation of the topic classification, distinguishing solely between positive and
under discussion. Because of this orthogonality, there are negative polarity.
four quadrants a sentence can fall in. It can be subjective or
objective, as well as with or without sentiment. For example, 1.3 Outline of Aspect-Level Sentiment Analysis
people may have varying opinions on what color a certain
In general, three processing steps can be distinguished
dress is1 in “Others think it looks like a blue and black dress,
when performing aspect-level sentiment analysis: identifi-
but to me it is a white and gold dress.”, without expressing
cation, classification, and aggregation [7]. While in practice,
any sentiment. In contrast, the statement “Some persons
not every method implements all three steps or in this exact
looked at the dress and saw a blue and black one, others
order, they represent major issues for aspect-level sentiment
were convinced it was white with gold instead” is purely
analysis. The first step is concerned with the identification
objective and also without sentiment. Statements conveying
of sentiment-target pairs in the text. The next step is the clas-
sentiment can be both subjective and objective as well. For
sification of the sentiment-target pairs. The expressed senti-
example “The blue and black dress is the most beautiful” is
ment is classified according to a predefined set of sentiment
a subjective statement with sentiment, while “My favorite
values, for instance positive and negative. Sometimes the
dress is sold out” is an objective statement with sentiment.
target is classified according to a predefined set of aspects
In light of the above discussion, we will use the term senti-
as well. At the end, the sentiment values are aggregated for
ment analysis throughout this survey, as it best captures the
each aspect to provide a concise overview. The actual pre-
research area under investigation.
sentation depends on the specific needs and requirements
With the above discussion in mind, finding sentiment
of the application.
can be formally defined as finding the quadruple
Besides these core elements of aspect-level sentiment
ðs; g; h; tÞ [8], where s represents the sentiment, g represents
analysis, there are additional concerns: robustness, flexibil-
the target object for which the sentiment is expressed, h
ity, and speed. Robustness is needed in order to cope with
represents the holder (i.e., the one expressing the senti-
the informal writing style found in most user-generated
ment), and t represents the time at which the sentiment
content. People often make lots of errors in spelling and
was expressed. Note that most approaches focus only on
grammar, not to mention the slang language, emoticons,
finding the pair ðs; gÞ. The target can be an entity, such as
and other constructions that are used to voice a certain
the overall topic of the review, or an aspect of an entity,
sentiment. Flexibility is the ability to deal with multiple
which can be any characteristic or property of that entity.
domains. An application may be performing very well on a
This decision is made based on the application domain at
certain domain, but very poorly on another, or just medio-
hand. For example, in product reviews, the product itself
cre on all domains. Last, an aspect-level sentiment analysis
is usually the entity, while all things related to that product
solution ideally is accessible using a Web interface due
(e.g., price, quality, etc.) are aspects of that product.
to Web ubiquity, underlining the need for high speed
Aspect-level sentiment analysis is concerned, not just with
performance.
finding the overall sentiment associated with an entity, but
also with finding the sentiment for the aspects of that
entity that are discussed. Some approaches use a fixed, 1.4 Focus of This Survey
predetermined list of aspects, while others freely discover To allow for a proper level of depth, we focus this survey on
aspects from the text. a particular sub-field of sentiment analysis. As discussed
Both sentiment and target can be expressed explicitly or in [8], sentiment analysis has been studied mainly at three
remain implicit, independent of each other. When explic- levels of classification. Sentiment is classified on either the
itly mentioned, a sentiment or target is literally in the text, document level, the sentence level, or the entity or aspect
while implicit expressions of sentiment or target have to be level. A focus on the first level assumes that the whole docu-
inferred from the text, which sometimes even requires ment expresses sentiment about only one topic. Obviously,
additional context or domain knowledge. For example, this is not the case in many situations. A focus on the second
“This hotel is fantastic” is an example of a sentence with level comes with a similar assumption in that one sentence
an explicit entity and an explicit sentiment, while “The ser- should only contain sentiment about one topic. Within
vice is great” expresses a similar explicit sentiment, but the same sentence, it is often the case that multiple entities
with an explicit aspect of an entity as its target. On the are compared or that certain sentiment carrying opinions
are contrasted. At both the document level and the sentence
1. See, for instance, http://www.wired.com/2015/02/science-one- level, the computed sentiment values are not directly associ-
agrees-color-dress/ ated with the topics (i.e., entities or aspects of entities)
SCHOUTEN AND FRASINCAR: SURVEY ON ASPECT-LEVEL SENTIMENT ANALYSIS 815
discussed in the text. In a similar manner, sentiment can be 2 EVALUATION METHODOLOGY

computed over any arbitrary piece of text, even a complete
Any maturing research area has to arrive at a common eval-
corpus (e.g., a corpus of microblog entries, where each post
uation methodology that is generally accepted in the field.
is considered a document).
For aspect-level sentiment analysis, this is not yet the case,
In contrast, aspect-level sentiment analysis aims to find
as evidenced by the wide variety of used evaluation meas-
sentiment-target pairs in a given text (i.e., this could range
ures and data sets.
from sentences or smaller textual units, to complete corpora
In recent years, the International Workshop on Semantic
containing many documents). Within aspect-level sentiment
Evaluation has embraced the task of aspect-level sentiment
analysis, the overall sentiment would generally refer to the
analysis [13], [14], providing a controlled evaluation meth-
entity, while aspect-level sentiment analysis would refer to
odology and shared data sets for all participants. All com-
the sentiment associated with aspects of the entity being dis-
peting systems get the same unannotated test data, which
cussed. This allows for a more detailed analysis that utilizes
they will have to annotate with aspect tags and sentiment
more of the information provided by the textual review.
tags. This is sent to the organization which will perform a
Therefore, this survey will focus on aspect-level analysis
controlled evaluation of the provided data using the same
and its various sub-tasks. This also allows us to cover more
procedures for each competing system. The result is an
recent developments, instead of repeating established
overview of approaches that can be directly compared
insights that can be found in other surveys [3], [7], [8], [12].
against each other.
A good survey and introduction into the field of senti-
Likewise, the GERBIL framework [15] also has the goal of
ment analysis is Pang and Lee’s publication from
directly comparing approaches with the same, controlled,
2008 [3]. Not only are various techniques and applications
evaluation methodology. To that end, it combines multiple
discussed, but also ethical, practical, and theoretical con-
data sets and many implementations of existing algorithms
siderations are covered by their article. However, the cov-
to compare against. Furthermore, the exact experimental
erage of the survey is restricted mostly to document-level
setting is permanently stored and can be referred to so that
machine learning approaches. There is a smaller survey
readers can exactly see how the evaluation is performed.
by Tang et al. [12] from 2009, and while it mainly focuses
Unfortunately, at the time of writing, this system is only
on document-level machine learning approaches as well,
available for the task of entity annotation. However, the
it specifically addresses the domain of consumer reviews.
concept is applicable to many tasks, including aspect-level
Tsytsarau and Palpanas [7] published a survey in 2011
sentiment analysis.
that, while still focusing on document-level sentiment
Of course, many problems arise when research field
analysis, distinguishes between four different approaches
standards are developed. For instance, the annotations
for identifying the sentiment value of words: machine
needed differ for the various approaches since some meth-
learning, dictionary-based, statistical, and semantic. These
ods classify sentiment in only positive or negative, while
four labels mainly describe how the sentiment value of a
others use a five-star rating. In other cases, the specific focus
single word is determined. In 2012, Liu published a sur-
of an evaluation may not be aspect-level sentiment analysis,
vey [8], with an updated overview of the entire field of
like in [16] where the task of selecting comprehensive
sentiment analysis. The chapter dealing with aspect-level
reviews is evaluated. The focus on different tasks also solic-
sentiment analysis is organized as a list of sub-problems
its the use of a wide variety of evaluation metrics.
that one encounters when implementing an actual solu-
tion: from definitions to aspect extraction, including vari-
2.1 Evaluation Measures
ous challenges that can be defined as part of aspect-level
Currently, most of the surveyed work uses accuracy, preci-
sentiment analysis, like dealing with implicit and explicit
sion, recall, and F1 to measure quantitative performance,
sentiment and entities, to how aspects and sentiment val-
but some less common metrics are in use as well. To facili-
ues can be identified and linked to one another. However,
tate the proper interpretation of the reported performances,
a systematic classification of approaches and reports of
we will briefly discuss these less common metrics and pres-
their accuracy are missing, a gap that the current survey
ent the general way of computing them.
is aiming to fill.
For sentiment classification, multiple measures are in use:
1.5 Organization Ranking Loss, Mean Absolute Error, and Mean Squared
Error (MSE). All of them assume that the sentiment value is
This survey is organized as follows. First, we discuss the
at least an interval type variable. This assumption can be rea-
evaluation methodology for aspect-level sentiment analysis.
sonable, even though in practice this is usually not the case.
Then, we present various approaches for aspect detection
Ranking Loss [17], used in [18], measures the average
and sentiment analysis in isolation as well as joint aspect
distance between the true rank and the predicted rank. For
detection and sentiment analysis approaches. After that,
a sentiment classification problem with m sentiment classes
we discuss some interesting related problems that most
(e.g., on a scale from one to five) and n test instances, Rank-
approaches encounter and present some solutions dedicated
ing Loss is defined in Equation (1) as the average deviation
to solve these issues. Then, the problem of aggregating senti-
between the actual sentiment value y for instance i and the
ment scores is discussed, as well as the presentation of the
predicted sentiment value y^ for that instance,
aspect sentiment scores. We conclude the paper with an
informed outlook on the field of aspect-level sentiment anal- X
n
jyi y^i j
ysis and highlight some of the most promising directions for Ranking Loss ¼ : (1)
i¼1
mn
future research.
An alternative to Ranking Loss is the macro-averaged exact situation, for example whether the probability distri-
Mean Absolute Error, which is particularly robust to imbal- butions are continuous or discrete. Characteristic for the
ance in data sets. Used in [19], it is computed as KL-divergence, compared to other measures is that it is not
a true metric, since it is not symmetrical: the KL-divergence
1X m
1 X of A compared to B is different than the KL-divergence of B
MAEM ðy; y
^Þ ¼ jyi y^i j; (2)
m j¼1 jyj j y 2y compared to A.
i j
^ is the vector
where y is the vector of true sentiment values, y 3 CORE SOLUTIONS
of predicted sentiment values, yj ¼ fyi : yi 2 y; yi ¼ jg , and
To provide insight into the large number of proposed meth-
m is the number of unique sentiment classes in y. ods for aspect-level sentiment analysis, a task-based top-
A similar measure is Least Absolute Errors (LAE), or L1 level categorization is made, dividing all approaches into
error, which is used in [20] to measure sentiment classifica- the following three categories: methods focusing on aspect
tion error. It is computed as detection, methods focusing on sentiment analysis, methods
X
n for joint aspect detection and sentiment analysis. Within
LAE ¼ j^
yi yi j; (3) each task, a method-based categorization is made that is
i¼1 appropriate for that task (e.g., supervised machine learning,
frequency-based, etc.). For each task, a table outlining all
where ^ y is the vector of n sentiment predictions and y is the surveyed methods that cover that task is given. Each table
vector of true sentiment values. lists the work describing the method, its domain (i.e., what
Related to this is the Mean Squared Error, or the mean L2 kind of data it is evaluated on), a short description of the
error, used in [21] to evaluate the sentiment prediction error task that is evaluated, and the performance as reported by
of the proposed method. This is a widely used metric, espe- the authors. For the methods that perform sentiment analy-
cially for regression, which is computed as sis, the number of sentiment classes is also reported. Note
that since evaluation scores are taken from the original
1X n
MSE ¼ yi yi Þ2 ;
ð^ (4) papers, experimental settings will be different for each
n i¼1 work and as a consequence the methods cannot be com-
pared using these evaluation scores. When multiple var-
where, again, ^ y is the vector of n sentiment predictions and
iants of an approach are evaluated and compared, we
y is the vector of true sentiment values.
report only the results of the variant that yields the best per-
For aspect detection, some algorithms return a ranked list
formance. When the same method is evaluated over multi-
of aspects. To compare rankings, multiple measures exist,
ple data sets, the results are presented as the average or as a
one of which, the normalized Discounted Cumulative Gain
range.
(nDCG), is used when reporting performance scores for the
Note that work describing both a method for aspect
discussed work.
detection and a different method for sentiment analysis
The normalized Discounted Cumulative Gain [22], also
appears twice: the aspect detection method is discussed in
used in [21], is particularly useful to evaluate relevance for
Section 3.1, while the sentiment analysis method is dis-
lists of returned aspects. Furthermore, relevance does not
cussed in Section 3.2. A tree overview of the classification
have to be binary. The regular Discounted Cumulative Gain
system is shown in Fig. 1, which is inspired by the organiza-
is computed as
tion of approaches that is used in the tutorial of Moghad-
Xk dam & Ester [24].
2relðiÞ 1
DCG@k ¼ ; (5)
i¼1
log 2 ði þ 1Þ 3.1 Aspect Detection
where k represents the top k returned aspects that will be All methods featuring an aspect detection method of inter-
evaluated, and relðiÞ is the relevance score of aspect i. To est are discussed in this section. A division is made between
normalize this score, and allow cross-query evaluation, the frequency-based, syntax-based (sometimes referred to as
DCG score is divided by the ideal DCG. This is the DCG relation-based methods), supervised machine learning,
that would have been returned by a perfect algorithm. For unsupervised machine learning, and hybrid approaches.
most of the discussed approaches, nDCG cannot be com- All the discussed approaches, together with their reported
puted, since it does not return a ranked list. However, if an performance can be found in Table 1.
algorithm produces rankings of aspects, for instance, based
on how much these are discussed in a review, nDCG is an 3.1.1 Frequency-Based Methods
effective way of summarizing the quality of these rankings. It has been observed that in reviews, a limited set of words
When dealing with generative probabilistic models, like is used much more often than the rest of the vocabulary.
topic models, where the full joint probability distribution These frequent words (usually only single nouns and com-
can be generated, it is also possible to use the KullbackLei- pound nouns are considered) are likely to be aspects. This
bler divergence [23], or KL-divergence for short. This meas- straightforward method turns out to be quite powerful, a
ures the difference between two probability distributions, fact demonstrated by the significant number of approaches
where one distribution is the one generated by the model using this method for aspect detection. Clear shortcomings
and the other is the distribution that represents the true are the fact that not all frequent nouns are actually referring
data. How the KL-divergence is computed depends on the to aspects. Some nouns in consumer reviews, such as
particular study, the goal is to find reviews that are most

comprehensive with respect to a certain aspect, so that senti-
ment analysis can be performed on reviews that have a thor-
ough discussion on that aspect.
Only explicit aspects are detected in [25], but [27]
employs association rule mining to find implicit aspects as
well. By restricting sentiment words to appear as rule ante-
cedents only, and aspect words to appear as rule conse-
quents, the generated association rules can now be used to
find aspects based on already found sentiment words. Last,
a major difference between the two methods is that,
while [25] generates the frequent item sets from a transac-
tion file, [27] generates its rules from the co-occurrence
matrix of the bipartite of sentiment words and explicit
aspects. One should note that these explicit features must
therefore first be detected, before implicit features can
be found using this method. The two methods can thus be
thought of as complementary.
Similar to [25] is [28], where a supervised form of
association rule mining is used to detect aspects. Instead
of the full review text, [28] targets pros and cons that are
separately specified on some Web sites. Since pros and
cons are known to be rich in aspect descriptions, this
task is allegedly simpler than detecting aspects in the
Fig. 1. Taxonomy for aspect-level sentiment analysis approaches using full text, and the obtained results are obviously better
the main characteristic of the proposed algorithm. than those of [25].
A major shortcoming of most frequency-based methods
‘dollar’ or ‘bucks’, are just frequently used. On the other is the fact that nouns and noun phrases that naturally have
hand, aspects that are not frequently mentioned, like very a high frequency are mistakenly seen as aspects. Red Opal,
specific aspects that most people do not discuss, will be a system introduced in [29], aims to address this issue by
missed by frequency-based methods. To offset these prob- comparing the frequency of a prospective aspect with base-
lems, frequency-based methods can be supplemented with line statistics gathered from a corpus of 100 million words
a set of rules to account for some of these issues. However, of spoken and written conversational English. To be consid-
these manually crafted rules often come with parameters ered as an aspect, a word or bigram has to appear more
which have to be tuned. often in a review than is likely given its baseline frequency.
The most well-known approach featuring a frequency- This improves feature extraction and reduces the number of
based method for aspect detection is [25]. The same authors non-features because these non-features are usually often
describe the matching sentiment analysis method in [26], occurring words that would be above a fixed threshold but
which will be explained in Section 3.2.1. The aspect detec- are filtered out when using baseline statistics. As part of the
tion method described in [25] only considers single nouns evaluation, a small scale survey was conducted to assess the
and compound nouns as possible aspects. First, the fre- actual helpfulness of the extracted features, which sug-
quency of each combination of nouns is retrieved. For this, gested that users prefer bigram features over unigram fea-
the nouns do not have to be next to each other, they should tures and specific features over more generic features. The
just appear in the same sentence. This helps to find aspects same concept of baseline statistics is used in [30], where it
like ‘screen size’ when it is phrased as ‘size of the screen’. used to filter the list of high-frequency noun phrases. Addi-
The noun combinations that occur in at least 1 percent of tionally, a part-of-speech pattern filter is also applied, such
the sentences are considered as aspects. Two rules are used that every aspect needs to be followed by an adjective (note
to prune the result in order to lower the number of false that this filter is designed to work with Chinese texts).
positives. The first aims to remove combinations where the
nouns never appear closely together, while the second aims 3.1.2 Syntax-Based Methods
to remove single-word aspects which appear only as part of Instead of focusing on frequencies to find aspects, syntax-
a multi-word aspect. When a sentence does not contain a based methods find aspects by means of the syntactical rela-
frequent aspect but does contain one or more sentiment tions they are in. A very simple relation is the adjectival modi-
words, as indicated by the sentiment analysis method used fier relation between a sentiment word and an aspect, as in
in conjunction with the current approach, then the noun or ‘fantastic food’, where ‘fantastic’ is an adjective modifying the
compound noun nearest to the sentiment word is extracted aspect ‘food’. A strong point of syntax-based methods is that
as an infrequent aspect. This process, while sensitive to gen- low-frequency aspects can be found. However, to get good
erating false positives, is able to increase the recall of the coverage, many syntactical relations need to be described.
method. An improvement to this process can be found To mitigate the low recall problem, a generalization step
in [16], where grammatical dependencies are employed to for syntactic patterns using a tree kernel function is pro-
find infrequent aspects instead of word distance. In this posed in [31]. Given a labeled data set, the syntactic patterns
TABLE 1
Approaches for Aspect Detection
domain evaluation task performance

frequency-based
Hu & Liu (2004) [25] product reviews aspect detection precision: 72%
recall: 80%
Long et al. (2010) [16] hotel reviews comprehensive review selection F1 : 70.6% - 93.3%
Hai et al. (2011) [27] cell phone reviews implicit aspect detection precision: 76.29%
recall: 72.71%
Liu et al. (2005) [28] product reviews aspect detection (pros/cons) precision: 88.9% / 79.1%
recall: 90.2% / 82.4%
Scaffidi et al. (2007) [29] product reviews aspect detection precision: 85%-90%
complexity: O(n)
Li et al. (2009) [30] product reviews aspect detection F1 : 74.07%
syntax-based
Zhao et al. (2010) [31] car, camera, and phone aspect detection precision: 73%, 66%, and 76%
reviews [32] recall: 63%, 67%, and 68%
Qiu et al. (2009) [33] product reviews aspect detection precision: 88%
(data from [26]) recall: 83%
Zhang et al (2010) [34] cars & mattress reviews aspect detection precision: 78% / 77%
recall: 56% / 64%
phone & LCD forum posts precision: 68% / 66%
recall: 44% / 55%
supervised machine learning
Jakob & Gurevych (2010) [35] data from [36][37][38] opinion target extraction precision: 61.4% - 74.9%
recall: 41.4% - 66.1%
unsupervised machine learning
Titov & McDonald (2008) [39] product, hotel, and aspect detection no quantitative evaluation
restaurant reviews
Lu et al. (2011) [20] hotel [40] & restaurant [41] sentence labeling accuracy: 79.4%
reviews F1 : 71.4% - 85.6%
Lakkaraju et al. (2011) [42] product reviews aspect detection (2/5-class) precision: 83.33% / 82.52%
recall: 81.12% / 80.72%
Zhan & Li (2011) [43] hotel [44] & restaurant [41] not available (graphs only)
reviews
Wang et al. (2011) [21] hotel [40] & mp3 player aspect rating prediction MSE: 1.234
reviews nDCG: 0.901
Moghaddam & Ester (2013) [45] product reviews [46][40][47] item categorization (cold) accuracy: 79%-86%
item categorization (default) accuracy: 95%-97%
Hai et al. (2014) [48] product reviews aspect detection no quantitative analysis
hybrid
Popescu & Etzioni (2005) [49] product reviews aspect detection precision: 87.84%
(data from [26]) recall: 77.6%
Yu et al. (2011) [50] product reviews aspect detection F1 : 70.6% - 76.0%
Raju et al. (2009) [51] product reviews aspect detection precision: 92.4%
(incl. partial matches) recall: 62.7%
Blair-Goldensohn et al. (2008) [52] restaurant & hotel reviews static aspect detection precision: 70.5%–94.6%
recall: 47.1%–82.2%
of all the annotated aspects are extracted. Then, for the and with additional known aspect words, more sentiment
unseen data, syntax trees of all sentences are obtained. words can be found, etc. The algorithm continues this pro-
Instead of directly trying to find an exact match between the cess until no more extra sentiment words or targets can be
aspect pattern and the syntax tree, both are split into several found. To find sentiment words based on known aspect
different substructures. Then the similarity between the pat- words, and the other way around, a set of rules based on
tern and a sentence can be measured as the number of grammatical relations from the employed dependency
matching substructures. The common convolution tree ker- parser, is constructed. In this way, more sentiment-aspect
nel is used to compute similarity scores for each pair of sub- combinations can be found and classified in a given text than
structures, with a threshold determining whether a pair is a with previous approaches. A big advantage of this method is
match or not. that it only needs a small seed set to work properly compared
In [33] (an extended version was published later in [53]), to the large corpus most trained classifiers require.
and its extension [34], aspect detection and sentiment lexicon
expansion are seen as interrelated problems for which a dou- 3.1.3 Supervised Machine Learning Methods
ble propagation algorithm is proposed, featuring parallel There are not many supervised machine learning methods
sentiment word expansion and aspect detection. With each for aspect detection that are purely machine learning meth-
extra known sentiment word, extra aspects can be found, ods. Since the power of supervised approaches lies in the
features that are used, feature construction often consists of of topic, MG-LDA models topics on two levels: global and
other methods (e.g., frequency-based methods) in order to local. The idea is to have a fixed set of global topics and a
generate more salient features that generalize better than dynamic set of local topics, from which the document is
simple bag-of-words or part-of-speech features. sampled. To find the local topics, a document is modeled as
In [35], aspect detection is cast as a labeling problem, a set of sliding windows where each window covers a cer-
which is solved by using a linear chain Conditional Random tain number of adjacent sentences. These windows overlap,
Field (CRF), common in natural language processing, to causing one particular word to be allowed to be sampled
process a whole sequence (e.g., a sentence) of words. This from multiple windows. This also solves the problem of too
automatically takes the context of a word into account when few co-occurrences: the bags of words are not too small in
assigning it a label. Multiple features are used when deter- this case. The set of global topics act in a similar way to the
mining the best label for a word, including the actual word, background topic of [58] in Section 3.3.3, increasing the
its part-of-speech tag, whether a direct dependency relation accuracy of the local topics that should represent the sought
exists between this word and a sentiment expression, aspects.
whether this word is in the noun phrase that is closest to a A similar notion is demonstrated in [20] where a distinc-
sentiment expression, and whether this word is in a sen- tion is made between global and local topics. Instead of the
tence that actually has a sentiment expression. The ground- more complex construction of sliding windows, LDA is sim-
truth from a subset of the used data sets [36], [37], [38] is ply performed on the sentence level, with the exception that
used to train the model. Four domains are covered in these the document topics are modeled in conjunction with the
review data sets: movies, web-services, cars, and cameras. sentence topics. In this way, the sentence topics can model
the aspects with all non-relevant words modeled as a docu-
3.1.4 Unsupervised Machine Learning ment topic.
In general, this class of models operates unsupervised, While finding both global and local topics is useful to get
requiring only labeled data to test and validate the model. coherent local topics that actually describe aspects, a differ-
Nevertheless, a large amount of data is generally needed to ent option is shown in [42], where LDA is combined with a
successfully train these type of models. Most of the Hidden Markov Model (HMM) to distinguish between
approaches in this section use LDA, which is a topic model aspect-words and background words. This distinction is
proposed in [54]. Each document is viewed as a mixture of drawn by incorporating syntactic dependencies between
topics that could have generated that document. It is similar aspect and sentiment. The same idea can be found in [59], a
to probabilistic Latent Semantic Analysis [55] but it utilizes CRF model discussed in Section 3.3.2, although in [42], it is
a Dirichlet prior for the topic distribution instead of a uni- employed in an unsupervised, generative manner.
form topic distribution. One of the main drawbacks of LDA Another way of adding syntactic dependencies is shown
is that the generated topics are unlabeled, preventing a in [43], where the topic model employs two vocabularies to
direct correspondence between topics and specific aspects pick words from. One vocabulary holds the nouns, while
or entities. And while sometimes a quick glance at the the other holds all the words that are dependent on the
words associated with a topic is enough to deduce which nouns (e.g., adjectives, adjectival verbs, etc.). These pairs
aspect it is referring to, not all topics are that clear cut. are extracted from the dependency tree as generated by a
Because LDA utilizes a bag of words approach when parser.
modeling documents and topics, the contents of a topic (i.e., In [21], the issue of coverage (cf. [16] in Section 3.1.1) is
the words associated with it) are not required to be semanti- addressed by estimating the emphasis placed on each
cally related: it might be impossible to characterize a topic, aspect by the reviewer. This is done by modeling the overall
making it much less suitable for interpretation. rating of the product as the weighted sum of the aspect rat-
Since LDA was designed to operate on the document ings. The inferred weights for the aspect can then be used as
level, employing it for the much finer-grained aspect-level a measure of emphasis. However, where [16] returns the
sentiment analysis is not straightforward. Some critical reviews which describe a certain aspect most comprehen-
issues that arise when implementing an LDA-based method sively based on how much the reviewer is writing about
for aspect-level sentiment analysis have been discussed it, [21] determines the emphasis on a certain aspect in a
in [39]. The main argument is that since LDA uses a bag of review by its influence on the overall rating. This is an
words approach on the document level, it will discover important difference, as the former will show the user
topics on the document level as well. This is good when the reviews that talk much about a certain aspect, even when it
goal is to find the document topic (i.e., this could be the is of no consequence to the overall rating, while the latter
entity, or some category), but not as useful when one is can output a list of reviews where a certain aspect greatly
looking for aspects. The topics that LDA returns are simply influences the rating, even when it is barely discussed.
too global in scope to catch the more locally defined aspects. Since LDA models are trained on a per-item basis, a sig-
One way to counter this would be to apply LDA on the sen- nificant number of data points is needed to infer reliable
tence level, but the authors argue that this would be prob- distributions. However, many products on the web have
lematic since the bag of words would be too small, leading only a limited number of reviews. Continuing the work on
to improper behavior of the LDA model (cf. [56]). Although aspect-level sentiment analysis and LDA models, a method
some solutions exist to this problem in the form of topic to deal with this so-called cold start problem is proposed
transitions [57], the authors deem those computationally in [45]. In addition to modeling aspects and sentiment val-
too expensive. Instead an extension to LDA is proposed ues for products, it also incorporates product categories
called Multi-grain LDA (MG-LDA). Besides the global type and the reviewers into the model. By grouping similar
TABLE 2
Approaches for Sentiment Analysis
domain classes evaluation task performance

dictionary-based
Hu & Liu (2004) [26] product reviews binary sentiment classification accuracy: 84.2%
Moghaddam & Ester (2010) [18] product reviews 5-star rating sentiment classification Ranking Loss: 0.49
Zhu et al. (2009) [62] restaurant reviews ternary aspect-sentiment extraction precision: 75.5%
Blair-Goldensohn et al. (2008) [52] restaurant & hotel binary sentiment classification precision: 68.0% / 77.2%
reviews (pos/neg) recall: 90.7% / 86.3%
Yu et al. (2011) [50] product reviews binary sentiment classification F1 : 71.7%–85.1%
Choi & Cardie (2008) [63] MPQA corpus [64] binary sentiment classification accuracy: 90.70%
Lu et al. (2011) [20] restaurant [41] & 5-star rating sentiment classification LAE: 0.560–0.790
hotel [40] reviews
Titov & McDonald (2008) [39] product, hotel, and binary sentiment classification Ranking Loss: 0.669
restaurant reviews
Popescu & Etzioni (2005) [49] product reviews ternary sentiment extraction precision: 76.68%
(data from [26]) recall: 77.44%
sentiment classification precision: 84.8%
recall: 89.28%
products into categories, aspects are associated to product same aspect, and [50] which targets pros and cons to find
categories instead of the individual products. Then instead aspects using frequent nouns and noun phrases, feeding
of a distribution over all aspects, for each product, only a those into an SVM classifier to make the final decision
distribution over the aspects in the product category will whether it is an aspect or not.
have to be derived from the data. Furthermore, this distri- Contrary to the above, a form of parallel hybridization
bution is influenced by the model of the reviewer, which is can be found in [52], where a MaxEnt classifier is used to
a distribution over the aspects this reviewer comments on find the frequent aspects, for which there is ample data, and
mostly, and with what rating. Hence, a more accurate pre- a rule-based method that uses frequency information and
diction can be made for products with little or no data. syntactic patterns to find the less frequent ones. In this way,
In [48], a supervised joint aspect and sentiment model is available data is used to drive aspect detection, with a rule-
proposed to determine the helpfulness of reviews on aspect based method that acts as back-up for cases where there is
level. The proposed model is a supervised probabilistic not enough data available.
graphical model, similar to supervised Latent Dirichlet Allo-
cation. Just like similar LDA models in Section 3.3.3, this 3.2 Sentiment Analysis
model separately and simultaneously models both aspect The second part of aspect-level sentiment analysis is
and sentiment words, to improve the quality of the found the actual sentiment analysis, which is the task of assigning
aspect topics. While the model is unsupervised with respect a sentiment score to each aspect. The first proposed app-
to aspect detection, it uses the helpfulness ratings provided roaches generally use a dictionary to find the sentiment
for each review as supervision. Unfortunately, because the scores for the individual words followed by an aggregation
focus of this work is on the helpfulness prediction, the aspect and/or association step to assign the sentiment of the sur-
detection part is not quantitatively evaluated. rounding words to the aspect itself. The later approaches
are all based on machine learning, either supervised or
3.1.5 Hybrid Methods unsupervised. All the approaches that are discussed in this
Every classification system has its exceptions, and the classi- section can be found in Table 2, where their reported perfor-
fication system used in this survey is no different. This sec- mance is also shown.
tion showcases work that falls in more than one of the
above categories. When two types of methods are used, 3.2.1 Dictionary-Based
they are called hybrid methods and they come in two fla- In [26], a sentiment dictionary is obtained by propagating
vors: serial hybridization, where the output of one phase the known sentiment of a few seed words through the
(e.g., frequency information) forms the input for the next WordNet synonym/antonym graph. Only adjectives are
phase (e.g., a classifier or clustering algorithm), and parallel considered as sentiment words here. Each adjective in a sen-
hybridization, where two or more methods are used to find tence will be assigned a sentiment class (i.e., positive or neg-
complementary sets of aspects. ative) from the generated sentiment dictionary. When a
Serial hybridization can be found in [49], where Point- negation word appears within a word distance of five
wise Mutual Information [60] is used to find possible words starting from the sentiment word, its polarity is
aspects, which are then fed into a Na€ıve Bayes classifier to flipped. Then, a sentiment class is determined for each sen-
output a set of explicit aspects. Other examples of serial tence using majority voting. Hence, the same sentiment
hybridization include [51], where the Dice similarity mea- class is assigned to each aspect within that sentence. How-
sure [61] is used to cluster noun phrases that are about the ever, when the number of positive and negative words is
the same, a different procedure is used. In that case, each sentiment analysis, the expressions (i.e., short phrases
sentiment bearing adjective is associated with the closest expressing one sentiment on one aspect or entity) are given
aspect within the sentence, in terms of word distance. Then for this approach. The proposed method is a binary senti-
majority voting is used among all sentiment words that are ment classifier based on an SVM. But while basic SVM
associated with the same aspect. In this case, having multi- approaches model the text using a simple bag-of-words
ple polarities within the same sentence is a possibility. model, the authors argue that such a model is too simple to
In contrast to other dictionary methods, [18] uses a set of represent an expression effectively. To solve this, the
adjectives provided by Epinions.com, where each adjective authors used the principle of compositional semantics,
is mapped to a certain star rating. The unknown sentiment which states that the meaning of an expression is a function
word, if it is not in this set, is then located in the WordNet of the meaning of its parts and the syntactic rules by which
synonymy graph. Employing a breadth-first search on the these are combined. Applying this principle, a two-step pro-
WordNet synonymy graph starting at the adjective with the cess is proposed in which the polarities of the parts are
unknown sentiment with a maximum depth of 5, the two determined first, and then these polarities are combined
closest adjectives which appear in the rated list of Epinions. bottom-up to form the polarity of the expression as a whole.
com are found. Then, using a distance-weighted nearest- However, instead of using a manually-defined rule set to
neighbor algorithm, it assigns the weighted average of the combine the various parts and their polarities, a learning
ratings of the two nearest neighbors as the estimated rating algorithm is employed to cope with the irregularities and
to the current adjective. complexities of natural language.
When performing sentiment analysis, some approaches, The learning algorithm of the previous approach consists
like the previously discussed [26], compute one sentiment of a compositional inference model using rules incorpo-
score for each sentence and then associate that sentiment rated into the SVM update method and a set of hidden vari-
with all the aspects that are mentioned in that sentence. ables to encode words being positive, negative, negator, or
However, this makes it impossible to properly deal with none of these types. The negator class includes both func-
sentences that contain aspects with varying sentiment. A tion-negators and content-negators. While function-nega-
solution is proposed in [62], where all sentences are seg- tors are only a small set of words like “not” and “never”,
mented with each segment being assigned to one of the content-negators are words like “eliminated” and “solve”,
aspects found in the sentence. Then, using a sentiment lexi- which also reverse the polarity of their surroundings. As
con, the polarity of each segment is determined and an machine learning approaches allow many features, they
aspect-polarity pair is generated that reflects the overall combine multiple lexicons, adding sentiment information
polarity for this aspect within a particular review. from both the General Inquirer lexicon as well as from the
polarity lexicon from Wilson et al. [65]. With some simple
3.2.2 Supervised Machine Learning heuristics and less sophisticated versions of the proposed
While the methods in the previous section all use a dictio- method as a baseline, the above solution is evaluated on the
nary as the main source for information, supervised MPQA corpus [64]. Experiments show that using composi-
machine learning methods usually learn many of their tional inference is more beneficial than using a learning
parameters from the data. However, since it is relatively approach, but incorporating both clearly results in the high-
easy to incorporate lexicon information as features into a est accuracy.
supervised classifier, many of them employ one or more Instead of a binary sentiment classifier, as is used in the
sentiment lexicons. In [52], the raw score from the senti- above two methods [50], [63], a Support Vector Regression
ment lexicon and some derivative measures (e.g., a mea- model is employed in [20] to find the sentiment score for an
sure called purity that reflects the fraction of positive to aspect. This allows the sentiment score to be modeled as a
negative sentiment, thus showing whether sentiment is real number in the zero to five interval, which is reminiscent
conflicted or uniform) are used as features for a MaxEnt of the widely used discrete five-star rating system.
classifier. When available, the overall star rating of the In [39], a perceptron-based online learning method called
review is used as an additional signal to find the sentiment PRanking [17], is used to perform the sentiment analysis,
of each aspect (cf. [29]). given the topic clusters that have been detected by an LDA-
In [50], the short descriptions in the ‘pros’ and ‘cons’ sec- like model. The input consists of unigrams, bigrams, and
tion of a review are mined for sentiment terms. These senti- frequent trigrams, plus binary features that describe the
ment terms are found using a dictionary [65], with the LDA clusters. For each sentence, a feature vector x is con-
location (i.e., either the ‘pros’ or ‘cons’ section) denoting structed consisting of binary features that signal the absence
their sentiment in that specific context. This information is or presence of a certain word-topic-probability combination,
then used to train a Support Vector Machine (SVM) that is with probabilities being grouped into buckets (e.g., ‘steak’,
able to classify sentiment terms as positive or negative. ‘food’, and ‘0.3-0.4’). The PRanking algorithm then takes the
Given a free text review, for each aspect, the expression that inner product of this vector (x) and a vector of learned
contains its sentiment is found, which should be within a weights (w) to arrive at a number, which is checked against
distance of five steps in the parse tree. Then, the SVM is a set of boundary values that divide the range a score can
used to determine the sentiment for that aspect. have into five separate ranges such that each range corre-
While not exactly an aspect-level sentiment analysis sponds to a sentiment value (e.g., one to five). In the training
method, [63] is still interesting as it performs sentiment phase, each misclassified instance will trigger an update
analysis on very short expressions, which can be associated where both the weights and the boundary values are
to aspects (cf. [62]). Since this method focuses solely on changed. For example, if an instance is given a sentiment
value which is too low, it will both increase weights and sentiment originates at a certain word, and is transferred by
decrease threshold values. other words (e.g., verbs) to the aspect word. A good exam-
ple would be the sentence “The automatic zoom prevents
3.2.3 Unsupervised Machine Learning blurry pictures”, where negative sentiment originates at
‘blurry’ and is reversed by the verb ‘prevents’, transferring
Another option is the use of an unsupervised machine
the now reversed sentiment to the aspect ‘automatic zoom’.
learning method. In [49], each explicit aspect is used to find
Because the described relations are very specific, the result
a potential sentiment phrase by looking for an sentiment
is a typical high-precision low-recall approach that, there-
phrase in its vicinity, where vicinity is measured using the
fore, works best on large volumes of data.
parsed syntactic dependencies. Each potential sentiment
While most of the previously described approaches focus
phrase is then examined, and only the ones that show a pos-
on product reviews, in [36], an aspect-level sentiment analy-
itive or negative sentiment are retained. The semantic orien-
sis approach is proposed for the movie review domain. This
tation, or polarity, is determined using an unsupervised
approach employs a lexicon for both the aspect detection
technique from the computer vision area called relaxation
and the sentiment analysis part. While the latter is common
labeling [66]. The task is to assign a polarity label to each
practice, the former is more of an exception. The intuition
sentiment phrase, while adhering to a set of constraints.
behind this is that a lexicon can capture all the domain spe-
These constraints arise for example from conjunctions and
cific cues for aspects. For example, this aspect lexicon
disjunctions [67]. The final output is a set of sentiment
includes a list of names of people involved in the movie that
phrases with their most likely polarity label, be it positive or
is under review. Dependency patterns that link the aspect
negative.
and the sentiment word are used to find aspect-sentiment
pairs. However, the described relations only cover the most
3.3 Joint Aspect Detection and Sentiment Analysis
frequent relations, so less frequent ones are missed.
Methods
All approaches discussed until now either have a method or 3.3.2 Supervised Machine Learning
model dedicated to either aspect detection or sentiment
An evident problem is that in general, machine learning
analysis. Since the two problems are not independent, mul-
methods excel in classifying instances in a given number of
tiple approaches have been proposed that both extract the
classes. Since the number of possible aspects and the differ-
aspects and determine their sentiment. The main advantage
ent words that can represent an aspect is practically
is that combining these two tasks allows one to use senti-
unbounded, a default classification algorithm cannot be
ment information to find aspects and aspects to find senti-
applied in a straightforward manner. In [69], both aspect
ment information. Some methods explicitly model this
detection and sentiment classification are cast as a binary
synergy, while others use it in a more implicit way. We dis-
classification problem. First, using a lexicon, all prospective
tinguish between syntax-based, supervised machine learn-
aspect and sentiment words are tagged. Then, the problem
ing, unsupervised machine learning, and hybrid methods.
of which aspect belongs to which sentiment word is solved
In Table 3, all approaches discussed in this section are
using a binary classification tournament model. Each round
shown, together with their reported performance.
of the tournament, two aspects are compared and the one
that best matches the sentiment word proceeds to the next
3.3.1 Syntax-Based Methods round. In this way, no direct relation between the aspect
Given the observation that it is much easier to find senti- and sentiment is needed. The drawback is that no additional
ment words than aspect words, syntax-based methods are aspects can be found by exploiting this relation, but an
generally designed to first detect sentiment words, and then advantage is that this method can effectively deal with ellip-
by using the grammatical relation between a sentiment sis, a linguistic phenomenon where the aspect is not linked
word and the aspect it is about, to find the actual aspect. A to the sentiment because it is either implicit or referred to
major advantage of this method is that low-frequency using a co-reference. According to the authors, as much as
aspects can also be found, as the key factor here is the gram- 30 percent of the sentences feature ellipsis.
matical relation between the aspect and its sentiment word To address the issue of long-range dependencies, [59]
(s). This is also its greatest shortcoming, since patterns have encodes both syntactic dependencies between words and
to be defined that describe the set of possible relations conjunctions between words into a CRF model. By introduc-
between an aspect and a sentiment word. Unfortunately, a ing more dependencies between the hidden nodes in the
very specific set of relations will miss a lot of aspects leading CRF model, words that are not directly adjacent in the linear
to high precision, but low recall, while a more general set of chain CRF, can now influence each other. Sentiment values
relations will yield more aspects but also many more words and their targets are linked simply by minimizing the word
that are not aspects, leading to low precision, but high distance and are extracted simultaneously. The model is
recall. Additionally, the extraction of grammatical relations then used to generate a list of sentiment-entity pairs as a
(usually) requires parsing the text, which is both slow and summary of the set of texts, which are product and movie
usually not error-free. reviews in this case, grouped as positive or negative.
An early syntax-based method is [68], where a shallow A strong limitation of the previous work is that each sen-
parser and an extensive set of rules is used to detect aspects tence is assumed to have only one aspect. In [19], a CRF
and sentiment. The lexicon describes not just the sentiment model is proposed that is able to deal with multiple aspects
for a given word, but also gives transfer patterns stating per sentence. Furthermore, when multiple aspects are men-
which words are affected by the sentiment. In this way, tioned in the same sentence, it is likely that they influence
TABLE 3
Approaches for Joint Aspect Detection and Sentiment Analysis
domain classes evaluation task performance

syntax-based
Nasukawa & Yi (2003) [68] general & binary combined (general/camera) precision: 94.3% / 94.5%
camera reviews recall: 28.6% / 24%
Zhuang et al. (2006) [36] movie reviews binary aspect-sentiment pair mining F1 : 52.9%
Kobayashi et al. (2006) [69] product reviews binary sentiment extraction precision: 67.7%
recall: 50.7%
aspect-sentiment pair mining precision: 76.6%
recall: 75.1%
sentiment classification precision: 82.2%
recall: 66.2%
Li et al. (2010) [59] product & binary combined (movies/products) precision: 82.6% / 86.6%
movie reviews recall: 76.2% / 69.3%
Marcheggiani et al. (2014) [19] hotel reviews ternary aspect detection F1 : 48.5%
(annotated subset sentiment classification MAEM : 0.5
of [40])
Jin et al. (2009) [70] camera reviews binary aspect extraction F1 : 78.8% - 82.7%
sentiment sentence extraction F1 : 84.81% - 88.52%
sentiment classification F1 : 70.59% - 77.15%
Zirn et al. (2011) [71] product reviews binary combined (pos/neg) precision: 66.38%/72.02%
recall: 72.94%/65.34%
Mei et al. (2007) [58] weblogs binary sentiment model (pos / neg) KL-divergence: 21 / 19
Titov & McDonald (2008) [72] hotel reviews 5-star rating combined avg. prec.: 74.5% - 87.6%
Moghaddam & Ester (2011) [73] product reviews 5-star rating aspect detection Rand Index: 0.83
sentiment classification Rand Index: 0.73
Jo & Oh (2011) [74] restaurant & binary sentiment classification accuracy: 84% - 86%
product reviews
Wang et al. (2011) [21] mp3 player & 5-star rating aspect rating prediction MSE: 1.234
hotel [40] reviews nDCG: 0.901
Sauper & Barzilay (2013) [75] restaurant reviews binary aspect cluster prediction precision: 74.3%
recall: 86.3%
sentiment classification accuracy: 82.5%
medical summaries aspect cluster prediction precision: 89.1%
recall: 93.4%
hybrid machine learning
Zhao et al. (2010) [76] restaurant [77] & ternary + aspect identification avg. F1 : 70.5%
hotel [44] reviews ‘conflicted’ sentiment identification precision @ 5: 82.5%
precision @ 10: 70.0%
Mukherjee & Liu (2012) [78] product reviews binary sentiment classification precision: 78%
recall: 73%
each other via certain discourse elements, which has an likely to appear in the training corpus, their values have to
effect on the sentiment score for each aspect. Therefore, the be guessed instead of counted. Furthermore, computational
model explicitly incorporates the relations between aspect- complexity increases exponentially when using higher n-
specific sentiments within one sentence. Last, the overall grams. This is the reason that in [70] only unigrams are
score of the review, which is often supplied by the users used. While this prevents the above mentioned problems, it
themselves, is taken into account as well. To do that, a hier- also deprives the model of any context-sensitivity. To
archical model is proposed that simultaneously predicts the account for it, the part-of-speech of a word is also modeled,
overall rating and the aspect ratings. This new model has an and in a way that makes it dependent on both the previous
additional variable for the overall sentiment score, and pair- and the next part-of-speech tag, thereby introducing some
wise factors that model the influence between the overall form of context-awareness. A bootstrapping approach is
sentiment score and each aspect’s sentiment score. A ran- proposed to make the model self-learn a lot of training
dom subset of 369 hotel reviews from the TripAdvisor data examples, mitigating the dependence on labeled training
set [40] is manually annotated for aspects to train and test data to some extent. The additional examples learned in this
the model. way proved to be beneficial when evaluating this approach,
An example of a method based on a lexicalized HMM improving F1 -score for both aspect detection and sentiment
is [70]. With HMM’s, the context of a word can easily be classification.
taken into consideration by using n-grams. However, sim- A Markov logic chain is employed as the main learning
ply using higher n-grams (e.g., bigrams, trigrams, etc.) method in [71]. Within the Markov chain, multiple lexicons
poses some problems. Because a lot of these n-grams are not are incorporated, as well as discourse relations. The latter
are acquired using the HILDA [79] discourse parser which by the reviewer. Last, when put to the test against a MaxEnt
returns a coarse-grained set of discourse segments as classifier, a supervised method, the proposed method per-
defined in [80], which are based on the Rhetorical Structure formed only slightly worse.
Theory [81]. Since sentiment classification is done on the The main improvement of [73], compared to previous
level of discourse segments, it is assumed each segment topic models is that the sentiment class of an aspect is
only expresses one sentiment, which is almost always the explicitly linked to the aspect itself. This makes the senti-
case. Entities, however, are not extracted in this method. ment analysis more context-aware: in this way, a word that
The proposed classification in [71] is binary, which, accord- is positive for one aspect can be negative for another. The
ing to the authors, results in problems with some segments latter is generally true for models that couple the sentiment
that have no clear polarity. Their findings concerning the nodes to the aspect nodes in the graphical model, and this
use of discourse elements were that using general structures same idea is demonstrated in both [21] and [74].
that can be found in the text systematically improves the In [74], aspects are detected as topics by constraining the
results. The fact that a certain discourse relation describes a model to only one aspect-sentiment combination per sen-
contrasting relation was encoded specifically, as it was tence. By assuming that each sentence is about only one
expected to correlate with the reversing of polarity of the aspect and conveys only one sentiment, the model is able to
various segments it connects to. However, this correlation find meaningful topics. This is a relatively simple solution
turned out to be not as strong as was expected beforehand. compared to for example the sliding windows technique [39]
This means, according to the authors, that the classical dis- or injecting syntactic knowledge into the topic model [42].
course relations might not be the best choice to represent Evaluation of the constructed topics revealed another inter-
the general structure of the text when performing sentiment esting fact: in one particular case there were three topics
analysis. Nevertheless, the same authors believe that focus- that conveyed negative sentiment for the same aspect.
ing on cue words to find discourse connectives in order to While this may not seem ideal at first (i.e., one unique topic
predict polarity reversals might still be worth investigating. per aspect-sentiment combination is more logical), close
inspection revealed that the three topics revealed three dis-
3.3.3 Unsupervised Machine Learning tinct reasons why the reviewers were negative about that
The class of unsupervised machine learning approaches aspect (i.e., the screen was too small, the screen was too
may be especially interesting, since these models are able to reflective, and the screen was easily covered with finger-
perform both aspect detection and sentiment analysis with- prints or dirt). This level of detail goes further than regular
out the use of labeled training data. The first topic mixture aspect-level sentiment analysis, providing not only the sen-
model [58] is based on probabilistic Latent Semantic Index- timent of the reviewers, but also the arguments and reasons
ing (PLSI) [55], a model similar to LDA, that is however why that sentiment is associated to that aspect.
more prone to overfitting and is not as statistically sound as In [75], a probabilistic model is presented that performs
LDA. In [58], not only topics that correspond to aspects are joint aspect detection and sentiment analysis for the restau-
modeled, but also a topic for all background words, causing rant reviews domain and aspect detection alone for the medi-
the retrieved topics to better correspond to the actual cal domain. For the restaurant domain, it models the aspects
aspects. Furthermore, the topics that correspond to aspects in such a way that they are dependent on the entity (i.e., the
are again mixtures of sentiment topics. In this way, the end restaurant), instead of having a global word distribution for
result is that both aspects and their sentiment are deter- aspects like previous models. This allows the model to have
mined simultaneously with the same model. Leveraging a different aspects for different kind of restaurants. For exam-
sentiment lexicon to better estimate the sentiment priors ple, a steak house has different aspects than an Italian ice
increases the accuracy of the sentiment classification. cream place and while the sentiment word distribution is
In [72], Titov and McDonald extend the model they pro- global (i.e., the same sentiment words are used for all types
pose in [39] by including sentiment analysis for the found of restaurants), a separate distribution that is different for
aspects. An additional observed variable is now added to the each restaurant is used to model the link between aspects
model, namely the aspect ratings provided by the author of and sentiment words. Furthermore, an HMM-based transi-
the review. With the assumption that the text is predictive of tion function is employed to model the fact that aspects and
the rating provided by the author, this information can be sentiment words often appear in a certain order. Last, a back-
leveraged to improve the predictions of the model. A strong ground word distribution is determined on a global level to
point is that the model does not rely on this information get rid of words that are irrelevant. A variant of the model is
being present, but when present, it is used to improve the used to process dictated patient summaries. Since the set of
model’s predictions. Besides utilizing the available aspect relevant aspects is expected to be shared across all summa-
ratings, the model can extract other aspects from the text as ries, the aspects are modeled as global word distribution.
well, and assign a sentiment score to them. While at least a The previous method operates in an unsupervised fashion,
certain amount of provided aspect ratings is needed for this requiring only a set of sentiment seed words to bias the senti-
model to truly benefit from them, perhaps the biggest advan- ment topics into a specific polarity. Furthermore, the pro-
tage is that the found aspects can be linked to actual aspects posed model admits an efficient inference procedure.
in the text. As mentioned earlier, generative models produce
unlabeled clusters that are not associated with any particular 3.3.4 Hybrid Machine Learning
aspect. This problem is solved by incorporating these aspect While LDA is designed to work with plain text, the above
ratings into the LDA model, providing a link between the methods have shown that the right preprocessing can sig-
words in the document and the concrete aspects as annotated nificantly improve the results of the generative model. This
can be extended a bit further by already optimizing some of First, various categories of comparative sentences are
the input for the topic model by using a supervised discrim- defined and, for each category, it is shown how to process
inative method. Both methods presented in this section fea- them. When possible, a comparator is reduced to its base
ture a MaxEnt classifier that optimizes some of the input for form, and its sentiment is found using the sentiment word
the LDA model. list generated from WordNet [26]. The comparators whose
The first method [76] uses a MaxEnt component to enrich polarity cannot be determined in this way are labeled as
the LDA model with part-of-speech information. In this context-dependent and are processed differently. For that,
way, the generative model can better distinguish between information in the pros and cons section is leveraged to
sentiment words, aspect words, and background words. compute an asymmetric version of the Pointwise Mutual
The MaxEnt classifier is trained using a relatively small set Information association score between the comparative
of labeled training data, and the learned weights are now words and the words in the pros and cons. A set of rules
input for a hidden node in the topic model. This is done then essentially combines the information about the entities,
before training the LDA model, so while training the LDA comparative words, and aspects being compared into one
model, the weights of the MaxEnt classifier remain fixed. coherent outcome: either a positive or negative sentiment
The second method [78] that combines an LDA model about the preferred entity.
with a MaxEnt classifier, uses the MaxEnt classifier to opti- A remaining problem in [84] is that when something is
mize the word priors that influence the generative process more positive than something else, the first is assumed to
of drawing words. Again, part-of-speech information is a have a positive sentiment. This is not always the case. Also
major feature for the MaxEnt component. The fact that problematic is the negation of comparators, as stated by the
external information can be integrated into the generative authors themselves. Their example of “not longer” not nec-
process of an LDA model makes it a very powerful and essarily being the same as “shorter” is illustrative. While the
popular method for aspect-level sentiment analysis. proposed method currently perceives the second entity as
the preferred one when encountering negations, the authors
4 RELATED ISSUES admit that it could also be the case that the user did not
specify any preference.
While finding aspects and determining their sentiment
value is the core of aspect-level sentiment analysis, there are 4.1.2 Conditional Sentences
more issues that play a role in developing an effective tool
As discussed in the previous section, conditional sentences
for aspect-level sentiment analysis. This section discusses
pose a problem in that it is hard to determine whether they
some of these related issues. First, a set of sub-problems will
actually express some sentiment on something or not.
be discussed, including how to deal with comparative opin-
In [82], an approach dedicated to conditional sentences was
ions, conditional sentences, and negations and other modi-
proposed, which can be seen as an extension of the existing
fiers. Then, a short discussion on aggregation of sentiment
line of research based on [26]. First, the various types of con-
scores is given, followed by a concise exposition on presen-
ditionals were grouped into four categories, each with part-
tation of aspect-level sentiment analysis results.
of-speech patterns for both the condition and the conse-
quent in that category. Around 95 percent of the targeted
4.1 Sub-Problems
sentences is covered by these patterns. The sentences found
Processing natural language in general, and performing are then classified as either positive, negative, or neutral
aspect-level sentiment analysis, specifically, is a very com- with respect to some topic in that sentence. For this study,
plex endeavor. Therefore, it has been proposed, for example the topic is assumed to be known beforehand. In contrast to
in [82], that instead of focusing on a one-size-fits-all solu- previously described research, the authors chose to use an
tion, researchers should focus on the many sub-problems. SVM to classify these sentences as having either a positive
By solving enough of the sub-problems, the problem as a or negative polarity.
whole can eventually be solved as well. This line of thought Features used for the SVM are the basic ones like senti-
has given rise to work specifically targeting a certain sub- ment words and part-of-speech information, but also some
problem in sentiment analysis, which is discussed below. common phrases and a list of words that imply the lack of
The presented approaches are not solutions for aspect-level sentiment. Also covered are negations by adding a list of
sentiment analysis and are therefore not in the tables negation keywords. This is, however, still based on a simple
together with the previously discussed approaches. How- word distance metric. Other notable features are the fact
ever, when aspect-level sentiment analysis methods take whether the topic is in the conditional or the consequent
the issues presented below into account (and some do to and the length of both the condition and consequent
some extent), performance will increase. phrases. Last, the sentiment words were weighted accord-
ing to the inverse of their distance to the topic.
4.1.1 Comparative Opinions Multiple ways of training were proposed in [82], but
In comparative sentences, one entity or aspect is usually using the whole sentence instead of only the conditional or
compared with another entity or aspect by preferring one consequent part turned out to be the most successful. Inter-
over the other. Detecting comparative sentences and finding estingly, while the whole-sentence classifier gave the best
the entities and aspects that are compared, as well as the results, the consequent-only classifier gave much better
comparative words themselves is very useful [83]. However, results than the conditional-only classifier, even approach-
for sentiment analysis, one really needs to know which entity ing the results of the whole-sentence classifier, suggesting
or aspect is preferred, a problem that is discussed in [84]. that most useful information to classify conditionals is in
the consequent and not in the conditional part. The classifier module (25 percent). Errors made by the valence shifter
was trained on a set of product reviews which were manu- module can roughly be attributed to three reasons: either
ally annotated and tested on both a binary and a ternary the polarity reading was ambiguous (10 percent), more
classification problem. world knowledge was required (19 percent), or the polarity
For the binary classification, the consequent-only classi- was modulated by phenomena more closely related to prag-
fier and the whole-sentence classifier yielded a similar per- matics than semantics (5 percent).
formance while for the ternary classification, the whole- While Polyani and Zaenen did not really discuss the
sentence approach performed clearly better. According to scope of a negation, this is actually a very important topic.
the authors, this signifies that to classify something as neu- Most approaches to sentiment analysis have at least some
tral, information from both the conditional and the conse- handling of negations, but they usually employ only a sim-
quent are needed. The best result the authors reported is an ple word distance metric to determine which words are
accuracy of 75.6 percent for the binary classification and affected by a negation keyword (cf. [87] for a comparison
66.0 percent for the ternary classification. Unfortunately, no of different word distances). In [88], the concept of the
baseline was defined to compare these results against. scope of a negation term is further developed. For each
negation term, its scope is found by using a combination
4.1.3 Negations and Other Modifiers of parse tree information and a set of rules. The general
From amongst the set of modifiers that change the polarity idea is to use the parse tree to find the least common
or strength of some sentiment, negations are implemented ancestor of the negation word and the word immediately
most. This comes to no surprise given the effect negations following it in the sentence. Then all leaves descending
can have on the sentiment of an aspect, sentence, or docu- from that ancestor that are to the right of the negation term
ment. A theoretical discussion by Polyani and Zaenen [85] are in the scope of the negation. This scope is then further
proposes some foundational considerations when dealing delimited and updated by the set of rules to cover some
with these contextual valence shifters as they are sometimes exceptions to this general rule.
called. The authors distinguish between sentence-based When looking at informal texts, such as microblog posts,
contextual valence shifters and discourse-based ones. additional modifiers need to be taken into account[89]. Lexi-
Negations and intensifiers, which belong to the sentence- cal variants that intensify the expressed sentiment include
based group, are mostly single words influencing the polar- the use of repeated punctuation with exclamation marks and
ity of words that are within their scope. Negations flip the using repeated characters inside a word (e.g., ‘haaaaaappy’).
polarity of a sentiment, while intensifiers either increase or Other sources of sentiment that are employed in informal
decrease the sentiment value. Other sentence-based contex- texts are emoticons, for which a custom list of emoticons
tual valence shifters are: modals, where a context of possi- with their sentiment score is usually needed.
bility or necessity is created as opposed to real events
(e.g., “if she is such a brilliant person, she must be socially 4.2 Aggregation
incapable.”); presuppositional items which represent cer- Several of the discussed approaches aggregate sentiment
tain expectation that are met or not (e.g., “this is barely over aspects, usually to show an aspect-based sentiment
sufficient”); and irony in which overly positive or negative summary. Most methods aggregate sentiment by simply
phrases are turned on themselves to create a sentence with averaging or taking a majority vote. In contrast, methods
the opposite valence or polarity (e.g., “the solid and trust- that employ topic models, for example [72], aggregate nat-
worthy bank turned to robbing their own customers”). urally over the whole corpus, thereby computing senti-
The category of discourse-based contextual valence ment for each topic or aspect based on all the reviews. A
shifters is more complex in nature. While one group, the different approach is shown [40], where the topic model
discourse connectors, are linked to some particular words, does not return the aggregated aspect ratings, but instead
all other categories are much harder to identify. We will presents the aspect ratings for each individual review, as
therefore only briefly discuss these discourse connectors, well as the relative weight placed on that aspect by the
and refer the interested reader to [85] for more categories. reviewer. The authors discuss that this enables advanced
Discourse connectors are words that connect two or more methods of aggregation, where aspect ratings can be
phrases in such a way that the combination is different weighted according to the emphasis placed on it by each
in terms of sentiment than simply the sum of its parts. reviewer.
An example to illustrate this is “while he is grumpy each In [90], multiple methods for aggregating sentiment
day, he is not a bad person”, where we can see that the scores are investigated. Even though this work focuses on
connector ‘while’ mitigates the effects of ‘grumpy’, resulting combining sentence-level sentiment scores into a docu-
in an overall positive sentence. ment-level sentiment score, the ideas can be naturally trans-
An implementation of the above framework was lated into the domain of aspect-level sentiment analysis.
described in [86], where many of the ideas of Polyani and Next to a series of heuristic methods, a formally defined
Zaenen are encoded in rules. The resulting pipeline, which method for aggregation based on the Dempster-Shafer The-
also included a part-of-speech tagger and a parser, was ory of Evidence [91] is proposed. This is a theory of uncer-
evaluated to analyze where errors do occur. The results are tainty that can be used to quantify the amount of evidence a
rather interesting, as about two-thirds of the errors occur certain source contributes to some proposition. In this case,
before the valence shifting module. Large contributions to the sources of evidence are the sentence sentiment scores,
errors are made by the parser and the tagger (around 14 per- and the proposition to which these sources of evidence con-
cent each) and the lack of a word sense disambiguation tribute is the final document-level sentiment score.
The following methods of aggregation are tested: ran- field is transcending its early stages. While in some cases, a
domly picking a sentence sentiment as the document senti- holistic approach is presented that is able to jointly perform
ment, simply averaging all sentence sentiment scores, aspect detection and sentiment analysis, in others dedicated
taking the absolute maximum score (e.g., when the stron- algorithms for each of those two tasks are provided. Most
gest positive sentence is þ5 and the strongest negative sen- approaches that are described in this survey are using
tence is 4, the overall sentiment will be þ5), summing the machine learning to model language, which is not surpris-
two maximum scores (e.g., in the previous example, sum- ing given the fact that language is a non-random, very com-
ming þ5 and 4 would result in a þ1 document-level senti- plex phenomenon for which a lot of data is available. The
ment), scaled rate which is the fraction of positive sentiment latter is especially true for unsupervised models, which are
words out of all sentiment words, and the discussed Demp- very well represented in this survey.
ster-Shafer method. As shown in [90], the proposed method We would like to stress that transparency and standardi-
clearly outperforms all heuristics. It is argued that this is zation is needed in terms of evaluation methodology and
caused by the fact that the Dempster-Shafer method takes data sets in order to draw firm conclusions about the cur-
all pieces of evidence into account, and the fact that it con- rent state-of-the-art. Benchmark initiatives like SemEval [13],
siders maximal agreements among the pieces of evidence. [14] or GERBIL [15] that provide a controlled testing envi-
Of interest is the fact that this method is tested on two data ronment are a shining example of how this can be achieved.
sets that have also been used for already discussed methods When considering the future of aspect-level sentiment
that perform aspect-level sentiment analysis (cf. Tables 1, 2, analysis, we foresee a move from traditional word-based
and 3). Hence, methods for aspect-level sentiment analysis approaches, towards semantically rich concept-centric
should be able to benefit from this research. aspect-level sentiment analysis [95]. For example, in “This
phone doesn’t fit in my pocket”, it is feasible to determine
4.3 Presentation that the discussed aspect is the size of the phone. However,
As a final step in the process of aspect-level sentiment analy- the negative sentiment conveyed by this sentence, related to
sis, the results should be presented to the user. This can be the fact that phones are supposed to fit in one’s pocket,
done in several ways, the first of which is simply showing the seems extremely hard to find for word-based methods.
numbers. In this case, for a certain product, a list of detected Related to this problem, pointing to the need for reasoning
aspects is shown, together with the aggregated sentiment functionality, is the still open research question of irony.
scores for each aspect. One can also imagine a table with the In [96], a conceptual model is presented that explicitly
scores for multiple products in order to easily compare them. models expectations, which is necessary to effectively detect
In [28], a visual format is advocated that shows bars that irony. This is also a step away from the traditional word-
denote the sentiment scores. Clicking the bar would show based approach towards a semantic model for natural
more details, including relevant snippets of reviews. In this language processing. While concept-centric, semantic
way, a user can quickly inspect the traits of several products approaches have only recently begun to emerge (e.g., ontol-
and compare them, without getting overwhelmed by a table ogies are being used to improve aspect detection [97]), they
full of numbers. When the timestamp of each review is should be up to this challenge, since semantic approaches
available, a timeline [92] could also be generated to show naturally integrate common sense knowledge, general
the change in sentiment over time. This is important for world knowledge, and domain knowledge.
services, which can change over time, or product character- Combining concept-centric approaches with the power
istics which may only show after prolonged use. of machine learning will give rise to algorithms that are able
Another possibility is to generate a summary of all the to reason with language and concepts at a whole new level.
analyzed reviews. When done right, this will produce a This will allow future applications to deal with complex
readable review that incorporates all the available informa- language structures and to leverage the available human-
tion spread over all reviews. In [93], an ontology is used to created knowledge bases. Additionally, this will enable
organize all the aspects into aspect categories and all senten- many application domains to benefit from the knowledge
ces that express sentiment on an aspect are linked to the obtained from aspect-level sentiment analysis.
aspects in the ontology as well. Two methods for summary
generation are tested: the first is to select representative sen- ACKNOWLEDGMENTS
tences from the ontology, the second is to generate senten- The authors of this paper are supported by the Dutch
ces with a language generator based on the aspects and national program COMMIT. The authors would like to
their known sentiment scores. While the sentence selection thank the reviewers for their invaluable insights. Further-
method yields more variation in the language being used in more, they are grateful for the constructive comments given
the summary as well as more details, the sentence genera- by Franciska de Jong and Rommert Dekker.
tion provides a better sentiment overview of the product. A
variation of this method is contrastive summarization [94], REFERENCES
where the summary consists of pairs of sentences that
[1] B. Bickart and R. M. Schindler, “Internet forums as influential
express opposing sentiment on the same aspect. sources of consumer information,” J. Interactive Marketing, vol. 15,
no. 3, pp. 31–40, 2001.
[2] E. van Kleef, H. C. M. van Trijp, and P. Luning, “Consumer
5 CONCLUSIONS research in the early Stages of new product development: A criti-
From the overview of the state-of-the-art in aspect-level sen- cal review of methods and techniques,” Food Quality Preference,
vol. 16, no. 3, pp. 181–201, 2005.
timent analysis presented in this survey, it is clear that the
[3] B. Pang and L. Lee, “Opinion mining and sentiment analysis,” [28] B. Liu, M. Hu, and J. Cheng, “Opinion observer: Analyzing and
Found. Trends Inf. Retrieval, vol. 2, no. 1-2, pp. 1–135, 2008. comparing opinions on the web,” in Proc. 14th Int. Conf. World
[4] Y. Chen and J. Xie, “Online consumer review: Word-of-Mouth as a Wide Web, 2005, pp. 342–351.
new element of marketing communication mix,” Manage. Sci., [29] C. Scaffidi, K. Bierhoff, E. Chang, M. Felker, H. Ng, and C. Jin,
vol. 54, no. 3, pp. 477–491, 2008. “Red opal: Product-feature scoring from reviews,” in Proc. 8th
[5] R. E. Goldsmith and D. Horowitz, “Measuring motivations for ACM Conf. Electron. Commerce, 2007, pp. 182–191.
online opinion seeking,” J. Interactive Advertising, vol. 6, no. 2, [30] Z. Li, M. Zhang, S. Ma, B. Zhou, and Y. Sun, “Automatic extrac-
pp. 3–14, 2006. tion for product feature words from comments on the web,” in
[6] I. Arnold and E. Vrugt, “Fundamental uncertainty and stock mar- Proc. 5th Asia Inf. Retrieval Symp. Inf. Retrieval Technol., 2009,
ket volatility,” Appl. Financial Econ., vol. 18, no. 17, pp. 1425–1440, pp. 112–123.
2008. [31] Y. Zhao, B. Qin, S. Hu, and T. Liu, “Generalizing syntactic struc-
[7] M. Tsytsarau and T. Palpanas, “Survey on mining subjective data tures for product attribute candidate extraction,” in Proc. Conf.
on the web,” Data Mining Knowl. Discovery, vol. 24, no. 3, pp. 478– North Am. Chapter Assoc. Comput. Linguistics: Human Lang. Tech-
514, 2012. nol., 2010, pp. 377–380.
[8] B. Liu, Sentiment Analysis and Opinion Mining (series Synthesis [32] J. Zhao, H. Xu, X. Huang, S. Tan, K. Liu, and Q. Zhang, “Overview
Lectures on Human Language Technologies). vol. 16. San Mateo, of chinese opinion analysis evaluation 2008,” in Proc. 1st Chinese
CA, USA: Morgan, 2012. Opinion Anal. Eval., 2008, pp. 1–21.
[9] Collins English Dictionary Complete and Unabridged. HarperCollins [33] G. Qiu, B. Liu, J. Bu, and C. Chen, “Expanding domain sentiment
Publishers, Opinion [Online]. Available: http://www.thefreedic- lexicon through double propagation,” in Proc. 21st Int. Joint Conf.
tionary.com Artif. Intell., 2009, pp. 1199–1204.
[10] S.-M. Kim and E. Hovy, “Determining the sentiment of opinions,” [34] L. Zhang, B. Liu, S. H. Lim, and E. O’Brien-Strain, “Extracting and
in Proc. 20th Int. Conf. Comput. Linguistics, 2004, pp. 1367–1373. ranking product features in opinion documents,” in Proc. 23rd Int.
[11] R. Plutchik, Emotion, A Psychoevolutionary Synthesis. New York, Conf. Comput. Linguistics, 2010, pp. 1462–1470.
NY, USA: Harper & Row, 1980. [35] N. Jakob and I. Gurevych, “Extracting opinion targets in a single-
[12] H. Tang, S. Tan, and X. Cheng, “A survey on sentiment detection and cross-domain setting with conditional random fields,” in
of reviews,” Expert Syst. Appl., vol. 36, no. 7, pp. 10 760–10 773, Proc. Conf. Empirical Methods Natural Lang. Process., 2010, pp. 1035–
2009. 1045.
[13] M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. [36] L. Zhuang, F. Jing, and X.-Y. Zhu, “Movie review mining and
Androutsopoulos, and S. Manandhar, “Semeval-2014 task 4: summarization,” in Proc. 15th ACM Int. Conf. Inf. Knowl. Manage.,
Aspect based sentiment analysis,” in Proc. 8th Int. Workshop vol. 6, no. 11, 2006, pp. 43–50.
Semantic Eval., 2014, pp. 27–35. [37] C. Toprak, N. Jakob, and I. Gurevych, “Sentence and expression
[14] M. Pontiki, D. Galanis, H. Papageorgiou, S. Manandhar, and I. level annotation of opinions in user-generated discourse,” in Proc.
Androutsopoulos, “SemEval-2015 task 12: Aspect based sentiment 48th Annu. Meet. Assoc. Comput. Linguistics, 2010, pp. 575–584.
analysis,” in Proc. 9th Int. Workshop Semantic Eval., 2015, pp. 486– [38] J. S. Kessler and N. Nicolov, “Targeting sentiment expressions
495. through supervised ranking of linguistic configurations,” in Proc.
[15] R. Usbeck, M. R€ oder, A.-C. Ngonga Ngomo, C. Baron, A. Both, M. 3rd Int. AAAI Conf. Weblogs Social Media, 2009, pp. 90–97.
Br€ummer, D. Ceccarelli, M. Cornolti, D. Cherix, B. Eickmann, [39] I. Titov and R. McDonald, “Modeling online reviews with multi-
P. Ferragina, C. Lemke, A. Moro, R. Navigli, F. Piccinno, G. Rizzo, grain topic models,” in Proc. 17th Int. Conf. World Wide Web, 2008,
H. Sack, R. Speck, R. Troncy, J. Waitelonis, and L. Wesemann, pp. 111–120.
“GERBIL: General entity annotator benchmarking framework,” in [40] H. Wang, Y. Lu, and C. Zhai, “Latent aspect rating analysis on
Proc. 24th Int. Conf. World Wide Web, 2015, pp. 1133–1143. review text data: A rating regression approach,” in Proc. 16th
[16] C. Long, J. Zhang, and X. Zhut, “A review selection approach for ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2010,
accurate feature rating estimation,” in Proc. 23rd Int. Conf. Comput. pp. 783–792.
Linguistics, 2010, pp. 766–774. [41] G. Ganu, N. Elhadad, and A. Marian, “Beyond the stars: Improv-
[17] K. Crammer and Y. Singer, “Pranking with ranking,” in Proc. Adv. ing ratig predictions using review content,” in Proc. 12th Int. Work-
Neural Inf. Process. Syst. 14, 2001, pp. 641–647. shop Web Databases, 2009.
[18] S. Moghaddam and M. Ester, “Opinion digger: An unsupervised [42] H. Lakkaraju, C. Bhattacharyya, I. Bhattacharya, and S. Merugu,
opinion miner from unstructured product reviews,” in Proc. 19th “Exploiting coherence for the simultaneous discovery of latent
ACM Int. Conf. Inf. Knowl. Manage., 2010, pp. 1825–1828. facets and associated sentiments,” in Proc. SIAM Int. Conf. Data
[19] D. Marcheggiani, O. T€ackstr€ om, A. Esuli, and F. Sebastiani, Mining, 2011, pp. 498–509.
“Hierarchical multi-label conditional random fields for aspect-ori- [43] T.-J. Zhan and C.-H. Li, “Semantic dependent word pairs genera-
ented opinion mining,” in Proc. 36th Eur. Conf. Inf. Retrieval, 2014, tive model for fine-grained product feature mining,” in Proc. 15th
pp. 273–285. Pacific-Asia Conf. Adv. Knowl. Discovery Data Mining, 2011,
[20] B. Lu, M. Ott, C. Cardie, and B. K. Tsou, “Multi-aspect sentiment pp. 460–475.
analysis with topic models,” in Proc. IEEE 11th Int. Conf. Data Min- [44] S. Baccianella, A. Esuli, and F. Sebastiani, “Multi-facet rating of
ing Workshops, 2011, pp. 81–88. product reviews,” in Proc. 31th Eur. Conf. IR Res. Adv. Inf. Retrieval,
[21] H. Wang, Y. Lu, and C. Zhai, “Latent aspect rating analysis with- 2009, pp. 461–472.
out aspect keyword supervision,” in Proc. 17th ACM SIGKDD Int. [45] S. Moghaddam and M. Ester, “The FLDA model for aspect-based
Conf. Knowl. Discovery Data Mining, 2011, pp. 618–626. opinion mining: Addressing the cold start problem,” in Proc. 22nd
[22] K. J€
arvelin and J. Kek€ al€ainen, “Cumulated gain-based evaluation Int. Conf. World Wide Web, 2013, pp. 909–918.
of IR techniques,” ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422–446, [46] N. Jindal and B. Liu, “Opinion spam and analysis,” in Proc. Int.
2002. Conf. Web Search Web Data Mining, 2008, pp. 219–230.
[23] S. Kullback and R. A. Leibler, “On information and sufficiency,” [47] S. Moghaddam and M. Ester, “On the design of LDA models for
Ann. Math. Statist., vol. 22, no. 1, pp. 79–86, 1951. aspect-based opinion mining,” in Proc. 21st ACM Int. Conf. Inf.
[24] S. Moghaddam, and M. Ester. (2013). Tutorial at WWW 2013: Knowl. Manag., 2012, pp. 803–812.
‘Opinion mining in online reviews: Recent trends’ [Online]. Avail- [48] Z. Hai, G. Cong, K. Chang, W. Liu, and P. Cheng, “Coarse-to-fine
able: http://www.cs.sfu.ca/ ester/papers/WWW2013.Tutorial. review selection via supervised joint aspect and sentiment mod-
Final.pdf el,” in Proc. 37th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval,
[25] M. Hu and B. Liu, “Mining opinion features in customer reviews,” 2014, pp. 617–626.
in Proc. 19th Nat. Conf. Artif. Intell., 2004, pp. 755–760. [49] A.-M. Popescu and O. Etzioni, “Extracting product features
[26] M. Hu and B. Liu, “Mining and summarizing customer reviews,” and opinions from reviews,” in Proc. Conf. Human Lang. Tech-
in Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discovery Data Min- nol. Conf. Empirical Methods Natural Lang. Process., 2005, pp.
ing, 2004, pp. 168–177. 339–346.
[27] Z. Hai, K. Chang, and J.-J. Kim, “Implicit feature identification [50] J. Yu, Z.-J. Zha, M. Wang, and T.-S. Chua, “Aspect ranking: Identi-
via co-occurrence association rule mining,” in Proc. 12th Int. fying important product aspects from online consumer reviews,”
Conf. Comput. Linguistics Intell. Text Process. 2011, vol. 6608, in Proc. 49th Annu. Meet. Assoc. Comput. Linguistics: Human Lang.
pp. 393–404. Technol, 2011, pp. 1496–1505.
[51] S. Raju, P. Pingali, and V. Varma, “An unsupervised approach to [74] Y. Jo and A. H. Oh, “Aspect and sentiment unification model for
product attribute extraction,” in Proc. 31th Eur. Conf. IR Res. Adv. online review analysis,” in Proc. 4th Int. Conf. Web Search Web Data
Inf. Retrieval, 2009, pp. 796–800. Mining, 2011, pp. 815–824.
[52] S. Blair-Goldensohn, T. Neylon, K. Hannan, G. A. Reis, [75] C. Sauper and R. Barzilay, “Automatic aggregation by joint
R. Mcdonald, and J. Reynar, “Building a sentiment summa- modeling of aspects and values,” J. Artif. Intell. Res., vol. 46, no. 1,
rizer for local service reviews,” in Proc. Workshop NLP Inf. pp. 89–127, 2013.
Explosion, 2008. [76] W. X. Zhao, J. Jiang, H. Yan, and X. Li, “Jointly modeling aspects
[53] G. Qiu, B. Liu, J. Bu, and C. Chen, “Opinion word expansion and and opinions with a MaxEnt-LDA hybrid,” in Proc. Conf. Empirical
target extraction through double propagation,” Comput. Linguis- Methods Natural Lang. Process., 2010, pp. 56–65.
tics, vol. 37, no. 1, pp. 9–27, 2011. [77] S. Brody and N. Elhadad, “An unsupervised aspect-senti-
[54] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet ment model for online reviews,” in Proc. Conf. North Am.
allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003. Ch. Assoc. Computat. Linguistics: Human Lang. Technol., 2010,
[55] T. Hofmann, “Learning the similarity of documents: An pp. 804–812.
information-geometric approach to document retrieval and [78] A. Mukherjee and B. Liu, “Modeling review comments,” in
categorization,” in Proc. Adv. Neural Inf. Process. Syst., 2000, Proc. 50th Annu. Meeting Assoc. Comput. Linguistics, 2012,
pp. 914–920. pp. 320–329.
[56] O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang, “Transferring [79] D. A. duVerle and H. Prendinger, “A novel discourse parser
topical knowledge from auxiliary long texts for short text based on support vector machine classification,” in Proc. Joint
clustering,” in Proc. 20th ACM Int. Conf. Inf. Knowl. Manage., Conf. 47th Annu. Meet. ACL 4th Int. Joint Conf. Natural Lang. Pro-
2011, pp. 775–784. cess., 2009, pp. 665–673.
[57] D. M. Blei and P. J. Moreno, “Topic segmentation with an aspect [80] R. Soricut and D. Marcu, “Sentence level discourse parsing
hidden Markov model,” in Proc. 24th Annu. Int. ACM SIGIR Conf. using syntactic and lexical information,” in Proc. Conf. North
Res. Develop. Inf. Retrieval, 2001, pp. 343–348. Am. Ch. Assoc. Comput. Linguistics Human Lang. Technol., 2003,
[58] Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai, “Topic sentiment pp. 149–156.
mixture: Modeling facets and opinions in weblogs,” in Proc. 16th [81] W. C. Mann and S. A. Thompson, “Rhetorical structure theory:
Int. Conf. World Wide Web, 2007, pp. 171–180. Toward a functional theory of text organization,” Text, vol. 8,
[59] F. Li, C. Han, M. Huang, X. Zhu, Y.-J. Xia, S. Zhang, and H. Yu, no. 3, pp. 243–281, 1998.
“Structure-aware review mining and summarization,” in Proc. [82] R. Narayanan, B. Liu, and A. Choudhary, “Sentiment analysis of
23rd Int. Conf. Comput. Linguistics, 2010, pp. 653–661. conditional sentences,” in Proc. Conf. Empirical Methods Natural
[60] K. W. Church and P. Hanks, “Word association norms, mutual Lang. Process., 2009, pp. 180–189.
information, and lexicography,” Comput. Linguistics, vol. 16, no. 1, [83] N. Jindal and B. Liu, “Identifying comparative sentences in text
pp. 22–29, 1990. documents,” in Proc. 29th Annu. Int. ACM SIGIR Conf. Res. Develop.
[61] L. R. Dice, “Measures of the amount of ecologic association Inf. Retrieval, 2006, pp. 244–251.
between species,” Ecology, vol. 26, no. 3, pp. 297–302, 1945. [84] M. Ganapathibhotla and B. Liu, “Mining opinions in comparative
[62] J. Zhu, H. Wang, B. K. Tsou, and M. Zhu, “Multi-aspect opinion sentences,” in Proc. 22nd Int. Conf. Comput. Linguistics, 2008,
polling from textual reviews,” in Proc. 18th ACM Conf. Inf. Knowl. pp. 241–248.
Manage., 2009, pp. 1799–1802. [85] L. Polanyi and A. Zaenen, “Contextual valence shifters,” in Com-
[63] Y. Choi and C. Cardie, “Learning with compositional semantics puting Attitude and Affect in Text: Theory and Applications (series
as structural inference for subsentential sentiment analysis,” in The Information Retrieval Series). New York, NY, USA: 2006,
Proc. Conf. Empirical Methods Natural Lang. Process., 2008, vol. 20, pp. 1–10.
pp. 793–801. [86] K. Moilanen and S. Pulman, “Sentiment composition,” in Proc.
[64] J. Wiebe, T. Wilson, and C. Cardie, “Annotating expressions of Recent Adv. Natural Lang. Process., 2007, pp. 378–382.
opinions and emotions in language,” Lang. Resources Eval., vol. 39, [87] A. Hogenboom, P. van Iterson, B. Heerschop, F. Frasincar, and U.
no. 2, pp. 165–210, 2005. Kaymak, “Determining negation scope and strength in sentiment
[65] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing contextual analysis,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., 2011,
polarity in phrase-level sentiment analysis,” in Proc. Conf. Human pp. 2589–2594.
Lang. Technol. Empirical Methods Natural Lang. Process., 2005, [88] L. Jia, C. Yu, and W. Meng, “The effect of negation on sentiment
pp. 347–354. analysis and retrieval effectiveness,” in Proc. 18th ACM Conf. Inf.
[66] R. A. Hummel and S. W. Zucker, “On the foundations of relaxa- Knowl. Manage., 2009, pp. 1827–1830.
tion labeling processes,” IEEE Trans. Pattern Anal. Mach. Intell., [89] M. Thelwall, K. Buckley, and G. Paltoglou, “Sentiment strength
vol. 5, no. 3, pp. 267–287, May 1983. detection for the social web,” J. Am. Soc. Inf. Sci. Technol., vol. 63,
[67] V. Hatzivassiloglou and K. R. McKeown, “Predicting the semantic no. 1, pp. 163–173, 2012.
orientation of adjectives,” in Proc. 35th Annu. Meet. Assoc. Comput. [90] M. E. Basiri, A. R. Naghsh-Nilchi, and N. Ghasem-Aghaee,
Linguistics, 8th Conf. Eur. Chapter Assoc. Comput. Linguistics, 1997, “Sentiment prediction based on Dempster-Shafer theory of
pp. 174–181. evidence,” Math. Problems Eng., vol. 2014, article 361201, p. 13,
[68] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorabil- 2014.
ity using natural language processing,” in Proc. 2nd Int. Conf. [91] G. Shafer, A Mathematical Theory of Evidence, vol. 1. Princeton, NJ,
Knowl. Capture, 2003, pp. 70–77. USA: Princeton Univ. Press, 1976.
[69] N. Kobayashi, R. Iida, K. Inui, and Y. Matsumoto, “Opinion mining [92] L.-W. Ku, Y.-T. Liang, and H.-H. Chen, “Opinion extraction, sum-
on the web by extracting subject-aspect-evaluation relations,” in marization and tracking in news and blog corpora,” in Proc. AAAI
Proc. AAAI Spring Symp.: Comput. Approaches Anal. Weblogs, 2006, Spring Symp.: Comput. Approaches Anal. Weblogs, 2006, pp. 100–107.
pp. 86–91. [93] G. Carenini, R. T. Ng, and A. Pauls, “Multi-document summariza-
[70] W. Jin, H. H. Ho, and R. K. Srihari,, “OpinionMiner: A novel tion of evaluative text,” in Proc. 11th Conf. Eur. Ch. ACL, 2006,
machine learning system for web opinion mining and extraction,” pp. 305–312.
in Proc. 15th ACM SIGKDD Int. Conf. Knowl. Discovery Data Min- [94] H. D. Kim and C. Zhai, “Generating comparative summaries of
ing, 2009, pp. 1195–1204. contradictory opinions in text,” in Proc. 18th ACM Conf. Inf. Knowl.
[71] C. Zirn, M. Niepert, H. Stuckenschmidth, and M. Strube, “Fine- Manage., 2009, pp. 385–394.
grained sentiment analysis with structural features,” in Proc. 5th [95] E. Cambria, B. Schuller, Y. Xia, and C. Havasi, “New avenues in
Int. Joint Conf. Natural Lang. Process., 2011, pp. 336–344. opinion mining and sentiment analysis,” IEEE Intell. Syst., vol. 28,
[72] I. Titov and R. McDonald, “A joint model of text and aspect no. 2, pp. 15–21, Mar. 2013.
ratings for sentiment summarization,” in Proc 46th Annu. Meet. [96] B. C. Wallace, “Computational irony: A survey and new
Assoc. Comput. Linguistics: Human Lang. Technol., 2008, pp. 308– perspectives,” Artif. Intell. Rev., vol. 43, no. 4, pp. 467–483,
316. 2015.
[73] S. Moghaddam and M. Ester, “ILDA: Interdependent LDA model [97] I. Pe~nalver-Martinez, F. Garcia-Sanchez, R. Valencia-Garcia, M.
for learning latent aspects and their ratings from online product
Angel Rodrıguez-Garcıa, V. Moreno, A. Fraga, and J. L. Sanchez-
reviews,” in Proc. 34th Int. ACM SIGIR Conf. Res. Develop. Inf. Cervantes, “Feature-based opinion mining through ontologies,”
Retrieval, 2011, pp. 665–674. Expert Syst. Appl., vol. 41, no. 13, pp. 5995–6008, 2014.
Kim Schouten is currently working toward Flavius Frasincar is an assistant professor in

the PhD degree at the Erasmus University information systems at Erasmus University Rot-
Rotterdam, focusing on aspect-level sentiment terdam, the Netherlands. He has been published
analysis and its application, implicitness of in numerous conferences and journals in the
aspects and sentiment, and how to move towards areas of databases, web information systems,
a more semantics-oriented form of sentiment personalization, and the semantic web. He is a
analysis. Other topics of interest include the member of the editorial boards of the Interna-
application of language technology within an tional Journal of Web Engineering and Technol-
economic framework, and language in relation to ogy and Decision Support Systems.
artificial intelligence.
" For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.

Survey On Aspect-Level Sentiment Analysis: Kim Schouten and Flavius Frasincar

Uploaded by

Copyright:

Available Formats

Survey On Aspect-Level Sentiment Analysis: Kim Schouten and Flavius Frasincar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Survey On Aspect-Level Sentiment Analysis: Kim Schouten and Flavius Frasincar

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO.

3, MARCH 2016 813

Survey on Aspect-Level Sentiment Analysis

T HE digital age, also referred to as the information soci-

discussed in the text. In a similar manner, sentiment can be 2 EVALUATION METHODOLOGY

particular study, the goal is to find reviews that are most

domain evaluation task performance

domain classes evaluation task performance

domain classes evaluation task performance

Kim Schouten is currently working toward Flavius Frasincar is an assistant professor in

" For more information on this or any other computing topic,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.