Song 2009

Topic and Keyword Re-ranking for LDA-based
Topic Modeling
Yangqiu Song† Shimei Pan§ Shixia Liu† Michelle X. Zhou†§ Weihong Qian†
†
IBM China Research Lab, Beijing, China; § IBM T. J. Watson Research Center, Hawthorne, NY, USA.
†
{yqsong,liusx,mxzhou,qianwh}@cn.ibm.com; § shimei@us.ibm.com
ABSTRACT to enhance the topic modeling results. The work closest to

Topic-based text summaries promise to help average users ours is an approach described in [2], which uses a TFIDF-
quickly understand a text collection and derive insights. Re- like term score to re-rank keywords. Compared to this work,
cent research has shown that the Latent Dirichlet Alloca- ours also re-ranks the LDA-derived topics in addition to re-
tion (LDA) model is one of the most effective approaches ranking the topic keywords. As a result, our work can select
to topic analysis. However, the LDA-based results may not the most salient topics that meet different user interests.
be ideal for human understanding and consumption. In this
paper, we present several topic and keyword re-ranking ap- 2. LDA TOPIC MODELING
proaches that can help users better understand and consume We denote a text corpus as D = {d1 , d2 , . . . , dM } where
the LDA-derived topics in their text analysis. Our methods dm is a document and M is the number of documents in
process the LDA output based on a set of criteria that model the corpus. Each document consists of a sequence of words
a user’s information needs. Our evaluation demonstrates the Wm = {wm,1 , wm,2 , . . . , wm,Nm }, where Nm is the num-
usefulness of the methods in summarizing several large-scale, ber of words in document dm . We also define a dictio-
real world data sets. nary V = {v1 , v2 , . . . , vV }, where V is the size of the dic-
Categories and Subject Descriptors: I.2.6 [Artificial tionary. Moreover, z is a latent variable representing the
Intelligence]: Learning H.3.1 [Information Storage and Re- latent topic associated with each observed word. We denote
trieval]: Content Analysis and Indexing Zm = {zm,1 , zm,2 , . . . , zm,Nm } as the topic sequence associ-
General Terms: Algorithms, Experimentation ated with the word sequence Wm . The generative procedure
of LDA can be formally defined as:
Keywords: Topic Model, Topic and Keyword Re-ranking 1. For all the topics k ∈ 1, . . . , K:
Choose a word distribution ϕk ∼ Dir(ϕ|β).
1. INTRODUCTION 2. For all the documents dm where m ∈ 1, . . . , M :
Latent topic models are effective methods for extract- 2.1. Choose Nm ∼ Poisson(ξ).
ing latent semantic information from text corpora [4, 7, 3]. 2.2. Choose a topic distribution θ m ∼ Dir(θ|α).
Among these models, Latent Dirichlet Allocation (LDA) [3] 2.3. For all the words wm,n where n ∈ 1, . . . , Nm :
appears to be the most effective one. LDA models a doc- Choose a topic index zm,n ∼ Mult(z|θ m ).
ument as a mixture of latent topics and each topic can be Choose a word wm,n ∼ Mult(w|ϕzm,n ).
further represented by a set of keywords. As a result, it We denote ϕk = (ϕk,1 , ϕk,2 , . . . , ϕk,V )T ∈ RV where ϕk,i =
allows a document to belong to multiple topics. However, p(w = vi |z = k). Thus, the parameters for the topic-word
LDA is a general model without considering different users’ distribution can be represented as Φ = (ϕ1 , ϕ2 , . . . , ϕK )T
information needs. For example, the LDA topics are ran- ∈ RK×V , where K is the topic number. Moreover, we denote
domly ordered, and users must manually navigate the topic θ m = (θm,1 , θm,2 , . . . , θm,K )T ∈ RK where θm,k = p(z =
list to find the ones that they are interested in. This task k|dm ). Then the parameters for document-topic distribu-
becomes more difficult, if there is a large number of top- tion is Θ = (θ 1 , θ 2 , . . . , θ M )T ∈ RM ×K .
ics. To better help users consume the LDA-derived topics Given a set of training documents, inferencing a topic
in an interactive visual text analytic process [9], we focus on model involves estimating the document-topic distribution
enhancing the LDA topic modeling results. Θ and the topic-word distribution Φ [3, 5]. In our experi-
In this paper, we present a set of re-ranking techniques ments, we use Gibbs sampling [5].
3. LDA MODEL ENHANCEMENT

Permission to make digital or hard copies of all or part of this work for In this section, we introduce our topic and keyword re-
personal or classroom use is granted without fee provided that copies are ranking methods to enhance the LDA output for more ef-
not made or distributed for profit or commercial advantage and that copies fective human consumption.
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific 3.1 Keyword Re-ranking
permission and/or a fee.
CIKM’09, November 2–6, 2009, Hong Kong, China. We first describe how we re-rank LDA-derived topic key-
Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00. words since the order of these keywords directly affects the
1757
semantics and thus the importance of a topic. The native or- Then the rank of a topic is defined as:
der of topic keywords produced by LDA may not be ideal for
T Rkc.v. , (µ(zk ))λ1 · (σ(zk ))λ2 , (5)
users to understand the semantics of a topic. For example,
when LDA is applied to a financial news corpus, common where λ1 and λ2 are the control parameters. Specifically,
words such as Dow, Jones, Wall, Street etc., are normally if λ1 = 1 and λ2 = 0, the ranking is determined purely by
ranked high in many topics because they are relevant to all topic coverage [14]. In contrast, if λ1 = 0 and λ2 = 1, the
of the topics. These words however are not useful in helping rank is simply determined by topic variance, a criterion that
users identify interesting topics since all of them are about is similar to principle component analysis [1].
finance. To better help people identify salient information,
we thus re-rank the LDA-derived topic keywords to refine 3.2.2 Laplacian Score
the topic definitions. While topic variance used in the first method reflects a
Inspired by the term re-weighting techniques used in in- topic’s representative power, the Laplacian score of a topic
formation retrieval (IR) [13, 8], we have experimented with represents its power in discriminating documents from dif-
two LDA-versions of TFIDF-like scores: ferent classes [6]. Our second method on topic re-ranking
ϕ̂k,i is motivated by the observation that two similar documents
KR1 = PK (1) are probably related to the same topic while documents that
k=1 ϕ̂k,i
are dissimilar probably belong to different topics. Since the
and Laplacian score of a topic reflects its power in discriminating
ϕ̂k,i documents from different classes while preserving the local
KR2 = ϕ̂k,i · log QK (2)
1
( k=1 ϕ̂k,i ) K structure of a document collection, we develop a Laplacian
score-based topic ranking method so that it assigns high
where the native weight ϕ̂k,i generated by LDA corresponds ranks to those topics with high discriminating power. It
to the term frequency. The topic proportion sum and topic consists of five main steps:
proportion product are used respectively in KR1 and KR2 1. Represent each document dm as a node in a graph. Its
to simulate the inverse document frequency to re-weight the features are represented by θ̂ m .
native weights. In fact, KR2 is the same as the re-weighting 2. Construct the T -nearest neighbor
technique used in [2]. ˘ graph ¯based on a
similarity matrix S where Sij = exp −d2ij /2σ 2 . Here, dij
can be either Euclidian distance or Hellinger distance [2].
3.2 Topic Re-ranking 3. Compute graph Laplacian L = D − S where D is a
The LDA-derived topics are randomly ordered which may diagonal matrix and Dii = M
P
j=1 Sij is the degree of the ith
not be equally important to a user. It is thus useful to vertex.
order the topics so that the most important ones can be 4. For each topic tk = (θ̂1,k , θ̂2,k , . . . , θ̂M,k )T ∈ RM , let
shown first. In general, the definition of importance may tT
k D1
vary from one user to another. For example, a user may t̃k = tk − 1T D1
1 where 1=[1, 1, ....1]T .
k
prefer to see the most talked topics, i.e. topics that cover 5. Compute the Laplacian score of the k-th topic:
more documents. In this case, the rank of a topic would
t̃Tk Lt̃k
be higher, if it covers more document content in the corpus. Lk = . (6)
In contrast, a user may be interested in a set of distinct t̃Tk Dt̃k
topics that have the least amount of content overlap with
one another. In this case, we will rank topics based on their Remark 1: To find the T -nearest neighbors of a topic, we
content uniqueness. Next, we describe a few application- keep a T -size heap. For each topic, we compute its distances
independent topic re-ranking methods that computes the to all the other topics and then check whether to insert it
topic ranks based on different ranking criteria. to the heap. Thus, the main time complexity is in graph
Laplacian construction which is O(M 2 K + M 2 log T ).
3.2.1 Weighted Topic Coverage and Variation
3.2.3 Pairwise Mutual Information
The first topic re-ranking method assumes that topics that
cover a significant portion of the corpus content are more We have also developed a topic re-ranking approach based
important than those covering little content. However, we on the pairwise mutual information of two topics. This met-
consider topics that appear in all the documents (e.g. a ric computes the information that two topics share. It also
topic on disclaims derived from a legal collection) to be too measures how much knowing one of the topics reduces our
generic to be interesting, although they have significant con- uncertainty about the other. Using this metric, we can rank
tent coverage. Thus we rank such topics lower. As a result, topic by measuring the amount of the pairwise mutual infor-
our first topic ranking metric is a combination of both con- mation between two topics. Specifically, we use the following
tent coverage and topic variance. More precisely, we define: procedure [12] to determine the rank of each topic.
, M 1. For ∀i, j, first compute M I(ti , tj ) based on the doc-
M
X X topic distributions of ti and tj . Then construct a complete
µ(zk ) = Nm · θ̂m,k Nm (3) graph G where the weight of an edge eti ,tj is M I(ti , tj ).
m=1 m=1 2. Build the maximal spanning tree MST of the complete
and graph G : (V, E), where V and E are vertex and edge sets.
v
u M , M 3. Define the relevant topic set T rel = {t1 , t2 , ..., tK } and
uX “ ”2 X the corresponding edges in MST.
σ(zk ) = t Nm · θ̂m,k − µ(zk ) Nm , (4)
4. While |T rel | > 0,
m=1 m=1
4.1. if ∃ a node v where tv ∈ T rel that is not connected
where the weight Nm is the document length. to the others in T rel , remove this topic tv (T rel ← T rel −tv ).
1758
4.2. otherwise remove the least weighted edge in T rel . 4.1.1 Annotation Criteria
5. Rank the topics according to the order in which they For the email data set, the person who owned the email
were removed. Rank the last removed topic the highest. collection helped us annotate the results. She was asked
Remark 2: We use the Prime’s algorithm to construct to label each topic as either “very important”, “somewhat
the MST. Thus, to compute the pairwise mutual information important” or “unimportant”. In addition, for each topic,
for topic re-ranking needs O(K 2 M ). By using a heap to con- she was also asked to label each of the top 50 keywords as
struct a priority queue, we can build an MST in O(|E| log |V|) = either “relevant” or “ irrelevant”. When asked about how
O(K 2 log K) time. she ranked these topics, the email owner summarized her
criteria as: (1) A “very important” topic clearly describes a
3.2.4 Topic Similarity major project that the email owner heavily involved. (2) A
The last similarity-based re-ranking method is developed “somewhat important” topic focuses on a specific event, such
to maximize topic diversity and minimize redundancy. While as writing a paper. (3) An “unimportant” topic either lacks
all the above methods use topic-document relationships, here a clear focus or is about very general work-related activities.
we employ topic-word distributions to compute topic simi- For the AIG news data set, we asked a person who was
larity. We have slightly modified the algorithm proposed in familiar with the recent AIG-related events to help us anno-
[11] to derive a topic rank: tate the topics and keywords. This person determined the
1. For ∀i, j, compute the similarity sij for ϕi and ϕj importance of a topic as follows: (1) A “very important”
based on maximal information compression index [11]. topic clearly describes an event directly related to AIG, e.g.
2. Sort the similarities for each topic. the AIG bonus controversy. (2) A “somewhat important”
3. Define the reduced topic set T red = {ϕ1 , ϕ2 , ..., ϕK }. topic focuses on some background events such as the 2009
4. While |T red | > 0, remove ϕj in T red which satisfies presidential election or the financial market crisis. (3) An
j = arg maxi maxj sij . “unimportant” topic is defined as either confusing or irrele-
5. The rank of a topic is determined by the topic removal vant, e.g. a topic about various advertisements.
order. The last removed topic should be ranked the highest. Given the annotated topics and keywords, we compared
Remark 3: In this algorithm, constructing the simi- the automatic topic and keyword re-ranking results with the
larity scores needs O(K 2 M ) and sorting the scores needs human-provided results using the F1 -measure, a criterion
O(K 2 log K). commonly used in information retrieval (IR). Following the
IR tradition, in our analysis we categorized our topics an-
notated by our experts into both “relevant” and “irrelevant”.
4. EXPERIMENTS The “relevant” topics are those that are either “very impor-
We have tested our topic and keyword re-ranking tech- tant” or “somewhat important” while “irrelevant” topics are
niques using two different data sets in a series of experi- those that are “unimportant”. Similarly, based on our topic
ments. or keyword re-ranking methods, each topic or keyword can
The first data set is a personal email collection dated from be categorized as either “retrieved” or “not retrieved” de-
February to December 2008 with 8326 email messages. Each pending on the assigned ranks and the cut-off threshold used
email is associated with a set of meta data such as sender, in each evaluation metric (“top 5” means only the top five
receiver, time, subject, body and reply counts. Only the keywords are retrieved).
subject and the body of each email were used to train the
topic model. We pre-processed each email to remove ir-
relevant information such as email signature and also did 4.1.2 Re-ranking Results
stop word removal. After pre-processing, the email collec- The keyword re-ranking results are shown in Tables 2 and
tion contained 958,069 word tokens in total. 4. In these Tables, KR0 is the baseline that uses the LDA
The second data set is an online document collection that estimated parameters ϕ̂k,i directly; KR1 and KR2 are de-
contained text retrieved by a search engine. These docu- fined in section 3.2. We can see that KR1 performed better
ments came from various news, blog and forum web sites. than KR0 and KR2 . It shows that for the two selected data
The search engine used “AIG insurance” as the query and re- sets, topic proportion sum is better than topic proportion
trieved 34,690 documents from January 2008 to April 2009. production in weighing the proportion.
After pre-processing, the final AIG collection contained The topic re-ranking results for the email data set are
11,491,246 word tokens in total. shown in Table 3. In the Table, “C.V.” represents Weighted
To test our methods, we ran LDA five times and we show Topic Coverage and Variation, “L.S.” represents Laplacian
both the average and standard derivation for the five runs. Score, “M.I.” represents Pairwise Mutual Information, and
We adopted an LDA algorithm that was trained with opti- “T.S.” represents Topic Similarity. Our results show that
mized hyper-parameters. For each run of LDA, we set the the Laplacian score-based method outperformed the other
maximum iteration to 1000. The initial model parameters methods. In particular, all the top five retrieved topics were
were set to the default values in the Mallet LDA toolkit [10]. labeled by the email owner as relevant.
We empirically set the topic number for the email and AIG The topic re-ranking results for the AIG data set are
data sets to 18 and 20 respectively. shown in Table 5. The Laplacian score-based method also
outperformed all the other methods using this data set.
4.1 Topic Re-ranking Overall, the Laplacian score-based re-ranking method seems
For each data set, we asked an expert to annotate the to capture the essence of an important topic the best. Ta-
topics and the corresponding topic keywords learned by our ble 1 shows the details of the Laplacian score-based topic
methods. The annotation was repeated for each of the five re-ranking results for the AIG data set. It contains the top
LDA runs on each data set. 10 keywords of all the 20 topics.
1759
Table 1: AIG topics ranked by Laplacian score (bus. means business; contr. means controversy; gov. means
government; int’l means international).
job related market info int’l market unclear focus insurance bus. retreat contr. general bus. bonus contr. ads Spitzer probe
aig crude hbos rating quote retreat aia bonus gps spitzer
ng stock bank insurer farm spa insurer taxpayer diamond settlement
jobcircle oil japan business travel regis policyholders payments shirts investigation
description price european subsidiaries progressive resort company compensation cellular fraud
employer percent asia assets car bancorp exposure executive garden greenberg
resume trading london reserve cheap executives clients employees jeans ceo
location investors brothers monday homeowners committee capital ceo ipod executive
apply dollar cash capital insurance lawmakers income administration shoes chairman
experience gold lehman federal life bailout commercial bailout ringtones products
title financial crisis american medical event million dollars silver services
insurance bus. election credit swap no focus financial company no focus web related gov. and crisis ads congress
quotation republican swap thing stanley amp story capitalism movie fortune
universal mccain mortgage remember morgan quot post paulson clip voted
health voters hedge pretty jpmorgan nbsp comment bernanke video legislation
policy obama derivatives idea merrill ldquo google economy drug republicans
protection presidential sell people goldman message thread war game leaders
care election bonds save bank www click crisis sitemap democrats
coverage candidate fund hard lehman public article growth crack lawmakers
benefits clinton risk work lynch read blog freddie music bush
medical bush investment place bankruptcy investments read debt torrent congress
life economic credit problem wall business daily congress porn paulson
Table 2: Email Keyword Re-ranking Results. Table 4: News (AIG) Keyword Re-ranking Results.
Retrieved Top 10 Top 20 Retrieved Top 10 Top 20
KR0 0.535 ± 0.055 0.442 ± 0.048 KR0 0.466 ± 0.111 0.445 ± 0.083
KR1 0.701 ± 0.067 0.600 ± 0.043 KR1 0.662 ± 0.094 0.614 ± 0.054
KR2 0.616 ± 0.078 0.551 ± 0.049 KR2 0.403 ± 0.079 0.343 ± 0.048
Table 3: Email Topic Re-ranking Results. Table 5: News (AIG) Topic Re-ranking Results.
Retrieved Top 5 Top 10 Retrieved Top 5 Top 10
C.V. 0.800 ± 0.000 0.620 ± 0.028 C.V. 0.640 ± 0.057 0.68 ± 0.028
L.S. 1.000 ± 0.000 0.780 ± 0.028 L.S. 0.760 ± 0.057 0.76 ± 0.035
M.I. 0.760 ± 0.106 0.740 ± 0.035 M.I. 0.760 ± 0.057 0.74 ± 0.035
T.S. 0.440 ± 0.057 0.480 ± 0.028 T.S. 0.720 ± 0.069 0.70 ± 0.045
5. CONCLUSIONS [5] T. Griffiths and M. Steyvers. Finding scientific topics.

In this paper, we have presented several topic and keyword Proceedings of the National Academy of Sciences,
101:5228–5235, 2004.
re-ranking methods. Our experiments show that among the
[6] X. He, D. Cai, and P. Niyogi. Laplacian score for feature
topic keywords re-ranking methods that we investigated, selection. In Proceedings of NIPS, 2005.
KR1 outperformed others on both of our test data sets. [7] T. Hofmann. Probabilistic latent semantic analysis. In
Moreover, among all the topic re-ranking methods that we Proceedings of UAI, pages 289–296, 1999.
tested, the Laplacian score-based method performed the best. [8] M. Lan, C. Tan, H. Low, and S. Sung. A comprehensive
We are also working on several areas to further improve comparative study on term weighting schemes for text
the current re-ranking methods. One area is to allow users categorization with support vector machines. In WWW
to interactively define their topic or keyword ranking cri- (Special interest tracks and posters), pages 1032–1033,
2005.
teria. We are also interested in exploring the use of vari-
[9] S. Liu, M. X. Zhou, S. Pan, W. Qian, W. Cai, and X. Lian.
ous application-specific features to build more accurate topic Interactive, topics-based visual text summarization and
summarization systems. analysis. In Proceedings of the CIKM, 2009.
[10] A. K. McCallum. Mallet: A machine learning for language
6. ACKNOWLEDGMENT toolkit. 2002.
[11] P. Mitra, C. Murthy, and S. Pal. Unsupervised feature
We would like to thank David Mimno, Tom Griffiths, and selection using feature similarity. IEEE Transactions on
Gregor Heinrich for their help and advices on LDA and its Pattern Analysis and Machine Intelligence, 24(3):301–312,
implementation. 2002.
[12] M. Sahami. Using Machine Learning to Improve
7. REFERENCES Information Access. PhD thesis, Department of Computer
Science, Stanford University, USA, 1998.
[1] C. M. Bishop. Pattern Recognition and Machine Learning [13] G. Salton and C. Buckley. Term-weighting approaches in
(Information Science and Statistics). Springer, 2006. automatic text retrieval. Information Processing and
[2] D. Blei and J. Lafferty. Topic Models, chapter: Topic Management, 24(5):513–523, 1988.
Models. Taylor and Francis, 2009. (in Press). [14] Y. Yang and J. Pedersen. A comparative study on feature
[3] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. selection in text categorization. In Proceedings of ICML,
Journal of Machine Learning Research, 3:993–1022, 2003. pages 412–420, 1997.
[4] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and
R. Harshman. Indexing by latent semantic analysis.
Journal of the American Society of Information Science,
41(6):391–407, 1990.
1760

Song 2009

Uploaded by

Copyright:

Available Formats

Song 2009

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Song 2009

Uploaded by

Copyright:

Available Formats

Topic and Keyword Re-ranking for LDA-based

ABSTRACT to enhance the topic modeling results. The work closest to

3. LDA MODEL ENHANCEMENT

5. CONCLUSIONS [5] T. Griffiths and M. Steyvers. Finding scientific topics.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.