Song 2009
Song 2009
Song 2009
Topic Modeling
Yangqiu Song† Shimei Pan§ Shixia Liu† Michelle X. Zhou†§ Weihong Qian†
†
IBM China Research Lab, Beijing, China; § IBM T. J. Watson Research Center, Hawthorne, NY, USA.
†
{yqsong,liusx,mxzhou,qianwh}@cn.ibm.com; § shimei@us.ibm.com
1757
semantics and thus the importance of a topic. The native or- Then the rank of a topic is defined as:
der of topic keywords produced by LDA may not be ideal for
T Rkc.v. , (µ(zk ))λ1 · (σ(zk ))λ2 , (5)
users to understand the semantics of a topic. For example,
when LDA is applied to a financial news corpus, common where λ1 and λ2 are the control parameters. Specifically,
words such as Dow, Jones, Wall, Street etc., are normally if λ1 = 1 and λ2 = 0, the ranking is determined purely by
ranked high in many topics because they are relevant to all topic coverage [14]. In contrast, if λ1 = 0 and λ2 = 1, the
of the topics. These words however are not useful in helping rank is simply determined by topic variance, a criterion that
users identify interesting topics since all of them are about is similar to principle component analysis [1].
finance. To better help people identify salient information,
we thus re-rank the LDA-derived topic keywords to refine 3.2.2 Laplacian Score
the topic definitions. While topic variance used in the first method reflects a
Inspired by the term re-weighting techniques used in in- topic’s representative power, the Laplacian score of a topic
formation retrieval (IR) [13, 8], we have experimented with represents its power in discriminating documents from dif-
two LDA-versions of TFIDF-like scores: ferent classes [6]. Our second method on topic re-ranking
ϕ̂k,i is motivated by the observation that two similar documents
KR1 = PK (1) are probably related to the same topic while documents that
k=1 ϕ̂k,i
are dissimilar probably belong to different topics. Since the
and Laplacian score of a topic reflects its power in discriminating
ϕ̂k,i documents from different classes while preserving the local
KR2 = ϕ̂k,i · log QK (2)
1
( k=1 ϕ̂k,i ) K structure of a document collection, we develop a Laplacian
score-based topic ranking method so that it assigns high
where the native weight ϕ̂k,i generated by LDA corresponds ranks to those topics with high discriminating power. It
to the term frequency. The topic proportion sum and topic consists of five main steps:
proportion product are used respectively in KR1 and KR2 1. Represent each document dm as a node in a graph. Its
to simulate the inverse document frequency to re-weight the features are represented by θ̂ m .
native weights. In fact, KR2 is the same as the re-weighting 2. Construct the T -nearest neighbor
technique used in [2]. ˘ graph ¯based on a
similarity matrix S where Sij = exp −d2ij /2σ 2 . Here, dij
can be either Euclidian distance or Hellinger distance [2].
3.2 Topic Re-ranking 3. Compute graph Laplacian L = D − S where D is a
The LDA-derived topics are randomly ordered which may diagonal matrix and Dii = M
P
j=1 Sij is the degree of the ith
not be equally important to a user. It is thus useful to vertex.
order the topics so that the most important ones can be 4. For each topic tk = (θ̂1,k , θ̂2,k , . . . , θ̂M,k )T ∈ RM , let
shown first. In general, the definition of importance may tT
k D1
vary from one user to another. For example, a user may t̃k = tk − 1T D1
1 where 1=[1, 1, ....1]T .
k
prefer to see the most talked topics, i.e. topics that cover 5. Compute the Laplacian score of the k-th topic:
more documents. In this case, the rank of a topic would
t̃Tk Lt̃k
be higher, if it covers more document content in the corpus. Lk = . (6)
In contrast, a user may be interested in a set of distinct t̃Tk Dt̃k
topics that have the least amount of content overlap with
one another. In this case, we will rank topics based on their Remark 1: To find the T -nearest neighbors of a topic, we
content uniqueness. Next, we describe a few application- keep a T -size heap. For each topic, we compute its distances
independent topic re-ranking methods that computes the to all the other topics and then check whether to insert it
topic ranks based on different ranking criteria. to the heap. Thus, the main time complexity is in graph
Laplacian construction which is O(M 2 K + M 2 log T ).
3.2.1 Weighted Topic Coverage and Variation
3.2.3 Pairwise Mutual Information
The first topic re-ranking method assumes that topics that
cover a significant portion of the corpus content are more We have also developed a topic re-ranking approach based
important than those covering little content. However, we on the pairwise mutual information of two topics. This met-
consider topics that appear in all the documents (e.g. a ric computes the information that two topics share. It also
topic on disclaims derived from a legal collection) to be too measures how much knowing one of the topics reduces our
generic to be interesting, although they have significant con- uncertainty about the other. Using this metric, we can rank
tent coverage. Thus we rank such topics lower. As a result, topic by measuring the amount of the pairwise mutual infor-
our first topic ranking metric is a combination of both con- mation between two topics. Specifically, we use the following
tent coverage and topic variance. More precisely, we define: procedure [12] to determine the rank of each topic.
, M 1. For ∀i, j, first compute M I(ti , tj ) based on the doc-
M
X X topic distributions of ti and tj . Then construct a complete
µ(zk ) = Nm · θ̂m,k Nm (3) graph G where the weight of an edge eti ,tj is M I(ti , tj ).
m=1 m=1 2. Build the maximal spanning tree MST of the complete
and graph G : (V, E), where V and E are vertex and edge sets.
v
u M , M 3. Define the relevant topic set T rel = {t1 , t2 , ..., tK } and
uX “ ”2 X the corresponding edges in MST.
σ(zk ) = t Nm · θ̂m,k − µ(zk ) Nm , (4)
4. While |T rel | > 0,
m=1 m=1
4.1. if ∃ a node v where tv ∈ T rel that is not connected
where the weight Nm is the document length. to the others in T rel , remove this topic tv (T rel ← T rel −tv ).
1758
4.2. otherwise remove the least weighted edge in T rel . 4.1.1 Annotation Criteria
5. Rank the topics according to the order in which they For the email data set, the person who owned the email
were removed. Rank the last removed topic the highest. collection helped us annotate the results. She was asked
Remark 2: We use the Prime’s algorithm to construct to label each topic as either “very important”, “somewhat
the MST. Thus, to compute the pairwise mutual information important” or “unimportant”. In addition, for each topic,
for topic re-ranking needs O(K 2 M ). By using a heap to con- she was also asked to label each of the top 50 keywords as
struct a priority queue, we can build an MST in O(|E| log |V|) = either “relevant” or “ irrelevant”. When asked about how
O(K 2 log K) time. she ranked these topics, the email owner summarized her
criteria as: (1) A “very important” topic clearly describes a
3.2.4 Topic Similarity major project that the email owner heavily involved. (2) A
The last similarity-based re-ranking method is developed “somewhat important” topic focuses on a specific event, such
to maximize topic diversity and minimize redundancy. While as writing a paper. (3) An “unimportant” topic either lacks
all the above methods use topic-document relationships, here a clear focus or is about very general work-related activities.
we employ topic-word distributions to compute topic simi- For the AIG news data set, we asked a person who was
larity. We have slightly modified the algorithm proposed in familiar with the recent AIG-related events to help us anno-
[11] to derive a topic rank: tate the topics and keywords. This person determined the
1. For ∀i, j, compute the similarity sij for ϕi and ϕj importance of a topic as follows: (1) A “very important”
based on maximal information compression index [11]. topic clearly describes an event directly related to AIG, e.g.
2. Sort the similarities for each topic. the AIG bonus controversy. (2) A “somewhat important”
3. Define the reduced topic set T red = {ϕ1 , ϕ2 , ..., ϕK }. topic focuses on some background events such as the 2009
4. While |T red | > 0, remove ϕj in T red which satisfies presidential election or the financial market crisis. (3) An
j = arg maxi maxj sij . “unimportant” topic is defined as either confusing or irrele-
5. The rank of a topic is determined by the topic removal vant, e.g. a topic about various advertisements.
order. The last removed topic should be ranked the highest. Given the annotated topics and keywords, we compared
Remark 3: In this algorithm, constructing the simi- the automatic topic and keyword re-ranking results with the
larity scores needs O(K 2 M ) and sorting the scores needs human-provided results using the F1 -measure, a criterion
O(K 2 log K). commonly used in information retrieval (IR). Following the
IR tradition, in our analysis we categorized our topics an-
notated by our experts into both “relevant” and “irrelevant”.
4. EXPERIMENTS The “relevant” topics are those that are either “very impor-
We have tested our topic and keyword re-ranking tech- tant” or “somewhat important” while “irrelevant” topics are
niques using two different data sets in a series of experi- those that are “unimportant”. Similarly, based on our topic
ments. or keyword re-ranking methods, each topic or keyword can
The first data set is a personal email collection dated from be categorized as either “retrieved” or “not retrieved” de-
February to December 2008 with 8326 email messages. Each pending on the assigned ranks and the cut-off threshold used
email is associated with a set of meta data such as sender, in each evaluation metric (“top 5” means only the top five
receiver, time, subject, body and reply counts. Only the keywords are retrieved).
subject and the body of each email were used to train the
topic model. We pre-processed each email to remove ir-
relevant information such as email signature and also did 4.1.2 Re-ranking Results
stop word removal. After pre-processing, the email collec- The keyword re-ranking results are shown in Tables 2 and
tion contained 958,069 word tokens in total. 4. In these Tables, KR0 is the baseline that uses the LDA
The second data set is an online document collection that estimated parameters ϕ̂k,i directly; KR1 and KR2 are de-
contained text retrieved by a search engine. These docu- fined in section 3.2. We can see that KR1 performed better
ments came from various news, blog and forum web sites. than KR0 and KR2 . It shows that for the two selected data
The search engine used “AIG insurance” as the query and re- sets, topic proportion sum is better than topic proportion
trieved 34,690 documents from January 2008 to April 2009. production in weighing the proportion.
After pre-processing, the final AIG collection contained The topic re-ranking results for the email data set are
11,491,246 word tokens in total. shown in Table 3. In the Table, “C.V.” represents Weighted
To test our methods, we ran LDA five times and we show Topic Coverage and Variation, “L.S.” represents Laplacian
both the average and standard derivation for the five runs. Score, “M.I.” represents Pairwise Mutual Information, and
We adopted an LDA algorithm that was trained with opti- “T.S.” represents Topic Similarity. Our results show that
mized hyper-parameters. For each run of LDA, we set the the Laplacian score-based method outperformed the other
maximum iteration to 1000. The initial model parameters methods. In particular, all the top five retrieved topics were
were set to the default values in the Mallet LDA toolkit [10]. labeled by the email owner as relevant.
We empirically set the topic number for the email and AIG The topic re-ranking results for the AIG data set are
data sets to 18 and 20 respectively. shown in Table 5. The Laplacian score-based method also
outperformed all the other methods using this data set.
4.1 Topic Re-ranking Overall, the Laplacian score-based re-ranking method seems
For each data set, we asked an expert to annotate the to capture the essence of an important topic the best. Ta-
topics and the corresponding topic keywords learned by our ble 1 shows the details of the Laplacian score-based topic
methods. The annotation was repeated for each of the five re-ranking results for the AIG data set. It contains the top
LDA runs on each data set. 10 keywords of all the 20 topics.
1759
Table 1: AIG topics ranked by Laplacian score (bus. means business; contr. means controversy; gov. means
government; int’l means international).
job related market info int’l market unclear focus insurance bus. retreat contr. general bus. bonus contr. ads Spitzer probe
aig crude hbos rating quote retreat aia bonus gps spitzer
ng stock bank insurer farm spa insurer taxpayer diamond settlement
jobcircle oil japan business travel regis policyholders payments shirts investigation
description price european subsidiaries progressive resort company compensation cellular fraud
employer percent asia assets car bancorp exposure executive garden greenberg
resume trading london reserve cheap executives clients employees jeans ceo
location investors brothers monday homeowners committee capital ceo ipod executive
apply dollar cash capital insurance lawmakers income administration shoes chairman
experience gold lehman federal life bailout commercial bailout ringtones products
title financial crisis american medical event million dollars silver services
insurance bus. election credit swap no focus financial company no focus web related gov. and crisis ads congress
quotation republican swap thing stanley amp story capitalism movie fortune
universal mccain mortgage remember morgan quot post paulson clip voted
health voters hedge pretty jpmorgan nbsp comment bernanke video legislation
policy obama derivatives idea merrill ldquo google economy drug republicans
protection presidential sell people goldman message thread war game leaders
care election bonds save bank www click crisis sitemap democrats
coverage candidate fund hard lehman public article growth crack lawmakers
benefits clinton risk work lynch read blog freddie music bush
medical bush investment place bankruptcy investments read debt torrent congress
life economic credit problem wall business daily congress porn paulson
Table 2: Email Keyword Re-ranking Results. Table 4: News (AIG) Keyword Re-ranking Results.
Retrieved Top 10 Top 20 Retrieved Top 10 Top 20
KR0 0.535 ± 0.055 0.442 ± 0.048 KR0 0.466 ± 0.111 0.445 ± 0.083
KR1 0.701 ± 0.067 0.600 ± 0.043 KR1 0.662 ± 0.094 0.614 ± 0.054
KR2 0.616 ± 0.078 0.551 ± 0.049 KR2 0.403 ± 0.079 0.343 ± 0.048
Table 3: Email Topic Re-ranking Results. Table 5: News (AIG) Topic Re-ranking Results.
Retrieved Top 5 Top 10 Retrieved Top 5 Top 10
C.V. 0.800 ± 0.000 0.620 ± 0.028 C.V. 0.640 ± 0.057 0.68 ± 0.028
L.S. 1.000 ± 0.000 0.780 ± 0.028 L.S. 0.760 ± 0.057 0.76 ± 0.035
M.I. 0.760 ± 0.106 0.740 ± 0.035 M.I. 0.760 ± 0.057 0.74 ± 0.035
T.S. 0.440 ± 0.057 0.480 ± 0.028 T.S. 0.720 ± 0.069 0.70 ± 0.045
1760