26595-Article Text-30658-1-2-20230626
26595-Article Text-30658-1-2-20230626
26595-Article Text-30658-1-2-20230626
13609
the topic distribution to avoid contradictions. On the other ordinated and jointly trained with the neural variational
hand, a speaker can decide whether to keep or change the inference objective to further improve topic coherence.
topic for themselves based on the utterances of other speak-
ers. That is, topic maintenance and switching in a conversa-
tion continue under the inter-speaker interaction. We employ Related Work
a graph neural network to reason the speaker graph and in- Topic Model
tegrate intra-speaker and inter-speaker dependencies among
utterances. The graph encoder and the sequence encoder co- Topic modeling has always been a catalyst for other research
operate to adequately capture the hierarchical structure of areas in Natural Language Process (NLP) (Panwar et al.
the conversation. The learned representations of the graph 2020; Jin et al. 2021; Srivastava and Sutton 2016). A classic
encoder are incorporated into the topic modeling process. statistical topic model is Latent Dirichlet Allocation (LDA),
Considering the structural properties of the conversation, which is based on Gibbs sampling to extract topics from doc-
we make reasonable assumptions on the topic distribution. uments (Blei, Ng, and Jordan 2003). With the development
First, to prevent confusion from modeling the entire con- of deep generative models, it has led to the study of neu-
versation with a single topic, we perform fine-grained topic ral topic models (NTMs) (Miao, Grefenstette, and Blunsom
modeling by assuming that each utterance compiles with a 2017; Zhu, Feng, and Li 2018; Wang, Zhou, and He 2019).
specific topic distribution. These distributions are mutually Variational Autoencoder (VAE) (Kingma and Welling 2013)
influenced across multiple turns. Additionally, the topic dis- is the most widely used framework for NTMs. GSM (Miao,
tribution of each utterance is assumed to rely on both global Grefenstette, and Blunsom 2017) replaces the prior with a
and local topic information. We assign each speaker a global Gaussian softmax function. ProdLDA (Srivastava and Sut-
topic distribution as a specific role. Then the local topic in- ton 2017) constructs a Laplace approximation to the Dirich-
formation in each utterance will be extracted and interacted let prior. ETM (Dieng, Ruiz, and Blei 2020) shares the
with the global role information to produce final topic dis- embedding space between words and topics. GNTM (Shen
tribution. Based on the novel graphical model of ConvNTM, et al. 2021) adds the document graph into the generative pro-
corresponding neural variational inference methods are car- cess of topic modeling. With the progress of social platforms
ried out for model learning. Furthermore, to further improve (e.g. Microblog and Twitter), application-oriented NTMs
topic coherence, we leverage the word co-occurrence infor- keep pouring out. LeadLDA (Li et al. 2016) considers the
mation as a new training objective, which can be jointly tree structure based on the re-posts and replying relations.
trained with the original objective of neural variational in- ForumLDA (Chen and Ren 2017) cooperatively models the
ference. The ConvNTM that grasps the word co-occurrence evolution of a root post, as well as its relevant and irrelevant
relationship can make related words tend to be clustered into response posts to detect topics. In these posts, people always
the same topic, which helps to obtain higher quality topic- discuss a single hot topic. While in our target conversation
word distributions. scenario, speakers with different roles may switch topics in
We run experiments based on the public benchmark con- multiple turns.
versational datasets, DailyDialog and EmpatheticDialogues.
Our proposed ConvNTM achieves the best performance on Multi-Turn Dialogue
topic modeling in terms of topic coherence and quality met- The simple concatenation of multi-turn dialogue contexts
rics, which indicates that ConvNTM has better topic inter- performs poorly since it makes the latent dialogue struc-
pretability on the dialogue corpora compared against gen- ture ignored. Abundant works suggest that the multi-turn
eral NTMs. Furthermore, we also conduct experiments on dialogue requires specific modeling methods (Qiu et al.
typical downstream tasks for dialogues based on the dis- 2020a,b). Serban et al. devise the hierarchical LSTM to
covered topics, including dialogue act classification and re- encode the structure and generate responses. DialoFlow (Li
sponse generation. The experimental results indicate that et al. 2021) is another solution, which views the dialogue
with the help of the topics discovered by ConvNTM, the as a dynamic flow and designs three objectives to capture
performance is prominently boosted compared against the the information dynamics. Moreover, the speaker feature is
baselines without topic information and existing topic-aware also considered as a pivotal factor in the dialogue. He et al.
dialogue methods. incorporate the turn changes among speakers to capture the
Our overall contributions are summarized as follows: fine-grained semantics of dialogue. Gu et al. introduce a
• To the best of our knowledge, for the first time, we pro- speaker-aware disentanglement strategy to tackle the entan-
pose ConvNTM, the neural topic model in particular de- gled dialogues and improve the performance of multi-turn
signed for the conversational scenario to formulate the dialogue response selection. Topic-aware models take the
multi-turn structure in dialogues to discover topics. advantage of the related topics to make conversational mod-
eling more consistent. Liu et al. propose two topic-aware
• Considering the multi-role interactions (speakers and ad- contrastive learning objectives to handle information scat-
dressees) in conversations, we perform utterance-level tering challenges for the dialogue summarization task. Zhu
fine-grained topic modeling and fuse global and local et al. propose a topic-driven knowledge-aware Transformer
topic information to determine topic distributions. to deal with the emotion detection in dialogue. We hope that
• We also leverage the word co-occurrence relationship to our ConvNTM can better facilitate the development of topic-
constrain the topic-word distribution, which can be co- aware methods.
13610
(a) Conversation Sequence Encoder (b) Multi-Role Graph Encoder (c) Topic Modeling
Speaker 1
...
Attention Layer
MLP MLP
BiLSTM BiLSTM ... BiLSTM
Embedding Layer
... Speaker 2
Figure 1: The model overview of ConvNTM: a) The conversation sequence encoder for modeling the multi-turn conversation
contexts; b) The multi-role graph encoder for formulating the intra-speaker and inter-speaker dependencies; c) The topic mod-
eling module to reconstruct utterance-level BoWs based on the fusion of global and local topic information.
Conversational Neural Topic Model for the conversation to describe the multi-role interactions.
(j)
In this section, we describe the modules and training ob- We denote each utterance representation hk as a node,
jectives of ConvNTM in detail. The model overview of the and the two types of edges between nodes reflect the intra-
ConvNTM is illustrated in Figure 1. speaker and inter-speaker dependencies. First, the individ-
ual roles of each speaker in the dialogue have a signifi-
Hierarchical Conversation Encoder cant impact on the continuation of the conversation. The
speaker tends to organize what he/she has said in the pre-
To fully extract semantic information in the multi-turn con- vious utterances to determine the topic of the current utter-
versation to help topic modeling, we use a hierarchical ance. Therefore, we consider intra-speaker dependency to
framework in which both a sequence encoder and a graph keep the topic consistency and avoid contradictions. For the
encoder cooperatively encode the conversation contexts to (j) (j)
better handle cross-utterance dependencies. speaker j, we add a bidirectional edge between hk1 and hk2
only if |k1 − k2 | ≤ Ks , where Ks indicates the window
Conversation sequence encoder. To capture the multi- size for aggregating contextual utterances from the same
turn structure of the conversation, we employ a sequence speaker. Second, a speaker will give feedback on the utter-
encoder that models the conversation contexts from word ance contents from other speakers, and then decide whether
level to utterance level. Suppose that a conversation ses- to keep or shift the current topic. It is also necessary to con-
sion c has J speakers, and the speaker j has nj utterances: struct the inter-speaker dependency in the graph to simu-
(j) (j) (j) (j) late the dynamic interactions. For two speakers j1 and j2 ,
{u1 , u2 , · · · , unj }. The words in the k-th utterance uk
(j ) (j )
(j)
are first encoded as ek through an embedding layer fe . A we add a bidirectional edge between hkj1 and hkj2 only if
1 2
two-layer Transformer encoder ftrm is then used to further |kj1 − kj2 | ≤ Kc , where Kc indicates the absolute distance
(j) window size of two utterances in the conversation. Taking
process ek and obtain the utterance-level representation
(j) Figure 1 as an example, the second speaker has three utter-
sk from the [CLS] token. In order to enhance the contex-
ances interspersed with the first speaker’s four utterances.
tual relationship among the multi-turn utterances, we feed
In this graph, the intra-speaker edges are in grey while the
the Transformer outputs into a bidirectional LSTM frnn and
inter-speaker edges are in black. We utilize a graph convolu-
a standard self-attention layer fattn successively. Finally, we
tion network (GCN) fgcn to update the utterance represen-
denote the learned utterance representations for the speaker
(j) (j) (j) tations under the multi-role interaction relations. Therefore,
j as {h1 , h2 , · · · , hnj }. The encoding process for the se- the learned utterance representation e
(j)
hk is given by:
quence encoder can be formulated as:
(j) (j)
(j) (j)
hk = fgcn (hk ).
e (5)
ek = fe (uk ), (1)
(j) (j) Topic Modeling
sk = ftrm (ek )[CLS] , (2)
Based on the speaker-oriented utterance representations
(j) (j) (j)
sek = frnn (s1 , s2 , · · · , s(j)
nj ) k , (3) from the graph encoder, we then introduce our techniques
(j) (j) (j) for topic modeling.
s1 , se2 , · · · , se(j)
hk = fattn (e nj ) k . (4)
Topic distribution assumption. Given a general docu-
Multi-role graph encoder. Considering the impact of ment, the generative process of existing NTMs is mainly di-
speaker information in a conversation, we construct a graph vided into three steps: 1) sample a topic distribution θ for
13611
a document or each sentence; 2) sample a topic assignment Generative process. Based on the above definitions, we
zt for each word wt from the topic distribution θ; 3) gen- summarize the generative process of ConvNTM as follows.
erate each word wt independently from the corresponding 1. For each speaker j in the conversation session c:
topic-word distribution βzt . However, a conversation con- (j) (j) (j)
tains multiple turns of utterances, the topics in the utterances i) Sample the latent variable zc ∼ N (µc , σc );
follow their respective topic distributions and are related to (j) (j)
ii) Draw θc = softmax(gθ (zc )) as the global topic
each other. The roles of different speakers also influence distribution.
the topic determination. Thus, we need to adapt the origi- (j)
2. For each utterance uk of the speaker j:
nal assumptions on the topic distribution according to the
(j)
unique properties of the conversation. Specifically, we as- i) Draw θk as the local topic information;
sume that each speaker j in the conversation session c holds (j) (j) (j)
(j) ii) Draw θe by fusing θc and θ ;
k k
a global topic information θc , and each utterance k has lo- (j)
(j) iii) For each word w in the utterance uk : draw w ∼
cal topic information θk , which is fused with the corre- (j)
sponding global topic to determine the eventual topic distri- softmax(θek β).
(j)
bution θek .
The Joint Training Objective
NTM framework. We process the nj utterances of Neural variational inference objective. Under the gen-
each speaker j into bag-of-words (BoW) representations: erative process of ConvNTM, the marginal likelihood of the
(j) (j) (j) (j)
{x1 , x2 , · · · , xnj }, where xk is a |V |-dimensional conversation session c is decomposed as:
multi-hot encoded vector for the k-th utterance and V is the
J Z
BoW vocabulary. Note that each g∗ mentioned below repre- Y
sents a multilayer perceptron (MLP). We first normalize the p(c|µ, σ, β) = p(θc(j) |µ(j) (j)
c , σc )
(j)
(j) j=1 θc
BoW vector xk and then use gx to extract the representa- nj
!
(j)
tion x
ek :
YY
· p(w|β, θc(j) ) dθc(j) . (13)
(j)
(j) x k=1 w
ek = gx ( P|V | k (j) ).
x (6)
v=1 k(x ) v Inspired by the success of VAE-based NTMs (Miao, Grefen-
In order to introduce multi-role interactions into topic mod- stette, and Blunsom 2017; Dieng, Ruiz, and Blei 2020), we
(j) (j) also employ a VAE framework for the utterance-level BoW
eling, we concatenate x hk
ek with the node representation e
given by the graph encoder. Then, we obtain the local topic reconstruction process. The posterior global topic distribu-
(j)
(j)
information θk of the utterance through gs : tion p(θc ) for each speaker j can be approximated by the
(j) (j) (j)
(j) (j) (j) inference network q(θc |µc , σc ). We can formulate pa-
kθ = gs (e x ⊕e
k k h ). (7)
rameter updates from the variational evidence lower bound
Next, all the utterances of each speaker j are integrated to (ELBO). From the perspective of ELBO, the training ob-
(j)
derive the global speaker-aware representation hc , which jective for the log-likelihood of the conversation consists of
(j) (j) two terms. The first term is to minimize the cross entropy
can be used to estimate the prior variables µc and log σc
via two separate networks gµ and gσ : between the input normalized BoW and reconstructed BoW,
nj
! and the second Kullback–Leibler (KL) divergence term is to
(j) (j) (j) minimize the distance between the variational posterior and
X
(j)
hc = tanh x k ⊕ hk ) · θ k
gc (e e , (8)
true posterior of latent variables. This part of the training
k=1
loss can be formulated as:
µ(j)
c = gµ (hc ),
(j)
log σc(j) = gσ (h(j)
c ). (9) nj
!
With the reparameterisation trick (Kingma and Welling (j)
X X
(j)
(j) (j) (j) Lc = − Eq(θ(j) |µ(j) ,σ(j) ) log p(w|θc , β)
2013), we can sample a latent variable zc ∼ N (µc , σc ). c c c
k=1 w
(j)
Then we use gθ to generate the global topic distribution θc : +wkl · DKL (q(θc(j) |µ(j) (j) (j)
c , σc )||p(θc )), (14)
θc(j) = softmax(gθ (zc(j) )). (10)
where wkl is the hyper-parameter for the weight of the KL
Finally, we can use gf to fuse local and global topic infor-
(j)
term.
mation to derive the eventual topic distribution θek :
(j) (j)
Controllable word co-occurrence objective. In addition
θe = gf (θ ⊕ θ(j) ).
k k c (11) to the ELBO commonly used in general NTMs, we further
Assuming that the number of topics is K, all the above topic leverage the word co-occurrence information of the training
distributions are K-dimensional vectors. To reconstruct the corpus to improve the topic quality. For the topic-word dis-
BoWs for each utterance in the conversation, we leverage tribution matrix β ∈ RK×|V | , its i-th row represents a multi-
a weighted matrix β ∈ RK×|V | to represent K topic-word nomial distribution on the i-th topic over the vocabulary V .
distributions. The reconstructed utterance BoW can be de- We expect that the top words in each topic are highly cor-
rived as: related and tend to co-occur in the same real conversations.
(j) (j)
x̂k = softmax(θek β). (12) Thus, we count the co-occurrence frequencies of all word
13612
pairs in all conversations in the training corpus, and con- Both of these TC metrics can be obtained in the gensim li-
struct a co-occurrence matrix M ∈ R|V |×|V | . Next, we add brary (Rehurek and Sojka 2011). TD measures the diver-
such a constraint on β, which can be described as the fol- sity across different topics. It is defined as the percentage
lowing loss: of unique words among the top words. A higher TD metric
indicates more topic variability. Pursuing either a high TC
|V | |V |
X X value or a high TD value independently does not guarantee
Lco = − Mw1 ,w2 log(β T β)w1 ,w2 . (15) the topic quality. Inspired by (Dieng, Ruiz, and Blei 2020),
w1 =1 w2 =1 we regard CV as the TC score and measure the topic quality
score (TQ) as the product of TC and TD.
Intuitively, we make the β-derived matrix as close as possi-
ble to the reference co-occurrence matrix M . We set a target log
p(wi ,wj )+ϵ
P (wi )P (wj )
co-occurrence distance as dco , and then design a controllable NPMI(wi , wj ) = . (18)
weight wco for the trade-off between Lc and Lco . Suppose − log(p(wi , wj ) + ϵ)
that there are C conversations in the training set, the overall Baselines. We compare our model with the mainstream
training loss of ConvNTM is given by: and state-of-the-art topic models as baselines. The base-
C X
X J lines include: 1) LDA (Blei, Ng, and Jordan 2003), the most
L = (1 − wco ) L(j) representative statistical topic model using Gibbs sampling;
c + wco Lco . (16)
c=1 j=1
2) GSM (Miao, Grefenstette, and Blunsom 2017), a VAE-
based NTM introducing Gaussian softmax for generating la-
The controllable factor wco is dynamically adjusted as: tent variables; 3) ProdLDA (Srivastava and Sutton 2017), an
NTM constructing Laplace approximation to the Dirichlet
0, Lco ≤ dco ,
prior; 4) ETM (Dieng, Ruiz, and Blei 2020), an NTM pro-
wco = Lco − dco (17) jecting topics and words into the same embedding space; 5)
min 1, , Lco > dco , GNTM (Shen et al. 2021), a recent NTM designing a docu-
Wco
ment graph and introducing it into the generative process of
where Wco is another hyper-parameter of the correcting fac- topic modeling. For all baselines, we employ their officially
tor for the proportional signal. reported parameter settings.
Implementation details. For the multi-role interaction
Experiments graph, we set the window sizes Ks and Kc to 2. The BoW
Experimental Setup dictionary size is set to 6,500 in DailyDialog and 7,533 in
Datasets. We conduct the experiments on two widely EmpatheticDialogues. The embedding size and hidden size
used multi-turn dialogue datasets, DailyDialog1 and Empa- of the Transformer, LSTM and GCN are all set to 64. For the
theticDialogues2 . DailyDialog (Li et al. 2017) totally con- loss function, wkl and Wco are set to 0.01 and 0.05, while the
tains 13,118 high-quality open-domain daily conversations, value of dco is determined by the number of topics and the
and covers various topics about daily life. It has 7.9 av- dataset. In our main results, dco is recommended to be set to
erage speaker turns per conversation, and each speaker 32 in DailyDialog and 31.375 in EmpatheticDialogues. The
has enough utterances for multi-turn modeling. We use training process has 100 epoches using the Adam optimizer
the official splits, i.e., 11,118/1,000/1,000. EmpatheticDia- with the base learning rate of 0.001. We implement the ex-
logues (Rashkin et al. 2019) contains about 25k personal periments on a Nvidia A40 GPU.3
conversations with rich emotional expressions and topic sit-
uations. Speakers discuss emotional topics and tend to inter- Main Results
act with empathy. We also employ the official splits data, i.e. For all baselines, one conversation is treated as one docu-
19,533/2,770/2,547 for train/val/test respectively. ment for topic modeling. Here we set the number of topics
to 20, and analyze the impact of the number of topics later.
Evaluation metrics. To evaluate the quality of topics gen- To properly evaluate the learned topics, we follow the previ-
erated by topic models, we adopt topic coherence (TC) ous works (Kim et al. 2012; Shen et al. 2021) and select the
and topic diversity (TD) metrics. TC measures the seman- top 10 words with the highest probability under each topic as
tic consistency of top words within each topic. A higher the representative word list to calculate topic quality metrics.
TC metric indicates more relevant keywords within each The comparison results are available in Table 1. Our Con-
topic and better topic interpretability. Following the pre- vNTM outperforms all baselines on two TC metrics (i.e. CV
vious work (Shen et al. 2021), we choose two TC mea- and NPMI) on two datasets, which indicates that with the
surements, CV and normalized pointwise mutual informa- help of formulating the specific multi-turn and multi-role in-
tion (NPMI), to provide a robust evaluation. The NPMI of formation in the conversation, the topics discovered by Con-
the word pair (wi , wj ) is calculated as equation (18). CV vNTM have the best topic interpretability. GNTM achieves
score stands for a widely used Content Vector-based coher- the highest on TD, while ConvNTM is slightly behind. This
ence metric, adopted by (Röder, Both, and Hinneburg 2015). reason may be that GNTM generates words and edges based
1 3
http://yanran.li/dailydialog Our code and data are available at https://github.com/ssshddd/
2
https://github.com/facebookresearch/EmpatheticDialogues ConvNTM.
13613
Dataset DailyDialog EmpatheticDialogues
Method TD CV NPMI TQ TD CV NPMI TQ
LDA 0.390 0.4308 -0.0083 0.1680 0.510 0.4230 0.0011 0.2158
GSM 0.445 0.4931 -0.0040 0.2194 0.530 0.4486 0.0055 0.2378
ProdLDA 0.720 0.5363 -0.0007 0.3861 0.736 0.4610 0.0173 0.3393
ETM 0.690 0.5688 0.0364 0.3925 0.713 0.4690 0.0130 0.3342
GNTM 0.810 0.5916 0.0588 0.4792 0.812 0.4809 0.0289 0.3905
ConvNTM 0.750 0.6542 0.0831 0.4907 0.790 0.5136 0.0495 0.4057
on topics at the same time, which may indirectly increase the Method TD TC NPMI TQ
sparsity among topic proportions. ETM and ProdLDA also ConvNTM (w/o contexts) 0.715 0.6240 0.0619 0.4462
have moderate TC metrics, but their TD is relatively low, ConvNTM (w/o graph) 0.705 0.6282 0.0657 0.4429
which is prone to generate redundant topics on the conver- ConvNTM (w/o speaker) 0.650 0.6099 0.0548 0.3964
sation dataset. Comprehensively considering the impact of ConvNTM (w/o Lco ) 0.780 0.6237 0.0645 0.4865
TC and TD, our ConvNTM which integrates multiple turns
ConvNTM 0.750 0.6542 0.0831 0.4907
and speaker roles can achieve state-of-the-art performance
on the TQ score.
Table 2: Ablation results for ConvNTM on DailyDialog.
Ablation Study
In order to verify the effectiveness of key modules of our
model, we compare ConvNTM with the following four
model variants: 1) ConvNTM (w/o contexts) removes the
conversation sequence encoder used to model multi-turn di-
alogue contexts; 2) ConvNTM (w/o graph) removes the
multi-role graph encoder used to model interactions between
speakers; 3) ConvNTM (w/o speaker) sets the number of
speakers to 1 that completely ignores the effect of the roles;
4) ConvNTM (w/o Lco ) remove the loss term Lco for the
word co-occurrence objective. Figure 2: Visualization of an example for discovered topics
Table 2 shows the comparison results of these different (one topic per line). Repeated words are in bold.
ablation methods on DailyDialog. Compared with the full
model, both ConvNTM (w/o contexts) and ConvNTM (w/o
graph) decrease on TC and TD, indicating that both the words in each line have strong associations and focus on a
multi-turn context structure and multi-role interaction infor- certain topic. This means that each learned topic has good
mation of the conversation have a significant impact on the internal coherence. The selected 4 topics can be summarized
topic quality. The performance of ConvNTM (w/o speaker) as food, family & friends, work, and traffic accidents. Mean-
is further degraded when the speaker’s role is not modeled while, ConvNTM has fewer repeated words, indicating less
and the utterances in the conversation are treated as sen- redundancy in the learned topics. While for GNTM, these
tences in the general document. This reflects the superior- topic words are mixed together, and some non-topic words
ity of ConvNTM over general NTMs for topic modeling on are repeated in different topics. For instance, “people” are
the unique properties of the conversation. In addition, when shown in multiple topics, and “work” and “family” appear
removing the word co-occurrence training objective, Con- in the same topic in GNTM, which destroys the topic diver-
vNTM (w/o Lco ) improves slightly on TD, while it drops sity, coherence and interpretability.
more significantly on TC, making the overall topic quality
worse. It means that considering word-occurrence informa- Analysis on Number of Topics
tion can help improve the coherence and interpretability of Since the number of topics is an important factor of the
learned topics. topic model, we compare the topic quality performance of
ConvNTM and several strong baselines. We set the vary-
Analysis on Discovered Topic Examples ing number of topics from 10 to 100, and the comparison
We also perform a qualitative analysis on discovered topics, results are shown in Figure 3. Our ConvNTM achieves the
comparing ConvNTM and the strong baseline GNTM. Fig- highest TC and TQ under all number of topics, which in-
ure 2 shows several representative topics learned by Con- dicates the robustness of our method on topic quality. All
vNTM and GNTM. We display the top 10 words under each models have high topic quality when the number of topics
topic per line. For our ConvNTM, we can see that the top is between 20 and 50. When the number of topics exceeds
13614
Figure 3: Comparison results of the varying number of topics on DailyDialog.
13615
Acknowledgements Holtgraves, T.; Srull, T. K.; and Socall, D. 1989. Conversa-
This work was supported by National Natural Science tion memory: The effects of speaker status on memory for
Foundation of China (NSFC Grant No. 62122089 and No. the assertiveness of conversation remarks. Journal of Per-
61876196), Beijing Outstanding Young Scientist Program sonality and Social Psychology, 56(2): 149.
NO. BJJWZYJH012019100020098, and Intelligent Social Jin, Y.; Zhao, H.; Liu, M.; Du, L.; and Buntine, W. 2021.
Governance Platform, Major Innovation & Planning Inter- Neural Attention-Aware Hierarchical Topic Model. arXiv
disciplinary Platform for the ”Double-First Class” Initiative, preprint arXiv:2110.07161.
Renmin University of China. We also wish to acknowledge Kim, H.; Sun, Y.; Hockenmaier, J.; and Han, J. 2012. Etm:
the support provided and contribution made by Public Pol- Entity topic models for mining documents associated with
icy and Decision-making Research Lab of RUC. Rui Yan entities. In 2012 IEEE 12th International Conference on
is supported by Beijing Academy of Artificial Intelligence Data Mining, 349–358. IEEE.
(BAAI) and CCF-Tencent Rhino-Bird Open Research Fund.
Kim, T.; and Vossen, P. 2021. Emoberta: Speaker-aware
emotion recognition in conversation with roberta. arXiv
References preprint arXiv:2108.12009.
Adiwardana, D.; Luong, M.-T.; So, D. R.; Hall, J.; Fiedel,
N.; Thoppilan, R.; Yang, Z.; Kulshreshtha, A.; Nemade, G.; Kingma, D. P.; and Welling, M. 2013. Auto-encoding varia-
Lu, Y.; et al. 2020. Towards a human-like open-domain chat- tional bayes. arXiv preprint arXiv:1312.6114.
bot. arXiv preprint arXiv:2001.09977. Lang, K. 1995. Newsweeder: Learning to filter netnews. In
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirich- Machine Learning Proceedings 1995, 331–339. Elsevier.
let allocation. Journal of machine Learning research, 3(Jan): Larochelle, H.; and Lauly, S. 2012. A neural autoregressive
993–1022. topic model. Advances in Neural Information Processing
Chen, C.; and Ren, J. 2017. Forum latent Dirichlet allocation Systems, 25.
for user interest discovery. Knowledge-Based Systems, 126: Li, J.; Liao, M.; Gao, W.; He, Y.; and Wong, K.-F. 2016.
1–7. Topic Extraction from Microblog Posts Using Conversation
Cheng, X.; Yan, X.; Lan, Y.; and Guo, J. 2014. Btm: Topic Structures. In Proceedings of the 54th Annual Meeting of
modeling over short texts. IEEE Transactions on Knowledge the Association for Computational Linguistics (Volume 1:
and Data Engineering, 26(12): 2928–2941. Long Papers), 2114–2123. Berlin, Germany: Association for
Dieng, A. B.; Ruiz, F. J.; and Blei, D. M. 2020. Topic mod- Computational Linguistics.
eling in embedding spaces. Transactions of the Association Li, R.; Lin, C.; Collinson, M.; Li, X.; and Chen, G. 2019. A
for Computational Linguistics, 8: 439–453. Dual-Attention Hierarchical Recurrent Neural Network for
Dieng, A. B.; Wang, C.; Gao, J.; and Paisley, J. 2017. Top- Dialogue Act Classification. In CoNLL.
icRNN: A Recurrent Neural Network with Long-Range Se- Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; and Niu, S.
mantic Dependency. In International Conference on Learn- 2017. DailyDialog: A Manually Labelled Multi-turn Di-
ing Representations. alogue Dataset. In Proceedings of The 8th International
Dziri, N.; Kamalloo, E.; Mathewson, K.; and Zaiane, O. Joint Conference on Natural Language Processing (IJCNLP
2019. Augmenting Neural Response Generation with 2017).
Context-Aware Topical Attention. In Proceedings of the Li, Z.; Zhang, J.; Fei, Z.; Feng, Y.; and Zhou, J. 2021. Con-
First Workshop on NLP for Conversational AI, 18–31. Flo- versations Are Not Flat: Modeling the Dynamic Information
rence, Italy: Association for Computational Linguistics. Flow across Dialogue Utterances. In Proceedings of the 59th
Gu, J.-C.; Li, T.; Liu, Q.; Ling, Z.-H.; Su, Z.; Wei, S.; Annual Meeting of the Association for Computational Lin-
and Zhu, X. 2020. Speaker-aware BERT for multi-turn re- guistics and the 11th International Joint Conference on Nat-
sponse selection in retrieval-based chatbots. In Proceedings ural Language Processing (Volume 1: Long Papers), 128–
of the 29th ACM International Conference on Information 138. Online: Association for Computational Linguistics.
& Knowledge Management, 2041–2044. Lin, T.; Hu, Z.; and Guo, X. 2019. Sparsemax and relaxed
He, Z.; Tavabi, L.; Lerman, K.; and Soleymani, M. 2021a. Wasserstein for topic sparsity. In Proceedings of the twelfth
Speaker Turn Modeling for Dialogue Act Classification. In ACM international conference on web search and data min-
Findings of the Association for Computational Linguistics: ing, 141–149.
EMNLP 2021, 2150–2157. Punta Cana, Dominican Repub- Liu, J.; Zou, Y.; Zhang, H.; Chen, H.; Ding, Z.; Yuan, C.;
lic: Association for Computational Linguistics. and Wang, X. 2021. Topic-Aware Contrastive Learning for
He, Z.; Tavabi, L.; Lerman, K.; and Soleymani, M. 2021b. Abstractive Dialogue Summarization. In Findings of the
Speaker Turn Modeling for Dialogue Act Classification. Association for Computational Linguistics: EMNLP 2021,
arXiv preprint arXiv:2109.05056. 1229–1243. Punta Cana, Dominican Republic: Association
Hofmann, T. 1999. Probabilistic latent semantic indexing. for Computational Linguistics.
In Proceedings of the 22nd annual international ACM SI- Ma, X.; Zhang, Z.; and Zhao, H. 2021. Enhanced Speaker-
GIR conference on Research and development in informa- aware Multi-party Multi-turn Dialogue Comprehension.
tion retrieval, 50–57. arXiv preprint arXiv:2109.04066.
13616
Miao, Y.; Grefenstette, E.; and Blunsom, P. 2017. Discov- communication. In Proceedings of the 2013 Conference on
ering discrete latent topics with neural variational inference. Empirical Methods in Natural Language Processing, 1765–
In International Conference on Machine Learning, 2410– 1775.
2419. PMLR. Wang, R.; Zhou, D.; and He, Y. 2019. Atm: Adversarial-
Panwar, M.; Shailabh, S.; Aggarwal, M.; and Krishna- neural topic model. Information Processing & Management,
murthy, B. 2020. TAN-NTM: Topic attention networks for 56(6): 102098.
neural topic modeling. arXiv preprint arXiv:2012.01524. Wang, W.; Huang, M.; Xu, X.-S.; Shen, F.; and Nie, L. 2018.
Qiu, L.; Zhao, Y.; Shi, W.; Liang, Y.; Shi, F.; Yuan, T.; Yu, Chat more: Deepening and widening the chatting topic via a
Z.; and Zhu, S.-C. 2020a. Structured Attention for Unsu- deep model. In The 41st international acm sigir conference
pervised Dialogue Structure Induction. In Proceedings of on research & development in information retrieval, 255–
the 2020 Conference on Empirical Methods in Natural Lan- 264.
guage Processing (EMNLP), 1889–1899. Online: Associa- Xing, C.; Wu, W.; Wu, Y.; Liu, J.; Huang, Y.; Zhou, M.; and
tion for Computational Linguistics. Ma, W.-Y. 2017. Topic Aware Neural Response Generation.
Qiu, L.; Zhao, Y.; Shi, W.; Liang, Y.; Shi, F.; Yuan, T.; In Proceedings of the Thirty-First AAAI Conference on Arti-
Yu, Z.; and Zhu, S.-C. 2020b. Structured attention for ficial Intelligence, AAAI’17, 3351–3357. AAAI Press.
unsupervised dialogue structure induction. arXiv preprint Zhang, Y.; Sun, S.; Galley, M.; Chen, Y.-C.; Brockett, C.;
arXiv:2009.08552. Gao, X.; Gao, J.; Liu, J.; and Dolan, B. 2019. Dialogpt:
Rashkin, H.; Smith, E. M.; Li, M.; and Boureau, Y.-L. 2019. Large-scale generative pre-training for conversational re-
Towards Empathetic Open-domain Conversation Models: a sponse generation. arXiv preprint arXiv:1911.00536.
New Benchmark and Dataset. In ACL. Zhao, H.; Phung, D.; Huynh, V.; Le, T.; and Buntine, W.
Rehurek, R.; and Sojka, P. 2011. Gensim–python framework 2021. Neural Topic Model via Optimal Transport. In Inter-
for vector space modelling. NLP Centre, Faculty of Infor- national Conference on Learning Representations.
matics, Masaryk University, Brno, Czech Republic, 3(2). Zhu, L.; Pergola, G.; Gui, L.; Zhou, D.; and He, Y. 2021.
Röder, M.; Both, A.; and Hinneburg, A. 2015. Exploring Topic-Driven and Knowledge-Aware Transformer for Dia-
the space of topic coherence measures. In Proceedings of logue Emotion Detection. In Proceedings of the 59th An-
the eighth ACM international conference on Web search and nual Meeting of the Association for Computational Linguis-
data mining, 399–408. tics and the 11th International Joint Conference on Natural
Serban, I. V.; Garcı́a-Durán, A.; Gulcehre, C.; Ahn, S.; Language Processing (Volume 1: Long Papers), 1571–1582.
Chandar, S.; Courville, A.; and Bengio, Y. 2016a. Gener- Online: Association for Computational Linguistics.
ating Factoid Questions With Recurrent Neural Networks: Zhu, Q.; Feng, Z.; and Li, X. 2018. GraphBTM: Graph en-
The 30M Factoid Question-Answer Corpus. In Proceed- hanced autoencoded variational inference for biterm topic
ings of the 54th Annual Meeting of the Association for Com- model. In Conference on Empirical Methods in Natural Lan-
putational Linguistics (Volume 1: Long Papers), 588–598. guage Processing (EMNLP 2018).
Berlin, Germany: Association for Computational Linguis-
tics.
Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A.; and
Pineau, J. 2016b. Building End-to-End Dialogue Systems
Using Generative Hierarchical Neural Network Models. In
Proceedings of the Thirtieth AAAI Conference on Artificial
Intelligence, AAAI’16, 3776–3783. AAAI Press.
Shen, D.; Qin, C.; Wang, C.; Dong, Z.; Zhu, H.; and Xiong,
H. 2021. Topic Modeling Revisited: A Document Graph-
based Neural Network Perspective. Advances in Neural In-
formation Processing Systems, 34: 14681–14693.
Srivastava, A.; and Sutton, C. 2016. Neural variational in-
ference for topic models. ArXiv Preprint, 1(1): 1–12.
Srivastava, A.; and Sutton, C. 2017. Autoencoding Varia-
tional Inference For Topic Models. In International Confer-
ence on Learning Representations.
Sun, Y.; Loparo, K.; and Kolacinski, R. 2020. Conversa-
tional structure aware and context sensitive topic model for
online discussions. In 2020 IEEE 14th International Con-
ference on Semantic Computing (ICSC), 85–92. IEEE.
Wallace, B. C.; Trikalinos, T. A.; Laws, M. B.; Wilson, I. B.;
and Charniak, E. 2013. A generative joint, additive, se-
quential model of topics and speech acts in patient-doctor
13617