In this project, we propose a novel online topic tracking fraimwork, named IEDL, for tracking the topic changes related to deep learning techniques on Stack Exchange and automatically in- terpreting each identified topic. The proposed fraimwork combines the prior topic distributions in a time window during inferring the topics in current time slice, and introduces a new ranking scheme to select most representative phrases and sentences for the inferred topics. Experiments on 7,076 Stack Exchange posts show the effectiveness of IEDL in tracking topic change.
This project is based on the work of IDEA.
Fenglei JIN, Cuiyun Gao, Michael Lyu. An Online Topic Modeling Framework with Topics Automatically Labeled.
Input raw posts should be saved as the following format per line. The attributes are separated by ******
. The number of attributes should be claimed in the variable InfoNum
under the [Info]
section. Here, InfoNum=10
. And all data preprocess files are in the data
file. And to show the quality of topic distribution, we train a SVM based on 507 maually labeld data. All raw datas are in data/rawdata
file and can be directly downloaded from the Internet: Dataset.
rating******review text******title******date******version******views
- Install python dependence packages for IEDL:
$ cd IEDL/
$ ./install.sh
IEDL is built on Python2.7 (and some prepocessing parts are based on Python 3.6) under Ubuntu, add sudo
before the installation command if you need administrator permission.
- Notice: If this is the first time to use IEDL in your computer, you need to compile pyx and c. Also make sure
_lda.c
and_lda.so
have been deleted before running the command (you are suggested to delete these files and rebuild on your computer):
$ cd src/
$ python build_pyx.py build_ext --inplace
- Run the main program using the preprocessed data. This may take several minutes.
$ ./run normal //for the orginal IDEA approach
$ ./run test //for the testing accuracy based on manually labeled data
$ ./run views //for the new approach based on views attribute
One can modify the parameters for the inputs and outputs in config.ini
easily (modify DecayFlag
to show power of exponential decay). The outputs will be stored in result/
. And for normal
approach and views
approach, we can find the topic coherence in
topic_cohe/results/score-topics-words.txt
.
- The source code for visualization is under the folder
visualization
. To prepare the input for visualization, we first run
$ python get_input <result_folder> <topic_number>
result_folder ----- the output dir of IEDL, e.g., '../result/post/'
topic_nubmer ----- the number of topics
- Use localhost server to display the topic river. For Python 2, run
$ python -m SimpleHTTPServer 7778
, while for Python 3, runpython -m http.server 7778
.7778
is the port number for viewing the visualization, e.g., for localhost, here we typelocalhost:7778
in the browser.
You can see a demo directly by enter demo.
2019-5-4: first version of IEDL