Ikm at Semeval-2017 Task 8: Convolutional Neural Networks For Stance Detection and Rumor Verification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

IKM at SemEval-2017 Task 8: Convolutional Neural Networks for

Stance Detection and Rumor Verification

Yi-Chin Chen, Zhao-Yang Liu, Hung-Yu Kao


Department of Computer Science and Information Engineering
National Cheng Kung University
Tainan, Taiwan, ROC
{kimberycc,kjes89011}@gmail.com
hykao@mail.ncku.edu.tw

Abstract ing, querying and commenting (SDQC) is applied


in SemEval2017 Task 8.
This paper describes our approach for In this paper, we describe a system for stance
SemEval-2017 Task 8. We aim at detecting classification and rumor verification in tweets. For
the stance of tweets and determining the
the first task, we are given tree-structured conver-
veracity of the given rumor. We utilize a
convolutional neural network for short text sations, where replies are triggered by a source
categorization using multiple filter sizes. tweet. We need to categorize the replies into one
Our approach beats the baseline classifiers of the SDQC categories by reply-source pairs. The
on different event data with good F1 second task is about rumor verification. Our sys-
scores. The best of our submitted runs tem is for the closed variant – which means the
achieves rank 1st among all scores on sub- veracity of a rumor will have to be predicted sole-
task B. ly without external data.
It is a challenging NLP task. Statements con-
1 Introduction taining sarcasm, irony and metaphor often need
personal experience to be able to infer their
Rumors in social networks are widely noticed due
broader context (Kreuz and Caucci, 2007). Fur-
to the broad success of online social media. Un-
thermore, lots of background knowledge is re-
confirmed rumors usually spark discussion before
quired to do the fact checking (Reichel and
being verified. These have created cost for society
Lendvai, 2016).
and panic among people. Rather than relying on
In this paper, we develop convolutional neural
human observers to identify trending rumors, it
network models for both tasks. Our system relies
would be helpful to detect them automatically and
on a supervised classifier, using text features of
limit the damage immediately. However, identify-
different word representation methods such as
ing false rumors early is a hard task without suffi-
learning word embedding through training and
cient evidence such as responses, retweet and fact
pre-trained word embedding model like GloVe
checking sites. Instead of propagation structure,
(Pennington et al., 2014). The experiment section
context-level patterns are more obvious and useful
presents our results and discusses the performance
for the identification of rumors at this stage – in
of our work.
particular, observing the different patterns of
stances amongst participants (Qazvinian et al.,
2 Related Work
2011).
Recent research has proposed a 4-way classifi- Rumor verification from online social media has
cation task to encompass all the different kinds of developed into a popular subject in recent years.
reactions to rumors (Arkaitz et al., 2016). A sche- The most common features were proposed by
ma of classifications including supporting, deny- Castillo (2011) who classified useful features into
four categories: message-based features, user-
based features, topic-based features, and propaga-
tion-based features. However, this approach is
limited because of the data skew problem when
false rumors are less common. Thus, most exist-
ing approaches attempt to classify truthfulness by
utilizing information beyond the content of the
posts – propagation structure, for example. Ke Wu
(2015) et al., proposed a novel message propaga-
tion pattern based on the users who transmit this
message. But most of these features are available
only when the rumors have been responded to by Figure 1: Architecture of Word-Embedding Con-
many users. Our task, on the other hand, is to do volutional Model
the initial classification on content features which represents one word in the tweet as follows:
are available much earlier.
𝑤𝑣1
𝑤𝑣2
3 System Overview 𝑡𝑚 = [ ⋮ ] (1)
𝑤𝑣𝑛 𝑛×𝑑
Our system employs a convolutional neural net-
work mainly inspired by Kim (2014). We chose Where 𝑡𝑚 is a word matrix formed by the concat-
models by testing on LOO (Leave One Out) vali- enation of each word vector.
dation performance. LOO can be simply ex-
plained as that we test on each conversation thread In the convolutional layer, we use tm as input
by retraining models on the other threads. In the and select a window size 𝑦 to slide over the ma-
following section, our CNN Tweet Model is brief- trix. To extract local features in the region of the
ly explained. window, a filter matrix 𝑓𝑚 ∈ 𝑅 𝑦×𝑑 is used to
produce element-wise multiplication and non-
3.1 Data Preprocessing
linear operations on the matrix values in the win-
Before applying the models, we need to do some dow at every position. The following is an exam-
transforms of the irregular input text. At first, we ple of this operation:
remove URLs and username with ‘@’ tags that do 𝑤𝑣𝑖
not contribute to sentiment analysis. In this case, 𝑒𝑙𝑖 = 𝑔 (𝑓𝑚 ∙ [ ⋮ ] + 𝑏) (2)
URLs and usernames are considered as noise 𝑤𝑣𝑖+𝑦−1
without external data. Furthermore, we convert all
letters to lower case. Besides removal, it is worth
Where 𝑓𝑚 is the filter matrix. The values of the
mentioning that we leave important clues such as
filter matrix will be learned by the CNN from the
hashtags and some special characters. Question
training process. 𝑏 is the bias term, 𝑔 is the non-
marks and exclamation marks, for example, have
linear function, and 𝑒𝑙𝑖 is an element of a local
proven to be helpful (Zhao, 2015).
feature vector. After we slide the window through
3.2 Convolutional Model the whole matrix, we get a local feature vector of
the input tweet as:
There are two steps for the process of encoding
tweets into matrices that are then passed to the in- 𝑓𝑣 = [𝑒𝑙1 , 𝑒𝑙2 , ⋯ , 𝑒𝑙𝑛−𝑦+1 ] (3)
put layer. This model is illustrated in Figure 1. Where 𝑓𝑣 ∈ 𝑅 𝑛−𝑦+1 is a local feature vector with
First, we use word embedding to convert each n-y+1 elements.
word in the tweet into a vector. We randomly ini- For the purpose of dealing with continuous n
tialize the word embedding matrix. Each row of words which may represent special meaning in
this matrix is a vector that represents a word in the NLP (e.g. “Boston Globe”), we use multiple win-
vocabulary. Then we learn the embedding weights dow sizes to produce different feature vectors.
during the training process. Second, we concate- Thus, the idea of a different window size applied
nate these word vectors to produce a matrix repre- to capturing features is similar to n-grams. Mean-
senting the sentence. In the matrix, each row while, we use different filter matrices to extract
different local features of the tweet in each win- stances with each class from training set random-
dow. ly, which means that there are 64 instances in a
A pooling layer is used for simplifying the in- batch.
formation of the output from the convolutional A voting scheme is applied to decrease the un-
layer. We extract the maximum value from each certainty of training on randomly selected sam-
local feature vector to form a condensed represen- ples. We trained 5 models to predict the same test-
tation vector. For every local feature vector, only ing data and took a vote for the final prediction.
the most important feature is extracted and noise By performing training multiple times inde-
is ignored. After the max-pooling operation, we pendently we achieved more robust results.
can concatenate all maximum values of each col- In subtask B, most of the parameter settings were
umn as follows: the same as in Task A. Because the output classes
𝑚𝑎𝑥(𝑓𝑣1 ) are rumor and non-rumor, we discard the label
𝑣𝑡 = [ ⋮ ] (4) “unverified”. In addition, we use the probability in
𝑚𝑎𝑥(𝑓𝑣𝑚 ) section 3.2 to define the credibility of our answer
c. The credibility in the interval [0, 1] is normal-
Where 𝑣𝑡 is the global feature vector representing
ized as:
the tweet. max(𝑃(𝑦=0,1|𝑣𝑡 ,𝑏))
Through the pooling layer, if we use the same 𝑐= ∑𝑖=0,1 𝑃(𝑦=𝑖|𝑣𝑡 ,𝑏)
(6)
window size and filter matrix on different tweets,
we can make sure the global feature size is fixed. 5 Evaluation
For classification, we feed the global feature
vectors of the tweet into a fully connected layer to We conduct experiments using the rumor datasets
calculate the probability distribution. A softmax annotated for stance (Zubiaga et al., 2016). The
activation function is applied as follows: statistics of the datasets are shown in Table 1. For
𝑇
subtask B, conversation threads are not available
𝑒 𝑤 𝑖 𝑣𝑡 +𝑏𝑖
𝑃(𝑦 = 𝑖|𝑣𝑡 , 𝑏) = (5) for the participants and the use of external data is
𝑤𝑇 ′ 𝑣𝑡 +𝑏 ′
∑𝑖′=1 𝑒 𝑖 𝑖
forbidden on the closed variant.
Where 𝑣𝑡 is the input vector, 𝑤 𝑇 𝑖′ is the 𝑖 ′ -th col-
umn of weight matrix 𝑊. With the probabilities Subtask A
over the four classes, we take the class with the Stance Support Deny Query Comment
maximum value as the label for the given input Training 841(20%) 333(8%) 330(8%) 2734(65%)
tweet. Testing 94(9%) 71(7%) 106(10%) 778(74%)
Subtask B
Veracity True False unverified
4 Tasks and Model Training
Training 127(47%) 50(18%) 95(35%)
During the training phase, our CNN model auto- Testing 8(40%) 12(60%) 0
matically learns the values of its filters based on Table 1: Statistics of datasets for subtask A and B.
the task.
In task A, the tweets are classified into four cat- 5.1 Baselines
egories: supporting, denying, querying and com- We compare our result with Lukasik’s (2016) in
menting. We defined the ground truth vector p as a Table 2. We follow their LOO settings and test on
one-hot vector. The parameter d used in the word the same dataset. The report includes accuracy
embedding is 128. The number of filters in the (Acc) and macro average of F1 scores across all
convolutional layers is 128. The probability of labels (F1) from Lukasik’s baseline.
dropout is set to 0.5. Adam Optimization algo- The results show our deep learning model is the
rithm is used to optimize our network’s loss func- best method in terms of F1 score. Especially, the
tion. Moreover, there are three filter region sizes CNN model beats all the other methods. While the
in our system: 2, 3 and 4, each of which has 2 fil- RNN method is not performing well on this task.
ters. Another issue is the GloVe embedding – the pre-
In order to deal with the imbalance of classes in training model sometimes lacks some of the vo-
the data, balanced mini-batching was applied. In cabulary from new events. Nevertheless, GloVe is
the statistics, more than 64% of the instances be- still competitive with the CNN method for the
long to the commenting class. We chose 16 in- Ferguson event.
Event Ottawa Ferguson Stance Precision Recall Accuracy
Acc F1 Acc F1 Support 0.19 0.20 0.20
GP 62.28 42.41 64.31 32.9 Deny 0.31 0.07 0.07
Lang. model 53.2 42.66 49.56 34.35
Query 0.58 0.45 0.45
NB 61.76 40.64 62.05 31.29
Comment 0.78 0.85 0.85
HP Approx. 67.77 32.29 68.44 25.99
HP Grad. 63.43 42.4 63.23 33.14 Table 4: Result on test data for subtask A.
CNN 61.74 44.9 62.31 36.49
CNN(GloVe) 59.61 38.87 63.03 39.48 Team Score RMSE
RNN(GloVe) 52.49 38.66 51.49 32.52 DFKI DKT 0.393 0.845
ECNU 0.464 0.736
Table 2: Accuracy and F1 scores for different
IITP 0.286 0.807
methods across datasets. The upper lines of the re-
sults are our baseline. IKM 0.536 0.763
NileTMRG 0.536 0.672
Window Precision Recall F1 Table 5: Rank on test data for subtask B.
Sizes
3 0.39 0.42 0.40 as features. However, it is challenging to extract
3,4 0.43 0.42 0.43 features of supporting which results in a poorer
2,3,4 0.43 0.40 0.42 performance.
3,4,5 0.45 0.45 0.45 The rank of subtask B is summarized in Table
2,3,4,5 0.44 0.45 0.44 5. As we can see our model performs best among
the official scores. Our code is available on github
Table 3: results of using different window sizes. for anyone who has interest in further explora-
tion2.
5.2 Window Sizes for Filters
Table 3 lists the results of using different window 6 Conclusion
sizes for the filters in the tweet encoding process. We develop a convolutional neural network sys-
We set different window sizes to observe the im- tem for detecting twitter stance and rumor veraci-
pact. The experiment was performed with the ty determination in this paper. Compared with the
same settings as in section 5.1 for the Ottawa baseline approach, our system obtains good re-
event. We obtain the best performance when the sults on stance detection. In addition, on the test
window size combination is (3, 4, 5). Different set of SemEval2017 Task8B, we ranked 2nd in the
window sizes 2, 3 and 4 correspond to the encod- official evaluation run.
ing for the bigrams, trigrams and four-grams of
the tweets respectively. We can see that the per- Reference
formance decrease slightly with the window size
Castillo, Carlos, Marcelo Mendoza, and Barbara Pob-
increases. That is, insufficient grams can lose lete. "Information credibility on twitter. " Proceed-
some features while too many grams can bring ings of the 20th international conference on World
noise. wide web. ACM, 2011.

5.3 Official Results1 Wu Ke, Song Yang, and Kenny Q. Zhu. "False rumors
detection on sina weibo by propagation struc-
Our submission results to the subtask A achieve an tures." Data Engineering (ICDE), 2015 IEEE 31st
accuracy of 0.701. The statistical details of each International Conference on. IEEE, 2015.
class are given in Table 4. We notice that the Yoon Kim. "Convolutional neural networks for sen-
comment stance is the easiest to detect, since they tence classification." arXiv preprint
take a large part of the data. The number of query arXiv:1408.5882 (2014).
stances are similar to support and deny, while it
Lukasik, Michal, Srijith, P.K, Vu, Duy, Bontcheva,
has much better precision and recall because the Kalina, Zubiaga, Arkaitz and Cohn. "Hawkes pro-
features of queries are more obvious. Likewise, cesses for continuous time sequence classification:
there are some negative words in the deny stance an application to rumor stance classification in
1 2
Results and task detail can be found on https://github.com/kimber-chen/Twitter-stance-
http://alt.qcri.org/semeval2017/task8/ classification-by-TensorFlow
twitter." Proceedings of 54th Annual Meeting of
the Association for Computational Linguistics. As-
sociation for Computational Linguistics, 2016.
Jeffrey Pennington, Richard Socher, and Christopher
D Manning. 2014. Glove: Global vectors for word
representation. In EMNLP, volume 14, pages
1532–1543.
Vahed Qazvinian, Emily Rosengren, Dragomir R.
Radev, and Qiaozhu Mei. 2011. Rumor has it:
Identifying misinformation in microblogs. In Pro-
ceedings of the Conference on Empirical Methods
in Natural Language Processing, EMNLP ’11,
pages 1589–1599.
Kreuz, Roger J., and Gina M. Caucci. "Lexical influ-
ences on the perception of sarcasm." Proceedings
of the Workshop on computational approaches to
Figurative Language. Association for Computa-
tional Linguistics, 2007.
Reichel, Uwe D., and Piroska Lendvai. "Veracity
computing from lexical cues and perceived certain-
ty trends." arXiv preprint
arXiv:1611.02590 (2016).
Zhao, Zhe, Paul Resnick, and Qiaozhu Mei. "Enquir-
ing minds: Early detection of rumors in social me-
dia from enquiry posts." Proceedings of the 24th
International Conference on World Wide Web.
ACM, 2015.
Arkaitz Zubiaga ,Maria Liakata,Rob Procter, Gerald-
ine Wong Sak Hoi, and Peter Tolmie. "Analysing
how people orient to and spread rumors in social
media by looking at conversational threads." PloS
one 11.3 (2016): e0150989.
Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob
Procter, and Michal Lukasik. "Stance classification
in rumors as a sequential task exploiting the tree
structure of social media conversations." arXiv
preprint arXiv:1609.09028 (2016)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy