Top2vec For Vaksin Hesistancy

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Conferences ICT, Society, and Human Beings 2021;

Web Based Communities and Social Media 2021;


and e-Health 2021

USE OF TWO TOPIC MODELING METHODS


TO INVESTIGATE COVID VACCINE HESITANCY

Phillip Ma1, Qing Zeng-Treitler2 and Stuart J. Nelson3


George Washington University, SMHS, Biomedical Informatics Center
2600 Virginia Ave, NW, First Floor, Washington DC 20037, USA
1MD
2PhD
3MD, FACP, FACMI

ABSTRACT
COVID vaccine hesitancy in the face of a pandemic is a concern for public health researchers and policy makers who aim
to achieve herd immunity. We investigated the COVID vaccine hesitancy by analyzing Twitter posts (tweets) using two
topic modeling methods: Latent Dirichlet Allocation (LDA) and Top2Vec. Of the two methods, Top2Vec was able to reveal
topics which directly discussed Vaccine Hesitancy and thus offered more utility for this research topic. Common reasons
for vaccine hesitancy found in the dataset included concerns about recent (at the time of tweet collection) news regarding
side effects associated with the COVID vaccines, and a mixture of scientific and government skepticism related to vaccine
development and distribution.

KEYWORDS
Vaccine Hesitancy, Topic Modeling, Twitter

1. INTRODUCTION
Since the approval of the first COVID vaccines, their production and distribution have been expanded to meet
global needs. Initially, vaccinations in many countries were limited to the most vulnerable and frequently
exposed populations (e.g., nursing home residents and healthcare workers) given the limited supply. As a larger
proportion of the population became eligible for the vaccine in the US, the issue of vaccine hesitancy began to
draw increasing attention. People eligible for the vaccine who refuse to receive or are hesitant of vaccination
pose a challenge towards achieving herd immunity. Much time and resources have been poured into
determining what the reasons are behind this hesitancy, and what approaches would be most effective to deploy
against it.1
Twitter provides a resource for researchers to explore public discourse on a wide range of topics through
tracking users’ interactions on the social media platform. As of 2021, Twitter had 191 million daily active
users2 and in 2012 their users produced 340 million tweets per day3. Although it is possible to post media such
as images and videos, users still primarily tweet text; a common method deployed by researchers is Natural
Language Processing to categorize, analyze, and interpret large sets of tweets. Twitter also has an easily
accessible API that can be used to retrieve live and historical tweets.
Tweets are short texts that were originally limited to 140 characters, but more recently have been expanded
to accommodate 280 characters; the short length poses a challenge for text analysis. Syed et al. (2017) reported
finding higher quality topics with greater coherence in models trained on larger documents (full papers as
opposed to abstracts). Another challenge is that tweets are not written or reviewed in the same rigorous process
as academic papers or news articles. Like other types of social media data, the text of tweets often contains
colloquial language, abbreviations, references to other sources, and otherwise difficult to interpret content.

1
https://www.nejm.org/doi/full/10.1056/nejmms2101220
2
https://www.statista.com/statistics/970920/monetizable-daily-active-twitter-users-worldwide/
3
https://blog.twitter.com/official/en_us/a/2012/twitter-turns-six.html

221
ISBN: 978-989-8704-30-6 © 2021

Topic modeling offers a way to select and analyze a subset of tweets that are determined to be pertinent to the
study.
Researchers have been utilizing social-media datasets throughout the course of the COVID-19 pandemic.
He et al. (2021) ran a variety of sentiment analyses (VADER, TextBlob, Standford NLP, and Linguistic Inquiry
and Word Count) and keyword filtering on social media datasets to track public discourse regarding
mask-wearing during the COVID-19 pandemic. Ordun et al. (2020) trained an LDA topic model and applied
it towards categorizing tweets to track discourse regarding COVID-19 over time and was able to correlate
Twitter activity with specific news events such as press conferences. Importantly, the researchers decided to
use. Term Frequency – Inverted Document Frequency (TF-IDF) to vectorize documents and used Uniform
Manifold Approximation and Projection for Dimension Reduction (UMAP) to plot these document vectors to
visualize topics. Jelodar et al. (2020) detailed a method utilizing LDA and Recurrent Neural Networks to find
topics in tweets, and to classify the sentiments of tweets within a given topic.
Latent Dirichlet Allocation (LDA) is a commonly used method for Topic Modeling and has many well
supported libraries that make its implementation relatively simple. LDA is a statistical model for determining
underlying topics across a corpus of documents (Blei et al., 2003). It assumes a generative model where each
document consists of a mixture of topics, and words from the topic distributions are chosen to compose the
document until the predetermined number of words in the document is reached. LDA also assumes that
documents are probability distributions over topics, , and that topics are probability distributions over words,
.The three hyperparameters used are  number of topics and the two Dirichlet priors,  and . A high 
implies that each document is likely to contain many topics and a high  implies that each topic contains a
mixture of many words.
Using LDA to find topics in Twitter datasets presents unique challenges. Yang et al. (2014) offered an
insider’s view of Twitter’s internal Topic Modeling system and address the shortcomings of models such as
LDA in discovering topics specifically from a corpus composed of tweets due to their relative size and paucity
of topics. Resnik et al. (2015) detailed various approaches to address the shortcomings of LDA in finding
coherent topics in twitter datasets, including using supervised and supervised-nested LDA models which
appeared to produce more coherent topics than ‘vanilla’ LDA. Weng et al. (2010) describe a similar problem
in producing coherent topics using LDA on tweets, and employed a method combining tweets from the same
user together into a single document to explore topics that individual users address often. Steinskog et al. (2017)
build on that work by comparing methods aggregating tweets from single users into individual documents, and
aggregating tweets that utilize the same hashtag into individual documents. Their results demonstrate
promising results training LDA models on these aggregated documents. Surian et al. (2016) compare using
LDA head-to-head with Dirichlet Multinomial Modeling (DMM) and report appreciably more interpretable
results with DMM on their twitter dataset examining sentiment related to the HPV vaccine.
An alternative to the LDA method is the Top2Vec model (Angelov, 2020) that utilizes Distributed Bag of
Words (DBOW) in doc2vec to create jointly embedded document and word vectors. Angelov assumes that the
semantic space created is a continuous representation of topics. In this semantic space the document vector
represents the topic of the document, with the word vectors nearest to the document vectors being the most
representative of that document’s topic. As such, documents that are clustered together are assumed to be of
the same topic and a centroid of these documents can be calculated. Therefore, the number of ‘dense’ document
vector areas represents the number of prominent topics. In order to overcome issues related to computational
load and the probability that documents will not cluster densely enough, the model first performs dimension
reduction of the document vectors with Uniform Manifold Approximation and Projection for Dimension
Reduction (UMAP). The model uses Hierarchical Density-Based Spatial Clustering of Applications with Noise
(HDBSCAN) to find these dense areas and calculate the subsequent topic vectors.
Words that appear in many documents are necessarily placed in semantic space that is typically equidistant
from all the documents that contain it, and is assumed to be in-between document clusters and not contained
within them. These words are often stop words and thus, Top2Vec does not require stop-word removal.
Similarly, words that are semantically similar, in that they are used in similar frequencies in their associated
topics, will appear close to one another in semantic space; this includes different stems of the same word. Thus,
Top2Vec also does not require stemming or lemmatizing before the model is trained.
In this study, we have applied LDA and Top2Vec to a set of tweets to discover topics related to COVID
vaccine hesitancy. The results from the two methods were manually reviewed and compared.

222
International Conferences ICT, Society, and Human Beings 2021;
Web Based Communities and Social Media 2021;
and e-Health 2021

2. BODY

2.1 Methods
The dataset consists of the full text of tweets (including hashtags) and was obtained from the
COVID-19-TweetIDs repository4 created and maintained by Chen et al. who utilized Twitter’s API to stream
tweets which were related to the COVID-19 pandemic and stored them in a repository as lists of tweet IDs.
The researchers provided a ‘rehydrator’ script in order to retrieve tweets based on Tweet ID using the Twarc5
library, which provides access to Twitter’s official API through a python wrapper. Our study used a subset of
tweets from January 1st 2021 to March 14th 2021. The dataset was filtered using keywords to obtain tweets
related to vaccination: ‘vaccine’, ‘moderna’, ‘pfizer’, ‘astrazeneca’, ‘astra’, ‘zeneca’, ‘j&j’, ‘jj’, and ‘johnson’
which yielded 3,403,166 tweets. Of those tweets, the same random sample of 900,000 tweets was used to train
both models; models using larger datasets failed during training due to computing constraints. Aside from
timestamps, the remainder of the associated metadata was discarded in accordance with Twitter’s Privacy
Policies6 regarding publicly available tweets and identifying data.
An LDA model was trained on the selected data. The data was first ‘cleaned’ through removal of
punctuation, stop words, links, and usernames. The dataset was then run through NLTK’s lemmatization
method, which is necessary in LDA models so that different ‘stems’ of the same root word do not appear as
separate words. A smaller subset of the data was used to in order to facilitate hyperparameter tuning;
a grid-search was performed where 60 different combinations of hyperparameters ( , , and ) were used as
inputs, and a set of parameters was settled upon based on which combination yielded the highest coherence
score. The number of topics () which resulted in the highest coherence score was 20, which was how many
topics the final trained LDA model to yield. The  and  used in the final model were 0.01 and 0.31
respectively.
Calculating Coherence Scores is a feature of the Gensim Topic Modeling library7 and allows for calculation
of a variety of coherence score modalities across a selection of its supported topic models. The default model
is Cv which acts as a proxy for topic quality and measures the semantic similarity between the top words in
each topic generated by the LDA model; the assumption being that words with similar meaning tend to occur
together in similar contexts, and that a topic representing this accurately would contain semantically similar
words among its top scored representatives. Cv is commonly used as a coherence score as it has performed
well in comparison to other coherence models in scoring how interpretable topics are likely to be by human
readers as described by Röder et al. in 2015 and Syed et al. in 2017.
A Top2Vec model was trained on the same selected dataset which initially yielded 3918 topics. The package
includes methods for reducing the number of topics. In this use case, it was helpful to reduce the number of
topics in order to make inspecting the tweets more feasible. Topic reduction is not permanent; the original
structure is preserved and can be queried to identify which original topics comprise the reduced topics for
further inspection. The package combines topics into the semantically most similar topic until the desired
number of topics is reached.

2.2 Results
We manually identified 4 topics obtained from the LDA model that are the most relevant to vaccine hesitancy.
A sample of topics are shown in Table 1.

4
https://github.com/echen102/COVID-19-TweetIDs
5
https://github.com/DocNow/twarc
6
https://developer.twitter.com/en/developer-terms/agreement-and-policy
7
https://radimrehurek.com/gensim/models/ldamodel.html

223
ISBN: 978-989-8704-30-6 © 2021

Table 1. Four examples of topics related to vaccine hesitancy found using the LDA model. The first column lists labels
assigned by the researchers as human interpretations of the topics. The second column lists the top 20 words in order of
descending probability
Table 1
Topic Top 20 most probable words
AZ/Oxford Vaccine vaccine, johnson, covid, astrazeneca, country, biden, european, help, news, thank, suspend, ensure, blood,
Side Effects emergency, clot, coronavirus, dose, germany, million, france
Effectiveness vs. vaccine, covid, variant, country, pandemic, south, africa, global, african, strain, access, need, develop,
New Strains effective, mutation, coronavirus, virus, nation, world, vaccination
Side Effects/Vaccine johnson, vaccine, covid, effective, reaction, single, severe, shot, prevent, boris, safe, report, death,
Reactions coronavirus, people, adverse, allergic, disease, pfizer, case
Long-term effects of vaccine, covid, effect, long, mrna, term, people, chance, risk, know, catch, year, protect, short, think, death,
Vaccine like, virus, time, wait

We then manually identified 8 topics obtained from the Top2Vec model that are the most relevant to vaccine
hesitancy. A sample of topics are shown in Table 2.
Table 2. Eight examples of topics related to vaccine hesitancy found using the Top2Vec model. The first column lists
labels assigned by the researchers as human interpretations of the topics. The second column lists the top 20 words in
order of descending probability
Table 2
Topic Top 20 most probable words
Alternative COVID hcq, ivermectin, cheap, vitamin, treatment, cure, experimental, treat, proven, pharma, untested, drugs, eua,
treatments, rushed drug, treating, rushed, widely, pushing, expensive
development
Belief that vaccine is traditional, guinea, genetic, technology, mrna, rna, gene, untested, experimental, cells, modified, vector,
experimental/untested liability, proven, injected, therapy, toxic
Low Personal Risk survival, rate, experimental, chances, yourself, subject, chance, untested, unknown, catching, liability, term,
higher, mortality, pose, therapy, gene, protect, healthy, injuries
AZ/Oxford Vaccine clots, clot, blood, suspended, clotting, astrazeneca, suspends, denmark, suspend, norway, reports,
Side Effects netherlands, european, temporarily, france, halted, evidence, fears, germany, italy
Skepticism based in fda, approved, authorization, emergency, drug, authorized, use, eua, johnson, approval, liability, approves,
semantics over FDA licensed, trials, authorizes, cleared, single, regulator, prevent, regulators
Approval vs EUA
Side Effects related autoimmune, trigger, disease, protein, cells, immune, causes, mrna, clotting, rna, mmr, lung, cause,
to autoimmune diseases, spike, cov, anaphylaxis, evidence, condition, sars
reactions
Belief that vaccines realise, ffs, bio, weapon, sound, transmit, yourself, lockdowns, mutates, eliminate, toxic, logic, forever,
are a bioweapon humans, cure, unknown, gonna, understand, biotech, viruses
Belief that vaccine is guinea, pigs, poison, thailand, indians, trial, traditional, spreads, theirs, adverse, clinical, netherlands,
experimental/untested experimental, omg, southern, forced, rushed, untested, covaxin, trials

2.3 Discussion
In this study, we demonstrate the use of two different topic models in order to identify topics related to vaccine
hesitancy on a Twitter dataset. Our results demonstrate that Top2Vec is able to extract more relevant topics
from a Twitter dataset which meet the needs of this particular study.
The LDA model revealed four topics (Table 1.) determined by the researchers to be relevant to vaccine
hesitancy. The first of these topics references the AztraZeneca/Oxford and the Johnson and Johnson vaccines,
their supposed side effects, and subsequent suspension in some countries8. The second of these topics addresses
the effectiveness of vaccines against emerging strains of COVID9. The last two topics are concerned with side
effects: in the short and long term respectively. Although the topics identified could all address reasons for
hesitancy among Twitter users, the topics did not appear to represent overt hesitance or hostility towards
vaccination; they may represent discussion on news topics. Notably, the 4th topic includes the word ‘wait’ as
the 20th most probably word in the topic which may suggest users being hesitant to receive a vaccine because
of long term side effects, but is not as obviously about hesitancy as topics discussed later.

8
https://www.bmj.com/content/372/bmj.n699
9
https://jamanetwork.com/journals/jama/fullarticle/2777785

224
International Conferences ICT, Society, and Human Beings 2021;
Web Based Communities and Social Media 2021;
and e-Health 2021

The Top2Vec model revealed eight topics (Table 2.) determined by the researchers to be most relevant to
vaccine hesitancy. One of the topics found a similar topic to the LDA model, referencing the
AstraZeneca/Oxford vaccine side effects and suspension. The remainder of the topics were determined to
represent tweets which were more overtly hesitant towards COVID vaccination. One topic (Alternative COVID
treatments…) contains words such as “untested”, “experimental” and “rushed” in addition to mentioned
“ivermectin”, “hcq” and “vitamin”. This was interpreted as skepticism of vaccines due to their expedited
approval process and offering alternative treatments for COVID. Two topics referenced “guinea”, “pig” and
“experimental” which was interpreted as a feeling that Twitter users would be used as experiments to test safety
or efficacy by receiving the vaccine. One topic (Low Personal Risk) appeared to reference survival rates or low
mortality rates of COVID infections and appear to believe that because of their perceived low risk, that
receiving a vaccine was not necessary or not worth the risk. There were multiple topics (only one is shown in
Table 2.) which reference the approval status of the vaccines; claiming that an Emergency Use Authorization
(EUA) does not constitute FDA approval and therefore are hesitant to receive the vaccine. One topic
(Bioweapon) appears to level the accusation that vaccines are “Bioweapons” and are “toxic”.

2.4 Limitations
The collected dataset was a customized subset of the COVID-19 Tweets dataset. It would have been more ideal
to stream tweets directly using the Twitter API using a custom filter. Twitter began censoring tweets related to
covid vaccine skepticism starting in December of 202010,11. As a result, there is a difficult to quantify number
of tweets and twitter traffic surrounding COVID vaccine skepticism that was likely not collected and possibly
lowers the proportion of vaccine skepticism represented in this dataset. Additionally, the topics which
represented negative sentiment on COVID vaccination were manually selected and were limited by the
subjectivity of human interpretation.

3. CONCLUSION
Top2Vec offers a novel method of topic modeling which can aid researchers in particular use cases. In this
instance, Top2Vec was able to provide appreciably differentiated topics when compared to LDA and was able
to identify the topic desired by this researcher. Additionally, Top2Vec offers a simplified data-cleaning pipeline
by obviating the need for stemming and stop-word removal. Vaccine uptake is an important factor in combating
the COVID-19 pandemic, but concerns remain about populations who are hesitant to receive the vaccine even
when it is available. By using Topic Modeling on a large Twitter dataset, the study was able to identify tweets
which were most likely to express vaccine hesitancy, and to explore the topics that most closely represented
those tweets. Common reasons for vaccine hesitancy found in the dataset through examination of negative
appearing topics included concerns about safety, skepticism about the development and approval of the
vaccines, vaccine efficacy, and judgements of personal risk from COVID infection. Efforts aimed at building
trust around COVID vaccination may do well to address these concerns.

ACKNOWLEDGEMENT
We would like to acknowledge the leadership and staff of the Biomedical Informatics Center at the School of
Medicine and Health Sciences at George Washington University.

10
https://blog.twitter.com/en_us/topics/company/2020/covid19-vaccine.html
11
https://blog.twitter.com/en_us/topics/company/2021/updates-to-our-work-on-covid-19-vaccine-misinformation.html

225
ISBN: 978-989-8704-30-6 © 2021

REFERENCES
Angelov, D. (2020). Top2Vec: Distributed Representations of Topics. http://arxiv.org/abs/2008.09470
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3,
993–1022.
Chen, E., Lerman, K., & Ferrara, E. (2020). Tracking Social Media Discourse About the COVID-19 Pandemic:
Development of a Public Coronavirus Twitter Data Set. JMIR Public Health and Surveillance, 6(2), e19273.
https://doi.org/10.2196/19273
He, L., He, C., Reynolds, T. L., Bai, Q., Huang, Y., Li, C., Zheng, K., & Chen, Y. (2021). Why do people oppose mask
wearing? A comprehensive analysis of U.S. tweets during the COVID-19 pandemic. Journal of the American Medical
Informatics Association. https://doi.org/10.1093/jamia/ocab047
Jelodar, H., Wang, Y., Orji, R., & Huang, S. (2020). Deep Sentiment Classification and Topic Discovery on Novel
Coronavirus or COVID-19 Online Discussions: NLP Using LSTM Recurrent Neural Network Approach. IEEE Journal
of Biomedical and Health Informatics, 24(10), 2733–2742. https://doi.org/10.1109/JBHI.2020.3001216
Ordun, C., Purushotham, S., & Raff, E. (2020). Exploratory Analysis of Covid-19 Tweets using Topic Modeling, UMAP,
and DiGraphs. http://arxiv.org/abs/2005.03082
Resnik, P., Armstrong, W., Claudino, L., Nguyen, T., Nguyen, V.-A., & Boyd-Graber, J. (2015). Beyond LDA: Exploring
Supervised Topic Modeling for Depression-Related Language in Twitter. Proceedings of the 2nd Workshop on
Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, 99–107.
https://doi.org/10.3115/v1/W15-1212
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence Measures. Proceedings of the
Eighth ACM International Conference on Web Search and Data Mining, 399–408.
https://doi.org/10.1145/2684822.2685324
Steinskog, A., Therkelsen, J., & Gambäck, B. (2017). Twitter Topic Modeling by Tweet Aggregation. Proceedings of the
21st Nordic Conference of Computational Linguistics, May, 77–86.
http://cran.uvigo.es/web/packages/stm/vignettes/stmVignette.pdf
Surian, D., Nguyen, D. Q., Kennedy, G., Johnson, M., Coiera, E., & Dunn, A. G. (2016). Characterizing Twitter Discussions
About HPV Vaccines Using Topic Modeling and Community Detection. Journal of Medical Internet Research, 18(8),
e232. https://doi.org/10.2196/jmir.6045
Syed, S., & Spruit, M. (2017). Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet
Allocation. 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 165–174.
https://doi.org/10.1109/DSAA.2017.61
Weng, J., Lim, E.-P., Jiang, J., & He, Q. (2010). TwitterRank: Finding topic-sensitive influential Twitterers. Proceedings
of the Third ACM International Conference on Web Search and Data Mining - WSDM ’10, 261.
https://doi.org/10.1145/1718487.1718520
Yang, S. H., Kolcz, A., Schlaikjer, A., & Gupta, P. (2014). Large-scale high-precision topic modeling on twitter.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1907–1916.
https://doi.org/10.1145/2623330.2623336

226

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy