7473-Article Text-10855-1-10-20200925

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Proceedings of the Eighth AAAI Conference on Human Computation and Crowdsourcing (HCOMP-20)

Modeling Annotator Perspective and


Polarized Opinions to Improve Hate Speech Detection
Sohail Akhtar, Valerio Basile, Viviana Patti
Computer Science Department
University of Turin, Italy
sohail.akhtar@edu.unito.it, {basile, patti}@di.unito.it

Abstract hate speech. In our method, supervised machine learning


In this paper we propose an approach to exploit the fine- models are trained to learn different points of view of the
grained knowledge expressed by individual human annotators human annotators on the same data, in order to subsequently
during a hate speech (HS) detection task, before the aggre- take them into account at prediction time. In this study, we
gation of single judgments in a gold standard dataset elimi- will try to answer the following research questions: (RQ1)
nates non-majority perspectives. We automatically divide the Does an automatic partition of the annotators based on the
annotators into groups, aiming at grouping them by similar polarization of their judgments reflect different perspectives
personal characteristics (ethnicity, social background, culture on hate speech? (RQ2) Are models trained to represent such
etc.). To serve a multi-lingual perspective, we performed clas- perspectives effective in HS detection tasks?
sification experiments on three different Twitter datasets in In order to test these research questions, we experimented
English and Italian languages. We created different gold stan-
dards, one for each group, and trained a state-of-the-art deep
on three datasets in English and Italian.
learning model on them, showing that supervised models in-
formed by different perspectives on the target phenomena Related Work
outperform a baseline represented by models trained on fully Hate Speech is a complex phenomenon which is dependent
aggregated data. Finally, we implemented an ensemble ap- on the relationships between various communities and so-
proach that combines the single perspective-aware classifiers
cial groups. (Poletto et al. 2020) mention several definitions
into an inclusive model. The results show that this strategy
further improves the classification performance, especially of hate speech, although there is no consensus on one formal
with a significant boost in the recall of HS prediction. definition (Ross et al. 2016). Therefore, it is difficult to de-
velop automatic systems that determine whether a message
contains any fragments of hate speech. A recent literature
Introduction survey on hate speech detection (Fortuna and Nunes 2018)
Hate Speech is a special type of abusive language. It has addresses many issues faced by researchers, including the
specific targets which are victimized based on their per- scarcity of high quality datasets available as benchmarks for
sonal characteristics or demographic background such as the hate speech detection tasks.
race, ethnicity, religion, color, sexual orientation or other There are approaches that measure the level of contro-
similar factors (Nobata et al. 2016). Researchers who re- versy by analyzing user opinions on controversial topics.
cently started tackling hate speech detection from a natural (Soberón et al. 2013) highlight the importance of disagree-
language processing perspective are designing operational ment in data annotations and treated it as a useful resource
frameworks for HS, annotating corpora with several seman- rather than noise in gold standard data. The majority of com-
tic frameworks, and automatic classifiers based on super- putational approaches to hate speech detection are based
vised machine learning models (Fortuna and Nunes 2018; on supervised machine learning, including deep learning,
Schmidt and Wiegand 2017; Poletto et al. 2020). but also Support Vector Machines, Random Forest, Logistic
Most datasets for HS detection are annotated by humans, Regression and Decision Trees (Fortuna and Nunes 2018;
often relying on crowd-sourcing, whereas typically no back- Schmidt and Wiegand 2017). In particular, the state of the
ground information about the workers is provided. Given the art is represented by deep learning models based on Trans-
highly subjective nature of HS, such datasets tend to exhibit former networks pre-trained on large amounts of unlabelled
low agreement by traditional measures. Moreover, aggrega- data and fine-tuned on task-specific annotated corpora.
tion by majority makes it difficult to model the different per- Recently, neural language models have gained popularity.
spective of the annotators. These models have been effectively applied to many NLP re-
We propose a methodology to automatically model the lated tasks showing substantial improvements in the perfor-
different perspectives that annotators may adopt towards cer- mances (Peters et al. 2018). Some of these pre-training based
tain highly subjective phenomena, i.e., abusive language and language models involve either feature-based approaches in
Copyright  c 2020, Association for the Advancement of Artificial which they only use pre-training as extra features, and de-
Intelligence (www.aaai.org). All rights reserved. pend on task-specific architectures such as ElMo (Peters et

151
al. 2018). (Howard and Ruder 2018) proposed the ULMFiT as prediction framework for the binary classification task of
model for text classification tasks, achieving state-of-the-art hate speech detection.
performance on several benchmarks. In the pre-training step, the model is trained on differ-
BERT (Devlin, Chang, and Toutanova 2019) is one of ent tasks on large unlabelled datasets. During the fine-tuning
the best known Transformer-based models employing a bi- step, the pre-trained parameters are adjusted according to a
directional approach that achieved state of the art perfor- specific task requirements and all the parameters are fine-
mance in many NLP tasks, in particular text classifica- tuned with labelled data from a downstream task.
tion (Yu, Jindian, and Luo 2019). BERT trains bidirectional In the second step of our method, we fine-tune BERT models
language representations from unlabelled text and it consid- to the group-based gold standard datasets obtained in the
ers both left and right contexts in a layered architecture (Mu- previous steps, in order to learn different points of view on
nikar, Shakya, and Shrestha 2019). the perception of the same phenomenon (HS) on the same
data. By contrast, the model trained on the original dataset
Method encodes all the possible points of view of the annotators.
Our proposed method is based on the assumption that a Many BERT pre-trained models are available for multiple
group of annotators can be divided into groups based on and individual languages, and trained on text from differ-
some characteristics such as cultural background, common ent genres and domains (Nozza, Bianchi, and Hovy 2020).
social behaviour and other similar factors. The idea is to in- In this work, we use the uncased base English model pro-
vestigate how these characteristics can influence the opin- vided by Google for English (uncased L-12 H-768 A-12).
ions of annotators expressed while annotating HS data. The For Italian, we use AlBERTo (Polignano et al. 2019), a
method works in two steps, and it is applied to an annotated model for Italian, pre-trained on Twitter data. AlBERTo has
dataset for which the single, pre-aggregated annotations are similar specifications to the BERT English base model.
known:
1. We divide the annotators into groups (two, in this itera- Data
tion of the study) by using a numeric index measuring the We test our methodology on three data sets in English and
polarization of the judgments. Italian languages. The first and second datasets in English
2. Different gold standard datasets are compiled following language are taken from previous work by (Waseem 2016).
the division of the annotators, and each used to train a The original dataset contains 6,909 messages from Twit-
different classifier. ter annotated in a multi-label fashion with four labels: sex-
ism, racism, both, and neither. We separated the corpus into
The original and group-based models are tested against the two binary datasets, namely Sexism and Racism. We were
same test set for comparison. The steps of the method are able to retrieve a smaller dataset containing 6,361 tweets.
detailed in the rest of this section. There are 5,551 negative instances of HS and 810 positive
in the Sexism dataset and 6,261 negative and 100 positive
Division of the Annotators into Groups
instances of HS in the Racism dataset. The third dataset is in
The first step of our method consists in automatically divid- Italian language, containing 1,859 tweets on topics related
ing the set of annotators into groups. The group split is found to LGBT community (1,635 negative and 224 positive in-
by an exhaustive search of the possible annotator partitions stances).
and finding a partition which maximizes the average Polar- We compiled the Sexism dataset as a binary classifica-
ization index (P-index). Such metric, introduced in (Akhtar, tion dataset, mapping the labels sexism and both in the orig-
Basile, and Patti 2019), leverages the information at the sin- inal dataset to the positive class, and the labels racism and
gle annotation level, measuring the level of polarization of neither to the negative class. The resulting Sexism dataset
all the annotations on each instance individually. The mea- has 810 positive (sexist) tweets out of 6,361 (12.7%). The
surement of the P-index of a message is a three-step process. original dataset was annotated by experts (feminist and anti-
First, the annotators are divided into groups (in this study, racism activists) and workers on a crowd-sourcing plat-
we limit the possible partitions to two groups). Second, the form1 . The guidelines developed by (Waseem and Hovy
agreement between the annotations of each group on each 2016) were used to annotate the dataset. Majority voting
instance is measured by using the normalized χ2 statistics, was used to create a gold standard. After dividing the an-
measuring how independent is their distribution compared to notators in two by following the method introduced in the
a uniform distribution. Finally, the P-index is computed as a Method section, we report an overall agreement (Fleiss’
function of the overall agreement and each of the per-group Kappa among all annotators) of 0.58. The intra-group agree-
agreement values. ment (Cohen’s Kappa only between the annotators of a
Once the annotator bi-partition is found that maximizes group) for group one is 0.53 and 0.64 for group two.
the average P-index, we create two new gold standard We applied the same scheme to separate the Racism
datasets, one for each individual group, by aggregating the dataset from the original dataset that we applied for the Sex-
annotations with a standard procedure of majority voting. ism dataset, except for the different labels. In particular, for
the Racism dataset, racism and both were mapped to the
Supervised Classification positive class, whereas the labels sexism and neither were
We employ the Bidirectional Encoder Representations from
1
Transformers (BERT) (Devlin, Chang, and Toutanova 2019) https://www.figure-eight.com/

152
mapped to the negative class. The final dataset comprises naturally have a bias towards the positive class, by construc-
100 positive (racist) tweets out of 6,361 (1.57%). We mea- tion.
sured the overall agreement between all annotators and the Tables 1, 2, and 3 show the results of the performed exper-
value of Fleiss’ Kappa is 0.23, which shows that there is a iments. The results report the arithmetic mean of the evalua-
high disagreement between the annotators. We measured the tion metrics across five runs, along with their standard devia-
intra-group agreement for the two groups as 0.22 and 0.25 tion, showing an improvement in our baseline on all datasets.
respectively.
The Homophobia dataset in Italian language is an out-
put of the ACCEPT European research project2 . The dataset Table 1: Results of the prediction on the Sexism dataset. Av-
consists of tweets annotated with hate speech against the erages of 5 runs with standard deviation in parenthesis.
LGBT+ community. The original dataset was annotated in
Classifier Prec. (1) Rec (1) F1 (1)
a multi-class fashion by five volunteers with four cate-
Baseline .812 (.034) .711 (.044) .756 (.015)
gories: homophobic, not homophobic, doubtful or neutral.
Group 1 .745 (.048) .764 (.045) .752 (.008)
We mapped not-homophobic, doubtful and neutral to the
Group 2 .720 (.019) .907 (.018) .802 (.008)
negative class (not homophobic) and the label homopho-
Inclusive .665 (.033) .939 (.009) .778 (.020)
bic to the positive class. These volunteers were hired by
the main Italian non-for-profit organization for the LGBT+
rights Arcigay3 . The annotators were selected to fill different
demographic features such as age, education and personal
view on LGBT+ stances to chose the volunteers for this im- Table 2: Results of the prediction on the Racism dataset. Av-
portant project. Some members of the LGBT+ community erages of 5 runs with standard deviation in parenthesis.
also volunteered to annotated the homophobia dataset. The Classifier Prec. (1) Rec. (1) F1 (1)
overall agreement measured by using Fleiss Kappa is 0.35, Baseline .852 (.159) .194 (.059) .312 (.085)
rather low value according to common interpretation. The Group 1 .654 (.154) .424 (.140) .488 (.104)
values of intra-group agreement for the two groups are 0.40 Group 2 .571 (.175) .412 (.198) .419 (.076)
and 0.39 respectively. Inclusive .532 (.141) .612 (.136) .542 (.091)

Evaluation
The datasets presented in the Data section are employed to
experiment with the method introduced in the Method sec- Table 3: Results of the prediction on the Homophobia
tion. For all datasets, the training set contains 80% of the dataset. Averages of 5 runs with standard deviation in paren-
dataset whereas, the remaining 20% constitutes the test set. thesis.
We fine-tuned the BERT models on the training sets, keep- Classifier Prec. (1) Rec. (1) F1 (1)
ing the test sets fixed for each dataset, for fair comparison. Baseline .415 (.146) .231 (.079) .273 (.038)
After a preliminary study, we fixed the sequence length at Group 1 .302 (.038) .471 (.154) .355 (.040)
128 words. The batch size was set to 12 for English and 8 Group 2 .531 (.112) .178 (.031) .262 (.033)
for Italian, also due to memory limitations. The learning rate Inclusive .302 (.039) .502 (.142) .367 (.035)
is 1e−5 . We repeated each experiment five times, in order to
average out the variance induced by the random initializa-
tion of the network. It is important to note that the improvement on the posi-
The classification performance on the gold standard cre- tive class is particularly important in this setting, since this
ated by majority voting from the original datasets (before binary classification task is actually a detection task. For
partition) are reported as baselines. We then test the perfor- the Sexism and Racism datasets, the overall improvement
mance of the two models trained on gold standard training is mainly due to a better recall on the positive class. Pre-
sets created by only considering one group of annotators at cision drops less substantially, leading to better F1 scores.
a time (Group 1 and Group 2). For the Homophobia dataset, group-based classifiers obtain
We also include the results obtained by a straightforward an even greater improvement over the baseline, with higher
ensemble classifier which considers an instance positive if precision, recall and F1 scores for the positive class.
any of the Group 1 or Group 2 classifiers (or both) consid- The baseline results on the Racism and Homophobia
ers it positive. We call this ensemble “Inclusive”. The ratio- datasets see substantially low recall values, which is ex-
nale behind this ensemble is that hate speech is a sparse and pected given the highly skewed class distribution. Group-
subjective phenomenon, where each personal background based classifiers largely correct this problem, although in-
induces a perspective that lead to different perceptions of troducing some false positives (hence the lower precision on
what constitutes hate. This classifier includes all these per- the positive class).
spectives in its decision process. The Inclusive classifier will Finally, the results of the Inclusive ensemble classifier
show that including multiple perspectives into the learning
2
http://accept.arcigay.it/ process is beneficial to the classification performance on all
3
https://www.arcigay.it/en/ the datasets, however at the cost of lower precision.

153
Conclusion and Future Work Intelligence for Transforming Business and Society (AITB),
In this paper, we presented a method to divide the annotators volume 1, 1–5. IEEE.
into groups based on their annotation behaviour, under the Nobata, C.; Tetreault, J.; Thomas, A.; Mehdad, Y.; and
hypothesis that such partition reflects characteristics such as Chang, Y. 2016. Abusive language detection in online user
cultural background, common social behaviour and similar content. In Proceedings of the 25th International Confer-
factors. We experimented with three social media datasets in ence on World Wide Web, WWW ’16, 145–153.
English and Italian, reporting improvements over the base- Nozza, D.; Bianchi, F.; and Hovy, D. 2020. What the
line across all the datasets. The implementation of an “inclu- [mask]? making sense of language-specific BERT models.
sive” classifier further boosts the classification performance CoRR abs/2003.02912.
by strongly increasing the recall on hateful messages. Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
Although the method boosts the hate speech classifica- C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextual-
tion performance, there are limitations which are important ized word representations. In Proceedings of the 2018 Con-
to consider. First, for the methodology to work, we need pre- ference of the North American Chapter of the Association
aggregated data, which is often not available. Another is- for Computational Linguistics: Human Language Technolo-
sue is epistemological: our methodology and the subsequent gies, Volume 1 (Long Papers), 2227–2237. New Orleans,
empirical evaluation show that there is a great deal of in- Louisiana: Association for Computational Linguistics.
formation that is effectively wiped out by the aggregation
step employed in the standard procedure to create bench- Poletto, F.; Basile, V.; Sanguinetti, M.; Bosco, C.; and Patti,
mark datasets. This consideration motivates us to strongly V. 2020. Resources and benchmark corpora for hate speech
promote the publication of datasets in pre-aggregated form, detection: A systematic review. Language Resources and
and to develop new paradigms of evaluation that take all the Evaluation. To appear.
perspectives due to different backgrounds into account. Polignano, M.; Basile, P.; de Gemmis, M.; Semeraro, G.;
We plan to apply the methodology presented in this pa- and Basile, V. 2019. Alberto: Italian BERT language under-
per to other abusive language phenomena such as cyber- standing model for NLP challenging tasks based on tweets.
bullying, radicalization, and extremism. We are also inter- In Proceedings of the Sixth Italian Conference on Compu-
ested to test the method on sentiment analysis tasks applied tational Linguistics (CLiC-it 2019), volume 2481 of CEUR
to specific domains such as political debates. Workshop Proceedings. Bari, Italy: CEUR-WS.org.
We plan to investigate the effect of dividing the annotators Ross, B.; Rist, M.; Carbonell, G.; Cabrera, B.; Kurowsky,
into more than two groups, and how to find an optimal num- N.; and Wojatzki, M. 2016. Measuring the Reliability
ber of partitions. In this direction, unsupervised clustering of Hate Speech Annotations: The Case of the European
of the annotators based on their annotations with standard Refugee Crisis. In Beißwenger, M.; Wojatzki, M.; and
methods (e.g. agglomerative) may be a solution both to the Zesch, T., eds., Proceedings of NLP4CMC III: 3rd Workshop
issue of the unavailability of background information on the on Natural Language Processing for Computer-Mediated
annotators, and to the problem of computational complexity Communication, 6–9.
and scalability of the exhaustive search approach. Schmidt, A., and Wiegand, M. 2017. A survey on hate
speech detection using natural language processing. In Pro-
References ceedings of the Fifth International Workshop on Natural
Akhtar, S.; Basile, V.; and Patti, V. 2019. A new measure of Language Processing for Social Media, 1–10. Valencia,
polarization in the annotation of hate speech. In Alviano, Spain: Association for Computational Linguistics.
M.; Greco, G.; and Scarcello, F., eds., AI*IA 2019 – Ad- Soberón, G.; Aroyo, L.; Welty, C.; Inel, O.; Lin, H.; and
vances in Artificial Intelligence, 588–603. Cham: Springer Overmeen, M. 2013. Measuring crowd truth: Disagreement
International Publishing. metrics combined with worker behavior filters. In Proceed-
Devlin, J.; Chang, M.-W.; and Toutanova, K. 2019. BERT: ings of the 1st International Conference on Crowdsourcing
Pre-training of deep bidirectional transformers for language the Semantic Web – Volume 1030, 45–58. CEUR-WS.org.
understanding. In Proceedings of the 2019 Conference of the Waseem, Z., and Hovy, D. 2016. Hateful symbols or hateful
North American Chapter of the Association for Computa- people? predictive features for hate speech detection on twit-
tional Linguistics: Human Language Technologies, Volume ter. In Proceedings of the NAACL Student Research Work-
1 (Long and Short Papers), 4171–4186. ACL. shop, 88–93. San Diego, California: ACL.
Fortuna, P., and Nunes, S. 2018. A survey on automatic Waseem, Z. 2016. Are you a racist or am I seeing things?
detection of hate speech in text. ACM Computing Surveys annotator influence on hate speech detection on twitter. In
51:1–30. Proceedings of the First Workshop on NLP and Computa-
Howard, J., and Ruder, S. 2018. Universal language model tional Social Science, 138–142. Austin, Texas: ACL.
fine-tuning for text classification. In Proceedings of the 56th Yu, S.; Jindian, S.; and Luo, D. 2019. Improving bert-
Annual Meeting of the Association for Computational Lin- based text classification with auxiliary sentence and domain
guistics (Volume 1: Long Papers), 328–339. ACL. knowledge. IEEE Access PP:1–1.
Munikar, M.; Shakya, S.; and Shrestha, A. 2019. Fine-
grained sentiment classification using bert. In 2019 Artificial

154

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy