Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

American Political Science Review Vol. 110, No.

2 May 2016
doi:10.1017/S0003055416000058 
c American Political Science Association 2016

Crowd-sourced Text Analysis: Reproducible and Agile Production


of Political Data
KENNETH BENOIT London School of Economics and Trinity College
DREW CONWAY New York University
BENJAMIN E. LAUDERDALE London School of Economics and Political Science
MICHAEL LAVER New York University
SLAVA MIKHAYLOV University College London

E
mpirical social science often relies on data that are not observed in the field, but are transformed
into quantitative variables by expert researchers who analyze and interpret qualitative raw sources.
While generally considered the most valid way to produce data, this expert-driven process is
inherently difficult to replicate or to assess on grounds of reliability. Using crowd-sourcing to distribute
text for reading and interpretation by massive numbers of nonexperts, we generate results comparable to
those using experts to read and interpret the same texts, but do so far more quickly and flexibly. Crucially,
the data we collect can be reproduced and extended transparently, making crowd-sourced datasets
intrinsically reproducible. This focuses researchers’ attention on the fundamental scientific objective of
specifying reliable and replicable methods for collecting the data needed, rather than on the content of
any particular dataset. We also show that our approach works straightforwardly with different types of
political text, written in different languages. While findings reported here concern text analysis, they have
far-reaching implications for expert-generated data in the social sciences.

olitical scientists have made great strides toward the content of political texts. Traditionally, a lot of polit-

P greater reproducibility of their findings since


the publication of Gary King’s influential article
Replication, Replication (King 1995). It is now standard
ical data are generated by experts applying comprehen-
sive classification schemes to raw sources in a process
that, while in principle repeatable, is in practice too
practice for good professional journals to insist that costly and time-consuming to reproduce. Widely used
authors lodge their data and code in a prominent open examples include1 the Polity dataset, rating countries
access repository. This allows other scholars to repli- on a scale “ranging from −10 (hereditary monarchy)
cate and extend published results by reanalyzing the to +10 (consolidated democracy)”2 ; the Comparative
data, rerunning and modifying the code. Replication of Parliamentary Democracy data with indicators of the
an analysis, however, sets a far weaker standard than “number of inconclusive bargaining rounds” in gov-
reproducibility of the data, which is typically seen as a ernment formation and “conflictual” government ter-
fundamental principle of the scientific method. Here, minations3 ; the Comparative Manifesto Project (CMP),
we propose a step towards a more comprehensive sci- with coded summaries of party manifestos, notably a
entific replication standard in which the mandate is to widely used left-right score4 ; and the Policy Agendas
replicate data production, not just data analysis. This Project, which codes text from laws, court decisions,
shifts attention from specific datasets as the essential political speeches into topics and subtopics (Jones and
scientific objects of interest, to the published and re- Baumgartner 2013). In addition to the issue of re-
producible method by which the data were generated. producibility, the fixed nature of these schemes and
We implement this more comprehensive replication the considerable infrastructure required to implement
standard for the rapidly expanding project of analyzing them discourages change and makes it harder to adapt
them to specific needs, as the data are designed to fit
general requirements rather than a particular research
Kenneth Benoit is Professor, London School of Economics and Trin- question.
ity College, Dublin (kbenoit@lse.ac.uk). Here, we demonstrate a method of crowd-sourced
Drew Conway, New York University. text annotation for generating political data that are
Benjamin E. Lauderdale is Associate Professor, London School
of Economics. both reproducible in the sense of allowing the data
Michael Laver is Professor, New York University. generating process to be quickly, inexpensively, and
Slava Mikhaylov is Senior Lecturer, University College London. reliably repeated, and agile in the sense of being ca-
An earlier draft of this article, with much less complete data, was pable of flexible design according to the needs of a
presented at the third annual Analyzing Text as Data conference
at Harvard University, 5–6 October 2012. A very preliminary ver-
1 Other examples of coded data include expert judgments on party
sion was presented at the 70th annual Conference of the Midwest
Political Science Association, Chicago, 12–15 April 2012. We thank policy positions of party positions (Benoit and Laver 2006; Hooghe
Joseph Childress and other members of the technical support team et al. 2010; Laver and Hunt 1992), and democracy scores from
at CrowdFlower for assisting with the setup of the crowd-sourcing Freedom House and corruption rankings from Transparency Inter-
platform. We are grateful to Neal Beck, Joshua Tucker, and five national.
2 http://www.systemicpeace.org/polity/polity4.htm
anonymous journal referees for comments on an earlier draft of this
3 http://www.erdda.se/cpd/data archive.html
article. This research was funded by the European Research Council
grant ERC-2011-StG 283794-QUANTESS. 4 https://manifesto-project.wzb.eu/

278
American Political Science Review Vol. 110, No. 2

specific research project. The notion of agile research of an ox is close to the true answer and, importantly,
is borrowed from recent approaches to software de- closer to this than the typical individual judgment (for
velopment, and incorporates not only the flexibility a general introduction see Surowiecki 2004). Crowd-
of design, but also the ability to iteratively test, de- sourcing is now understood to mean using the Inter-
ploy, verify, and, if necessary, redesign data generation net to distribute a large package of small tasks to a
through feedback in the production process. In what large number of anonymous workers, located around
follows, we apply this method to a common measure- the world and offered small financial rewards per task.
ment problem in political science: locating political par- The method is widely used for data-processing tasks
ties on policy dimensions using text as data. Despite the such as image classification, video annotation, data
lower expertise of crowd workers compared to experts, entry, optical character recognition, translation, rec-
we show that properly deployed crowd-sourcing gener- ommendation, and proofreading. Crowd-sourcing has
ates results indistinguishable from expert approaches. emerged as a paradigm for applying human intelligence
Given the millions of available workers online, crowd- to problem-solving on a massive scale, especially for
sourced data collection can also be repeated as often as problems involving the nuances of language or other
desired, quickly and with low cost. Furthermore, our interpretative tasks where humans excel but machines
approach is easily tailored to specific research needs, perform poorly.
for specific contexts and time periods, in sharp contrast Increasingly, crowd-sourcing has also become a tool
to large “canonical” data generation projects aimed at for social scientific research (Bohannon 2011). In sharp
maximizing generality. For this reason, crowd-sourced contrast to our own approach, most applications use
data generation may represent a paradigm shift for data crowds as a cheap alternative to traditional subjects for
production and reproducibility in the social sciences. experimental studies (e.g., Horton et al. 2011; Lawson
While, as a proof of concept, we apply our particular et al. 2010; Mason and Suri 2012; Paolacci et al. 2010).
method for crowd-sourced data production to the anal- Using subjects in the crowd to populate experimental
ysis of political texts, the core problem of specifying or survey panels raises obvious questions about exter-
a reproducible data production process extends to all nal validity, addressed by studies in political science
subfields of political science. (Berinsky et al. 2012), economics (Horton et al. 2011)
In what follows, we first review the theory and prac- and general decision theory and behavior (Chandler
tice of crowd-sourcing. We then deploy an experi- et al. 2014; Goodman et al. 2013; Paolacci et al. 2010).
ment in content analysis designed to evaluate crowd- Our method for using workers in the crowd to label
sourcing as a method for reliably and validly extracting external stimuli differs fundamentally from such appli-
meaning from political texts, in this case party mani- cations. We do not care at all about whether our crowd
festos. We compare expert and crowd-sourced analyses workers represent any target population, as long as dif-
of the same texts, and assess external validity by com- ferent workers, on average, make the same judgments
paring crowd-sourced estimates with those generated when faced with the same information. In this sense
by completely independent expert surveys. In order to our method, unlike online experiments and surveys,
do this, we design a method for aggregating judgments is a canonical use of crowd-sourcing as described by
about text units of varying complexity, by readers of Galton.6
varying quality,5 into estimates of latent quantities of All data production by humans requires expertise,
interest. To assess the external validity of our results, and several empirical studies have found that data cre-
our core analysis uses crowd workers to estimate party ated by domain experts can be matched, and sometimes
positions on two widely used policy dimensions: “eco- improved at much lower cost, by aggregating judg-
nomic” policy (right-left) and “social” policy (liberal- ments of nonexperts (Alonso and Baeza-Yates 2011;
conservative). We then use our method to generate Alonso and Mizzaro 2009; Carpenter 2008; Hsueh et al.
“custom” data on a variable not available in canoni- 2009; Ipeirotis et al. 2013; Snow et al. 2008). Provided
cal datasets, in this case party policies on immigration. crowd workers are not systematically biased in relation
Finally, to illustrate the general applicability of crowd- to the “true” value of the latent quantity of interest,
sourced text annotation in political science, we test the and it is important to check for such bias, the central
method in a multilingual and technical environment to tendency of even erratic workers will converge on this
show that crowd-sourced text analysis is effective for true value as the number of workers increases. Be-
texts other than party manifestos and works well in cause experts are axiomatically in short supply while
different languages. members of the crowd are not, crowd-sourced solutions
also offer a straightforward and scalable way to address
reliability in a manner that expert solutions cannot. To
HARVESTING THE WISDOM OF CROWDS improve confidence, simply employ more crowd work-
The intuition behind crowd-sourcing can be traced to ers. Because data production is broken down into many
Aristotle (Lyon and Pacuit 2013) and later Galton simple specific tasks, each performed by many differ-
(1907), who noticed that the average of a large num- ent exchangeable workers, it tends to wash out biases
ber of individual judgments by fair-goers of the weight that might affect a single worker, while also making it

5 In what follows we use the term “reader” to cover a person, whether


expert, crowd worker, or anyone else, who is evaluating a text unit 6 We are interested in the weight of the ox, not in how different
for meaning. people judge the weight of the ox.

279
Crowd-sourced Text Analysis May 2016

possible to estimate and correct for worker-specific ef- a posteriori human interpretation that may be haphaz-
fects using the type of scaling model we employ below. ard and is potentially biased.8
Crowd-sourced data generation inherently requires Our argument here speaks directly to more tradi-
a method for aggregating many small pieces of informa- tional content analysis within the social sciences, which
tion into valid measures of our quantities of interest.7 is concerned with problems that automated text anal-
Complex calibration models have been used to correct ysis cannot yet address. This involves the “reading”
for worker errors on particular difficult tasks, but the of text by real humans who interpret it for mean-
most important lesson from this work is that increas- ing. These interpretations, if systematic, may be classi-
ing the number of workers reduces error (Snow et al. fied and summarized using numbers, but the underly-
2008). Addressing statistical issues of “redundant” cod- ing human interpretation is fundamentally qualitative.
ing, Sheng et al. (2008) and Ipeirotis et al. (2014) show Crudely, human analysts are employed to engage in
that repeated coding can improve the quality of data natural language processing (NLP) which seeks to ex-
as a function of the individual qualities and number tract “meaning” embedded in the syntax of language,
of workers, particularly when workers are imperfect treating a text as more than a bag of words. NLP is
and labeling categories are “noisy.” Ideally, we would another remarkable growth area, though it addresses a
benchmark crowd workers against a “gold standard,” fundamentally difficult problem and fully automated
but such benchmarks are not always available, so schol- NLP still has a long way to go. Traditional human
ars have turned to Bayesian scaling models borrowed experts in the field of inquiry are of course highly so-
from item-response theory (IRT), to aggregate infor- phisticated natural language processors, finely tuned to
mation while simultaneously assessing worker quality particular contexts. The core problem is that they are
(e.g., Carpenter 2008; Raykar et al. 2010). Welinder and in very short supply. This means that text processing
Perona (2010) develop a classifier that integrates data by human experts simply does not scale to the huge
difficulty and worker characteristics, while Welinder volumes of text that are now available. This in turn
et al. (2010) develop a unifying model of the character- generates an inherent difficulty in meeting the more
istics of both data and workers, such as competence, ex- comprehensive scientific replication standard to which
pertise, and bias. A similar approach is applied to rater we aspire. Crowd-sourced text analysis offers a com-
evaluation in Cao et al. (2010) where, using a Bayesian pelling solution to this problem. Human workers in the
hierarchical model, raters’ judgments are modeled as a crowd can be seen, perhaps rudely, as generic and very
function of a latent item trait, and rater characteristics widely available “biological” natural language proces-
such as bias, discrimination, and measurement error. sors. Our task in this article is now clear. Design a
We build on this work below, applying both a simple system for employing generic workers in the crowd to
averaging method and a Bayesian scaling model that analyze text for meaning in a way that is as reliable and
estimates latent policy positions while generating diag- valid as if we had used finely tuned experts to do the
nostics on worker quality and sentence difficulty. We same job.
find that estimates generated by our more complex By far the best known research program in politi-
model match simple averaging very closely. cal science that relies on expert human readers is the
long-running Manifesto Project (MP). This project has
analyzed nearly 4,000 manifestos issued since 1945 by
A METHOD FOR REPLICABLE CODING OF nearly 1,000 parties in more than 50 countries, using
POLITICAL TEXT experts who are country specialists to label sentences
We apply our crowd-sourcing method to one of the in each text in their original languages. A single expert
most wide-ranging research programs in political sci- assigns every sentence in every manifesto to a single
ence, the analysis of political text, and in particular category in a 56-category scheme devised by the project
text processing by human analysts that is designed to in the mid-1980s (Budge et al. 1987; Budge et al. 2001;
extract meaning systematically from some text corpus, Klingemann et al. 1994; Klingemann et al. 2006; Laver
and from this to generate valid and reliable data. This and Budge 1992).9 This has resulted in a widely used
is related to, but quite distinct from, spectacular recent “canonical” dataset that, given the monumental coor-
advances in automated text analysis that in theory scale dinated effort of very many experts over 30 years, is
up to unlimited volumes of political text (Grimmer unlikely ever to be recollected from scratch and in this
and Stewart 2013). Many automated methods involve sense is unlikely to be replicated. Despite low levels of
supervised machine learning and depend on labeled interexpert reliability found in experiments using the
training data. Our method is directly relevant to this MP’s coding scheme (Mikhaylov et al. 2012), a proposal
enterprise, offering a quick, effective, and, above all, to re-process the entire manifesto corpus many times,
reproducible way to generate labeled training data. using many independent experts, is in practice a non-
Other, unsupervised, methods intrinsically require starter. Large canonical datasets such as this, therefore,
tend not to satisfy the deeper standard of reproducible
research that requires the transparent repeatability of
data generation. This deeper replication standard can
7 Of course aggregation issues are no less important when combining
any multiple judgments, including those of experts. Procedures for
8 This human interpretation can be reproduced by workers in the
aggregating nonexpert judgments may influence both the quality of
data and convergence on some underlying “truth,” or trusted expert crowd, though this is not our focus in this article.
judgment. For an overview, see Quoc Viet Hung et al. (2013). 9 https://manifesto-project.wzb.eu/

280
American Political Science Review Vol. 110, No. 2

FIGURE 1. Hierarchical Coding Scheme for Two Policy Domains with Ordinal Positioning

however be satisfied with the crowd-sourced method sources of unreliability when multiple experts applied
we now describe. this scheme to the same documents (Mikhaylov et al.
2012). The second is practical: it is impossible to write
clear and precise instructions, to be understood reli-
A simple coding scheme for economic and ably by a diverse, globally distributed, set of workers
social policy in the crowd, for using a detailed and complex 56-
We assess the potential for crowd-sourced text analysis category scheme quintessentially designed for highly
using an experiment in which we serve up an identical trained experts. This highlights an important trade-
set of documents, and an identical set of text processing off. There may be data production tasks that cannot
tasks, to both a small set of experts (political science feasibly be explained in clear and simple terms, so-
faculty and graduate students) and a large and het- phisticated instructions that can only be understood
erogeneous set of crowd workers located around the and implemented by highly trained experts. Sophisti-
world. To do this, we need a simple scheme for labeling cated instructions are designed for a more limited pool
political text that can be used reliably by workers in the of experts who can understand and implement them
crowd. Our scheme first asks readers to classify each and, for this reason, imply less scalable and replicable
sentence in a document as referring to economic policy data production. Such tasks may not be suitable for
(left or right), to social policy (liberal or conservative), crowd-sourced data generation and may be more suited
or to neither. Substantively, these two policy dimen- to traditional methods. The striking alternative now
sions have been shown to offer an efficient representa- made available by crowd-sourcing is to break down
tion of party positions in many countries.10 They also complicated data production tasks into simple small
correspond to dimensions covered by a series of expert jobs, as happens when complex consumer products are
surveys (Benoit and Laver 2006; Hooghe et al. 2010; manufactured on factory production lines. Over and
Laver and Hunt 1992), allowing validation of estimates above the practical need to have simple instructions for
we derive against widely used independent estimates crowd workers, furthermore, the scheme in Figure 1 is
of the same quantities. If a sentence was classified as motivated by the observation that most scholars using
economic policy, we then ask readers to rate it on a five- manifesto data actually seek simple solutions, typically
point scale from very left to very right; those classified estimates of positions on a few general policy dimen-
as social policy were rated on a five-point scale from sions; they do not need estimates of these positions in
liberal to conservative. Figure 1 shows this scheme.11 a 56-dimensional space.
We did not use the MP’s 56-category classification
scheme, for two main reasons. The first is method-
ological: complexity of the MP scheme and uncertain Text corpus
boundaries between many of its categories were major While we extend this in work we discuss below, our
baseline text corpus comprises 18,263 natural sentences
10 See Chapter 5 of Benoit and Laver (2006) for an extensive empir- from British Conservative, Labour and Liberal Demo-
ical review of this for a wide range of contemporary democracies. crat manifestos for the six general elections held be-
11 Our instructions—fully detailed in the Online Appendix (Section
6)—were identical for both experts and nonexperts, defining the
tween 1987 and 2010. These texts were chosen for
economic left-right and social liberal-conservative policy dimensions two main reasons. First, for systematic external valida-
we estimate and providing examples of labeled sentences. tion, there are diverse independent estimates of British

281
Crowd-sourced Text Analysis May 2016

party positions for this period, from contemporary ex- Welinder et al. 2010; Welinder and Perona 2010; White-
pert surveys ( Benoit 2005, 2010; Laver 1998; Laver and hill et al. 2009).
Hunt 1992) as well as MP expert codings of the same We model each sentence, j , as a vector of parameters,
texts. Second, there are well-documented substantive θj d , which corresponds to sentence attributes on each
shifts in party positions during this period, notably the of four latent dimensions, d. In our application, these
sharp shift of Labour towards the center between 1987 dimensions are latent domain propensity of a sentence
and 1997. The ability of crowd workers to pick up this to be labeled economic (1) and social (2) versus none;
move is a good test of external validity. latent position of the sentence on economic (3) and
In designing the breakdown and presentation of the social (4) dimensions. Individual readers i have po-
text processing tasks given to both experts and the tential biases in each of these dimensions, manifested
crowd, we made a series of detailed operational de- when classifying sentences as “economic” or “social,”
cisions based on substantial testing and adaptation (re- and when assigning positions on economic and social
viewed in the Appendix). In summary, we used natural policy scales. Finally, readers have four sensitivities, cor-
sentences as our fundamental text unit. Recognizing responding to their relative responsiveness to changes
that most crowd workers dip into and out of our jobs in the latent sentence attributes in each dimension.
and would not stay online to code entire documents, we Thus, the latent coding of sentence j by reader i on
served target sentences from the corpus in a random dimension d is
sequence, set in a two-sentence context on either side
of the target sentence, without identifying the text from μ∗ij d = χid (θj d + ψid)
which the sentence was drawn. Our coding experiments
showed that these decisions resulted in estimates that
did not significantly differ from those generated by the where the χid indicate relative responsiveness of read-
classical approach of reading entire documents from ers to changes in latent sentence attributes θj d , and
beginning to end. the ψid indicate relative biases towards labeling sen-
tences as economic or social (d = 1, 2), and rating eco-
nomic and social sentences as right rather than left
(d = 3, 4).
We cannot observe readers’ behavior on these di-
SCALING DOCUMENT POLICY POSITIONS mensions directly. We therefore model their responses
FROM CODED SENTENCES to the choice of label between economic, social and
Our aim is to estimate the policy positions of entire “neither” domains using a multinomial logit given μ∗ij 1
documents: not the code value of any single sentence, and μ∗ij 2 . We model their choice of scale position as an
but some aggregation of these values into an estimate ordinal logit depending on μ∗ij 3 if they label the sentence
of each document’s position on some meaningful policy as economic and on μ∗ij 4 if they label the sentence as
scale while allowing for reader, sentence, and domain social.12 This results in the following model for the 11
effects. One option is simple averaging: identify all eco- possible combinations of labels and scales that a reader
nomic scores assigned to sentences in a document by can give a sentence:13
all readers, average these, and use this as an estimate
of the economic policy position of a document. Math-  
ematical and behavioral studies on aggregations of in- 1
p (none) = ,
dividual judgments imply that simpler methods often 1 + exp(μ∗ij 1 ) + exp(μ∗ij 2 )
perform as well as more complicated ones, and often
more robustly (e.g., Ariely et al. 2000; Clemen and  
exp(μ∗ij 1 )
Winkler 1999). Simple averaging of individual judg- p (econ; scale) =
ments is the benchmark when there is no additional 1 + exp(μ∗ij 1 ) + exp(μ∗ij 2 )
information on the quality of individual coders (Arm-     
strong 2001; Lyon and Pacuit 2013; Turner et al. 2014). × logit−1 ξscale − μ∗ij 3 − logit−1 ξscale−1 − μ∗ij 3 ,
However, this does not permit direct estimation of mis-
classification tendencies by readers who for example
fail to identify economic or social policy “correctly,”
or of reader-specific effects in the use of positional
scales. 12 By treating these as independent, and using the logit, we are

An alternative is to model each sentence as contain- assuming independence between the choices and between the so-
ing information about the document, and then scale cial and economic dimensions (IIA). It is not possible to identify a
more general model that relaxes these assumptions without asking
these using a measurement model. We propose a model additional questions of readers.
based on item response theory (IRT), which accounts 13 Each policy domain has five scale points, and the model assumes
for both individual reader effects and the strong pos- proportional odds of being in each higher scale category in response
sibility that some sentences are intrinsically harder to to the sentence’s latent policy positions θ3 and θ4 and the coder’s
interpret. This approach has antecedents in psycho- sensitivities to this association. The cutpoints ξ for ordinal scale re-
sponses are constrained to be symmetric around zero and to have
metric methods (e.g., Baker and Kim 2004; Fox 2010; the same cutoffs in both social and economic dimensions, so that the
Hambleton et al. 1991; Lord 1980), and has been used latent scales are directly comparable to one another and to the raw
to aggregate crowd ratings (e.g., Ipeirotis et al. 2014; scales. Thus, ξ2 = ∞, ξ1 = −ξ−2 , ξ0 = −ξ−1 , and ξ−3 = −∞.

282
American Political Science Review Vol. 110, No. 2

  (four to six)15 experts to independently code each of


exp(μ∗ij 2 )
p (soc; scale) = the 18,263 sentences in our 18-document text corpus,
1 + exp(μ∗ij 1 ) + exp(μ∗ij 2 ) using the scheme described above. The entire corpus
     was processed twice by our experts. First, sentences
× logit−1 ξscale − μ∗ij 4 − logit−1 ξscale−1 − μ∗ij 4 . were served in their natural sequence in each mani-
festo, to mimic classical expert content analysis. Sec-
The primary quantities of interest are not sentence ond, about a year later, sentences were processed in
level attributes, θj d , but rather aggregates of these for random order, to mimic the system we use for serving
entire documents, represented by the θ̄k,d for each sentences to crowd workers. Sentences were uploaded
document k on each dimension d. Where j d are dis- to a custom-built, web-based platform that displayed
tributed normally with mean zero and standard devia- sentences in context and made it easy for experts to
tion σd, we model these latent sentence level attributes process a sentence with a few mouse clicks. In all, we
θj d hierarchically in terms of corresponding latent doc- harvested over 123,000 expert evaluations of manifesto
ument level attributes: sentences, about seven per sentence. Table 1 provides
details of the 18 texts, with statistics on the overall and
mean numbers of evaluations, for both stages of expert
θj d = θ̄k(j ),d + j d.
processing as well as the crowd processing we report
below.
As at the sentence level, two of these (d = 1, 2) cor-
respond to the overall frequency (importance) of eco-
nomic and social dimensions relative to other topics, External validity of expert evaluations
and the remaining two (d = 3, 4) correspond to aggre-
Figure 2 plots two sets of estimates of positions of
gate left-right positions of documents on economic and
the 18 manifestos on economic and social policy: one
social dimensions.
generated by experts processing sentences in natural
This model enables us to generate estimates of not
sequence (x axis); the other generated by completely
only our quantities of interest for the document-level
independent expert surveys (y axis).16 Linear regres-
policy positions, but also a variety of reader- and
sion lines summarizing these plots show that expert text
sentence-level diagnostics concerning reader agree-
processing predicts independent survey measures very
ment and the “difficulty” of domain and positional
well for economic policy (R = 0.91), somewhat less well
coding for individual sentences. Simulating from the
for the noisier dimension of social policy (R = 0.81).
posterior also makes it straightforward to estimate
To test whether coding sentences in their natural se-
Bayesian credible intervals indicating our uncertainty
quence affected results, our experts also processed the
over document-level policy estimates.14
entire text corpus taking sentences in random order.
Posterior means of the document level θ̄kd correlate
Comparing estimates from sequential and random-
very highly with those produced by the simple averag-
order sentence processing, we found almost identi-
ing methods discussed earlier: 0.95 and above, as we
cal results, with correlations of 0.98 between scales.17
report below. It is therefore possible to use averaging
Moving from “classical” expert content analysis to hav-
methods to summarize results in a simple and intuitive
ing experts process sentences served at random from
way that is also invariant to shifts in mean document
anonymized texts makes no substantive difference to
scores that might be generated by adding new docu-
point estimates of manifesto positions. This reinforces
ments to the coded corpus. The value of our scaling
our decision to use the much more scalable random
model is to estimate reader and sentence fixed effects,
sentence sequencing in the crowd-sourcing method we
and correct for these if necessary. While this model
specify.
is adapted to our particular classification scheme, it is
general in the sense that nearly all attempts to mea-
sure policy in specific documents will combine domain Internal reliability of expert coding
classification with positional coding.
Agreement between experts. As might be expected,
agreement between our experts was far from perfect.
BENCHMARKING A CROWD OF EXPERTS Table 2 classifies each of the 5,444 sentences in the 1987
and 1997 manifestos, all of which were processed by the
Our core objective is to compare estimates generated
same six experts. It shows how many experts agreed
by workers in the crowd with analogous estimates
the sentence referred to economic, or social, policy. If
generated by experts. Since readers of all types will
experts are in perfect agreement on the policy content
likely disagree over the meaning of particular sen-
of each sentence, either all six label each sentence as
tences, an important benchmark for our comparison
of expert and crowd-sourced text coding concerns lev-
els of disagreement between experts. The first stage 15 Three of the authors of this article, plus three senior PhD students
of our empirical work therefore employed multiple in Politics from New York University processed the six manifestos
from 1987 and 1997. One author of this article and four NYU PhD
students processed the other 12 manifestos.
14 We estimate the model by MCMC using the JAGS software, and 16 These were Laver and Hunt (1992); Laver (1998) for 1997; Benoit

provide the code, convergence diagnostics, and other details of our and Laver (2006) for 2001; Benoit (2005, 2010) for 2005 and 2010.
estimations in Section 2 of the Online Appendix. 17 Details provided in the Online Appendix, Section 5.

283
Crowd-sourced Text Analysis May 2016

TABLE 1. Texts and Sentences Coded: 18 British Party Manifestos

Total Mean Expert Mean Expert Total Mean Total


Sentences in Evaluations: Evaluations: Expert Crowd Crowd
Manifesto Manifesto Natural Sequence Random Sequence Evaluations Evaluations Evaluations

Con 1987 1,015 6.0 2.4 7,920 44 36,594


LD 1987 878 6.0 2.3 6,795 22 24,842
Lab 1987 455 6.0 2.3 3,500 20 11,087
Con 1992 1,731 5.0 2.4 11,715 6 28,949
LD 1992 884 5.0 2.4 6,013 6 20,880
Lab 1992 661 5.0 2.3 4,449 6 23,328
Con 1997 1,171 6.0 2.3 9,107 20 11,136
LD 1997 873 6.0 2.4 6,847 20 5,627
Lab 1997 1,052 6.0 2.3 8,201 20 4,247
Con 2001 748 5.0 2.3 5,029 5 3,796
LD 2001 1,178 5.0 2.4 7,996 5 5,987
Lab 2001 1,752 5.0 2.4 11,861 5 8,856
Con 2005 414 5.0 2.3 2,793 5 2,128
LD 2005 821 4.1 2.3 4,841 5 4,173
Lab 2005 1,186 4.0 2.4 6,881 5 6,021
Con 2010 1,240 4.0 2.3 7,142 5 6,269
LD 2010 855 4.0 2.4 4,934 5 4,344
Lab 2010 1,349 4.0 2.3 7,768 5 6,843
Total 18,263 91,400 32,392 123,792 215,107

FIGURE 2. British Party Positions on Economic and Social Policy 1987–2010

Manifesto Placement Manifesto Placement


Economic Social

01
01
r=0.91 r=0.82
2

05
05
01
0 1 87
87
Expert Coding Estimate
Expert Coding Estimate

7
97
05
0
97
9 57 9
92
87
8 7 9
92
1

1
100 05
10
10 97
0
01
0

05
97
97
1
100 0
92
9 2 10 055
01
87
8 7 87
8
77
−1

−1

0
01
97
9 1
7 05
0 5 10
0 87
8
92
92
01
997
92
2
92
−2

−2

87

0 5 10 15 20 0 5 10 15 20

Expert Survey Placement Expert Survey Placement

Notes: Sequential expert text processing (vertical axis) and independent expert surveys (horizontal). Labour red, Conservatives blue,
Liberal Democrats yellow, labeled by last two digits of year.

dealing with economic (or social) policy, or none do. tences: some but not all experts label these as having
The first data column of the table shows a total of 4,125 economic policy content.
sentences which all experts agree have no social policy The shaded boxes show sentences for which the six
content. Of these, there are 1,193 sentences all experts experts were in unanimous agreement—on economic
also agree have no economic policy content, and 527 policy, social policy, or neither. There was unanimous
that all experts agree do have economic policy content. expert agreement on about 35 percent of the labeled
The experts disagree about the remaining 2,405 sen- sentences. For about 65 percent of sentences, there

284
American Political Science Review Vol. 110, No. 2

TABLE 2. Domain Classification Matrix for 1987 and 1997 Manifestos: Frequency with which
Sentences were Assigned by Six Experts to Economic and Policy Domains

Experts Assigning Social Policy Domain

Experts Assigning Economic Domain 0 1 2 3 4 5 6 Total

0 1,193 196 67 59 114 190 170 1,989


1 326 93 19 11 9 19 – 477
2 371 92 15 15 5 – – 498
3 421 117 12 7 – – – 557
4 723 68 10 – – – – 801
5 564 31 – – – – – 595
6 527 – – – – – – 527
Total 4,125 597 123 92 128 209 170 5,444

Note: Shaded boxes: perfect agreement between experts.

TABLE 3. Interexpert Scale Reliability Analysis for the Economic Policy, Generated by
Aggregating All Expert Scores for Sentences Judged to have Economic Policy Content

Item N Sign Item-scale Correlation Item-rest Correlation Cronbach’s Alpha

Expert 1 2,256 + 0.89 0.76 0.95


Expert 2 2,137 + 0.89 0.76 0.94
Expert 3 1,030 + 0.87 0.74 0.94
Expert 4 1,627 + 0.89 0.75 0.95
Expert 5 1,979 + 0.89 0.77 0.95
Expert 6 667 + 0.89 0.81 0.93
Overall 0.95
k Policy Domain 0.93

was disagreement, even about the policy area, among ranges from 0 to 1.0).18 A far more important bench-
trained experts of the type usually used to analyze po- mark of reliability, however, focuses on the construc-
litical texts. tion of the scale resulting from combining the coders’
judgments, which is of more direct interest than the
codes assigned to any particular fragment of text. Scale
Scale reliability. Despite substantial disagreement reliability, as measured by a Cronbach’s alpha of 0.95,
among experts about individual sentences, we saw is “excellent” by any conventional standard.19 We can
above that we can derive externally valid estimates of therefore apply our model to aggregate the noisy in-
party policy positions if we aggregate the judgments of formation contained in the combined set of expert
all experts on all sentences in a given document. This judgements at the sentence level to produce coherent
happens because, while each expert judgment on each estimates of policy positions at the document level.
sentence is a noisy realization of some underlying signal This is the essence of crowd-sourcing. It shows that our
about policy content, the expert judgments taken as a experts are really a small crowd.
whole scale nicely—in the sense that in aggregate they
are all capturing information about the same underly- 18 Expert agreement for the random order coding as to the precise

ing quantity. Table 3 shows this, reporting a scale and scoring of positions within the policy domains had κ = 0.56 for a
coding reliability analysis for economic policy positions polarity scale (left, neutral, right) and κ = 0.41 for the full five-point
scale. For position scoring agreement rates can be estimated only
of the 1987 and 1997 manifestos, derived by treating roughly, however, as sentences might have been assigned different
economic policy scores for each sentence allocated by policy domains by different raters, and therefore be placed using a
each of the six expert coders as six sets of independent different positional scale.
19 Conventionally, an alpha of 0.70 is considered “acceptable.”
estimates of economic policy positions.
Despite the variance in expert coding of the policy Nearly identical results for social policy are available in the Online
Appendix (Section 1d). Note that we use Cronbach’s alpha as a
domains as seen in Table 2, overall agreement as to measure of scale reliability across readers, as opposed to a measure
the policy domain of sentences was 0.93 using Fleiss’ of inter-reader agreement (in which case we would have used Krip-
kappa, a very high level of inter-rater agreement (as κ pendorff’s alpha).

285
Crowd-sourced Text Analysis May 2016

DEPLOYING CROWD-SOURCED TEXT unambiguous correct answers specified in advance.22


CODING Correctly performing “gold” tasks, which are both used
in qualification tests and randomly sprinkled through
CrowdFlower: A crowd-sourcing platform the job, is used to monitor worker quality and block
with multiple channels spammers and bad workers. We specified our own set
of gold HITs as sentences for which there was unani-
Many online platforms now distribute crowd-sourced mous expert agreement on both policy area (economic,
microtasks (Human Intelligence Tasks or “HITs”) via social, or neither), and policy direction (left or right, lib-
the Internet. The best known is Amazon’s Mechanical eral or conservative), and seeded each job with the rec-
Turk (MT), an online marketplace for serving HITs ommended proportion of about 10% “gold” sentences.
to workers in the crowd. Workers must often pass a We therefore used “natural” gold sentences occurring
pretask qualification test, and maintain a certain qual- in our text corpus, but could also have used “artificial”
ity score from validated tasks that determines their gold, manufactured to represent archetypical economic
status and qualification for future jobs. However, MT or social policy statements. We also used a special type
has for legal reasons become increasingly difficult to of gold sentences called “screeners,” (Berinsky et al.
use for non-U.S. researchers and workers, with the re- 2014). These contained an exact instruction on how to
sult that a wide range of alternative crowd-sourcing label the sentence,23 set in a natural two-sentence con-
channels has opened up. Rather than relying on a sin- text, and are designed to ensure coders pay attention
gle crowd-sourcing channel, we used CrowdFlower, a throughout the coding process.
service that consolidates access to dozens of channels.20 Specifying gold sentences in this way, we imple-
CrowdFlower not only offers an interface for designing mented a two-stage process of quality control. First,
templates and uploading tasks that look the same on workers were only allowed into the job if they cor-
any channel but, crucially, also maintains a common rectly completed 8 out of 10 gold tasks in a qualifi-
training and qualification system for potential workers cation test.24 Once workers are on the job and have
from any channel before they can qualify for tasks, as seen at least four more gold sentences, they are given
well as cross-channel quality control while tasks are a “trust” score, which is simply the proportion of cor-
being completed. rectly labeled gold. If workers get too many gold HITs
wrong, their trust level goes down. They are ejected
Quality control from the job if their trust score falls below 0.8. The
current trust score of a worker is recorded with each
Excellent quality assurance is critical to all reliable HIT, and can be used to weight the contribution of
and valid data production. Given the natural economic the relevant piece of information to some aggregate
motivation of workers in the crowd to finish as many estimate. Our tests showed this weighting made no
jobs in as short a time as possible, it is both tempt- substantial difference, however, mainly because trust
ing and easy for workers to submit bad or faked data. scores all tended to range in a tight interval around a
Workers who do this are called “spammers.” Given the mean of 0.84.25 Many more potential HITs than we use
open nature of the platform, it is vital to prevent them here were rejected as “untrusted,” because the workers
from participating in a job, using careful screening and did not pass the qualification test, or because their trust
quality control (e.g., Berinsky et al. 2014; Eickhoff and score subsequently fell below the critical threshold.
de Vries 2012; Kapelner and Chandler 2010; Nowak Workers are not paid for rejected HITs, giving them
and Rger 2010;). Conway used coding experiments a strong incentive to perform tasks carefully, as they
to assess three increasingly strict screening tests for do not know which of these have been designated as
workers in the crowd (Conway 2013).21 Two findings gold for quality assurance. We have no hesitation in
directly inform our design. First, using a screening or concluding that a system of thorough and continuous
qualification test substantially improves the quality of monitoring of worker quality is necessary for reliable
results; a well-designed test can screen out spammers and valid crowd sourced text analysis.
and bad workers who otherwise tend to exploit the job.
Second, once a suitable test is in place, increasing its
difficulty does not improve results. It is vital to have a Deployment
filter on the front end to keep out spammers and bad We set up an interface on CrowdFlower that was
workers, but a tougher filter does not necessarily lead nearly identical to our custom-designed expert web
to better workers.
The primary quality control system used by Crowd-
22 For CrowdFlower’s formal definition of gold, see https://success.
Flower relies on completion of “gold” HITs: tasks with
crowdflower.com/hc/en-us/articles/201855809-Guide-to-Test-
Question-Data.
23 For example, “Please code this sentence as having economic policy
20 See http://www.crowdflower.com. content with a score of very right.”
21 24 Workers giving wrong labels to gold questions are given a short
There was a baseline test with no filter, a “low-threshold” filter
where workers had to correctly code 4/6 sentences correctly, and a explanation of why they are wrong.
“high-threshold” filter that required 5/6 correct labels. A “correct” 25 Our Online Appendix (Section 4) reports the distribution of trust
label means the sentence is labeled as having the same policy domain scores from the complete set of crowd codings by country of the
as that provided by a majority of expert coders. The intuition here is worker and channel, in addition to results that scale the manifesto
that tough tests also tend to scare away good workers. aggregate policy scores by the trust scores of the workers.

286
American Political Science Review Vol. 110, No. 2

FIGURE 3. Expert and Crowd-sourced Estimates of Economic and Social Policy Positions

system and deployed this in two stages. First, we pert text processing.27 The very high correlations of
oversampled all sentences in the 1987 and 1997 man- aggregate policy measures generated by crowd workers
ifestos, because we wanted to determine the number and experts suggest both are measuring the same la-
of judgments per sentence needed to derive stable tent quantities. Substantively, Figure 3 also shows that
estimates of our quantities of interest. We served up crowd workers identified the sharp rightwards shift of
sentences from the 1987 and 1997 manifestos until we Labour between 1987 and 1997 on both economic and
obtained a minimum of 20 judgments per sentence. Af- social policy, a shift identified by expert text processing
ter analyzing the results to determine that our estimates and independent expert surveys. The standard errors of
of document scale positions converged on stable values crowd-sourced estimates are higher for social than for
once we had five judgments per sentence—in results economic policy, reflecting both the smaller number
we report below—we served the remaining manifestos of manifesto sentences devoted to social policy and
until we reached five judgments per sentence. In all, higher coder disagreement over the application of this
we gathered 215,107 judgments by crowd workers of policy domain.28 Nonetheless Figure 3 summarizes our
the 18,263 sentences in our 18 manifestos, employing evidence that the crowd-sourced estimates of party pol-
a total of 1,488 different workers from 49 different icy positions can be used as substitutes for the expert
countries. About 28 percent of these came from the estimates, which is our main concern in this article.
United States, 15 percent from the United Kingdom, Our scaling model provides a theoretically well-
11 percent from India, and 5 percent each from Spain, grounded way to aggregate all the information in our
Estonia, and Germany. The average worker processed expert or crowd data, relating the underlying posi-
about 145 sentences; most processed between 10 and 70 tion of the political text both to the “difficulty” of
sentences, 44 workers processed over 1,000 sentences, a particular sentence and to a reader’s propensity to
and four processed over 5,000.26 identify the correct policy domain, and position within
domain.29 Because positions derived from the scaling
model depend on parameters estimated using the full
CROWD-SOURCED ESTIMATES OF PARTY set of coders and codings, changes to the text corpus can
POLICY POSITIONS affect the relative scaling. The simple mean of means
Figure 3 plots crowd-sourced estimates of the eco- method, however, is invariant to rescaling and always
nomic and social policy positions of British party man- produces the same results, even for a single document.
ifestos against estimates generated from analogous ex-
27 Full point estimates are provided in the Online Appendix, Sec-
26 Our final crowd-coded dataset was generated by deploying tion 1.
28 An alternative measure of correlation, Lin’s concordance corre-
through a total of 26 CrowdFlower channels. The most common
was Neodev (Neobux) (40%), followed by Mechanical Turk (18%), lation coefficient (Lin 1989, 2000), measures correspondence as well
Bitcoinget (15%), Clixsense (13%), and Prodege (Swagbucks) (6%). covariation, if our objective is to match the values on the identity
Opening up multiple worker channels also avoided the restriction line, although for many reasons here it is not. The economic and
imposed by Mechanical Turk in 2013 to limit the labor pool to work- social measures for Lin’s coefficient are 0.95 and 0.84, respectively.
ers based in the United States and India. Full details along with the 29 We report more fully on diagnostic results for our coders on the
range of trust scores for coders from these platforms are presented basis of the auxiliary model quantity estimates in the Online Ap-
in the Online Appendix (Section 4). pendix (Section 1e).

287
Crowd-sourced Text Analysis May 2016

FIGURE 4. Expert and Crowd-sourced Estimates of Economic and Social Policy Codes of
Individual Sentences, all Manifestos

Economic Domain Social Domain


2

2
1

1
Expert Mean Code

Expert Mean Code

0
0

−1
−1

−2
−2

−2 −1 0 1 2 −2 −1 0 1 2

Crowd Mean Code Crowd Mean Code

Note: Fitted line is the principal components or Deming regression line.

Comparing crowd-sourced estimates from the scaling Calibrating the number of crowd judgments
model to those produced by a simple averaging of the per sentence
mean of mean sentence scores, we find correlations of
0.96 for the economic and 0.97 for the social policy posi- A key question for our method concerns how many
tions of the 18 manifestos. We present both methods as noisier crowd-based judgments we need to generate re-
confirmation that our scaling method has not “manu- liable and valid estimates of fairly long documents such
factured” policy estimates. While this model does allow as party manifestos. To answer this, we turn to evidence
us to take proper account of reader and sentence fixed from our oversampling of 1987 and 1997 manifestos.
effects, it is also reassuring that a simple mean of means Recall we obtained a minimum of 20 crowd judgments
produced substantively similar estimates. for each sentence in each of these manifestos, allow-
We have already seen that noisy expert judgments ing us to explore what our estimates of the position
about sentences aggregate up to reliable and valid es- of each manifesto would have been, had we collected
timates for documents. Similarly, crowd-sourced docu- fewer judgments. Drawing random subsamples from
ment estimates reported in Figure 3 are derived from our oversampled data, we can simulate the conver-
crowd-sourced sentence data that are full of noise. gence of estimated document positions as a function
As we already argued, this is the essence of crowd- of the number of crowd judgments per sentence. We
sourcing. Figure 4 plots mean expert against mean did this by bootstrapping 100 sets of subsamples for
crowd-sourced scores for each sentence. The scores each of the subsets of n = 1 to n = 20 workers, com-
are highly correlated, though crowd workers are sub- puting manifesto positions in each policy domain from
stantially less likely to use extremes of the scales than aggregated sentence position means, and computing
experts. The first principal component and associated standard deviations of these manifesto positions across
confidence intervals show a strong and significant sta- the 100 estimates. Figure 5 plots these for each mani-
tistical relationship between crowd sourced and expert festo as a function of the increasing number of crowd
assessments of individual manifesto sentences, with no workers per sentence, where each point represents the
evidence of systematic bias in the crowd-coded sen- empirical standard error of the estimates for a specific
tence scores.30 Overall, despite the expected noise, our manifesto. For comparison, we plot the same quantities
results show that crowd workers systematically tend to for the expert data in red.
make the same judgments about individual sentences The findings show a clear trend: uncertainty over
as experts. the crowd-based estimates collapses as we increase the
number of workers per sentence. Indeed, the only dif-
ference between experts and the crowd is that expert
variance is smaller, as we would expect. Our findings
30 Lack of bias is indicated by the fact that the fitted line crosses the vary somewhat with policy area, given the noisier char-
origin. acter of social policy estimates, but adding additional

288
American Political Science Review Vol. 110, No. 2

FIGURE 5. Standard Errors of Manifesto-level Policy Estimates as a Function of the Number of


Workers, for the Oversampled 1987 and 1997 Manifestos

Economic Social
0.05

0.05
Std. error of bootstrapped manifesto estimates

Std error of bootstrapped manifesto estimates


0.04

0.04
0.03

0.03
0.02

0.02
0.01

0.01
0.00

0.00
5 10 15 20 5 10 15 20

Crowd codes per sentence Crowd codes per sentence

Note: Each point is the bootstrapped standard deviation of the mean of means aggregate manifesto scores, computed from sentence-
level random n subsamples from the codes.

crowd-sourced sentence judgments led to convergence might well not apply, and would likely be higher. But,
with our expert panel of five to six coders at around for large documents with many sentences, we find that
15 crowd coders. However, the steep decline in the the number of crowd judgments per sentence that we
uncertainty of our document estimates leveled out at need is not high.
around five crowd judgments per sentence, at which
point the absolute level of error is already low for both
policy domains. While increasing the number of unbi- CROWD-SOURCING DATA FOR SPECIFIC
ased crowd judgments will always give better estimates, PROJECTS: IMMIGRATION POLICY
we decided on cost-benefit grounds for the second stage
A key problem for scholars using “canonical” datasets,
of our deployment to continue coding in the crowd until
over and above the replication issues we discuss above,
we had obtained five crowd judgments per sentence.
is that the data often do not measure what a modern
This may seem a surprisingly small number, but there
researcher wants to measure. For example the widely
are a number of important factors to bear in mind in
used MP data, using a classification scheme designed
this context. First, the manifestos comprise about 1000
in the 1980s, do not measure immigration policy, a core
sentences on average; our estimates of document posi-
concern in the party politics of the 21st century (Ruedin
tions aggregate codes for these. Second, sentences were
2013; Ruedin and Morales 2012). Crowd-sourcing data
randomly assigned to workers, so each sentence score
frees researchers from such “legacy” problems and al-
can be seen as an independent estimate of the position
lows them more flexibly to collect information on their
of the manifesto on each dimension.31 With five scores
precise quantities of interest. To demonstrate this, we
per sentence and about 1000 sentences per manifesto,
designed a project tailored to measure British parties’
we have about 5000 “little” estimates of the manifesto
immigration policies during the 2010 election. We ana-
position, each a representative sample from the larger
lyzed the manifestos of eight parties, including smaller
set of scores that would result from additional worker
parties with more extreme positions on immigration,
judgments about each sentence in each document. This
such as the British National Party (BNP) and the UK
sample is big enough to achieve a reasonable level
Independence Party (UKIP). Workers were asked to
of precision, given the large number of sentences per
label each sentence as referring to immigration policy
manifesto. While the method we use here could be used
or not. If a sentence did cover immigration, they were
for much shorter documents, the results we infer here
asked to rate it as pro- or anti-immigration, or neutral.
for the appropriate number of judgments per sentence
We deployed a job with 7,070 manifesto sentences plus
136 “gold” questions and screeners devised specifically
31 Coding a sentence as referring to another dimension is a null for this purpose. For this job, we used an adaptive
estimate. sentence sampling strategy which set a minimum of

289
Crowd-sourced Text Analysis May 2016

FIGURE 6. Correlation of Combined Immigration Crowd Codings with Benoit (2010) Expert Survey
Position on Immigration

Estimated Immigration Positions

4
BNP
r=0.96

2 UKIP

Lab
Crowd

Con
0

PC
LD
SNP
Greens
−2
−4

0 5 10 15 20

Expert Survey

five crowd-sourced labels per sentence, unless the first levels with external expert surveys, and correlating at
three of these were unanimous in judging a sentence 0.93 with party position estimates from the original
not to concern immigration policy. This is efficient when crowd coding.34 With just hours from deployment to
coding texts with only “sparse” references to the matter dataset, and for very little cost, crowd sourcing enabled
of interest; in this case most manifesto sentences (ap- us to generate externally valid and reproducible data
proximately 96%) were clearly not about immigration related to our precise research question.
policy. Within just five hours, the job was completed,
with 22,228 codings, for a total cost of $360.32
We assess the external validity of our results us-
CROWD SOURCED TEXT ANALYSIS IN
ing independent expert surveys by Benoit (2010) and
OTHER CONTEXTS AND LANGUAGES
the Chapel Hill Expert Survey (Bakker et al. 2015). As carefully designed official statements of a party’s
Figure 6 compares the crowd-sourced estimates to policy stances, election manifestos tend to respond well
those from expert surveys. The correlation with the to systematic text analysis. In addition, manifestos are
Benoit (2010) estimates (shown) was 0.96, and 0.94 written for popular consumption and tend to be eas-
with independent expert survey estimates from the ily understood by nontechnical readers. Much political
Chapel Hill survey.33 To assess whether this data pro- information, however, can be found in texts gener-
duction exercise was as reproducible as we claim, we ated from hearings, committee debates, or legislative
repeated the entire exercise with a second deployment speeches on issues that often refer to technical pro-
two months after the first, with identical settings. This visions, amendments, or other rules of procedure that
new job generated another 24,551 pieces of crowd- might prove harder to analyze. Furthermore, a ma-
sourced data and was completed in just over three jority of the world’s political texts are not in English.
hours. The replication generated nearly identical esti- Other widely studied political contexts, such as the Eu-
mates, detailed in Table 4, correlating at the same high ropean Union, are multilingual environments where
researchers using automated methods designed for a
32 The job set 10 sentences per “task” and paid $0.15 per task. single language must make hard choices. Schwarz et al.
33 CHES included two highly correlated measures, one aimed at (forthcoming) applied unsupervised scaling methods
“closed or open” immigration policy another aimed at policy toward to a multilingual debate in the Swiss parliament, for
asylum seekers and whether immigrants should be integrated into
British society. Our measure averages the two. Full numerical results
are given in the Online Appendix, Section 3. 34 Full details are in the Online Appendix, Section 7.

290
American Political Science Review Vol. 110, No. 2

TABLE 4. Comparison Results for Replication of Immigration Policy Crowd Coding

Wave

Initial Replication Combined

Total Crowd Codings 24,674 24,551 49,225


Number of Coders 51 48 85
Total Sentences Coded as Immigration 280 264 283
Correlation with Benoit Expert Survey (2010) 0.96 0.94 0.96
Correlation with CHES 2010 0.94 0.91 0.94
Correlation of Results between Waves 0.93

instance, but had to ignore a substantial number of ing, with interlanguage correlations ranging between
French and Italian speeches in order to focus on the 0.92 and 0.96.36 Our text measures from this technical
majority German texts. In this section, we demonstrate debate produced reliable measures of the very specific
that crowd-sourced text analysis, with appropriately dimension we sought to estimate, and the validity of
translated instructions, offers the means to overcome these measures was demonstrated by their ability to
these limitations by working in any language. predict the voting behavior of the speakers. Not only
Our corpus comes from a debate in the European are these results straightforwardly reproducible, but
Parliament, a multilanguage setting where the EU offi- this reproducibility is invariant to the language in which
cially translates every document into 24 languages. To the speech was written. Crowd-sourced text analysis
test our method in a context very different from party does not only work in English.
manifestos, we chose a fairly technical debate concern-
ing a Commission report proposing an extension to a
regulation permitting state aid to uncompetitive coal CONCLUSIONS
mines. This debate concerned not only the specific pro- We have illustrated across a range of applications that
posal, involving a choice of letting the subsidies expire crowd-sourced text analysis can produce valid politi-
in 2011, permitting a limited continuation until 2014, or cal data of a quality indistinguishable from traditional
extending them until 2018 or even indefinitely.35 It also expert methods. Unlike traditional methods, however,
served as debating platform for arguments supporting crowd-sourced data generation offers several advan-
state aid to uncompetitive industries, versus the tra- tages. Foremost among these is the possibility of meet-
ditionally liberal preference for the free market over ing a replication standard far stronger than the current
subsidies. Because a vote was taken at the end of the practice of facilitating reproducible analysis. By offer-
debate, we also have an objective measure of whether ing a published specification for feasibly replicating the
the speakers supported or objected to the continuation process of data generation, the methods demonstrated
of state aid. here go much farther towards meeting a more strin-
We downloaded all 36 speeches from this debate, gent standard of reproducibility that is the hallmark of
originally delivered by speakers from 11 different coun- scientific inquiry. All of the data used in this article are
tries in 10 different languages. Only one of these speak- of course available in a public archive for any reader
ers, an MEP from the Netherlands, spoke in English, to reanalyze at will. Crowd-sourcing our data allows
but all speeches were officially translated into each tar- us to do much more than this, however. Any reader
get language. After segmenting this debate into sen- can take our publically available crowdsourcing code
tences, devising instructions and representative test and deploy this code to reproduce our data collection
sentences and translating these into each language, we process and collect a completely new dataset. This can
deployed the same text analysis job in English, Ger- be done many times over, by any researcher, anywhere
man, Spanish, Italian, Polish, and Greek, using crowd in the world. This, to our minds, takes us significantly
workers to read and label the same set of texts, but closer to a true scientific replication standard.
using the translation into their own language. Figure 7 Another key advantage of crowd-sourced text anal-
plots the score for each text against the eventual vote ysis is that it can form part of an agile research process,
of the speaker. It shows that our crowd-sourced scores precisely tailored to a specific research question rather
for each speech perfectly predict the voting behavior than reflecting the grand compromise at the heart of
of each speaker, regardless of the language. In Table 5, the large canonical datasets so commonly deployed
we show correlations between our crowd-sourced es- by political scientists. Because the crowd’s resources
timates of the positions of the six different language can be tapped in a flexible fashion, text-based data on
versions of the same set of texts. The results are strik- completely new questions of interest can be processed

35 This was the debate from 23 November 2010, “State aid to


facilitate the closure of uncompetitive coal mines.” http://bit.ly/ 36 Lin’s concordance coefficient has a similar range of values, from
EP-Coal-Aid-Debate 0.90 to 0.95.

291
Crowd-sourced Text Analysis May 2016

FIGURE 7. Scored Speeches from a Debate over State Subsidies by Vote, from Separate
Crowd-sourced Text Analysis in Six Languages

Language
Mean Crowd Score

0 German
English
Spanish
Greek
-1
Italian
Polish

-2

For (n=25) Against (n=6)


Vote

Note: Aggregate scores are standardized for direct comparison.

TABLE 5. Summary of Results from EP Debate Coding in Six Languages

Correlations of 35 Speaker Scores

Language English German Spanish Italian Greek Polish

German 0.96 – – – – –
Spanish 0.94 0.95 – – – –
Italian 0.92 0.94 0.92 – – –
Greek 0.95 0.97 0.95 0.92 – –
Polish 0.96 0.95 0.94 0.94 0.93 –
Sentence N 414 455 418 349 454 437
Total Judgments 3,545 1,855 2,240 1,748 2,396 2,256
Cost $109.33 $55.26 $54.26 $43.69 $68.03 $59.25
Elapsed Time (hrs) 1 3 3 7 2 1

only for the contexts, questions, and time periods re- sourcing project is by no means cost-free, though these
quired. Coupled with the rapid completion time of costs are mainly denominated in learning time and
crowd-sourced tasks and their very low marginal cost, effort spent by the researcher, rather than research
this opens the possibility of valid text processing to dollars. Having paid the inevitable fixed start-up costs
researchers with limited resources, especially graduate that apply to any rigorous new data collection project,
students. For those with more ambition or resources, whether or not this involves crowd-sourcing, the beauty
its inherent scalability means that crowd-sourcing can of crowd-sourcing arises from two key features of the
tackle large projects as well. In our demonstrations, our crowd. The pool of crowd workers is to all intents and
method worked as well for hundreds of judgments as purposes inexhaustible, giving crowd-sourcing projects
it did for hundreds of thousands. a scalability and replicability unique among projects
Of course, retooling for any new technology in- employing human workers. And the low marginal cost
volves climbing a learning curve. We spent consid- of adding more crowd workers to any given project puts
erable time pretesting instruction wordings, qualifica- ambitious high quality data generation projects in the
tion tests, compensation schemes, gold questions, and a realistic grasp of a wider range of researchers than ever
range of other detailed matters. Starting a new crowd- before. We are still in the early days of crowd-sourced

292
American Political Science Review Vol. 110, No. 2

data generation in the social sciences. Other scholars public sector more efficient”) may be coded in different ways
will doubtless find many ways to fortify the robustness if the coder knows this comes from a right- rather than a left-
and broaden the scope of the method. But, whatever wing party. Codings are typically aggregated into document
these developments, we now have a new method for scores as if coders had zero priors, even though we do not
collecting political data that allows us to do things we know how much of the score given to some sentence is the
could not do before. coder’s judgment about the content of the sentence, and how
much a judgment about its author. In coding experiments
reported in the Online Appendix (Section 5), semiexpert
APPENDIX: METHODOLOGICAL DECISIONS coders coded the same manifesto sentences both knowing
ON SERVING POLITICAL TEXT TO and not knowing the name of the author. We found slight
WORKERS IN THE CROWD systematic coding biases arising from knowing the identity of
Text units: Natural sentences. The MP specifies a “qua- the document’s author. For example, we found coders tended
sisentence” as the fundamental text unit, defined as “an ar- to code precisely the same sentences from Conservative man-
gument which is the verbal expression of one political idea ifestos as more right wing, if they knew these sentences came
or issue” (Volkens 2001). Recoding experiments by Däubler from a Conservative manifesto. This informed our decision
et al. (2012), however, show that using natural sentences to withhold the name of the author of sentences deployed in
makes no statistically significant difference to point estimates, crowd-sourcing text coding.
but does eliminate significant sources of both unreliability
and unnecessary work. Our dataset therefore consists of all Context units: +/− two sentences. Classical content anal-
natural sentences in the 18 UK party manifestos under inves- ysis has always involved coding an individual text unit in
tigation.37 light of the text surrounding it. Often, it is this context that
gives a sentence substantive meaning, for example because
Text unit sequence: Random. In “classical” expert text many sentences contain pronoun references to surround-
coding, experts process sentences in their natural sequence, ing text. For these reasons, careful instructions for drawing
starting at the beginning and ending at the end of a document. on context have long formed part of coder instructions for
Most workers in the crowd, however, will never reach the end content analysis (see Krippendorff 2013). For our coding
of a long policy document. Processing sentences in natural scheme, on the basis of prerelease coding experiments, we
sequence, moreover, creates a situation in which one sen- situated each “target” sentence within a context of the two
tence coding may well affect priors for subsequent sentence sentences on either side in the text. Coders were instructed
codings, so that summary scores for particular documents to code target sentence not context, but to use context
are not aggregations of independent coder assessments.38 An to resolve any ambiguity they might feel about the target
alternative is to randomly sample sentences from the text sentence.
corpus for coding—with a fixed number of replacements per
sentence across all coders—so that each coding is an indepen-
dent estimate of the latent variable of interest. This has the SUPPLEMENTARY MATERIAL
big advantage in a crowdsourcing context of scalability. Jobs
To view supplementary material for this article, please visit
for individual coders can range from very small to very large;
http://dx.doi.org/10.1017/S0003055416000058.
coders can pick up and put down coding tasks at will; every
little piece of coding in the crowd contributes to the overall
database of text codings. Accordingly our method for crowd-
sourced text coding serves coders sentences randomly se- REFERENCES
lected from the text corpus rather than in naturally occurring
sequence. Our decision to do this was informed by coding ex- Alonso, O., and R. Baeza-Yates. 2011. “Design and Implementa-
tion of Relevance Assessments Using Crowdsourcing.” In Ad-
periments reported in the Online Appendix (Section 5), and vances in Information Retrieval, eds. P. Clough, C. Foley, C. Gur-
confirmed by results reported above. Despite higher variance rin, G. Jones, W. Kraaij, H. Lee, and V. Mudoch. Berlin:
in individual sentence codings under random sequence cod- Springer.
ing, there is no systematic difference between point estimates Alonso, O., and S. Mizzaro. 2009. Can we get rid of TREC assessors?
Using Mechanical Turk for relevance assessment. Paper read at
of party policy positions depending on whether sentences
Proceedings of the SIGIR 2009 Workshop on the Future of IR
were coded in natural or random sequence. Evaluation.
Ariely, D., W. T. Au, R. H. Bender, D. V. Budescu, C. B. Dietz, H. Gu,
Text authorship: Anonymous. In classical expert coding, and G. Zauberman. 2000. “The effects of averaging subjective
coders typically know the authorship of the document they probability estimates between and within judges.” Journal of Ex-
are coding. Especially in the production of political data, perimental Psychology: Applied 6 (2): 130–47.
coders likely bring nonzero priors to coding text units. Pre- Armstrong, J. S., ed. 2001. Principles of Forecasting: A Handbook for
Researchers and Practitioners. New York: Springer.
cisely the same sentence (“we must do all we can to make the Baker, Frank B, and Seock-Ho Kim. 2004. Item Response The-
ory: Parameter Estimation Techniques. Boca Raton: CRC
37 Segmenting “natural” sentences, even in English, is never an exact
Press.
science, but our rules matched those from Däubler et al. (2012), treat- Bakker, Ryan, Catherine de Vries, Erica Edwards, Liesbet Hooghe,
ing (for example) separate clauses of bullet pointed lists as separate Seth Jolly, Gary Marks, Jonathan Polk, Jan Rovny, Marco Steen-
sentences. bergen and Milada Vachudova. 2015. “Measuring Party Positions
38 Coded sentences do indeed tend to occur in “runs” of similar in Europe: The Chapel Hill Expert Survey Trend File, 1999-2010.”
topics, and hence codes; however to ensure appropriate statistical Party Politics 21 (1): 143–52.
aggregation it is preferable if the codings of those sentences are Benoit, Kenneth. 2005. “Policy Positions in Britain 2005: Results
independent. from an Expert Survey.” London School of Economics.

293
Crowd-sourced Text Analysis May 2016

Benoit, Kenneth. 2010. “Expert Survey of British Political Parties.” Kapelner, A., and D. Chandler. 2010. “Preventing Satisficing in On-
Trinity College Dublin. line Surveys: A ‘Kapcha’ to Ensure Higher Quality Data.” Paper
Benoit, Kenneth, and Michael Laver. 2006. Party Policy in Modern read at The World’s First Conference on the Future of Distributed
Democracies. London: Routledge. Work (CrowdConf 2010).
Berinsky, A., G. Huber, and G. Lenz. 2012. “Evaluating Online Labor King, Gary. 1995. “Replication, Replication.” PS: Political Science
Markets for Experimental Research: Amazon.com’s Mechanical & Politics 28 (03): 444–52.
Turk.” Political Analysis 20 (3): 351–68. Klingemann, Hans-Dieter, Richard I. Hofferbert, and Ian Budge.
Berinsky, A., M. Margolis, and M. Sances. 2014. “Separating the 1994. Parties, Policies, and Democracy. Boulder: Westview
Shirkers from the Workers? Making Sure Respondents Pay Atten- Press.
tion on Self-Administered Surveys.” American Journal of Political Klingemann, Hans-Dieter, Andrea Volkens, Judith Bara, Ian Budge,
Science. and Michael McDonald. 2006. Mapping Policy Preferences II: Es-
Bohannon, J. 2011. “Social Science for Pennies.” Science 334: timates for Parties, Electors, and Governments in Eastern Europe,
307. European Union and OECD 1990–2003. Oxford: Oxford Univer-
Budge, Ian, Hans-Dieter Klingemann, Andrea Volkens, Judith Bara, sity Press.
Eric Tannenbaum, Richard Fording, Derek Hearl, Hee Min Kim, Krippendorff, Klaus. 2013. Content Analysis: An Introduction to Its
Michael McDonald, and Silvia Mendes. 2001. Mapping Policy Methodology. 3rd ed. Thousand Oaks, CA: Sage.
Preferences: Estimates for Parties, Electors and Governments 1945– Laver, M. 1998. “Party Policy in Britain 1997: Results from an Expert
1998. Oxford: Oxford University Press. Survey.” Political Studies 46 (2): 336–47.
Budge, Ian, David Robertson, and Derek Hearl. 1987. Ideology, Laver, Michael, and Ian Budge. 1992. Party Policy and Government
Strategy and Party Change: Spatial Analyses of Post-War Elec- Coalitions. New York: St. Martin’s Press.
tion Programmes in 19 Democracies. Cambridge, UK: Cambridge Laver, Michael, and W. Ben Hunt. 1992. Policy and Party Competi-
University Press. tion. New York: Routledge.
Cao, J, S. Stokes, and S. Zhang. 2010. “A Bayesian Approach to Lawson, C., G. Lenz, A. Baker, and M. Myers. 2010. “Looking Like
Ranking and Rater Evaluation: An Application to Grant Re- a Winner: Candidate Appearance and Electoral Success in New
views.” Journal of Educational and Behavioral Statistics 35 (2): Democracies.” World Politics 62 (4): 561–93.
194–214. Lin, L. 1989. “A Concordance Correlation Coefficient to Evaluate
Carpenter, B. 2008. “Multilevel Bayesian Models of Categorical Data Reproducibility.” Biometrics 45: 255–68.
Annotation.” Unpublished manuscript. Lin, L. 2000. “A Note on the Concordance Correlation Coefficient.”
Chandler, Jesse, Pam Mueller, and Gabriel Paolacci. 2014. Biometrics 56: 324–5.
“Nonnaı̈veté among Amazon Mechanical Turk Workers: Con- Lord, Frederic. 1980. Applications of Item Response Theory to Prac-
sequences and Solutions for Behavioral Researchers.” Behavior tical Testing Problems. New York: Routledge.
Research Methods 46 (1): 112–30. Lyon, Aidan, and Eric Pacuit. 2013. “The Wisdom of Crowds:
Clemen, R., and R. Winkler. 1999. “Combining Probability Distri- Methods of Human Judgement Aggregation.” In Hand-
butions From Experts in Risk Analysis.” Risk Analysis 19 (2): book of Human Computation, ed. P. Michelucci. New York:
187–203. Springer.
Conway, Drew. 2013. “Applications of Computational Methods in Mason, W., and S. Suri. 2012. “Conducting Behavioral Research on
Political Science.” Department of Politics, New York University. Amazon’s Mechanical Turk.” Behavior Research Methods 44 (1):
Däubler, Thomas, Kenneth Benoit, Slava Mikhaylov, and 1–23.
Michael Laver. 2012. “Natural Sentences as Valid Units for Coded Mikhaylov, Slava, Michael Laver, and Kenneth Benoit. 2012. “Coder
Political Text.” British Journal of Political Science 42 (4): 937– Reliability and Misclassification in Comparative Manifesto Project
51. Codings.” Political Analysis 20 (1): 78–91.
Eickhoff, C., and A. de Vries. 2012. “Increasing Cheat Robustness Nowak, S., and S. Rger. 2010. “How Reliable are Annotations via
of Crowdsourcing Tasks.” Information Retrieval 15: 1–17. Crowdsourcing? A Study about Inter-Annotator Agreement for
Fox, Jean-Paul. 2010. Bayesian Item Response Modeling: Theory and Multi-Label Image Annotation.” Paper read at The 11th ACM
Applications: New York: Springer. International Conference on Multimedia Information Retrieval,
Galton, F. 1907. “Vox Populi.” Nature (London) 75: 450–1. 29–31 March 2010, Philadelphia.
Goodman, Joseph, Cynthia Cryder, and Amar Cheema. 2013. “Data Paolacci, Gabriel, Jesse Chandler, and Panagiotis Ipeirotis. 2010.
Collection in a Flat World: Strengths and Weaknesses of Mechan- “Running Experiments on Amazon Mechanical Turk.” Judgement
ical Turk Samples.” Journal of Behavioral Decision Making 26 (3): and Decision Making 5: 411–9.
213–24. Quoc Viet Hung, Nguyen, Nguyen Thanh Tam, Lam Ngoc Tran, and
Grimmer, Justin, and Brandon M Stewart. 2013. “Text as Data: The Karl Aberer. 2013. “An Evaluation of Aggregation Techniques in
Promise and Pitfalls of Automatic Content Analysis Methods for Crowdsourcing.” In Web Information Systems Engineering – WISE
Political Texts.” Political Analysis 21 (3): 267–97. 2013, eds. X. Lin, Y. Manolopoulos, D. Srivastava, and G. Huang:
Hambleton, Ronald K., Hariharan Swaminathan, and H. Jane Berlin: Springer.
Rogers. 1991. Fundamentals of Item Response Theory. Thousand Raykar, V. C., S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bo-goni,
Oaks, CA: Sage. and L. Moy. 2010. “Learning from Crowds.” Journal of Machine
Hooghe, Liesbet, Ryan Bakker, Anna Brigevich, Catherine de Vries, Learning Research 11: 1297–322.
Erica Edwards, Gary Marks, Jan Rovny, Marco Steenbergen, Ruedin, Didier. 2013. “Obtaining Party Positions on Immigration
and Milada Vachudova. 2010. “Reliability and Validity of Mea- in Switzerland: Comparing Different Methods.” Swiss Political
suring Party Positions: The Chapel Hill Expert Surveys of 2002 Science Review 19 (1): 84–105.
and 2006.” European Journal of Political Research. 49 (5): 687– Ruedin, Didier, and Laura Morales. 2012. “Obtaining Party Positions
703. on Immigration from Party Manifestos.” Paper presented at the
Horton, J., D. Rand, and R. Zeckhauser. 2011. “The Online Labora- Elections, Public Opinion and Parties (EPOP) conference, Oxford,
tory: Conducting Experiments in a Real Labor Market.” Experi- 7 Sept 2012.
mental Economics 14: 399–425. Schwarz, Daniel, Denise Traber, and Kenneth Benoit. Forthcom-
Hsueh, P., P. Melville, and V. Sindhwani. 2009. “Data Quality from ing. “Estimating Intra-Party Preferences: Comparing Speeches to
Crowdsourcing: A Study of Annotation Selection Criteria.” Paper Votes.” Political Science Research and Methods.
read at Proceedings of the NAACL HLT 2009 Workshop on Active Sheng, V., F. Provost, and Panagiotis Ipeirotis. 2008. “Get Another
Learning for Natural Language Processing. Label? Improving Data Quality and Data Mining using Multiple,
Ipeirotis, Panagiotis G., Foster Provost, Victor S. Sheng, and Noisy Labelers.” Paper read at Proceedings of the 14th ACM
Jing Wang. 2014. “Repeated Labeling using Multiple Noisy SIGKDD International Conference on Knowledge Discovery and
Labelers.” Data Mining and Knowledge Discovery 28 (2): Data Mining.
402–41. Snow, R., B. O’Connor, D. Jurafsky, and A. Ng. 2008. “Cheap
Jones, Frank R., and Bryan D. Baumgartner. 2013. “Policy Agendas and Fast—But is it Good?: Evaluating Non-expert Annota-
Project.” tions for Natural Language Tasks.” Paper read at Proceedings

294
American Political Science Review Vol. 110, No. 2

of the Conference on Empirical Methods in Natural Language Welinder, P., S. Branson, S. Belongie, and P. Perona. 2010. “The
Processing. Multidimensional Wisdom of Crowds.” Paper read at Advances in
Surowiecki, J. 2004. The Wisdom of Crowds. New York: W.W. Norton Neural Information Processing Systems 23 (NIPS 2010).
& Company, Inc. Welinder, P., and P. Perona. 2010. “Online Crowdsourcing: Rating
Turner, Brandon M., Mark Steyvers, Edgar C. Merkle, Annotators and Obtaining Cost-effective Labels.” Paper read at
David V. Budescu, and Thomas S. Wallsten. 2014. “Forecast IEEE Conference on Computer Vision and Pattern Recognition
Aggregation via Recalibration.” Machine Learning 95 (3): Workshops (ACVHL).
261–89. Whitehill, J., P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan.
Volkens, Andrea. 2001. “Manifesto Coding Instructions, 2nd revised 2009. “Whose Vote should Count More: Optimal Integration of
ed.” In Discussion Paper (2001), ed. W. Berlin, p. 96, Andrea Labels from Labelers of Unknown Expertise.” Paper read at
Volkens Berlin: Wissenschaftszentrum Berlin für Sozialforschung Advances in Neural Information Processing Systems 22 (NIPS
gGmbH (WZB). 2009).

295
Reproduced with permission of the copyright owner. Further reproduction prohibited without
permission.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy