Quality Signals in Generated Stories

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Quality Signals in Generated Stories

Manasvi Sagarkar∗ John Wieting† Lifu Tu‡ Kevin Gimpel‡



University of Chicago, Chicago, IL, 60637, USA

Carnegie Mellon University, Pittsburgh, PA, 15213, USA

Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA
manasvi@uchicago.edu, jwieting@cs.cmu.edu, {lifu,kgimpel}@ttic.edu

Abstract is to produce continuations that are both interest-


ing and relevant given the context.
We study the problem of measuring the qual- Neural networks are increasingly employed
ity of automatically-generated stories. We fo- for natural language generation, most often with
cus on the setting in which a few sentences of encoder-decoder architectures based on recurrent
a story are provided and the task is to generate
neural networks (Cho et al., 2014; Sutskever et al.,
the next sentence (“continuation”) in the story.
We seek to identify what makes a story con- 2014). However, while neural methods are effec-
tinuation interesting, relevant, and have high tive for generation of individual sentences condi-
overall quality. We crowdsource annotations tioned on some context, they struggle with coher-
along these three criteria for the outputs of ence when used to generate longer texts (Kiddon
story continuation systems, design features, et al., 2016). In addition, it is challenging to apply
and train models to predict the annotations. neural models in less constrained generation tasks
Our trained scorer can be used as a rich feature
with many valid solutions, such as open-domain
function for story generation, a reward func-
tion for systems that use reinforcement learn-
dialogue and story continuation.
ing to learn to generate stories, and as a partial The story continuation task is difficult to formu-
evaluation metric for story generation. late and evaluate because there can be a wide va-
riety of reasonable continuations for typical story
1 Introduction contexts. This is also the case in open-domain dia-
logue systems, in which common evaluation met-
We study the problem of automatic story gen- rics like BLEU (Papineni et al., 2002) are only
eration in the climate of neural network natu- weakly correlated with human judgments (Liu
ral language generation methods. Story genera- et al., 2016). Another problem with metrics like
tion (Mani, 2012; Gervás, 2012) has a long his- BLEU is the dependence on a gold standard. In
tory, beginning with rule-based systems in the story generation and open-domain dialogue, there
1970s (Klein et al., 1973; Meehan, 1977). Most can be several equally good continuations for any
story generation research has focused on mod- given context which suggests that the quality of
eling the plot, characters, and primary action of a continuation should be computable without re-
the story, using simplistic methods for producing liance on a gold standard.
the actual linguistic form of the stories (Turner, In this paper, we study the question of iden-
1993; Riedl and Young, 2010). More recent work tifying the characteristics of a good continuation
learns from data how to generate stories holisti- for a given context. We begin by building sev-
cally without a clear separation between content eral story generation systems that generate a con-
selection and surface realization (McIntyre and tinuation from a context. We develop simple
Lapata, 2009), with a few recent methods based systems based on recurrent neural networks and
on recurrent neural networks (Roemmele and Gor- similarity-based retrieval and train them on the
don, 2015; Huang et al., 2016). ROC story dataset (Mostafazadeh et al., 2016). We
We follow the latter style and focus on a setting use crowdsourcing to collect annotations of the
in which a few sentences of a story are provided quality of the continuations without revealing the
(the context) and the task is to generate the next gold standard. We ask annotators to judge continu-
sentence in the story (the continuation). Our goal ations along three distinct criteria: overall quality,
relevance, and interestingness. We collect mul- connected sentences (Jain et al., 2017), or settings
tiple annotations for 4586 context/continuation in which a user and agent take turns adding sen-
pairs. These annotations permit us to compare tences to a story (Swanson and Gordon, 2012;
methods for story generation and to study the re- Roemmele and Gordon, 2015; Roemmele, 2016).
lationships among the criteria. We analyze our an- Our annotation criteria—relevance, interesting-
notated dataset by developing features of the con- ness, and overall quality—are inspired by those
text and continuation and measuring their correla- from prior work. McIntyre and Lapata (2009) sim-
tion with each criterion. ilarly obtain annotations for story interestingness.
We combine these features with neural net- They capture coherence in generated stories by us-
works to build models that predict the human ing an automatic method based on sentence shuf-
scores, thus attempting to automate the process fling. We discuss the relationship between rele-
of human quality judgment. We find that our vance and coherence below in Section 3.2.
predicted scores correlate well with human judg- Roemmele et al. (2017) use automated linguis-
ments, especially when using our full feature set. tic analysis to evaluate story generation systems.
Our scorer can be used as a rich feature function They explore the various factors that affect the
for story generation or a reward function for sys- quality of a story by measuring feature values for
tems that use reinforcement learning to learn to different story generation systems, but they do not
generate stories. It can also be used as a partial obtain any quality annotations as we do here.
evaluation metric for story generation.1 Examples Since there is little work in automatic evalua-
of contexts, generated continuations, and quality tion of story generation, we can turn to the related
predictions from our scorer are shown in Table 3. task of open-domain dialogue. Evaluation of dia-
The annotated data and trained scorer are available logue systems often uses perplexity or metrics like
at the authors’ websites. BLEU (Papineni et al., 2002), but Liu et al. (2016)
show that most common evaluation metrics for di-
2 Related Work alog systems are correlated very weakly with hu-
Research in automatic story generation has a long man judgments. Lowe et al. (2017) develop an au-
history, with early efforts driven primarily by tomatic metric for dialog evaluation by training a
hand-written rules (Klein et al., 1973; Meehan, model to predict crowdsourced quality judgments.
1977; Dehn, 1981; Lebowitz, 1985; Turner, 1993), While this idea is very similar to our work, one
often drawing from theoretical analysis of sto- key difference is that their annotators were shown
ries (Propp, 1968; Schank and Abelson, 1975; both system outputs and the gold standard for each
Thorndyke, 1977; Wilensky, 1983). Later meth- context. We fear this can bias the annotations by
ods were based on various methods of planning turning them into a measure of similarity to the
from artificial intelligence (Theune et al., 2003; gold standard, so we do not show the gold stan-
Oinonen et al., 2006; Riedl and Young, 2010) dard to annotators.
or commonsense knowledge resources (Liu and Wang et al. (2017) use crowdsourcing (upvotes
Singh, 2002; Winston, 2014). A detailed summary on Quora) to obtain quality judgments for short
of this earlier work is beyond our scope; for sur- stories and train models to predict them. One dif-
veys, please see Mani (2012), Gervás (2012), or ference is that we obtain annotations for three dis-
Gatt and Krahmer (2017). tinct criteria, while they only use upvotes. An-
More recent work in story generation has fo- other difference is that we collect annotations for
cused on data-driven methods (McIntyre and La- both manually-written continuations and a range
pata, 2009, 2010; McIntyre, 2011; Elson, 2012; of system-generated continuations, with the goal
Daza et al., 2016; Roemmele, 2016). The gener- of using our annotations to train a scorer that can
ation problem is often constrained via anchoring be used within training.
to some other input, such as a topic or list of key-
words (McIntyre and Lapata, 2009), a sequence 3 Data Collection
of images (Huang et al., 2016), a set of loosely-
Our goal is to collect annotations of the quality of
1
However, since our scorer does not use a gold standard, a sentence in a story given its preceding sentences.
it is possible to “game” the metric by directly optimizing the
predicted score, so if used as an evaluation metric, it should We use the term context to refer to the preced-
still be validated with a small-scale manual evaluation. ing sentences and continuation to refer to the next
sentence being generated and evaluated. We now uation is always a single sentence and the context
describe how we obtain hcontext, continuationi consists of all previous sentences in the story.
pairs from automatic and human-written stories We trained a 3-layer bidirectional S EQ 2S EQ
for crowdsourcing quality judgments. model, with each layer having hidden vector
We use the ROC story corpus (Mostafazadeh dimensionality 1024. The size of the vocabulary
et al., 2016), which contains 5-sentence stories was 31,220. We used scheduled sampling (Bengio
about everyday events. We use the initial data re- et al., 2015), using the previous ground truth word
lease of 45,502 stories. The first 45,002 stories in the decoder with probability 0.5t , where t is the
form our training set (TRAIN) for story generation index of the mini-batch processed during training.
models and the last 500 stories form our develop- We trained the model for 20,000 epochs with a
ment set (DEV) for tuning hyperparameters while batch size of 100. We began training the model
training story generation models. For collecting on consecutive sentence pairs (so the context was
annotations, we compile a dataset of 4586 context- only a single sentence), then shifted to training on
continuation pairs, drawing contexts from DEV as full story contexts.
well as the 1871-story validation set from the ROC We considered four different methods for the
Story Cloze task (Mostafazadeh et al., 2016). decoding function of our S EQ 2S EQ model:
For contexts, we use 3- and 4-sentence prefixes
• S EQ 2S EQ -G REEDY: return the highest-scoring
from the stories in this set of 4586. We use both 3
output under greedy (arg max) decoding.
and 4 sentence contexts as we do not want our an-
notated dataset to include only story endings (for • S EQ 2S EQ -D IV: return the kth-best output us-
the 4-sentence contexts, the original 5th sentence ing a diverse beam search (Vijayakumar et al.,
is the ending of the story) but also more general in- 2016) with beam size k = 10.
stances of story continuation. We did not use 1 or • S EQ 2S EQ -S AMPLE: sample words from the
2 sentence contexts because we consider the space distribution over output words at each step us-
of possible continuations for these short contexts ing a temperature parameter τ = 0.4.
to be too unconstrained and thus it would be diffi-
• S EQ 2S EQ -R EVERSE: reverse input sequence
cult for both systems and annotators.
(at test time only) and use greedy decoding.
We generated continuations for each context us-
ing a variety of systems (described in Section 3.1) Each decoding rule contributes one eighth of
as well as simply taking the human-written contin- the total data generated for annotation, so the
uation from the original story. We then obtained S EQ 2S EQ models account for one half of the
annotations for the continuation with its context hcontext, continuationi pairs to be annotated.
via crowdsourcing, described in Section 3.2.
3.1.2 Human Generated Outputs
3.1 Story Continuation Systems For human generated continuations, we use two
In order to generate a dataset with a range of methods. The first is simply the gold standard con-
qualities, we consider six ways of generating the tinuation from the ROC stories dataset, which we
continuation of the story, four based on neu- call H UMAN. The second finds the most similar
ral sequence-to-sequence models and two using context in the ROC training corpus, then returns
human-written sentences. To lessen the possibil- the continuation for that context. To compute sim-
ity of annotators seeing the same context multiple ilarity between contexts, we use the sum of two
times, which could bias the annotations, we used similarity scores: BLEU score (Papineni et al.,
at most two methods out of six for generating the 2002) and the overall sentence similarity described
continuation for a particular context. by Li et al. (2006). Since this method is similar
to an information retrieval-based story generation
3.1.1 Sequence-to-Sequence Models system, we refer to it as R ETRIEVAL. H UMAN and
We used a standard sequence-to-sequence R ETRIEVAL each contribute a fourth of the total
(S EQ 2S EQ) neural network model (Sutskever data generated for annotation.
et al., 2014) to generate continuations given
contexts. We trained the models on TRAIN and 3.2 Crowdsourcing Annotations
tuned on DEV. We generated 180,008 hcontext, We used Amazon Mechanical Turk to collect an-
continuationi pairs from TRAIN, where the contin- notations of continuations paired with their con-
texts. We collected annotations for 4586 context- Criterion Mean Std. IA MAD IA SDAD
Overall 5.2 2.5 2.1 1.6
continuation pairs, collecting the following three Relevance 5.2 3.0 2.3 1.8
criteria for each pair: Interestingness 4.6 2.5 2.1 1.9

• Overall quality (O): a subjective judgment by Table 1: Means and standard deviations for each cri-
the annotator of the quality of the continuation, terion, as well as inter-annotator (IA) mean absolute
differences (MAD) and standard deviations of absolute
i.e., roughly how much the annotator thinks the
differences (SDAD).
continuation adds to the story.
• Relevance (R): a measure of how relevant the
continuation is to the context. This addresses greater than 97%, and to have had at least 500
the question of whether the continuation fits HITs approved. We paid $0.08 per HIT. Since
within the world of the story. task duration can be difficult to estimate from
HIT times (due to workers becoming distracted or
• Interestingness (I): a measure of the amount of working on multiple HITs simultaneously), we re-
new (but still relevant) information added to the port the top 5 modes of the time duration data in
story. We use this to measure whether the con- seconds. For pairs with 3 sentences in the context,
tinuation makes the story more interesting. the most frequent durations are 11, 15, 14, 17, and
21 seconds. For 4 sentences, the most frequent du-
Our criteria follow McIntyre and Lapata (2009)
rations are 18, 20, 19, 21, and 23 seconds.
who used interestingness and coherence as two
We required each worker to annotate no more
quality criteria for story generation. Our notion
than 150 continuations so as not to bias the data
of relevance is closely related to coherence; when
collected. After collecting all annotations, we ad-
thinking of judging a continuation, we believed
justed the scores to account for how harshly or le-
that it would be more natural for annotators to
niently each worker scored the sentences on av-
judge the relevance of the continuation to its con-
erage. We did this by normalizing each score by
text, rather than judging the coherence of the re-
the absolute value of the difference between the
sulting story. That is, coherence is a property of a
worker’s mean score and the average mean score
discourse, while relevance is a property of a con-
of all workers for each criterion. We only normal-
tinuation (in relation to the context).
ized scores of workers who annotated more than
Our overall quality score was intended to cap-
10 pairs in order to ensure reliable worker means.
ture any remaining factors that determine human
We then averaged the two adjusted sets of scores
quality judgment. In preliminary annotation ex-
for each pair to get a single set of scores.
periments, we found that the overall score tended
to capture a notion of fluency/grammaticality,
4 Dataset Analysis
hence we decided not to annotate this criterion
separately. We asked annotators to forgive minor Table 1 shows means and standard deviations for
ungrammaticalities in the continuations and rate the three criteria. The means are similar across the
them as long as they could be understood. If an- three, though interestingness has the lowest, which
notators could not understand the continuation, we aligns with our expectations of the ROC stories.
asked them to assign a score of 0 for all criteria. For measuring inter-annotator agreement, we con-
We asked the workers to rate the continuations sider the mean absolute difference (MAD) of the
on a scale of 1 to 10, with 10 being the high- two judgments for each pair.3 Table 1 shows the
est score. We obtained annotations from two dis- MADs for each criterion and the corresponding
tinct annotators for each pair and for each crite- standard deviations (SDAD). Overall quality and
rion, adding up to a total of 4586×2×3 = 27516 interestingness showed slightly lower MADs than
judgments. We asked annotators to annotate all relevance, though all three criteria are similar.
three criteria for a given pair simultaneously in The average scores for each data source are
one HIT.2 We required workers to be located in shown in Table 2. The ranking of the systems is
the United States, to have a HIT approval rating
3
Cohen’s Kappa is not appropriate for our data because,
2
In a preliminary study, we experimented with asking for while we obtained two annotations for each pair, they were
each criterion separately to avoid accidental correlation of the not always from the same pair of annotators. In this case,
criteria, but found that it greatly reduced cumulative cognitive an annotator-agnostic metric like MAD (and its associated
load for each annotator to do all three together. standard deviation) is a better measure of agreement.
System # O R I 4.1 Relationships Among Criteria
S EQ 2S EQ -G REEDY 596 4.18 4.09 3.81
S EQ 2S EQ -D IV 584 3.36 3.50 3.00 Table 4 shows correlations among the criteria for
S EQ 2S EQ -S AMPLE 578 3.69 3.70 3.42 different sets of outputs. R ETRIEVAL outputs
S EQ 2S EQ -R EVERSE 577 4.61 4.39 4.02
R ETRIEVAL 1086 5.68 4.93 5.15 show a lower correlation between overall score
H UMAN 1165 7.22 8.05 6.33 and interestingness than H UMAN outputs. This is
likely because the R ETRIEVAL outputs with high
Table 2: Average criteria scores for each system (O =
overall, R = relevance, I = interestingness). interestingness scores frequently contained more
surprising content such as new character names
or new actions/events that were not found in the
context. Therefore, a high interestingness score
consistent across criteria. Human-written contin- was not as strongly correlated with overall qual-
uations are best under all three criteria. The H U - ity as with H UMAN outputs, for which interesting
MAN relevance average is higher than interesting-
continuations were less likely to contain erroneous
ness. This matches our intuitions about the ROC new material.
corpus: the stories were written to capture com- H UMAN continuations have a lower correlation
monsense knowledge about everyday events rather between relevance and interestingness than the
than to be particularly surprising or interesting sto- R ETRIEVAL or S EQ 2S EQ models. This is likely
ries in their own right. Nonetheless, we do find because nearly all H UMAN outputs are relevant, so
that the H UMAN continuations have higher inter- their interestingness does not depend on their rel-
estingness scores than all automatic systems. evance. For S EQ 2S EQ, the continuations can only
The R ETRIEVAL system actually outperforms be interesting if they are first somewhat relevant to
all S EQ 2S EQ systems on all criteria, though the the context; nonsensical output was rarely anno-
gap is smallest on relevance. We found that the tated as interesting. Thus the S EQ 2S EQ relevance
S EQ 2S EQ systems often produced continuations and interestingness scores have a higher correla-
that fit topically within the world suggested by tion than for H UMAN or R ETRIEVAL.
the context, though they were often generic or The lower rows show correlations for different
merely topically relevant without necessarily mov- levels of overall quality. For stories whose over-
ing the story forward. We found S2S-G REEDY all quality is greater than 7.5, the correlations be-
produced outputs that were grammatical and rele- tween the overall score and the other two criteria is
vant but tended to be more mundane whereas S2S- higher than when the overall quality is lower. The
R EVERSE tended to produce slightly more inter- correlation between relevance and interestingness
esting outputs that were still grammatical and rel- is not as high (0.34). The stories at this quality
evant on average. The sampling and diverse beam level are already at least somewhat relevant and
search outputs were frequently ungrammatical and understandable, hence like H UMAN outputs, the
therefore suffer under all criteria. interestingness score is not as dependent on the
relevance score. For stories with overall quality
We show sample outputs from the different sys-
below 2.5, the stories are often not understandable
tems in Table 3. We also show predicted criteria
so annotators assigned low scores to all three cri-
scores from our final automatic scoring model (see
teria, leading to higher correlation among them.
Section 6 for details). We show predicted rather
than annotated scores here because for a given 4.2 Features
context, we did not obtain annotations for all con- We also analyze our dataset by designing features
tinuations for that context. We can see some of the of the hcontext, continuationi pair and measuring
characteristics of the different models and under- their correlation with each criterion.
stand how their outputs differ. The R ETRIEVAL
outputs are sometimes more interesting than the 4.2.1 Shallow Features
H UMAN outputs, though they often mention new We consider simple features designed to capture
entities that were not contained in the context, or surface-level characteristics of the continuation:
they may be merely topically related to the context
without necessarily resulting in a coherent story. • Length: number of tokens in the continuation.
This affects interestingness as well, as a continua- • Relative length: the length of the continuation
tion must first be relevant in order to be interesting. divided by the length of the context.
Context 1: Tripp wanted to learn how to put a topspin on his serve . He was a more advanced tennis player . He sought out
a tennis pro to help him . He finally perfected his topspin .
System Continuation O R I
S2S-G REEDY He won the game . 4.12 4.99 3.45
S2S-D IV Now he had the game. 4.25 5.04 3.60
S2S-S AMPLE Now , he is able to play his . 4.48 4.88 3.94
S2S-R EVERSE He took a few minutes . 4.92 5.77 4.22
R ETRIEVAL Lyn now has a great backhand . 6.05 7.14 5.34
H UMAN His game improved even more. 5.05 6.16 4.36
Context 2: Neil had just entered the country of Oman . He found the desert land to be enchanting . The women dressed
beautifully and the men were friendly . Neil felt very comfortable in Oman .
S2S-G REEDY Neil decided to try the best man. 5.47 6.13 4.74
S2S-D IV They were days and decided 4.66 5.65 3.96
S2S-S AMPLE Neil Neil the trip trip of the trip of the trip 4.95 5.66 4.72
S2S-R EVERSE He took a tour of the city. 3.97 4.83 3.64
R ETRIEVAL Neil saw that South Koreans were a very kind people ! 6.26 6.94 5.66
H UMAN He wished he could stay forever! 6.24 7.22 5.58
Context 3: Ed and Emma were twins and wanted to have matching Halloween costumes . But they couldn ’t agree on a
costume ! Ed wanted to be a superhero and Emma wanted to be a mermaid .
S2S-G REEDY He took out and could make to work . 4.60 5.11 4.11
S2S-D IV So , s ’ and they would learn . 4.71 5.41 4.18
S2S-S AMPLE They decided went their great time and they their family . s house . 4.86 5.50 4.58
S2S-R EVERSE They decided to try to their local home . 4.74 5.21 4.22
R ETRIEVAL Then their mom offered a solution to please them both . 5.59 6.11 5.05
H UMAN Then their mom said she could make costumes that ’d please them both . 6.17 6.71 5.69

Table 3: Sample system outputs for different contexts. Final three columns show predicted scores from our trained
scorer (see Section 6 for details).

Corr(O,R) Corr(O,I) Corr(R,I) don et al. (2011) obtained strong results by using
H UMAN 0.70 0.63 0.44
R ETRIEVAL 0.68 0.52 0.47 PMIs to compute a score that measures the causal
H UMAN + R ET. 0.76 0.61 0.53 relatedness between a premise and its potential al-
S EQ 2S EQ -A LL 0.72 0.70 0.59 ternatives. For a hcontext, continuationi pair, we
Overall > 7.5 0.46 0.47 0.34
5 < Overall < 7.5 0.44 0.31 0.24 compute the following score (Gordon et al., 2011):
2.5 < Overall < 5 0.38 0.35 0.38
Overall < 2.5 0.41 0.41 0.38 P P
Overall > 2.5 0.76 0.69 0.59 u∈context v∈continuation PMI(u, v)
spmi =
Table 4: Pearson correlations between criteria for dif-
Ncontext Ncontination
ferent subsets of the annotated data. where Ncontext and Ncontinuation are the numbers of
tokens in the context and continuation. We cre-
• Language model: perplexity from a 4-gram ate 6 versions of the above score, combining three
language model with modified Kneser-Ney window sizes (10, 25, and 50) with both stan-
smoothing estimated using KenLM (Heafield, dard PMI and positive PMI (PPMI). To compute
2011) from the Personal Story corpus (Gordon PMI/PPMI, we use the Personal Story corpus.4
and Swanson, 2009), which includes about 1.6 For efficiency and robustness, we only compute
million personal stories from weblogs. PMI/PPMI of a word pair if the pair appears more
• IDF: the average of the inverse document fre- than 10 times in the corpus using the particular
quencies (IDFs) across all tokens in the contin- window size.
uation. The IDFs are computed using Wikipedia 4.2.3 Entity Mention Features
sentences as “documents”. We compute several features to capture how
4.2.2 PMI Features relevant the continuation is to the input. In
4
We use features based on pointwise mutual infor- We use Wikipedia for IDFs and the Personal Story cor-
pus for PMIs. IDF is a simpler statistic which is presumed to
mation (PMI) of word pairs in the context and con- be similar across a range of large corpora for most words; we
tinuation. We take inspiration from methods de- use Wikipedia because it has broad coverage in terms of vo-
veloped for the Choice of Plausible Alternatives cabulary. PMIs require computing word pair statistics and are
therefore expected to be more data-dependent, so we chose
(COPA) task (Roemmele et al., 2011), in which the Personal Story corpus due to its effectiveness for related
a premise is provided with two alternatives. Gor- tasks (Gordon et al., 2011).
Feature O R I relation for interestingness. The S EQ 2S EQ mod-
Length 0.007 0.055 0.071
Relative length 0.018 0.020 0.060 els output very common words which lets them
Language model 0.025 0.034 0.058 have relatively low perplexities even with occa-
IDF 0.418 0.316 0.408 sional disfluencies, while the human-written out-
PPMI (w = 10) 0.265 0.321 0.224
PPMI (w = 25) 0.289 0.341 0.249 puts contain more rare words.
PPMI (w = 50) 0.299 0.351 0.259 The IDF feature shows highest correlation with
Has old mentions 0.050 0.151 0.023 overall and interestingness, and lower correlation
Number of old mentions 0.057 0.146 0.049
Has new mentions -0.048 -0.115 -0.026 with relevance. This is intuitive since the IDF
Number of new mentions -0.052 -0.119 -0.029 feature will be largest when many rare words are
Has new names -0.005 -0.129 0.017
Number of new names -0.005 -0.130 0.017
used, which is expected to correlate with inter-
Is H UMAN? 0.56 0.62 0.50 estingness more than relevance. We suspect IDF
Is H UMAN ∪ R ETRIEVAL? 0.60 0.49 0.56 correlates so well with overall because S EQ 2S EQ
models typically generate common words, so this
Table 5: Spearman correlations between features and
annotations. The final two rows are “oracle” binary feature may partially separate the S EQ 2S EQ from
features that return 1 for continuations from those sets. H UMAN/R ETRIEVAL.
Unlike IDF, the PPMI scores (with window
sizes w shown in parentheses) show highest cor-
order to compute these features we use the relations with relevance. This is intuitive, since
part-of-speech tagging, named entity recognition PPMI will be highest when topical coherence is
(NER), and coreference resolution tools in Stan- present in the discourse. Higher correlations are
ford CoreNLP (Manning et al., 2014): found when using larger window sizes.6
The old mentions features have the highest cor-
• Has old mentions: a binary feature that returns relation with relevance, as expected. A contin-
1 if the continuation has “old mentions,” i.e., uation that continues coreference chains is more
mentions that are part of a coreference chain likely to be relevant. The new mention/name fea-
that began in the context. tures have negative correlations with relevance,
• Number of old mentions: the number of old which is also intuitive: introducing new characters
mentions in the continuation. makes the continuation less relevant.
To explore the question of separability be-
• Has new mentions: a binary feature that returns tween machine and human-written continuations,
1 if the continuation has “new mentions,” i.e., we measured correlations of “oracle” features that
mentions that are not part of any coreference simply return 1 if the output was generated by hu-
chain that began in the context. mans and 0 if it was generated by a system. Such
• Number of new mentions: the number of new features are highly correlated with all three crite-
mentions in the continuation. ria as seen in the final two rows of Table 5. This
• Has new names: if the continuation has new suggests that human annotators strongly preferred
mentions, this binary feature returns 1 if any of human generated stories over our models’ out-
the new mentions is a name, i.e., if the mention puts. Some features may correlate with the anno-
is a person named entity from the NER system. tated criteria if they separate human- and machine-
generated continuations (e.g., IDF).
• Number of new names: the number of new
names in the continuation. 5 Methods for Score Prediction
4.3 Comparing Features We now consider ways to build models to predict
Table 5 shows Spearman correlations between our our criteria. We define neural networks that take as
features and the criteria.5 The length features have input representations of the context/continuation
small positive correlations with all three criteria, pair hb, ci and our features and output a continu-
showing highest correlation with interestingness. ous value for each predicted criterion.
Language model perplexity shows weak correla- We experiment with two ways of representing
tion for all three measures, with its highest cor- the input based on the embeddings of b and c,
5 6
These use the combined training and validation sets; we We omit full results for brevity, but the PPMI features
describe splitting the data below in Section 6. showed slightly higher correlations than PMI features.
which we denote vb and vc respectively. The O R I
All features 57.3 53.4 49.6
first (“cont”) uses only the continuation embed- - PMI 56.3 50.4 48.6
ding without any representation of the context or - IDF 56.6 53.6 46.0
the similarity between the context and continua- - Mention 54.8 50.3 48.6
- Length 56.1 55.9 45.3
tion: xcont = hvc i. The second (“sim+cont”) No features 51.9 44.9 43.8
also contains the elementwise multiplication of + PMI 54.5 50.9 44.9
+ IDF 54.3 46.7 46.3
the context and continuation embeddings concate- + Mention 53.8 48.8 46.0
nated with the absolute difference: xsim+cont = + Length 51.9 43.1 44.9
hvb vc , |vb − vc |, vc i. + IDF, Length 54.6 46.5 47.3
To compute representations v, we use the av- Table 6: Ablation experiments with several feature sets
erage of character n-gram embeddings (Huang (Spearman correlations on the validation set).
et al., 2013; Wieting et al., 2016), fixing the out-
put dimensionality to 300. We found this to out- validation test
model features O R I O R I
perform other methods. In particular, the next
none 51.9 44.9 43.8 53.3 46.0 50.5
best method used gated recurrent averaging net- cont IDF, Len. 54.6 46.5 47.3 51.6 40.6 50.2
works (GRANs; Wieting and Gimpel, 2017), fol- all 57.3 53.4 49.6 57.1 54.3 52.8
none 51.6 43.7 44.3 52.2 45.0 48.4
lowed by long short-term memory (LSTM) net- sim+
IDF, Len. 54.2 45.6 47.7 56.0 46.8 53.0
works (Hochreiter and Schmidhuber, 1997), and cont
all 55.1 54.8 47.4 58.7 55.8 52.9
followed finally by word averaging.
Table 7: Correlations (Spearman’s ρ × 100) on valida-
The input, whether xcont or xsim+cont , is fed to
tion and test sets for best models with three feature sets.
one fully-connected hidden layer with 300 units,
followed by a rectified linear unit (ReLU) activa-
tion. Our manually computed features (Length,
IDF, PMI, and Mention) are concatenated prior to Each row corresponds to one feature ablation or
this layer. The output layer follows and uses a lin- addition, except for the final row which corre-
ear activation. sponds to adding two feature sets that are efficient
We use mean absolute error as our loss function to compute: IDF and Length. The Mention and
during training. We train to predict the three crite- PMI features are the most useful for relevance,
ria jointly, so the loss is actually the sum of mean which matches the pattern of correlations in Ta-
absolute errors over the three criteria. We found ble 5, while IDF and Length features are most
this form of multi-task learning to significantly helpful for interestingness. All feature sets con-
outperform training separate models for each cri- tribute in predicting overall quality, with the Men-
terion. When tuning, we tune based on the aver- tion features showing the largest drop in correla-
age Spearman correlation across the three criteria tion when they are ablated.
on our validation set. We train all models for 25 6.2 Final Results
epochs using Adam (Kingma and Ba, 2014) with
a learning rate of 0.001. Table 7 shows our final results on the validation
and test sets. The highest correlations on the
6 Experiments test set are achieved by using the sim+cont model
with all features. While interestingness can be
After averaging the two annotator scores to get predicted reasonably well with just IDF and the
our dataset of 4586 context/continuation pairs, we Length features, the prediction of relevance is im-
split the data randomly into 600 pairs for valida- proved greatly with the full feature set.
tion, 600 for testing, and used the rest (3386) for Using our strongest models, we computed the
training. For our evaluation metric, we use Spear- average predicted criterion scores for each story
man correlation between the scorer’s predictions generation system on the test set. Overall, the pre-
and the annotated scores. dicted rankings are strongly correlated with the
rankings yielded by the aggregated annotations
6.1 Feature Ablation shown in Table 2, especially in terms of distin-
Table 6 shows results as features are either re- guishing human-written and machine-generated
moved from the full set or added to the featureless continuations.
model, all when using the “cont” input schema. While the PMI features are very helpful for pre-
dicting relevance, they do have demanding space Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
requirements due to the sheer number of word cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning
pairs with nonzero counts in large corpora. We at-
phrase representations using RNN encoder–decoder
tempted to replace the PMI features by similar fea- for statistical machine translation. In Proceedings of
tures based on word embedding similarity, follow- the 2014 Conference on Empirical Methods in Nat-
ing the argument that skip-gram embeddings with ural Language Processing (EMNLP). pages 1724–
negative sampling form an approximate factoriza- 1734.
tion of a PMI score matrix (Levy and Goldberg, Angel Daza, Hiram Calvo, and Jesús Figueroa-Nazuno.
2014). However, we were unable to find the same 2016. Automatic text generation by learning from
performance by doing so; the PMI scores were still literary structures. In Proceedings of the Fifth Work-
shop on Computational Linguistics for Literature.
superior. pages 9–19.
For the automatic scores shown in Table 3, we
used the sim+cont model with IDF and Length Natalie Dehn. 1981. Story generation after TALE-
SPIN. In Proceedings of the 9th International Joint
features. Since this model does not require PMIs Conference on Artificial Intelligence (IJCAI). pages
or NLP analyzers, it is likely to be the one used 16–18.
most in practice by other researchers within train-
David K. Elson. 2012. Modeling Narrative Discourse.
ing/tuning settings. We release this trained scorer Ph.D. thesis, Columbia University.
as well as our annotated data to the research com-
munity. Albert Gatt and Emiel Krahmer. 2017. Survey of the
state of the art in natural language generation: Core
tasks, applications and evaluation. arXiv preprint
7 Conclusion arXiv:1703.09902 .
We conducted a manual evaluation of neural Pablo Gervás. 2012. Story generator algorithms. The
sequence-to-sequence and retrieval-based story Living Handbook of Narratology 19.
continuation systems along three criteria: overall Andrew S. Gordon, Cosmin Adrian Bejan, and Kenji
quality, relevance, and interestingness. We ana- Sagae. 2011. Commonsense causal reasoning using
lyzed the annotations and identified features that millions of personal stories. In 25th Conference on
correlate with each criterion. These annotations Artificial Intelligence (AAAI-11).
also provide a new story understanding task: pre- Andrew S. Gordon and Reid Swanson. 2009. Identi-
dicting the quality scores of generated continua- fying personal stories in millions of weblog entries.
tions. We took initial steps toward solving this In Third International Conference on Weblogs and
Social Media, Data Challenge Workshop.
task by developing an automatic scorer that uses
features, compositional architectures, and multi- Kenneth Heafield. 2011. KenLM: faster and smaller
task training. Our trained continuation scorer can language model queries. In Proceedings of the
be used as a rich feature function for story gen- EMNLP 2011 Sixth Workshop on Statistical Ma-
chine Translation.
eration or a reward function for systems that use
reinforcement learning to learn to generate stories. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
The annotated data and trained scorer are available short-term memory. Neural computation 9(8).
at the authors’ websites. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
Alex Acero, and Larry Heck. 2013. Learning deep
Acknowledgments structured semantic models for web search using
clickthrough data. In Proceedings of the 22nd ACM
We thank Melissa Roemmele and the anonymous international conference on Conference on informa-
reviewers for their comments. We also thank tion & knowledge management. ACM, pages 2333–
2338.
NVIDIA Corporation for donating GPUs used in
this research. Ting-Hao (Kenneth) Huang, Francis Ferraro, Nasrin
Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Ja-
cob Devlin, Ross Girshick, Xiaodong He, Push-
References meet Kohli, Dhruv Batra, C. Lawrence Zitnick,
Devi Parikh, Lucy Vanderwende, Michel Galley, and
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Margaret Mitchell. 2016. Visual storytelling. In
Noam Shazeer. 2015. Scheduled sampling for se- Proceedings of the 2016 Conference of the North
quence prediction with recurrent neural networks. American Chapter of the Association for Computa-
In Advances in Neural Information Processing Sys- tional Linguistics: Human Language Technologies.
tems. pages 1171–1179. pages 1233–1239.
Parag Jain, Priyanka Agrawal, Abhijit Mishra, Mo- Christopher D. Manning, Mihai Surdeanu, John Bauer,
hak Sukhwani, Anirban Laha, and Karthik Sankara- Jenny Rose Finkel, Steven Bethard, and David Mc-
narayanan. 2017. Story generation from sequence of Closky. 2014. The Stanford CoreNLP natural lan-
independent short descriptions. In Workshop on Ma- guage processing toolkit. In ACL (System Demon-
chine Learning for Creativity, at the 23rd SIGKDD strations). pages 55–60.
Conference on Knowledge Discovery and Data Min-
ing (KDD 2017). Neil McIntyre and Mirella Lapata. 2009. Learning to
tell tales: A data-driven approach to story genera-
Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. tion. In Proceedings of the Joint Conference of the
2016. Globally coherent text generation with neural 47th Annual Meeting of the ACL and the 4th Interna-
checklist models. In Proceedings of the 2016 Con- tional Joint Conference on Natural Language Pro-
ference on Empirical Methods in Natural Language cessing of the AFNLP. pages 217–225.
Processing. pages 329–339. Neil McIntyre and Mirella Lapata. 2010. Plot induc-
tion and evolutionary search for story generation. In
Diederik Kingma and Jimmy Ba. 2014. Adam: A Proceedings of the 48th Annual Meeting of the Asso-
method for stochastic optimization. arXiv preprint ciation for Computational Linguistics. pages 1562–
arXiv:1412.6980 . 1572.

Sheldon Klein, John F. Aeschlimann, David F. Bal- Neil Duncan McIntyre. 2011. Learning to tell tales:
siger, Steven L. Converse, Claudine Court, Mark automatic story generation from Corpora. Ph.D.
Foster, Robin Lao, John D. Oakley, and Joel Smith. thesis, The University of Edinburgh.
1973. Automatic novel writing: A status report.
Technical Report 186, University of Wisconsin- James R. Meehan. 1977. TALE-SPIN, an interactive
Madison. program that writes stories. In Proceedings of the
5th International Joint Conference on Artificial In-
telligence (IJCAI). pages 91–98.
Michael Lebowitz. 1985. Story-telling as planning and
learning. Poetics 14(6):483–502. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
Omer Levy and Yoav Goldberg. 2014. Neural word Pushmeet Kohli, and James Allen. 2016. A cor-
embedding as implicit matrix factorization. In Ad- pus and cloze evaluation for deeper understanding of
vances in Neural Information Processing Systems commonsense stories. In Proceedings of the 2016
27. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Yuhua Li, David McLean, Zuhair A Bandar, James D Language Technologies. pages 839–849.
O’shea, and Keeley Crockett. 2006. Sentence sim-
ilarity based on semantic nets and corpus statistics. K.M. Oinonen, Mariet Theune, Antinus Nijholt, and
IEEE transactions on knowledge and data engineer- J.R.R. Uijlings. 2006. Designing a story database
ing 18(8):1138–1150. for use in automatic story generation, Springer Ver-
lag, pages 298–301. Lecture Notes in Computer Sci-
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose- ence.
worthy, Laurent Charlin, and Joelle Pineau. 2016. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
How NOT to evaluate your dialogue system: An em- Jing Zhu. 2002. BLEU: a method for automatic
pirical study of unsupervised evaluation metrics for evaluation of machine translation. In Proceedings
dialogue response generation. In Proceedings of the of the 40th annual meeting on association for com-
2016 Conference on Empirical Methods in Natural putational linguistics. pages 311–318.
Language Processing. pages 2122–2132.
Vladimir Propp. 1968. Morphology of the Folktale,
Hugo Liu and Push Singh. 2002. MAKEBELIEVE: volume 9. University of Texas Press.
Using commonsense knowledge to generate stories.
In Proceedings of the Eighteenth National Confer- Mark O. Riedl and Robert Michael Young. 2010. Nar-
ence on Artificial Intelligence (AAAI). pages 957– rative planning: Balancing plot and character. Jour-
958. nal of Artificial Intelligence Research 39:217–268.
Melissa Roemmele. 2016. Writing stories with help
Ryan Lowe, Michael Noseworthy, Iulian V. Ser- from recurrent neural networks. In Thirtieth AAAI
ban, Nicolas Angelard-Gontier, Yoshua Bengio, and Conference on Artificial Intelligence. pages 4311 –
Joelle Pineau. 2017. Towards an automatic Turing 4312.
test: Learning to evaluate dialogue responses. In
Proceedings of the 55th Annual Meeting of the As- Melissa Roemmele, Cosmin Adrian Bejan, and An-
sociation for Computational Linguistics. drew S. Gordon. 2011. Choice of Plausible Alterna-
tives: An Evaluation of Commonsense Causal Rea-
Inderjeet Mani. 2012. Computational modeling of soning. In AAAI Spring Symposium on Logical For-
narrative. Synthesis Lectures on Human Language malizations of Commonsense Reasoning. Stanford
Technologies 5(3):1–142. University.
Melissa Roemmele and Andrew S. Gordon. 2015. Cre- Robert Wilensky. 1983. Story grammars versus story
ative help: a story writing assistant. In Inter- points. Behavioral and Brain Sciences 6(4):579–
national Conference on Interactive Digital Story- 591.
telling. Springer, pages 81–92.
Patrick Henry Winston. 2014. The Genesis story un-
Melissa Roemmele, Andrew S. Gordon, and Reid derstanding and story telling system: A 21st century
Swanson. 2017. Evaluating story generation sys- step toward artificial intelligence. Memo 019, Cen-
tems using automated linguistic analyses. In Work- ter for Brains Minds and Machines, MIT.
shop on Machine Learning for Creativity, at the 23rd
SIGKDD Conference on Knowledge Discovery and
Data Mining (KDD 2017).

Roger C. Schank and Robert P. Abelson. 1975. Scripts,


plans, and knowledge. In Proceedings of the 4th
International Joint Conference on Artificial Intelli-
gence (IJCAI). pages 151–157.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.


Sequence to sequence learning with neural net-
works. In Advances in neural information process-
ing systems. pages 3104–3112.

Reid Swanson and Andrew S. Gordon. 2012. Say any-


thing: Using textual case-based reasoning to enable
open-domain interactive storytelling. ACM Trans.
Interact. Intell. Syst. 2(3):16:1–16:35.

Mariët Theune, Sander Faas, Anton Nijholt, and Dirk


Heylen. 2003. The virtual storyteller: story cre-
ation by intelligent agents. In Proceedings of the
1st International Conference on Technologies for
Interactive Digital Storytelling and Entertainment.
Springer, pages 204–215.

Perry W. Thorndyke. 1977. Cognitive structures in


comprehension and memory of narrative discourse.
Cognitive psychology 9(1):77–110.

Scott R. Turner. 1993. Minstrel: a computer model of


creativity and storytelling. Ph.D. thesis, University
of California at Los Angeles.

Ashwin K Vijayakumar, Michael Cogswell, Ram-


prasath R Selvaraju, Qing Sun, Stefan Lee, David
Crandall, and Dhruv Batra. 2016. Diverse beam
search: Decoding diverse solutions from neural se-
quence models. arXiv preprint arXiv:1610.02424 .

Tong Wang, Ping Chen, and Boyang Li. 2017. Pre-


dicting the quality of short narratives from social
media. In Proceedings of the Twenty-Sixth Inter-
national Joint Conference on Artificial Intelligence
(IJCAI).

John Wieting, Mohit Bansal, Kevin Gimpel, and Karen


Livescu. 2016. Charagram: Embedding words and
sentences via character n-grams. In Proceedings of
the 2016 Conference on Empirical Methods in Nat-
ural Language Processing. pages 1504–1515.

John Wieting and Kevin Gimpel. 2017. Revisiting re-


current networks for paraphrastic sentence embed-
dings. In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics (Vol-
ume 1: Long Papers). pages 2078–2088.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy