Quality Signals in Generated Stories
Table 1: Means and standard deviations for each cri-
terion, as well as inter-annotator (IA) mean absolute
differences (MAD) and standard deviations of absolute
i.e., roughly how much the annotator thinks the
differences (SDAD).
continuation adds to the story.
• Relevance (R): a measure of how relevant the
continuation is to the context. This addresses greater than 97%, and to have had at least 500
the question of whether the continuation fits HITs approved. We paid $0.08 per HIT. Since
within the world of the story. task duration can be difficult to estimate from
HIT times (due to workers becoming distracted or
• Interestingness (I): a measure of the amount of working on multiple HITs simultaneously), we re-
new (but still relevant) information added to the port the top 5 modes of the time duration data in
story. We use this to measure whether the con- seconds. For pairs with 3 sentences in the context,
tinuation makes the story more interesting. the most frequent durations are 11, 15, 14, 17, and
21 seconds. For 4 sentences, the most frequent du-
Our criteria follow McIntyre and Lapata (2009)
rations are 18, 20, 19, 21, and 23 seconds.
who used interestingness and coherence as two
We required each worker to annotate no more
quality criteria for story generation. Our notion
than 150 continuations so as not to bias the data
of relevance is closely related to coherence; when
collected. After collecting all annotations, we ad-
thinking of judging a continuation, we believed
justed the scores to account for how harshly or le-
that it would be more natural for annotators to
niently each worker scored the sentences on av-
judge the relevance of the continuation to its con-
erage. We did this by normalizing each score by
text, rather than judging the coherence of the re-
the absolute value of the difference between the
sulting story. That is, coherence is a property of a
worker’s mean score and the average mean score
discourse, while relevance is a property of a con-
of all workers for each criterion. We only normal-
tinuation (in relation to the context).
ized scores of workers who annotated more than
Our overall quality score was intended to cap-
10 pairs in order to ensure reliable worker means.
ture any remaining factors that determine human
We then averaged the two adjusted sets of scores
quality judgment. In preliminary annotation ex-
for each pair to get a single set of scores.
periments, we found that the overall score tended
to capture a notion of fluency/grammaticality,
4 Dataset Analysis
hence we decided not to annotate this criterion
separately. We asked annotators to forgive minor Table 1 shows means and standard deviations for
ungrammaticalities in the continuations and rate the three criteria. The means are similar across the
them as long as they could be understood. If an- three, though interestingness has the lowest, which
notators could not understand the continuation, we aligns with our expectations of the ROC stories.
asked them to assign a score of 0 for all criteria. For measuring inter-annotator agreement, we con-
We asked the workers to rate the continuations sider the mean absolute difference (MAD) of the
on a scale of 1 to 10, with 10 being the high- two judgments for each pair.3 Table 1 shows the
est score. We obtained annotations from two dis- MADs for each criterion and the corresponding
tinct annotators for each pair and for each crite- standard deviations (SDAD). Overall quality and
rion, adding up to a total of 4586×2×3 = 27516 interestingness showed slightly lower MADs than
judgments. We asked annotators to annotate all relevance, though all three criteria are similar.
three criteria for a given pair simultaneously in The average scores for each data source are
one HIT.2 We required workers to be located in shown in Table 2. The ranking of the systems is
the United States, to have a HIT approval rating
Cohen’s Kappa is not appropriate for our data because,
In a preliminary study, we experimented with asking for while we obtained two annotations for each pair, they were
each criterion separately to avoid accidental correlation of the not always from the same pair of annotators. In this case,
criteria, but found that it greatly reduced cumulative cognitive an annotator-agnostic metric like MAD (and its associated
load for each annotator to do all three together. standard deviation) is a better measure of agreement.
System # O R I 4.1 Relationships Among Criteria
S EQ 2S EQ -G REEDY 596 4.18 4.09 3.81
S EQ 2S EQ -D IV 584 3.36 3.50 3.00 Table 4 shows correlations among the criteria for
S EQ 2S EQ -S AMPLE 578 3.69 3.70 3.42 different sets of outputs. R ETRIEVAL outputs
S EQ 2S EQ -R EVERSE 577 4.61 4.39 4.02
R ETRIEVAL 1086 5.68 4.93 5.15 show a lower correlation between overall score
H UMAN 1165 7.22 8.05 6.33 and interestingness than H UMAN outputs. This is
likely because the R ETRIEVAL outputs with high
Table 2: Average criteria scores for each system (O =
overall, R = relevance, I = interestingness). interestingness scores frequently contained more
surprising content such as new character names
or new actions/events that were not found in the
context. Therefore, a high interestingness score
consistent across criteria. Human-written contin- was not as strongly correlated with overall qual-
uations are best under all three criteria. The H U - ity as with H UMAN outputs, for which interesting
MAN relevance average is higher than interesting-
continuations were less likely to contain erroneous
ness. This matches our intuitions about the ROC new material.
corpus: the stories were written to capture com- H UMAN continuations have a lower correlation
monsense knowledge about everyday events rather between relevance and interestingness than the
than to be particularly surprising or interesting sto- R ETRIEVAL or S EQ 2S EQ models. This is likely
ries in their own right. Nonetheless, we do find because nearly all H UMAN outputs are relevant, so
that the H UMAN continuations have higher inter- their interestingness does not depend on their rel-
estingness scores than all automatic systems. evance. For S EQ 2S EQ, the continuations can only
The R ETRIEVAL system actually outperforms be interesting if they are first somewhat relevant to
all S EQ 2S EQ systems on all criteria, though the the context; nonsensical output was rarely anno-
gap is smallest on relevance. We found that the tated as interesting. Thus the S EQ 2S EQ relevance
S EQ 2S EQ systems often produced continuations and interestingness scores have a higher correla-
that fit topically within the world suggested by tion than for H UMAN or R ETRIEVAL.
the context, though they were often generic or The lower rows show correlations for different
merely topically relevant without necessarily mov- levels of overall quality. For stories whose over-
ing the story forward. We found S2S-G REEDY all quality is greater than 7.5, the correlations be-
produced outputs that were grammatical and rele- tween the overall score and the other two criteria is
vant but tended to be more mundane whereas S2S- higher than when the overall quality is lower. The
R EVERSE tended to produce slightly more inter- correlation between relevance and interestingness
esting outputs that were still grammatical and rel- is not as high (0.34). The stories at this quality
evant on average. The sampling and diverse beam level are already at least somewhat relevant and
search outputs were frequently ungrammatical and understandable, hence like H UMAN outputs, the
therefore suffer under all criteria. interestingness score is not as dependent on the
relevance score. For stories with overall quality
We show sample outputs from the different sys-
below 2.5, the stories are often not understandable
tems in Table 3. We also show predicted criteria
so annotators assigned low scores to all three cri-
scores from our final automatic scoring model (see
teria, leading to higher correlation among them.
Section 6 for details). We show predicted rather
than annotated scores here because for a given 4.2 Features
context, we did not obtain annotations for all con- We also analyze our dataset by designing features
tinuations for that context. We can see some of the of the hcontext, continuationi pair and measuring
characteristics of the different models and under- their correlation with each criterion.
stand how their outputs differ. The R ETRIEVAL
outputs are sometimes more interesting than the 4.2.1 Shallow Features
H UMAN outputs, though they often mention new We consider simple features designed to capture
entities that were not contained in the context, or surface-level characteristics of the continuation:
they may be merely topically related to the context
without necessarily resulting in a coherent story. • Length: number of tokens in the continuation.
This affects interestingness as well, as a continua- • Relative length: the length of the continuation
tion must first be relevant in order to be interesting. divided by the length of the context.
Context 1: Tripp wanted to learn how to put a topspin on his serve . He was a more advanced tennis player . He sought out
a tennis pro to help him . He finally perfected his topspin .
System Continuation O R I
S2S-G REEDY He won the game . 4.12 4.99 3.45
S2S-D IV Now he had the game. 4.25 5.04 3.60
S2S-S AMPLE Now , he is able to play his . 4.48 4.88 3.94
S2S-R EVERSE He took a few minutes . 4.92 5.77 4.22
R ETRIEVAL Lyn now has a great backhand . 6.05 7.14 5.34
H UMAN His game improved even more. 5.05 6.16 4.36
Context 2: Neil had just entered the country of Oman . He found the desert land to be enchanting . The women dressed
beautifully and the men were friendly . Neil felt very comfortable in Oman .
S2S-G REEDY Neil decided to try the best man. 5.47 6.13 4.74
S2S-D IV They were days and decided 4.66 5.65 3.96
S2S-S AMPLE Neil Neil the trip trip of the trip of the trip 4.95 5.66 4.72
S2S-R EVERSE He took a tour of the city. 3.97 4.83 3.64
R ETRIEVAL Neil saw that South Koreans were a very kind people ! 6.26 6.94 5.66
H UMAN He wished he could stay forever! 6.24 7.22 5.58
Context 3: Ed and Emma were twins and wanted to have matching Halloween costumes . But they couldn ’t agree on a
costume ! Ed wanted to be a superhero and Emma wanted to be a mermaid .
S2S-G REEDY He took out and could make to work . 4.60 5.11 4.11
S2S-D IV So , s ’ and they would learn . 4.71 5.41 4.18
S2S-S AMPLE They decided went their great time and they their family . s house . 4.86 5.50 4.58
S2S-R EVERSE They decided to try to their local home . 4.74 5.21 4.22
R ETRIEVAL Then their mom offered a solution to please them both . 5.59 6.11 5.05
H UMAN Then their mom said she could make costumes that ’d please them both . 6.17 6.71 5.69
Table 3: Sample system outputs for different contexts. Final three columns show predicted scores from our trained
scorer (see Section 6 for details).
Corr(O,R) Corr(O,I) Corr(R,I) don et al. (2011) obtained strong results by using
H UMAN 0.70 0.63 0.44
R ETRIEVAL 0.68 0.52 0.47 PMIs to compute a score that measures the causal
H UMAN + R ET. 0.76 0.61 0.53 relatedness between a premise and its potential al-
S EQ 2S EQ -A LL 0.72 0.70 0.59 ternatives. For a hcontext, continuationi pair, we
Overall > 7.5 0.46 0.47 0.34
5 < Overall < 7.5 0.44 0.31 0.24 compute the following score (Gordon et al., 2011):
2.5 < Overall < 5 0.38 0.35 0.38
Overall < 2.5 0.41 0.41 0.38 P P
Overall > 2.5 0.76 0.69 0.59 u∈context v∈continuation PMI(u, v)
spmi =
Table 4: Pearson correlations between criteria for dif-
Ncontext Ncontination
ferent subsets of the annotated data. where Ncontext and Ncontinuation are the numbers of
tokens in the context and continuation. We cre-
• Language model: perplexity from a 4-gram ate 6 versions of the above score, combining three
language model with modified Kneser-Ney window sizes (10, 25, and 50) with both stan-
smoothing estimated using KenLM (Heafield, dard PMI and positive PMI (PPMI). To compute
2011) from the Personal Story corpus (Gordon PMI/PPMI, we use the Personal Story corpus.4
and Swanson, 2009), which includes about 1.6 For efficiency and robustness, we only compute
million personal stories from weblogs. PMI/PPMI of a word pair if the pair appears more
• IDF: the average of the inverse document fre- than 10 times in the corpus using the particular
quencies (IDFs) across all tokens in the contin- window size.
uation. The IDFs are computed using Wikipedia 4.2.3 Entity Mention Features
sentences as “documents”. We compute several features to capture how
4.2.2 PMI Features relevant the continuation is to the input. In
We use features based on pointwise mutual infor- We use Wikipedia for IDFs and the Personal Story cor-
pus for PMIs. IDF is a simpler statistic which is presumed to
mation (PMI) of word pairs in the context and con- be similar across a range of large corpora for most words; we
tinuation. We take inspiration from methods de- use Wikipedia because it has broad coverage in terms of vo-
veloped for the Choice of Plausible Alternatives cabulary. PMIs require computing word pair statistics and are
therefore expected to be more data-dependent, so we chose
(COPA) task (Roemmele et al., 2011), in which the Personal Story corpus due to its effectiveness for related
a premise is provided with two alternatives. Gor- tasks (Gordon et al., 2011).
Feature O R I relation for interestingness. The S EQ 2S EQ mod-
Length 0.007 0.055 0.071
Relative length 0.018 0.020 0.060 els output very common words which lets them
Language model 0.025 0.034 0.058 have relatively low perplexities even with occa-
IDF 0.418 0.316 0.408 sional disfluencies, while the human-written out-
PPMI (w = 10) 0.265 0.321 0.224
PPMI (w = 25) 0.289 0.341 0.249 puts contain more rare words.
PPMI (w = 50) 0.299 0.351 0.259 The IDF feature shows highest correlation with
Has old mentions 0.050 0.151 0.023 overall and interestingness, and lower correlation
Number of old mentions 0.057 0.146 0.049
Has new mentions -0.048 -0.115 -0.026 with relevance. This is intuitive since the IDF
Number of new mentions -0.052 -0.119 -0.029 feature will be largest when many rare words are
Has new names -0.005 -0.129 0.017
Number of new names -0.005 -0.130 0.017
used, which is expected to correlate with inter-
Is H UMAN? 0.56 0.62 0.50 estingness more than relevance. We suspect IDF
Is H UMAN ∪ R ETRIEVAL? 0.60 0.49 0.56 correlates so well with overall because S EQ 2S EQ
models typically generate common words, so this
Table 5: Spearman correlations between features and
annotations. The final two rows are “oracle” binary feature may partially separate the S EQ 2S EQ from
features that return 1 for continuations from those sets. H UMAN/R ETRIEVAL.
Unlike IDF, the PPMI scores (with window
sizes w shown in parentheses) show highest cor-
order to compute these features we use the relations with relevance. This is intuitive, since
part-of-speech tagging, named entity recognition PPMI will be highest when topical coherence is
(NER), and coreference resolution tools in Stan- present in the discourse. Higher correlations are
ford CoreNLP (Manning et al., 2014): found when using larger window sizes.6
The old mentions features have the highest cor-
• Has old mentions: a binary feature that returns relation with relevance, as expected. A contin-
1 if the continuation has “old mentions,” i.e., uation that continues coreference chains is more
mentions that are part of a coreference chain likely to be relevant. The new mention/name fea-
that began in the context. tures have negative correlations with relevance,
• Number of old mentions: the number of old which is also intuitive: introducing new characters
mentions in the continuation. makes the continuation less relevant.
To explore the question of separability be-
• Has new mentions: a binary feature that returns tween machine and human-written continuations,
1 if the continuation has “new mentions,” i.e., we measured correlations of “oracle” features that
mentions that are not part of any coreference simply return 1 if the output was generated by hu-
chain that began in the context. mans and 0 if it was generated by a system. Such
• Number of new mentions: the number of new features are highly correlated with all three crite-
mentions in the continuation. ria as seen in the final two rows of Table 5. This
• Has new names: if the continuation has new suggests that human annotators strongly preferred
mentions, this binary feature returns 1 if any of human generated stories over our models’ out-
the new mentions is a name, i.e., if the mention puts. Some features may correlate with the anno-
is a person named entity from the NER system. tated criteria if they separate human- and machine-
generated continuations (e.g., IDF).
• Number of new names: the number of new
names in the continuation. 5 Methods for Score Prediction
4.3 Comparing Features We now consider ways to build models to predict
Table 5 shows Spearman correlations between our our criteria. We define neural networks that take as
features and the criteria.5 The length features have input representations of the context/continuation
small positive correlations with all three criteria, pair hb, ci and our features and output a continu-
showing highest correlation with interestingness. ous value for each predicted criterion.
Language model perplexity shows weak correla- We experiment with two ways of representing
tion for all three measures, with its highest cor- the input based on the embeddings of b and c,
5 6
These use the combined training and validation sets; we We omit full results for brevity, but the PPMI features
describe splitting the data below in Section 6. showed slightly higher correlations than PMI features.
which we denote vb and vc respectively. The O R I
All features 57.3 53.4 49.6
first (“cont”) uses only the continuation embed- - PMI 56.3 50.4 48.6
ding without any representation of the context or - IDF 56.6 53.6 46.0
the similarity between the context and continua- - Mention 54.8 50.3 48.6
- Length 56.1 55.9 45.3
tion: xcont = hvc i. The second (“sim+cont”) No features 51.9 44.9 43.8
also contains the elementwise multiplication of + PMI 54.5 50.9 44.9
+ IDF 54.3 46.7 46.3
the context and continuation embeddings concate- + Mention 53.8 48.8 46.0
nated with the absolute difference: xsim+cont = + Length 51.9 43.1 44.9
hvb vc , |vb − vc |, vc i. + IDF, Length 54.6 46.5 47.3
To compute representations v, we use the av- Table 6: Ablation experiments with several feature sets
erage of character n-gram embeddings (Huang (Spearman correlations on the validation set).
et al., 2013; Wieting et al., 2016), fixing the out-
put dimensionality to 300. We found this to out- validation test
model features O R I O R I
perform other methods. In particular, the next
none 51.9 44.9 43.8 53.3 46.0 50.5
best method used gated recurrent averaging net- cont IDF, Len. 54.6 46.5 47.3 51.6 40.6 50.2
works (GRANs; Wieting and Gimpel, 2017), fol- all 57.3 53.4 49.6 57.1 54.3 52.8
none 51.6 43.7 44.3 52.2 45.0 48.4
lowed by long short-term memory (LSTM) net- sim+
IDF, Len. 54.2 45.6 47.7 56.0 46.8 53.0
works (Hochreiter and Schmidhuber, 1997), and cont
all 55.1 54.8 47.4 58.7 55.8 52.9
followed finally by word averaging.
Table 7: Correlations (Spearman’s ρ × 100) on valida-
The input, whether xcont or xsim+cont , is fed to
tion and test sets for best models with three feature sets.
one fully-connected hidden layer with 300 units,
followed by a rectified linear unit (ReLU) activa-
tion. Our manually computed features (Length,
IDF, PMI, and Mention) are concatenated prior to Each row corresponds to one feature ablation or
this layer. The output layer follows and uses a lin- addition, except for the final row which corre-
ear activation. sponds to adding two feature sets that are efficient
We use mean absolute error as our loss function to compute: IDF and Length. The Mention and
during training. We train to predict the three crite- PMI features are the most useful for relevance,
ria jointly, so the loss is actually the sum of mean which matches the pattern of correlations in Ta-
absolute errors over the three criteria. We found ble 5, while IDF and Length features are most
this form of multi-task learning to significantly helpful for interestingness. All feature sets con-
outperform training separate models for each cri- tribute in predicting overall quality, with the Men-
terion. When tuning, we tune based on the aver- tion features showing the largest drop in correla-
age Spearman correlation across the three criteria tion when they are ablated.
on our validation set. We train all models for 25 6.2 Final Results
epochs using Adam (Kingma and Ba, 2014) with
a learning rate of 0.001. Table 7 shows our final results on the validation
and test sets. The highest correlations on the
6 Experiments test set are achieved by using the sim+cont model
with all features. While interestingness can be
After averaging the two annotator scores to get predicted reasonably well with just IDF and the
our dataset of 4586 context/continuation pairs, we Length features, the prediction of relevance is im-
split the data randomly into 600 pairs for valida- proved greatly with the full feature set.
tion, 600 for testing, and used the rest (3386) for Using our strongest models, we computed the
training. For our evaluation metric, we use Spear- average predicted criterion scores for each story
man correlation between the scorer’s predictions generation system on the test set. Overall, the pre-
and the annotated scores. dicted rankings are strongly correlated with the
rankings yielded by the aggregated annotations
6.1 Feature Ablation shown in Table 2, especially in terms of distin-
Table 6 shows results as features are either re- guishing human-written and machine-generated
moved from the full set or added to the featureless continuations.
model, all when using the “cont” input schema. While the PMI features are very helpful for pre-
