Quality Signals in Generated Stories
Quality Signals in Generated Stories
Quality Signals in Generated Stories
• Overall quality (O): a subjective judgment by Table 1: Means and standard deviations for each cri-
the annotator of the quality of the continuation, terion, as well as inter-annotator (IA) mean absolute
differences (MAD) and standard deviations of absolute
i.e., roughly how much the annotator thinks the
differences (SDAD).
continuation adds to the story.
• Relevance (R): a measure of how relevant the
continuation is to the context. This addresses greater than 97%, and to have had at least 500
the question of whether the continuation fits HITs approved. We paid $0.08 per HIT. Since
within the world of the story. task duration can be difficult to estimate from
HIT times (due to workers becoming distracted or
• Interestingness (I): a measure of the amount of working on multiple HITs simultaneously), we re-
new (but still relevant) information added to the port the top 5 modes of the time duration data in
story. We use this to measure whether the con- seconds. For pairs with 3 sentences in the context,
tinuation makes the story more interesting. the most frequent durations are 11, 15, 14, 17, and
21 seconds. For 4 sentences, the most frequent du-
Our criteria follow McIntyre and Lapata (2009)
rations are 18, 20, 19, 21, and 23 seconds.
who used interestingness and coherence as two
We required each worker to annotate no more
quality criteria for story generation. Our notion
than 150 continuations so as not to bias the data
of relevance is closely related to coherence; when
collected. After collecting all annotations, we ad-
thinking of judging a continuation, we believed
justed the scores to account for how harshly or le-
that it would be more natural for annotators to
niently each worker scored the sentences on av-
judge the relevance of the continuation to its con-
erage. We did this by normalizing each score by
text, rather than judging the coherence of the re-
the absolute value of the difference between the
sulting story. That is, coherence is a property of a
worker’s mean score and the average mean score
discourse, while relevance is a property of a con-
of all workers for each criterion. We only normal-
tinuation (in relation to the context).
ized scores of workers who annotated more than
Our overall quality score was intended to cap-
10 pairs in order to ensure reliable worker means.
ture any remaining factors that determine human
We then averaged the two adjusted sets of scores
quality judgment. In preliminary annotation ex-
for each pair to get a single set of scores.
periments, we found that the overall score tended
to capture a notion of fluency/grammaticality,
4 Dataset Analysis
hence we decided not to annotate this criterion
separately. We asked annotators to forgive minor Table 1 shows means and standard deviations for
ungrammaticalities in the continuations and rate the three criteria. The means are similar across the
them as long as they could be understood. If an- three, though interestingness has the lowest, which
notators could not understand the continuation, we aligns with our expectations of the ROC stories.
asked them to assign a score of 0 for all criteria. For measuring inter-annotator agreement, we con-
We asked the workers to rate the continuations sider the mean absolute difference (MAD) of the
on a scale of 1 to 10, with 10 being the high- two judgments for each pair.3 Table 1 shows the
est score. We obtained annotations from two dis- MADs for each criterion and the corresponding
tinct annotators for each pair and for each crite- standard deviations (SDAD). Overall quality and
rion, adding up to a total of 4586×2×3 = 27516 interestingness showed slightly lower MADs than
judgments. We asked annotators to annotate all relevance, though all three criteria are similar.
three criteria for a given pair simultaneously in The average scores for each data source are
one HIT.2 We required workers to be located in shown in Table 2. The ranking of the systems is
the United States, to have a HIT approval rating
3
Cohen’s Kappa is not appropriate for our data because,
2
In a preliminary study, we experimented with asking for while we obtained two annotations for each pair, they were
each criterion separately to avoid accidental correlation of the not always from the same pair of annotators. In this case,
criteria, but found that it greatly reduced cumulative cognitive an annotator-agnostic metric like MAD (and its associated
load for each annotator to do all three together. standard deviation) is a better measure of agreement.
System # O R I 4.1 Relationships Among Criteria
S EQ 2S EQ -G REEDY 596 4.18 4.09 3.81
S EQ 2S EQ -D IV 584 3.36 3.50 3.00 Table 4 shows correlations among the criteria for
S EQ 2S EQ -S AMPLE 578 3.69 3.70 3.42 different sets of outputs. R ETRIEVAL outputs
S EQ 2S EQ -R EVERSE 577 4.61 4.39 4.02
R ETRIEVAL 1086 5.68 4.93 5.15 show a lower correlation between overall score
H UMAN 1165 7.22 8.05 6.33 and interestingness than H UMAN outputs. This is
likely because the R ETRIEVAL outputs with high
Table 2: Average criteria scores for each system (O =
overall, R = relevance, I = interestingness). interestingness scores frequently contained more
surprising content such as new character names
or new actions/events that were not found in the
context. Therefore, a high interestingness score
consistent across criteria. Human-written contin- was not as strongly correlated with overall qual-
uations are best under all three criteria. The H U - ity as with H UMAN outputs, for which interesting
MAN relevance average is higher than interesting-
continuations were less likely to contain erroneous
ness. This matches our intuitions about the ROC new material.
corpus: the stories were written to capture com- H UMAN continuations have a lower correlation
monsense knowledge about everyday events rather between relevance and interestingness than the
than to be particularly surprising or interesting sto- R ETRIEVAL or S EQ 2S EQ models. This is likely
ries in their own right. Nonetheless, we do find because nearly all H UMAN outputs are relevant, so
that the H UMAN continuations have higher inter- their interestingness does not depend on their rel-
estingness scores than all automatic systems. evance. For S EQ 2S EQ, the continuations can only
The R ETRIEVAL system actually outperforms be interesting if they are first somewhat relevant to
all S EQ 2S EQ systems on all criteria, though the the context; nonsensical output was rarely anno-
gap is smallest on relevance. We found that the tated as interesting. Thus the S EQ 2S EQ relevance
S EQ 2S EQ systems often produced continuations and interestingness scores have a higher correla-
that fit topically within the world suggested by tion than for H UMAN or R ETRIEVAL.
the context, though they were often generic or The lower rows show correlations for different
merely topically relevant without necessarily mov- levels of overall quality. For stories whose over-
ing the story forward. We found S2S-G REEDY all quality is greater than 7.5, the correlations be-
produced outputs that were grammatical and rele- tween the overall score and the other two criteria is
vant but tended to be more mundane whereas S2S- higher than when the overall quality is lower. The
R EVERSE tended to produce slightly more inter- correlation between relevance and interestingness
esting outputs that were still grammatical and rel- is not as high (0.34). The stories at this quality
evant on average. The sampling and diverse beam level are already at least somewhat relevant and
search outputs were frequently ungrammatical and understandable, hence like H UMAN outputs, the
therefore suffer under all criteria. interestingness score is not as dependent on the
relevance score. For stories with overall quality
We show sample outputs from the different sys-
below 2.5, the stories are often not understandable
tems in Table 3. We also show predicted criteria
so annotators assigned low scores to all three cri-
scores from our final automatic scoring model (see
teria, leading to higher correlation among them.
Section 6 for details). We show predicted rather
than annotated scores here because for a given 4.2 Features
context, we did not obtain annotations for all con- We also analyze our dataset by designing features
tinuations for that context. We can see some of the of the hcontext, continuationi pair and measuring
characteristics of the different models and under- their correlation with each criterion.
stand how their outputs differ. The R ETRIEVAL
outputs are sometimes more interesting than the 4.2.1 Shallow Features
H UMAN outputs, though they often mention new We consider simple features designed to capture
entities that were not contained in the context, or surface-level characteristics of the continuation:
they may be merely topically related to the context
without necessarily resulting in a coherent story. • Length: number of tokens in the continuation.
This affects interestingness as well, as a continua- • Relative length: the length of the continuation
tion must first be relevant in order to be interesting. divided by the length of the context.
Context 1: Tripp wanted to learn how to put a topspin on his serve . He was a more advanced tennis player . He sought out
a tennis pro to help him . He finally perfected his topspin .
System Continuation O R I
S2S-G REEDY He won the game . 4.12 4.99 3.45
S2S-D IV Now he had the game. 4.25 5.04 3.60
S2S-S AMPLE Now , he is able to play his . 4.48 4.88 3.94
S2S-R EVERSE He took a few minutes . 4.92 5.77 4.22
R ETRIEVAL Lyn now has a great backhand . 6.05 7.14 5.34
H UMAN His game improved even more. 5.05 6.16 4.36
Context 2: Neil had just entered the country of Oman . He found the desert land to be enchanting . The women dressed
beautifully and the men were friendly . Neil felt very comfortable in Oman .
S2S-G REEDY Neil decided to try the best man. 5.47 6.13 4.74
S2S-D IV They were days and decided 4.66 5.65 3.96
S2S-S AMPLE Neil Neil the trip trip of the trip of the trip 4.95 5.66 4.72
S2S-R EVERSE He took a tour of the city. 3.97 4.83 3.64
R ETRIEVAL Neil saw that South Koreans were a very kind people ! 6.26 6.94 5.66
H UMAN He wished he could stay forever! 6.24 7.22 5.58
Context 3: Ed and Emma were twins and wanted to have matching Halloween costumes . But they couldn ’t agree on a
costume ! Ed wanted to be a superhero and Emma wanted to be a mermaid .
S2S-G REEDY He took out and could make to work . 4.60 5.11 4.11
S2S-D IV So , s ’ and they would learn . 4.71 5.41 4.18
S2S-S AMPLE They decided went their great time and they their family . s house . 4.86 5.50 4.58
S2S-R EVERSE They decided to try to their local home . 4.74 5.21 4.22
R ETRIEVAL Then their mom offered a solution to please them both . 5.59 6.11 5.05
H UMAN Then their mom said she could make costumes that ’d please them both . 6.17 6.71 5.69
Table 3: Sample system outputs for different contexts. Final three columns show predicted scores from our trained
scorer (see Section 6 for details).
Corr(O,R) Corr(O,I) Corr(R,I) don et al. (2011) obtained strong results by using
H UMAN 0.70 0.63 0.44
R ETRIEVAL 0.68 0.52 0.47 PMIs to compute a score that measures the causal
H UMAN + R ET. 0.76 0.61 0.53 relatedness between a premise and its potential al-
S EQ 2S EQ -A LL 0.72 0.70 0.59 ternatives. For a hcontext, continuationi pair, we
Overall > 7.5 0.46 0.47 0.34
5 < Overall < 7.5 0.44 0.31 0.24 compute the following score (Gordon et al., 2011):
2.5 < Overall < 5 0.38 0.35 0.38
Overall < 2.5 0.41 0.41 0.38 P P
Overall > 2.5 0.76 0.69 0.59 u∈context v∈continuation PMI(u, v)
spmi =
Table 4: Pearson correlations between criteria for dif-
Ncontext Ncontination
ferent subsets of the annotated data. where Ncontext and Ncontinuation are the numbers of
tokens in the context and continuation. We cre-
• Language model: perplexity from a 4-gram ate 6 versions of the above score, combining three
language model with modified Kneser-Ney window sizes (10, 25, and 50) with both stan-
smoothing estimated using KenLM (Heafield, dard PMI and positive PMI (PPMI). To compute
2011) from the Personal Story corpus (Gordon PMI/PPMI, we use the Personal Story corpus.4
and Swanson, 2009), which includes about 1.6 For efficiency and robustness, we only compute
million personal stories from weblogs. PMI/PPMI of a word pair if the pair appears more
• IDF: the average of the inverse document fre- than 10 times in the corpus using the particular
quencies (IDFs) across all tokens in the contin- window size.
uation. The IDFs are computed using Wikipedia 4.2.3 Entity Mention Features
sentences as “documents”. We compute several features to capture how
4.2.2 PMI Features relevant the continuation is to the input. In
4
We use features based on pointwise mutual infor- We use Wikipedia for IDFs and the Personal Story cor-
pus for PMIs. IDF is a simpler statistic which is presumed to
mation (PMI) of word pairs in the context and con- be similar across a range of large corpora for most words; we
tinuation. We take inspiration from methods de- use Wikipedia because it has broad coverage in terms of vo-
veloped for the Choice of Plausible Alternatives cabulary. PMIs require computing word pair statistics and are
therefore expected to be more data-dependent, so we chose
(COPA) task (Roemmele et al., 2011), in which the Personal Story corpus due to its effectiveness for related
a premise is provided with two alternatives. Gor- tasks (Gordon et al., 2011).
Feature O R I relation for interestingness. The S EQ 2S EQ mod-
Length 0.007 0.055 0.071
Relative length 0.018 0.020 0.060 els output very common words which lets them
Language model 0.025 0.034 0.058 have relatively low perplexities even with occa-
IDF 0.418 0.316 0.408 sional disfluencies, while the human-written out-
PPMI (w = 10) 0.265 0.321 0.224
PPMI (w = 25) 0.289 0.341 0.249 puts contain more rare words.
PPMI (w = 50) 0.299 0.351 0.259 The IDF feature shows highest correlation with
Has old mentions 0.050 0.151 0.023 overall and interestingness, and lower correlation
Number of old mentions 0.057 0.146 0.049
Has new mentions -0.048 -0.115 -0.026 with relevance. This is intuitive since the IDF
Number of new mentions -0.052 -0.119 -0.029 feature will be largest when many rare words are
Has new names -0.005 -0.129 0.017
Number of new names -0.005 -0.130 0.017
used, which is expected to correlate with inter-
Is H UMAN? 0.56 0.62 0.50 estingness more than relevance. We suspect IDF
Is H UMAN ∪ R ETRIEVAL? 0.60 0.49 0.56 correlates so well with overall because S EQ 2S EQ
models typically generate common words, so this
Table 5: Spearman correlations between features and
annotations. The final two rows are “oracle” binary feature may partially separate the S EQ 2S EQ from
features that return 1 for continuations from those sets. H UMAN/R ETRIEVAL.
Unlike IDF, the PPMI scores (with window
sizes w shown in parentheses) show highest cor-
order to compute these features we use the relations with relevance. This is intuitive, since
part-of-speech tagging, named entity recognition PPMI will be highest when topical coherence is
(NER), and coreference resolution tools in Stan- present in the discourse. Higher correlations are
ford CoreNLP (Manning et al., 2014): found when using larger window sizes.6
The old mentions features have the highest cor-
• Has old mentions: a binary feature that returns relation with relevance, as expected. A contin-
1 if the continuation has “old mentions,” i.e., uation that continues coreference chains is more
mentions that are part of a coreference chain likely to be relevant. The new mention/name fea-
that began in the context. tures have negative correlations with relevance,
• Number of old mentions: the number of old which is also intuitive: introducing new characters
mentions in the continuation. makes the continuation less relevant.
To explore the question of separability be-
• Has new mentions: a binary feature that returns tween machine and human-written continuations,
1 if the continuation has “new mentions,” i.e., we measured correlations of “oracle” features that
mentions that are not part of any coreference simply return 1 if the output was generated by hu-
chain that began in the context. mans and 0 if it was generated by a system. Such
• Number of new mentions: the number of new features are highly correlated with all three crite-
mentions in the continuation. ria as seen in the final two rows of Table 5. This
• Has new names: if the continuation has new suggests that human annotators strongly preferred
mentions, this binary feature returns 1 if any of human generated stories over our models’ out-
the new mentions is a name, i.e., if the mention puts. Some features may correlate with the anno-
is a person named entity from the NER system. tated criteria if they separate human- and machine-
generated continuations (e.g., IDF).
• Number of new names: the number of new
names in the continuation. 5 Methods for Score Prediction
4.3 Comparing Features We now consider ways to build models to predict
Table 5 shows Spearman correlations between our our criteria. We define neural networks that take as
features and the criteria.5 The length features have input representations of the context/continuation
small positive correlations with all three criteria, pair hb, ci and our features and output a continu-
showing highest correlation with interestingness. ous value for each predicted criterion.
Language model perplexity shows weak correla- We experiment with two ways of representing
tion for all three measures, with its highest cor- the input based on the embeddings of b and c,
5 6
These use the combined training and validation sets; we We omit full results for brevity, but the PPMI features
describe splitting the data below in Section 6. showed slightly higher correlations than PMI features.
which we denote vb and vc respectively. The O R I
All features 57.3 53.4 49.6
first (“cont”) uses only the continuation embed- - PMI 56.3 50.4 48.6
ding without any representation of the context or - IDF 56.6 53.6 46.0
the similarity between the context and continua- - Mention 54.8 50.3 48.6
- Length 56.1 55.9 45.3
tion: xcont = hvc i. The second (“sim+cont”) No features 51.9 44.9 43.8
also contains the elementwise multiplication of + PMI 54.5 50.9 44.9
+ IDF 54.3 46.7 46.3
the context and continuation embeddings concate- + Mention 53.8 48.8 46.0
nated with the absolute difference: xsim+cont = + Length 51.9 43.1 44.9
hvb vc , |vb − vc |, vc i. + IDF, Length 54.6 46.5 47.3
To compute representations v, we use the av- Table 6: Ablation experiments with several feature sets
erage of character n-gram embeddings (Huang (Spearman correlations on the validation set).
et al., 2013; Wieting et al., 2016), fixing the out-
put dimensionality to 300. We found this to out- validation test
model features O R I O R I
perform other methods. In particular, the next
none 51.9 44.9 43.8 53.3 46.0 50.5
best method used gated recurrent averaging net- cont IDF, Len. 54.6 46.5 47.3 51.6 40.6 50.2
works (GRANs; Wieting and Gimpel, 2017), fol- all 57.3 53.4 49.6 57.1 54.3 52.8
none 51.6 43.7 44.3 52.2 45.0 48.4
lowed by long short-term memory (LSTM) net- sim+
IDF, Len. 54.2 45.6 47.7 56.0 46.8 53.0
works (Hochreiter and Schmidhuber, 1997), and cont
all 55.1 54.8 47.4 58.7 55.8 52.9
followed finally by word averaging.
Table 7: Correlations (Spearman’s ρ × 100) on valida-
The input, whether xcont or xsim+cont , is fed to
tion and test sets for best models with three feature sets.
one fully-connected hidden layer with 300 units,
followed by a rectified linear unit (ReLU) activa-
tion. Our manually computed features (Length,
IDF, PMI, and Mention) are concatenated prior to Each row corresponds to one feature ablation or
this layer. The output layer follows and uses a lin- addition, except for the final row which corre-
ear activation. sponds to adding two feature sets that are efficient
We use mean absolute error as our loss function to compute: IDF and Length. The Mention and
during training. We train to predict the three crite- PMI features are the most useful for relevance,
ria jointly, so the loss is actually the sum of mean which matches the pattern of correlations in Ta-
absolute errors over the three criteria. We found ble 5, while IDF and Length features are most
this form of multi-task learning to significantly helpful for interestingness. All feature sets con-
outperform training separate models for each cri- tribute in predicting overall quality, with the Men-
terion. When tuning, we tune based on the aver- tion features showing the largest drop in correla-
age Spearman correlation across the three criteria tion when they are ablated.
on our validation set. We train all models for 25 6.2 Final Results
epochs using Adam (Kingma and Ba, 2014) with
a learning rate of 0.001. Table 7 shows our final results on the validation
and test sets. The highest correlations on the
6 Experiments test set are achieved by using the sim+cont model
with all features. While interestingness can be
After averaging the two annotator scores to get predicted reasonably well with just IDF and the
our dataset of 4586 context/continuation pairs, we Length features, the prediction of relevance is im-
split the data randomly into 600 pairs for valida- proved greatly with the full feature set.
tion, 600 for testing, and used the rest (3386) for Using our strongest models, we computed the
training. For our evaluation metric, we use Spear- average predicted criterion scores for each story
man correlation between the scorer’s predictions generation system on the test set. Overall, the pre-
and the annotated scores. dicted rankings are strongly correlated with the
rankings yielded by the aggregated annotations
6.1 Feature Ablation shown in Table 2, especially in terms of distin-
Table 6 shows results as features are either re- guishing human-written and machine-generated
moved from the full set or added to the featureless continuations.
model, all when using the “cont” input schema. While the PMI features are very helpful for pre-
dicting relevance, they do have demanding space Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
requirements due to the sheer number of word cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning
pairs with nonzero counts in large corpora. We at-
phrase representations using RNN encoder–decoder
tempted to replace the PMI features by similar fea- for statistical machine translation. In Proceedings of
tures based on word embedding similarity, follow- the 2014 Conference on Empirical Methods in Nat-
ing the argument that skip-gram embeddings with ural Language Processing (EMNLP). pages 1724–
negative sampling form an approximate factoriza- 1734.
tion of a PMI score matrix (Levy and Goldberg, Angel Daza, Hiram Calvo, and Jesús Figueroa-Nazuno.
2014). However, we were unable to find the same 2016. Automatic text generation by learning from
performance by doing so; the PMI scores were still literary structures. In Proceedings of the Fifth Work-
shop on Computational Linguistics for Literature.
superior. pages 9–19.
For the automatic scores shown in Table 3, we
used the sim+cont model with IDF and Length Natalie Dehn. 1981. Story generation after TALE-
SPIN. In Proceedings of the 9th International Joint
features. Since this model does not require PMIs Conference on Artificial Intelligence (IJCAI). pages
or NLP analyzers, it is likely to be the one used 16–18.
most in practice by other researchers within train-
David K. Elson. 2012. Modeling Narrative Discourse.
ing/tuning settings. We release this trained scorer Ph.D. thesis, Columbia University.
as well as our annotated data to the research com-
munity. Albert Gatt and Emiel Krahmer. 2017. Survey of the
state of the art in natural language generation: Core
tasks, applications and evaluation. arXiv preprint
7 Conclusion arXiv:1703.09902 .
We conducted a manual evaluation of neural Pablo Gervás. 2012. Story generator algorithms. The
sequence-to-sequence and retrieval-based story Living Handbook of Narratology 19.
continuation systems along three criteria: overall Andrew S. Gordon, Cosmin Adrian Bejan, and Kenji
quality, relevance, and interestingness. We ana- Sagae. 2011. Commonsense causal reasoning using
lyzed the annotations and identified features that millions of personal stories. In 25th Conference on
correlate with each criterion. These annotations Artificial Intelligence (AAAI-11).
also provide a new story understanding task: pre- Andrew S. Gordon and Reid Swanson. 2009. Identi-
dicting the quality scores of generated continua- fying personal stories in millions of weblog entries.
tions. We took initial steps toward solving this In Third International Conference on Weblogs and
Social Media, Data Challenge Workshop.
task by developing an automatic scorer that uses
features, compositional architectures, and multi- Kenneth Heafield. 2011. KenLM: faster and smaller
task training. Our trained continuation scorer can language model queries. In Proceedings of the
be used as a rich feature function for story gen- EMNLP 2011 Sixth Workshop on Statistical Ma-
chine Translation.
eration or a reward function for systems that use
reinforcement learning to learn to generate stories. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
The annotated data and trained scorer are available short-term memory. Neural computation 9(8).
at the authors’ websites. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
Alex Acero, and Larry Heck. 2013. Learning deep
Acknowledgments structured semantic models for web search using
clickthrough data. In Proceedings of the 22nd ACM
We thank Melissa Roemmele and the anonymous international conference on Conference on informa-
reviewers for their comments. We also thank tion & knowledge management. ACM, pages 2333–
2338.
NVIDIA Corporation for donating GPUs used in
this research. Ting-Hao (Kenneth) Huang, Francis Ferraro, Nasrin
Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Ja-
cob Devlin, Ross Girshick, Xiaodong He, Push-
References meet Kohli, Dhruv Batra, C. Lawrence Zitnick,
Devi Parikh, Lucy Vanderwende, Michel Galley, and
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Margaret Mitchell. 2016. Visual storytelling. In
Noam Shazeer. 2015. Scheduled sampling for se- Proceedings of the 2016 Conference of the North
quence prediction with recurrent neural networks. American Chapter of the Association for Computa-
In Advances in Neural Information Processing Sys- tional Linguistics: Human Language Technologies.
tems. pages 1171–1179. pages 1233–1239.
Parag Jain, Priyanka Agrawal, Abhijit Mishra, Mo- Christopher D. Manning, Mihai Surdeanu, John Bauer,
hak Sukhwani, Anirban Laha, and Karthik Sankara- Jenny Rose Finkel, Steven Bethard, and David Mc-
narayanan. 2017. Story generation from sequence of Closky. 2014. The Stanford CoreNLP natural lan-
independent short descriptions. In Workshop on Ma- guage processing toolkit. In ACL (System Demon-
chine Learning for Creativity, at the 23rd SIGKDD strations). pages 55–60.
Conference on Knowledge Discovery and Data Min-
ing (KDD 2017). Neil McIntyre and Mirella Lapata. 2009. Learning to
tell tales: A data-driven approach to story genera-
Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. tion. In Proceedings of the Joint Conference of the
2016. Globally coherent text generation with neural 47th Annual Meeting of the ACL and the 4th Interna-
checklist models. In Proceedings of the 2016 Con- tional Joint Conference on Natural Language Pro-
ference on Empirical Methods in Natural Language cessing of the AFNLP. pages 217–225.
Processing. pages 329–339. Neil McIntyre and Mirella Lapata. 2010. Plot induc-
tion and evolutionary search for story generation. In
Diederik Kingma and Jimmy Ba. 2014. Adam: A Proceedings of the 48th Annual Meeting of the Asso-
method for stochastic optimization. arXiv preprint ciation for Computational Linguistics. pages 1562–
arXiv:1412.6980 . 1572.
Sheldon Klein, John F. Aeschlimann, David F. Bal- Neil Duncan McIntyre. 2011. Learning to tell tales:
siger, Steven L. Converse, Claudine Court, Mark automatic story generation from Corpora. Ph.D.
Foster, Robin Lao, John D. Oakley, and Joel Smith. thesis, The University of Edinburgh.
1973. Automatic novel writing: A status report.
Technical Report 186, University of Wisconsin- James R. Meehan. 1977. TALE-SPIN, an interactive
Madison. program that writes stories. In Proceedings of the
5th International Joint Conference on Artificial In-
telligence (IJCAI). pages 91–98.
Michael Lebowitz. 1985. Story-telling as planning and
learning. Poetics 14(6):483–502. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
Omer Levy and Yoav Goldberg. 2014. Neural word Pushmeet Kohli, and James Allen. 2016. A cor-
embedding as implicit matrix factorization. In Ad- pus and cloze evaluation for deeper understanding of
vances in Neural Information Processing Systems commonsense stories. In Proceedings of the 2016
27. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Yuhua Li, David McLean, Zuhair A Bandar, James D Language Technologies. pages 839–849.
O’shea, and Keeley Crockett. 2006. Sentence sim-
ilarity based on semantic nets and corpus statistics. K.M. Oinonen, Mariet Theune, Antinus Nijholt, and
IEEE transactions on knowledge and data engineer- J.R.R. Uijlings. 2006. Designing a story database
ing 18(8):1138–1150. for use in automatic story generation, Springer Ver-
lag, pages 298–301. Lecture Notes in Computer Sci-
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose- ence.
worthy, Laurent Charlin, and Joelle Pineau. 2016. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
How NOT to evaluate your dialogue system: An em- Jing Zhu. 2002. BLEU: a method for automatic
pirical study of unsupervised evaluation metrics for evaluation of machine translation. In Proceedings
dialogue response generation. In Proceedings of the of the 40th annual meeting on association for com-
2016 Conference on Empirical Methods in Natural putational linguistics. pages 311–318.
Language Processing. pages 2122–2132.
Vladimir Propp. 1968. Morphology of the Folktale,
Hugo Liu and Push Singh. 2002. MAKEBELIEVE: volume 9. University of Texas Press.
Using commonsense knowledge to generate stories.
In Proceedings of the Eighteenth National Confer- Mark O. Riedl and Robert Michael Young. 2010. Nar-
ence on Artificial Intelligence (AAAI). pages 957– rative planning: Balancing plot and character. Jour-
958. nal of Artificial Intelligence Research 39:217–268.
Melissa Roemmele. 2016. Writing stories with help
Ryan Lowe, Michael Noseworthy, Iulian V. Ser- from recurrent neural networks. In Thirtieth AAAI
ban, Nicolas Angelard-Gontier, Yoshua Bengio, and Conference on Artificial Intelligence. pages 4311 –
Joelle Pineau. 2017. Towards an automatic Turing 4312.
test: Learning to evaluate dialogue responses. In
Proceedings of the 55th Annual Meeting of the As- Melissa Roemmele, Cosmin Adrian Bejan, and An-
sociation for Computational Linguistics. drew S. Gordon. 2011. Choice of Plausible Alterna-
tives: An Evaluation of Commonsense Causal Rea-
Inderjeet Mani. 2012. Computational modeling of soning. In AAAI Spring Symposium on Logical For-
narrative. Synthesis Lectures on Human Language malizations of Commonsense Reasoning. Stanford
Technologies 5(3):1–142. University.
Melissa Roemmele and Andrew S. Gordon. 2015. Cre- Robert Wilensky. 1983. Story grammars versus story
ative help: a story writing assistant. In Inter- points. Behavioral and Brain Sciences 6(4):579–
national Conference on Interactive Digital Story- 591.
telling. Springer, pages 81–92.
Patrick Henry Winston. 2014. The Genesis story un-
Melissa Roemmele, Andrew S. Gordon, and Reid derstanding and story telling system: A 21st century
Swanson. 2017. Evaluating story generation sys- step toward artificial intelligence. Memo 019, Cen-
tems using automated linguistic analyses. In Work- ter for Brains Minds and Machines, MIT.
shop on Machine Learning for Creativity, at the 23rd
SIGKDD Conference on Knowledge Discovery and
Data Mining (KDD 2017).