STORYEVAL An Empirical Evaluation Framework For Na
STORYEVAL An Empirical Evaluation Framework For Na
STORYEVAL An Empirical Evaluation Framework For Na
net/publication/221250690
CITATIONS READS
19 125
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jennifer Sabourin on 05 March 2015.
Abstract Charles, and Mead 2002; Riedl, Saretto, and Young 2003;
Research in intelligent narrative technologies has recently Magerko 2007), and art (Mateas and Stern 2005). These
experienced a significant resurgence. As the field matures, systems have leveraged a variety of computational
devising principled evaluation methodologies will become approaches to narrative generation and drama
increasingly important to ensure continued progress. management, including adversarial search (Weyhrauch
Because of the complexities of narrative phenomena, as 1997), planning (Cavazza, Charles, and Mead 2002; Riedl,
well as the inherent subjectivity of narrative experiences, Saretto, and Young 2003), decision-theoretic approaches
effectively evaluating intelligent narrative technologies (Si, Marsella, Pynadath 2005; Mott and Lester 2006), and
poses significant challenges. In this paper, we present Markov decision processes (Nelson et al. 2006; Roberts et
STORYEVAL, an evaluation framework for empirically
al. 2006).
studying computational models of narrative generation.
Drawing on evaluation methodologies from cognitive Interactive narrative systems have won accolades for
science, human-computer interaction, and natural language novel story generation technologies (Mateas and Stern
processing, as well as techniques that have begun to emerge 2005), they have been embraced by international audiences
in the narrative technologies community, STORYEVAL of hundreds and thousands of users (Johnson and Beal
consists of four complementary tools for evaluating both 2005; Mateas and Stern 2005; Zoll et al. 2006), and they
interactive and non-interactive narrative generation: have been used as effective educational tools across many
Narrative Metrics, Cognitive-Affective Studies, Director- domains (Johnson and Beal 2005; McQuiggan et al. 2008).
centric Studies, and Extrinsic Narrative Evaluations. We Despite these successes, progress in the field has proven
discuss the benefits and limitations of each family of difficult to measure for several reasons. First, narrative
techniques and illustrate their application with example
experiences are intrinsically subjective, making critical
narrative generators drawn from the field.
assessment notoriously unreliable. Second, narratives are
typically authored for specific domains and content, which
Introduction makes it difficult to systematically compare alternate
computational models and the systems that embody them.
Recent years have seen significant growth in research on Third, narratives are enormously complex, multi-
intelligent narrative technologies. Much of this work has dimensional constructs. Despite thousands of years of
focused on narrative generation (Turner 1994; Riedl and storytelling, refinement, and analysis, there does not (and
Young 2005), and work on non-interactive narrative perhaps cannot) exist any canonical theory of narrative.
generators has sought to capture an array of complex For scientists and engineers seeking to advance the field’s
narrative phenomena, ranging from character intentionality state-of-the-art, these factors pose significant challenges.
(Riedl and Young 2005) to models of the creative process In this paper, we address these issues by proposing an
(Turner 1994). Interactive narrative systems have used empirical evaluation framework for studying
similar generative techniques to balance well-formed story computational models of narrative generation. Our hope is
experiences with significant player agency. Their capacity that by highlighting the most effective approaches for
to dynamically construct and revise story plans in response assessing narrative generation, we can better explore the
to users’ actions has shown promise for applications in role of empirical evaluation in intelligent narrative
education (Zoll et al. 2006; Mott and Lester 2006), training technologies. Further, by analyzing the techniques used to
(Si, Marsella, Pynadath 2005), entertainment (Cavazza, evaluate narrative generators, we can obtain a clearer view
of the community’s progress, the methods necessary for
Copyright © 2009, Association for the Advancement of Artificial measuring it, and insights into the most promising
Intelligence (www.aaai.org). All rights reserved. approaches for accelerating advancements in the field.
We begin with a brief overview of narrative and the narratives and non-interactive narratives pose their own
challenges inherent in evaluating narrative generators. We distinct evaluation challenges.
then present STORYEVAL, an empirical narrative generation Further complicating matters, intelligent narrative
evaluation framework that draws on methodologies from technologies have adopted several different philosophical
cognitive science, human-computer interaction, and natural approaches to constructing narrative. For example,
language processing, as well as the narrative technologies researchers have distinguished between story-centric,
community itself. We describe each of STORYEVAL’s four author-centric, and world-centric approaches to narrative
components (Narrative Metrics, Cognitive-Affective generation (Bailey 1999). Story-centric approaches view
Studies, Director-centric Studies, and Extrinsic Narrative narrative as an abstract artifact with intrinsic characteristics
Evaluations), discuss benefits and limitations, and suggest to guide generation, perhaps best exemplified by story
how they might be applied to specific narrative generators. grammars. Author-centric narrative generation attempts to
explicitly model human authors’ story creation processes.
World-centric approaches populate story worlds with
Characteristics of Intelligent Narrative autonomous characters, and then allow narrative to emerge
Technologies from character interactions (Bailey 1999). Of course, each
approach has its own strengths and weaknesses, and its
Intelligent narrative technologies encompass a broad range own standards for assessment.
of techniques for story understanding, generation, This complex landscape of alternative models and
interaction, and authoring. In this paper, we focus modalities calls for an amalgamation of complementary
primarily on narrative generators. Following an overview techniques for effectively assessing intelligent narrative
of narrative and the philosophical perspectives that have technologies. Unfortunately, traditional human-computer
informed work to date on narrative generation, we discuss interaction approaches are often ill suited to evaluating
the challenges of narrative technology evaluation. narrative phenomena. In the next section, we discuss why
this is the case, and introduce techniques that are effective
Narrative Generation for narrative evaluation.
Although narrative is generally defined as the
representation of one or more events, such a simple Evaluation Challenges
definition fails to indicate the complexity that characterizes Evaluating narrative generators differs substantially from
narrative phenomena. For example, narrative events often evaluating traditional software and AI systems. Classical
have intricate temporal and causal relationships, maintain AI models (e.g., theorem provers, planners) often use
one or more continuant subjects, and constitute a well- objective measures such as soundness, completeness, and
formed whole (Prince 2003). Critical analysis must optimality to assess a model’s performance on a given task
consider issues of dramatic tension, plot structure, and (Russell and Norvig 2003). Although narrative generators
character. Narratologists distinguish three components of have a specific task, namely, to construct a story, the
narrative: fabula, sjuzet, and medium. The fabula is the subjectivity and complexity inherent in narrative, as well
story, consisting of the “set of narrated situations and as the sheer space of possible narratives, renders analysis
events in their chronological sequence” (Prince 2003). The of optimality and completeness difficult, if not impossible.
sjuzet is the discourse, the “set of narrated situations and Evaluation methodologies must consider which
events in the order of their presentation to the receiver” components (fabula, sjuzet, or media) the generator is
(Prince 2003). The medium is the delivery mechanism targeting, how dependent and sensitive the resulting
used to present the discourse to an audience, such as text, narrative is on the hand-authored specifications provided to
oral storytelling, animation, or film. Although an author the generator, what measures of “goodness” are
may not consciously consider narrative in these terms as appropriate for the stories generated, and for what aim or
they craft a story, every complete narrative implements the purpose were the generated stories created. Computational
components in some form. properties such as time and space complexity, robustness,
Such reductionist approaches have had a significant and the space of possible stories are also important for
influence on work in intelligent narrative technologies. assessing a generator, but these issues are usually
Different projects have focused on different components of overshadowed by concerns about the internal structure and
narrative, each accounting for specific sets of goals, surface presentation of the generated narratives.
requirements, and constraints. Further, as researchers have Assessing interactive narrative generators also differs
begun to build systems that incorporate interactivity into from more traditional software evaluation (e.g., database
narrative, system goals and priorities have experienced a systems, word processors, e-mail clients). Several of the
corresponding shift (Riedl, Saretto, and Young 2003). prominent assessment techniques used by the human-
Introducing interactivity affects issues of fabula, sjuzet, computer interaction (HCI) community employ analyses
and media. It places additional demands on character, and such as cognitive walkthroughs, heuristic evaluations, and
it carries important implications for constructing dramatic model-based techniques (Dix et al. 2004). These
and well-structured stories. Consequently, both interactive approaches must make certain assumptions about the
software being evaluated: the software enables the
completion of some well-defined task(s); its goals include complementary tools for empirically assessing interactive
ease-of-use, efficiency, and learnability; it behaves and non-interactive narrative generators:
deterministically in response to user input; and existing • Narrative Metrics: By measuring specific characteristics
cognitive models accurately reflect mental processing of a generated narrative, narrative metrics can be used to
during user interaction. While these assumptions are evaluate the product of a narrative generator.
appropriate for a wide range of systems, many of them • Cognitive-Affective Studies: By gauging audience
break down when applied to narrative technologies. response to a narrative experience, cognitive-affective
Interactive narratives often exhibit mixed-initiative, studies assess the impact of a narrative generator through
stochastic behavior; they may seek to intentionally prolong human participant experiments.
or frustrate a user for narrative effect; and emotion often • Director-centric Studies: By evaluating the
influences user behavior as much as cognitive factors, computational performance of a director agent or drama
which are not accounted for by GOMS and keystroke-level manager, director-centric studies assess the effectiveness
models. Further, factors such as character believability, of a narrative generator.
plot coherence, dramatic tension, and narrative structure, • Extrinsic Narrative Evaluations: By assessing the
all irrelevant to traditional software systems, are central to performance of the application in which a narrative
any heuristic analysis of a generated narrative. For these generator is embedded, extrinsic narrative evaluations
reasons, purely analytical approaches are often of limited measure the degree to which a narrative generator
value for assessing intelligent narrative technologies. contributes to the application’s overall effectiveness.
Another major approach to evaluation used by the HCI STORYEVAL integrates these four broad approaches into a
community is the user participant study. Currently, single framework for comprehensively assessing narrative
empirical approaches offer more promise for assessing generators. The proposed methodology is neither
narrative generators than purely analytical techniques. automated nor algorithmic, but it provides a set of
Unfortunately, human participant studies can be expensive. assessment techniques that can be adapted to individual
They also raise many practical issues regarding choice of interactive and non-interactive systems. We discuss each
participants, experimental design, logistics of laboratory of STORYEVAL’s four families of evaluation in turn.
and field studies, and statistical analysis of results.
Nevertheless, empirical evaluation addresses many of the Narrative Metrics
shortcomings associated with analytical approaches, so it is
a widely used approach for assessing intelligent narrative Narrative metrics focus assessment on the results produced
by narrative generators. Unfortunately, there are no
technologies.
accepted, objective metrics for evaluating narrative
The issues that distinguish the evaluation of intelligent
artifacts; if there were, film and literary critics would be
narrative technologies from other types of evaluation are
out of jobs. Instead, narrative metrics can leverage simple
reminiscent of those encountered by other AI sub-
heuristics or user participant studies for assessment.
disciplines. Natural language processing is one example.
Language is inextricably tied to narrative. It shares Because of the lack of objective measures, and because
there are too many variables associated with a comparison
narrative’s complex and multi-faceted nature, and its
of machine-generated stories and human-generated stories,
assessment is often subjective. These properties
it can be difficult to measure machine-generated narrative
exacerbate the problems of evaluation. Work on embodied
against human standards. Instead, experimental designs
conversational agents (ECAs) has also raised challenging
can compare machine-generated stories to other machine-
evaluation issues. The complexity of evaluating ECAs’
natural language and dialogue behavior, as well as their generated stories and determine the effects of various
architectural components on narrative generation. Factors
capacity for expressive multimodal communication,
that have been assessed using this type of analysis include
complicates assessment. Fortunately, both fields have
character believability (Riedl and Young, 2005) and
made significant progress in developing principled
narrative prose quality (Callaway and Lester 2001).
evaluation methodologies (Walker et al. 1997; Cassell et
Narrative metrics can leverage empirically grounded
al. 2000; Belz and Reiter 2006), a cause for optimism for
narrative generation evaluation. theories from the social sciences. For example, Riedl and
Young (2005) use a novel experimental approach that takes
advantage of a well-grounded psychological model of story
An Empirical Evaluation Framework understanding, QUEST (Graesser et al. 1991). The system
being evaluated, Fabulist, uses a variant of partial-order
Because narrative generators are complex systems, causal link planning to produce narratives that account for
multiple methodologies must be employed to successfully character intentionality. Riedl and Young conducted a
evaluate the full scope of their functionalities and the human participant study that compared two versions of the
stories and interactive experiences they create. To this Fabulist system: one uses an advanced planner (IPOCL),
end, we propose STORYEVAL, an empirical evaluation and the other uses a traditional partial order causal link
framework for computational models of narrative planner. The two conditions compared participant
generation. The STORYEVAL framework consists of four judgments on generated narrative question-answer pairs
against assessments provided by the QUEST models. The
experiment assessed participants’ comprehension of narrative features with actual narrative “goodness” may be
character intentionality and Fabulist’s ability to motivate questionable. While more sophisticated, automated
character actions through the IPOCL-enhanced narrative techniques could be implemented, the associated
generator. The investigators concluded that the enhanced computational costs may violate real-time performance
narrative generator more effectively supports reader requirements.
comprehension of character intentionality, although the
novel and complex evaluation approach introduced some Cognitive-Affective Studies
experimental design issues into the assessment.
The quality of a narrative is inseparably tied to an
Metrics such as style, readability, grammar, and diction
audience’s response to it. Cognitive-affective studies shift
can be used to evaluate narratives expressed in natural
language. For example, AUTHOR is a narrative generator the focus away from narrative artifacts and toward the
cognitive-affective states fostered by a narrative
that combines story generation facilities and deep natural
experience. While some experiments such as those
language generation to construct high quality narrative
discussed above can assess audiences’ cognitive responses
prose (Callaway and Lester 2001). AUTHOR is composed
to a generated narrative, their focus is primarily on
of five principal components: a discourse history, sentence
narrative-dependent metrics such as prose quality and
planner, reviser, lexical choice component, and surface
realizer. A human participant experiment was conducted character believability rather than on participants’
emotional and attentional states.
to assess the system’s generative performance. The system
Because one of the most powerful effects of narrative is
was provided with two different story plans, and was run
the sense of being transported into a story (Gerrig 1993;
on each with various architectural components removed.
Green and Brock 2000), narrative is well suited to fostering
Conditions included no reviser, no lexical choice
high levels of audience engagement and presence (Kelso,
component, no discourse history, all three components
working, and all three components disabled. This resulted Weyhrauch, and Bates 1993; Rowe, McQuiggan, and
Lester 2007). The fundamental premise motivating
in ten generated narratives, which were read and
cognitive-affective studies of narrative is that if stories
quantitatively graded by a pool of readers on narrative
produced by narrative generators elicit responses that are
metrics such as style, readability, grammar, and diction.
reminiscent of those resulting from human-generated
The study led the authors to conclude that the discourse
stories, the narrative generator is capable of producing
history and revision components were particularly
important to resulting narrative quality, with results quality narratives.
Unfortunately, it can be difficult to observe and assess
concerning lexical choice being less conclusive. While
cognitive-affective state. Many experiments request
this type of study could not compare machine-generated
periodic emotion self-reports throughout a narrative
narrative against human standards, it was able to determine
experience (Lee 2007) or administer validated
which sub-processes of narrative generation were
questionnaires following the completion of the intervention
important for producing quality results.
In addition to narrative metrics’ use in a post-hoc (McQuiggan, Rowe, and Lester 2008). Unfortunately,
both of these techniques are highly subjective. Self-reports
manner in evaluation, they can also be incorporated
can jarringly interrupt a narrative experience, and post
directly into narrative generators. For example, the drama
surveys take measurements long after cognitive-affective
managers presented in Weyhrauch (1997) and Nelson et al.
responses actually occur. Alternative techniques include
(2006) have used narrative metrics in the form of objective
facial expression analysis (Ekman 2003) and monitoring
evaluation functions to assess candidate narrative
directions. The functions were designed to declaratively physiological measures such as heart rate and galvanic skin
response (Lee 2007). However, most physiological
encode authors’ aesthetic preferences, against which
measures only provide indirect indicators of cognitive-
narratives will be judged. The evaluation functions
affective state. Despite their limitations, user participant
combine several measurements that are hypothesized to
studies hold much appeal for narrative generation
reflect authorial goals, such as spatial locality of action,
evaluation.
topical locality of action, and the degree to which plot
points are motivated by prior events. Combining common Research on interactive narrative generators has long
been interested in presence, informally defined as a user’s
authorial goals into a single comprehensive, weighted
sense of “being there” when interacting with a mediated
measure, an objective evaluation function is then used
environment (Schubert, Friedmann and Regenbrecht 1999;
during the optimization process that guides narrative
Insko 2003). Experiments investigating interactive
decision making. This approach is useful for making
narrative generators have yielded a number of surprising
rapid, simple assessments about the quality of a narrative
or narrative experience, and is particularly attractive for and interesting presence-related results. In some of the Oz
group’s earliest work, Kelso et al. (1993) investigated the
generators that use machine learning or other optimization-
notion of dramatic presence by observing a user
based approaches. Unfortunately, simple evaluation
participating in an interactive drama populated with live
functions are limited in their ability to measure many of
actors. They concluded that by being an active participant
narrative’s most fundamental components. Further, the
in the narrative, rather than a passive observer, the
generality of the assumptions that associate particular
interactor “found interactive drama more powerful, easily
causing immediate, personal emotions, not the traditional measures of heart rate and galvanic skin response to
vicarious empathy for other characters” (Kelso, accurately model and assess emotional states during a
Weyhrauch, Bates 1993). These experiments informed the narrative interaction (Lee, McQuiggan, Lester 2007).
work pursued by the Oz group over subsequent years. Other work on CRYSTAL ISLAND has focused on the
Unfortunately, the expense associated with using live transitions between different emotional states.
actors make these types of experiments difficult to Experimental evidence suggests that different types of
reproduce or to run on multiple participants. empathetic character behaviors during a narrative
McQuiggan et al. conducted a pair of experiments interaction result in different emotion transition responses
investigating the relationship between character behavior (McQuiggan, Robison and Lester 2008).
and user presence in an implemented interactive narrative
(2008). The experiments compared two versions of Director-centric Studies
CRYSTAL ISLAND, an interactive, 3D science mystery in
Evaluation that centers on director agents and drama
which students learn about microbiology as they
managers constitutes the third technique for evaluating
simultaneously discover the source of a mysterious illness
narrative generation. Director agents themselves must
plaguing the island. Both versions featured the same
perform significant narrative evaluation in the course of
narrative, characters, world, and content, but one included
a small subset of the characters who engaged users in short generating narratives. Director agents seek to provide
well-formed narrative experiences, and in interactive
empathetic exchanges. Using a validated instrument for
narratives provide significant player agency (Riedl and
measuring presence, Witmer and Singer’s PQ (1998), the
Young 2003). To accomplish this objective, director
studies found an increase in presence among students in
agents should ideally consider the full scope of narrative—
the empathetic character condition. This result was
these include story elements such as plot, discourse, media,
produced across two populations, middle school and high
school students, and it suggested that simple variations in character, and drama—as well as expected user cognitive-
affective responses, and then use the results to guide
character behavior can yield significant gains in user
narrative decision making. Currently, most director agents
presence.
perform a subset of these analyses, leveraging automated
However, the relationship between presence and
narrative metrics and cognitive-affective models to
engagement in interactive narrative generators is not
determine appropriate courses of action and intervention.
entirely clear, as evidenced by work from Dow et al. on an
augmented reality version of Façade (2007). A human It should be noted that narrative generation tasks need not
be performed by a single centralized agent, but can be
participant experiment compared an augmented reality
realized in a distributed manner, as is done in character-
version of Façade (AR Façade) with traditional desktop
centric narrative generation (Cavazza, Charles, and Mead
versions of the popular interactive drama. It was found
2002). In this case, individual agents perform an
that AR Façade elicited higher levels of presence than
additional form of metacognitive processing of their own
desktop versions. However, qualitative interviews
conducted after the intervention found that the enhanced actions (e.g., assessing emotional impact on others), and
use this information to further guide behavior (Aylett and
presence experienced in AR Façade did not correspond to
Louchart 2008). Regardless of approach, it is incumbent
increased levels of engagement. Some participants
upon director agents to effectively and automatically
actually preferred the desktop version of Façade and
evaluate current and potential narrative directions, and then
indicated that they would rather “portray a character on the
use this information to manage the interactive narrative
screen, rather than literally be in the situation” (Dow et al.
2007). The investigators hypothesized that the augmented experience.
Director-centric studies can be used to assess the
reality interface made users feel “too close” to the socially
efficacy of particular strategies for balancing player agency
uncomfortable scenario that Façade implements. This
and narrative structure, such as proactive intervention
work suggests that while presence and engagement are
(Magerko 2007), reactive intervention (Riedl, Saretto, and
important variables for assessing narrative experiences,
Young 2003), and computational models of narrative
narrative’s objective is not necessarily a simple
optimization of the two factors. rationality (Mott and Lester 2006). Some of the earliest
evaluation work that centered on drama manager
Integrally related to presence and engagement are
performance was conducted to assess the Moe architecture,
assessments of emotional experiences fostered by narrative
which investigated three variants of adversarial search as
events. Many of the most powerful narrative experiences
mechanisms for informing a drama manager’s narrative
are defined by the affective responses they invoke: the
decision making (Weyhrauch 1997). Moe was run against
horror genre seeks to elicit fear, comedies elicit joy, and
the action genre elicits excitement. In recognition of the nine different classes of simulated users, each varying in
skill and cooperative tendency. For each model, user
centrality of affective response in narrative, numerous
simulations compared search-enhanced interactive drama
human participant studies have been conducted to model
experiences against a version lacking a drama manager. It
and assess emotional responses to narrative interventions.
was found that the search-enhanced manager’s resulting
For example, experiments with CRYSTAL ISLAND have
narrative distribution was significantly superior to the
combined emotional self-report data with physiological
version lacking the manager, as measured by an aesthetic
evaluation function. However, Weyhrauch noted generates stories that communicate a theme or moral
important limitations of his assessment: the evaluation lesson. Façade (Mateas and Stern 2005) seeks to deliver
focuses on the performance of the drama manager’s search an artistically complete, conversation-driven, dramatic
algorithm, rather than on the interactive drama as a whole, experience. A number of interactive narratives generators
and it does not provide findings that can inform design aim to balance user agency and narrative coherence solely
improvements for a single experience. Moreover, the for the purpose of entertainment (Cavazza, Charles and
study did not include judgments provided by human users. Mead 2002; Riedl, Saretto and Young 2003). CRYSTAL
Although the director-centric evaluation technique for ISLAND (McQuiggan, Rowe, and Lester 2008), FearNot!
evaluating the Moe architecture did not include judgments (Zoll et al. 2006), and the Tactical Language and Culture
solicited from human participants, the technique of Training System (Johnson and Beal 2005) use narrative to
comparing different narrative director implementations contextualize learning and problem-solving scenarios.
using simulated users is a promising one for preliminary Extrinsic narrative evaluation is needed to assess narrative
assessment. Work at Georgia Tech by Nelson et al. (2006) generation with an eye toward a narrative’s purpose. The
and Roberts et al. (2006) has continued this line of distinction between the “inward facing” and “outward
research by comparing alternative optimization-based facing” techniques play roles analogous to intrinsic and
approaches for drama management. These projects have extrinsic evaluation in natural language processing
modeled the task of finding effective drama management (Jurafsky and Martin 2008). Intrinsic evaluations measure
strategies as reinforcement learning and Targeted models independently of any particular application, while
Trajectory Distribution-MDP problems, respectively. extrinsic evaluations assess models within an application
Emphasizing the goal of affording significant user agency, and gauge the application’s overall effectiveness. We
their work highlights the importance of optimizing for a discuss several evaluation approaches that measure
distribution of different, high quality stories, rather than narrative generators by their ability to produce narratives
merely focusing on policies that direct users toward a small that support some extrinsic goal.
set of highly rated narrative experiences. Determining the Extrinsic evaluation is critical for narrative generators
most effective strategies for achieving this goal remains an used in the service of education and training. Educational
open research question. narratives naturally lend themselves to extrinsic evaluation.
Real-time performance constraints are another important These applications provide measureable variables that can
consideration when evaluating narrative director agents. be used to assess the overall performance of a narrative
Often, the inherent narrative decision-making processes system, such as learning gains. Recently, intelligent
presented to a director agent are intractable. Weyhrauch narrative technology research teams have begun to
addressed this problem by limiting Moe’s search depth, as collaborate with colleagues in the learning sciences to
well through memoization strategies during online search conduct user participant studies. For example, laboratory
(1997). Techniques used at Georgia Tech have simply studies investigating the CRYSTAL ISLAND narrative-
moved the optimization process off-line (Nelson et al. centered learning environment have shown significant
2006; Roberts et al. 2006). Mott and Lester’s U-Director learning gains among eighth graders after a single
system implements a decision-theoretic approach to interaction with the science mystery (McQuiggan et al.
narrative management, a technique that poses a compute- 2008). Field studies involving the FearNot! narrative
intensive Bayesian inference problem during each narrative learning environment, which targets social education about
decision-making cycle (2006). To address this issue, they bullying, have investigated changes in students’ empathetic
empirically investigated a number of different characteristics after completing the narrative scenario (Zoll
approximation techniques for Bayesian inference, the et al. 2006). Researchers building the Tactical Language
techniques’ associated performance within the domain, and and Culture Training System have completed a number of
the effectiveness of their resulting decisions for guiding iterative usability and learning evaluations in conjunction
users through the narrative. Although U-Director’s with the US Army (Johnson and Beal 2005).
empirical evaluation was limited in scope, the findings A narrative generator need not serve an educational
underscored the importance of evaluating computational purpose to benefit from extrinsic evaluation. For example,
efficiency and its tradeoffs for narrative effectiveness. Mehta et al. (2007) performed a qualitative evaluation of
Façade’s conversational system within the context of its
Extrinsic Narrative Evaluation larger dramatic objective. The authors ran several human
participants through Façade and focused their attention on
The first three families of evaluation methodologies
points where the system’s conversational facilities failed.
operate with an “inward facing” focus: they do not
The authors concluded that Façade was relatively
consider narratives’ larger motivating contexts. To round successful at maintaining user engagement and sense of
out the evaluation framework’s assessment methodologies,
drama. Curiously, users would often interpret
the final technique, extrinsic narrative evaluation, operates
conversational breakdowns as natural features of the
with an “outward facing” focus. Most narratives do not
narrative, and they inferred that conversational cues and
merely aim to recount a sequence of events; rather, they
character responses generated by Facade were an important
are used to entertain, communicate an idea, or serve some
part of these experiences.
external purpose. For example, MINSTREL (Turner 1994)
Major drawbacks associated with extrinsic evaluation support. This research was supported by the National
include the expense of embedding narrative technologies Science Foundation under Grants REC-0632450, IIS-
into full applications and the difficulty of conducting large, 0757535, DRL-0822200 and IIS-0812291. Any opinions,
controlled human participant studies with appropriate findings, and conclusions or recommendations expressed
populations. Clearly, extrinsic evaluation requires the in this material are those of the authors and do not
existence of reasonably mature systems. Nevertheless, necessarily reflect the views of the National Science
when extrinsic evaluation is possible, it can be an effective Foundation.
means for assessing narrative technologies.
References
Discussion and Conclusions
Aylett, R. and Louchart, S. 2008. If I were you: Double
The STORYEVAL framework represents a first step toward appraisal in affective agents. Proc. of the 7th International
an integrated evaluation methodology for computational Conference on AAMAS, 1233-1236, Estoril, Portugal.
models of narrative generation. By employing narrative
metrics, cognitive-affective studies, director-centric Bailey, P. 1999. Searching for storiness: Story-generation
studies, and extrinsic narrative evaluations, we can from a reader’s perspective. Working Notes of the AAAI
systematically assess precisely which aspects of a narrative Fall Symposium on Narrative Intelligence, 157-163, Cape
generator most effectively contribute to its successful Cod, MA.
performance. STORYEVAL offers a promising beginning
for a comprehensive narrative generation evaluation Belz, A., and Reiter, E. 2006. Comparing automatic and
framework, but it does not address the evaluation of related human evaluation of NLG systems. Proc. of the 11th Conf.
narrative tasks such as story understanding or narrative of EACL, 313-320, Trento, Italy.
authoring. Nevertheless, it highlights a number of central
issues for evaluating intelligent narrative technologies: Callaway, C. and Lester, J. 2001. Evaluating the effects of
• The complexity inherent in intelligent narrative natural language generation techniques on reader
technologies calls for a sophisticated multi-faceted satisfaction. Proc. of the 23rd Annual Conference of the
approach to evaluation. Cognitive Science Society, 164-169, Edinburgh, UK.
• While narrative generation evaluation methodologies
can draw on techniques from cognitive science, human- Cassell, J., Sullivan, J., Prevost, S., and Churchill, E. (Eds.)
computer interaction, and natural language processing, 2000. Embodied Conversational Agents. Boston, MA.:
the assessment of narrative generation raises issues that MIT Press.
are fundamentally different from those found in other
types of software design. Cavazza, M., Charles, F. and Mead, S.J., 2002. Planning
• Narrative metrics, cognitive-affective studies, director- characters’ behaviour in interactive storytelling. Journal of
centric studies, and extrinsic narrative evaluations are Visualization and Computer Animation 13: 121-131.
integrally interrelated, and each has its own benefits and
limitations. Dix, A., Finlay, J., Abowd, G., and Beale, R. 2004.
• Narrative evaluation is not merely important for Human-Computer Interaction. Harlow, England: Pearson
empirical validation, but its techniques can also form the Education, Inc.
basis for computational models of narrative generation.
As evidenced by progress in natural language Dow, S., Mehta, M., Harmon, E., MacIntyre, B., and
processing, adopting effective evaluation methodologies Mateas, M. 2007. Presence and engagement in an
can facilitate the rapid advancement of a field (Belz and interactive drama. Proc. of CHI, 1475-1484, San Jose, CA.
Reiter 2006; Walker et al. 1997), as well as provide
empirical support for identifying the community’s most Ekman, P. 2003. Emotions Revealed. New York: Henry
promising approaches. By promoting vigorous discussion Holt.
of evaluation issues such as experimental methodologies,
automated assessments, and shared tasks, the intelligent Gerrig, R. 1993. Experiencing Narrative Worlds: On the
narrative technologies community can continue to grow Psychological Activities of Reading. New Haven: Yale
and develop principled approaches for assessing and University Press.
improving computational models of narrative.
Graesser, A.C., Lang, K.L., and Roberts, R.M. 1991.
Question answering in the context of stories. Journal of
Acknowledgements Experimental Psychology: General, 120(3), 254-277.
The authors would like to thank the other members of the
Green, M., and Brock, T. 2000. The role of transportation
IntelliMedia Center for Intelligent Systems at North
in the persuasiveness of public narratives. Journal of
Carolina State University for useful discussions and
Personality and Social Psychology, 79(5), 701-721.
interactive fiction Anchorhead. IEEE Computer Graphics
Insko, B. E. 2003. Measuring presence: Subjective, and Applications, 26(3): 32-41.
behavioral and physiological methods. In G. Riva, F.
Davide, & W. A. IJsselsteijn (Eds.), Being There: Prince, G. 2003. Dictionary of Narratology (Revised
Concepts, Effects and Measurements of User Presence in Edition). Lincoln, NE.: University of Nebraska Press.
Synthetic Environments. Amsterdam: IOS Press. 109-119.
Riedl, M. and Young, R. M. 2005. An objective character
Johnson, W.L. and Beal, C. 2005. Iterative evaluation of a believability evaluation procedure for multi-agent story
large-scale, intelligent game for language learning. Proc. generation systems. Proc. of IVA, 278-291, Kos, Greece.
of the 12th Intl Conference on Artificial Intelligence in
Education, 290-297, Amsterdam, The Netherlands. Riedl, M., Saretto, C. J., and Young, R.M. 2003.
Managing interaction between users and agents in a multi-
Jurafsky, D. and Martin, J. 2008. Speech and Language agent storytelling environment. Proc. of the 2nd Intl Conf.
Processing. Upper Saddle River, NJ.: Pearson Education, on AAMAS, 741-748, Melbourne, Australia.
Inc.
Roberts, D., Nelson, M., Isbell, C., Mateas, M., and
Kelso, M., Weyhrauch, P., and Bates, J. 1993. Dramatic Littman, M. 2006. Targeting specific distributions of
presence. Presence: The Journal of Teleoperators and trajectories in MDPs. Proc. of the 21st AAAI, Boston, MA.
Virtual Environments, 2(1), 1-15.
Rowe, J., McQuiggan, S., and Lester, J. 2007. Narrative
Lee, S., McQuiggan, S. and Lester, J. 2007. Inducing user presence in intelligent learning environments. Working
affect recognition models for task-oriented environments. Notes of the 2007 AAAI Fall Symposium on Intelligent
Proc. of the 11th Intl. Conf. on User Modeling, 380-384, Narrative Technologies, 126-133, Washington D.C.
Corfu, Greece.
Russell, S., and Norvig, P. 2003. Artificial Intelligence –
Magerko, B. 2007. Evaluating Preemptive Story A Modern Approach. Upper Saddle River, NJ.: Pearson
Direction in the Interactive Drama Architecture. Journal Education, Inc.
of Game Development, 2(3).
Schubert, T., Friedmann, F., and Regenbrecht, H. 1999.
Mateas, M. and Stern, A. 2005. Structuring content in the Embodied presence in virtual environments. In Ray Paton
Façade interactive drama architecture. Proc. of AIIDE, 93- & Irene Neilson (Eds.), Visual Representations and
98, Marina del Rey, CA. Interpretations. London: Springer. 269-278.
McQuiggan, S., Robison, J., and Lester, J. 2008. Affective Si, M., Marsella, S.C., and Pynadath, D.V. 2005.
transitions in narrative-centered learning environments. THESPIAN: An architecture for interactive pedagogical
Proc of the 9th Intl Conf on ITS, 490-499, Montreal, CAN. drama. Proc. of the 12th Intl Conf. on Artificial Intelligence
in Education, 21-28, Amsterdam, The Netherlands.
McQuiggan, S., Rowe, J. and Lester, J. 2008. The effects
of empathetic virtual characters on presence in narrative- Turner, S. 1994. The Creative Process: A Computer
centered learning environments. Proc. of CHI, 1511-1520, Model of Storytelling and Creativity. Hillsdale, NJ.:
Florence, Italy. Lawrence Erlbaum Associates.
McQuiggan, S., Rowe, J., Lee, S. and Lester, J. 2008. Walker, M., Litman, D., Kamm, C., and Abella, A. 1997.
Story-based learning: the impact of narrative on learning PARADISE: A framework for evaluating spoken dialogue
experiences and outcomes. Proc. of the 9th Intl Conference agents. Proc. of ACL, 271-280, Madrid, Spain.
on ITSs, 530-539, Montreal, Canada.
Weyhrauch, P. 1997. Guiding interactive drama. Ph.D.
Mehta, M., Dow, S., Mateas, M. and MacIntyre, B. 2007. diss., Dept. of Computer Science, Carnegie Mellon Univ.,
Evaluating a conversation-centered interactive drama. Pittsburgh, PA.
Proc. of the 6th Intl Conf on AAMAS, 1-8, Honolulu, HI.
Witmer, B. and Singer, M. 1998. Measuring presence in
Mott, B. and Lester, J. 2006. U-Director: A decision- virtual environments: A presence questionnaire. Presence:
theoretic narrative planning architecture for storytelling Teleoperators and Virtual Environments, 7(3), 225-240.
environments. In Proc. of the 5th Intl Conf on AAMAS,
977-984, Hakodate, Japan. Zoll, C., Enz, S., Schaub, H., Aylett, R., and Paiva, A.
2006. Fighting bullying with the help of autonomous
Nelson, M., Mateas, M., Roberts, D., and Isbell, C. 2006. agents in a virtual school environment. Proc. of the 7th
Declarative optimization-based drama management in the Intl Conf. on Cognitive Modeling, Trieste, Italy.