Movie Script Summarization As Graph-Based Scene Extraction
Movie Script Summarization As Graph-Based Scene Extraction
1066
Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1066–1076,
c
Denver, Colorado, May 31 – June 5, 2015.
2015 Association for Computational Linguistics
synopses and loglines, identify main characters and 1960s (Mosteller and Wallace, 1964). More re-
their stories, or facilitate browsing (e.g., “show me cently, the availability of large collections of dig-
every scene where there is a shooting”). In this pa- itized books and works of fiction has enabled re-
per we explore whether current NLP technology can searchers to observe cultural trends, address ques-
be used to address some of these tasks. Specifically, tions about language use and its evolution, study
we focus on script summarization, which we con- how individuals rise to and fall from fame, perform
ceptualize as the process of generating a shorter ver- gender studies, and so on (Michel et al., 2010). Most
sion of a screenplay, ideally encapsulating its most existing work focuses on low-level analysis of word
informative scenes. The resulting summaries can patterns, with a few notable exceptions. Elson et al.
be used to enhance script browsing, give readers a (2010) analyze 19th century British novels by con-
rough idea of the script’s content and plotline, and structing a conversational network with vertices cor-
speed up reading time. responding to characters and weighted edges corre-
So, what makes a good script summary? Accord- sponding to the amount of conversational interac-
ing to modern film theory, “all films are about noth- tion. Elsner (2012) analyzes characters and their
ing — nothing but character” (Monaco, 1982). Be- emotional trajectories, whereas Nalisnick and Baird
yond characters, a summary should also highlight (2013) identify a character’s enemies and allies in
major scenes representative of the story and its pro- plays based on the sentiment of their utterances.
gression. With this in mind, we define a script sum- Other work (Bamman et al., 2013, 2014) automat-
mary as a chain of scenes which conveys a narrative ically infers latent character types (e.g., villains or
and smooth transitions from one scene to the next. heroes) in novels and movie plot summaries.
At the same time, a good chain should incorporate Although we are not aware of any previous ap-
some diversity (i.e., avoid redundancy), and focus proaches to summarize screenplays, the field of
on important scenes and characters. We formalize computer vision is rife with attempts to summa-
the problem of selecting a good summary chain us- rize video (see Reed 2004 for an overview). Most
ing a graph-theoretic approach. We represent scripts techniques are based on visual information and rely
as (directed) bipartite graphs with vertices corre- on low-level cues such as motion, color, or audio
sponding to scenes and characters, and edge weights (e.g., Rasheed et al. 2005). Movie summarization is
to their strength of correlation. Intuitively, if two a special type of video summarization which poses
scenes are connected, a random walk starting from many challenges due to the large variety of film
one would reach the other frequently. We find a styles and genres. A few recent studies (Weng et al.,
chain of highly connected scenes by jointly optimiz- 2009; Lin et al., 2013) have used concepts from so-
ing logical progression, diversity, and importance. cial network analysis to identify lead roles and role
Our contributions in this work are three-fold: we communities in order to segment movies into scenes
introduce a novel summarization task, on a new text (containing one or more shots) and create more in-
genre, and formalize scene selection as the problem formative summaries. A surprising fact about this
of finding a chain that represents a film’s story; we line of work is that it does not exploit the movie
propose several novel methods for analyzing script script in any way. Characters are typically identified
content (e.g., identifying important characters and using face recognition techniques and scene bound-
their interactions); and perform a large-scale human aries are presumed unknown and are automatically
evaluation study using a question-answering task. detected. A notable exception are Sang and Xu
Experimental results show that our method produces (2010) who generate video summaries for movies,
summaries which are more informative compared to while taking into account character interaction fea-
several competitive baselines. tures which they estimate from the corresponding
screenplay.
2 Related Work Our own approach is inspired by work in ego-
centric video analysis. An egocentric video offers
Computer-assisted analysis of literary text has a long a first-person view of the world and is captured from
history, with the first studies dating back to the a wearable camera focusing on the user’s activities,
1067
# Movies AvgLines AvgScenes AvgChars
Drama 665 4484.53 79.77 60.94 ...
s1 s2 s3 s4 s5 s6 s7
Thriller 451 4333.10 91.84 52.59
Comedy 378 4303.02 66.13 57.51
Action 288 4255.56 101.82 59.99
s1 s2 s3 s4 s5 s6 s7 ...
Figure 2: ScriptBase corpus statistics. Movies can have
multiple genres, thus numbers do not add up to 1,276. //
1068
scene 1 scene 2 scene 3 scene 4 specifically in a particular scene. For wc,s , we con-
0.3 0.2 0.1 0.07 0.04 0.07 sider the probability of a character being important,
0.11
i.e., of them belonging to the set of main characters:
0.1 0.04 0.05 0.01 0.2 0.3 0.2 wc,s = P(c ∈ main(M)), ∀(c, s, wc,s ) ∈ E (4)
char 1 char 2 char 3 char 4 where P(c ∈ main(M)) is some probability score as-
sociated with c being a main character in script M.
Figure 4: Example of a bipartite graph, connecting a For ws,c , we take the number of interactions a char-
movie’s scenes with participating characters. acter is involved in relative to the total number of
interactions in a specific scene as indicative of the
Scene-to-scene Progression The first term in the character’s importance in that scene. Interactions re-
objective is responsible for selecting chains repre- fer to conversational interactions as well as relations
senting a logically coherent story. Intuitively, this between characters (e.g., who does what to whom):
means that if our chain includes a scene where a
∑ inter(c, c0 )
character commits an action, then scenes involving c0 ∈Cs
ws,c = , ∀(s, c, ws,c ) ∈ E (5)
affected parties or follow-up actions should also be ∑ inter(c1 , c2 )
included. We operationalize this idea of progression c1 ,c2 ∈Cs
in a story in terms of how strongly the characters in We defer discussion of how we model probabil-
a selected scene si influence the transition to the next ity P(c ∈ Main(M)) and obtain interaction counts to
scene si+1 : Section 5. Weights ws,c and wc,s are normalized:
|S0 |−1 ws,c
P(S0 ) = ∑ ∑ INF(si , si+1 |c) (3) ws,c = , ∀(s, c, ws,c ) ∈ E (6)
∑(s,c0 ,w0s,c ) w0s,c
i=0 c∈Ci
wc,s
wc,s = , ∀(c, s, wc,s ) ∈ E (7)
We represent screenplays as weighted, bipartite ∑(c,s0 ,w0c,s ) w0c,s
graphs connecting scenes and characters:
We calculate the stationary distributions of a ran-
B = (V, E) : V = C ∪ S dom walk on a transition matrix T , enumerating over
all vertices v (i.e., characters and scenes) in the bi-
E = {(s, c, ws,c )|s ∈ S, c ∈ C, ws,c ∈ [0, 1]} ∪ partite graph B:
(
{(c, s, wc,s )|c ∈ C, s ∈ S, wc,s ∈ [0, 1]} wi, j if (vi , v j , wi, j ∈ E B )
T (i, j) = (8)
The set of vertices V corresponds to the union of 0 otherwise
characters C and scenes S. We therefore add to
We measure the influence individual characters have
the bipartite graph one node per scene and one
on scene-to-scene transitions as follows. The sta-
node per character, and two directed edges for each
tionary distribution rk for a RWR walker starting at
scene-character and character-scene pair. An exam-
node k is a vector that satisfies:
ple of a bipartite graph is shown in Figure 4. We
further assume that two scenes si and si+1 are tightly rk = (1 − ε)Trk + εek (9)
connected in such a graph if a random walk with
restart (RWR; Tong et al. 2006; Kim et al. 2014) where T is the transition matrix of the graph, ek is a
which starts in si has a high probability of ending seed vector, with all elements 0, except for element k
in si+1 . which is set to 1, and ε is a restart probability param-
In order to calculate the random walk stationary eter. In practice, our vectors rk and ek are indexed by
distributions, we must estimate the weights between the scenes and characters in a movie, i.e., they have
a character and a scene. We are interested in how length |S| + |C|, and their nth element corresponds
important a character is generally in the movie, and either to a known scene or character. In cases where
1069
graphs are relatively small, we can compute r di- we define dsen , the sentiment overlap between two
rectly4 by solving: scenes as:
preferable for large graphs, since the performed matrix inver- characters is an empirical question in its own right. For our
sion is computationally expensive. purposes, we assume that this relation holds.
1070
subject to their weights λ. We add a constraint corre- clubs her over the head contains the relation
sponding to the compression rate, i.e., the number of clubs(MAN,CATHERINE). Pronouns are resolved to
scenes to be selected and enforce their linear order their antecedent using the Stanford coreference res-
by disallowing non-consecutive combinations. We olution system (Lee et al., 2011).
use GLPK6 to solve the linear problem.
Sentiment We labeled lexical items in screenplays
5 Implementation with sentiment values using the AFINN-96 lexi-
con (Nielsen, 2011), which is essentially a list of
In this section we discuss several aspects of the im- words scored with sentiment strength within the
plementation of the model presented in the previous range [−5, +5]. The list also contains obscene words
section. We explain how interactions are extracted (which are often used in movies) and some Internet
and how sentiment is calculated. We also present our slang. By summing over the sentiment scores of in-
method for identifying main characters and estimat- dividual words, we can work out the sentiment of an
ing the weights ws,c and wc,s in the bipartite graph. interaction between two characters, the sentiment of
a scene (see Equation (17)), and even the sentiment
Interactions The notion of interaction underlies between characters (e.g., who likes or dislikes whom
many aspects of the model defined in the previous in the movie in general).
section. For instance, interaction counts are required
to estimate the weights ws,c in the bipartite graph of Main Characters The progress term in our sum-
the progression term (see Equation (5)), and in defin- marization objective crucially relies on characters
ing diversity (see Equations (15)–(17)). As we shall and their importance (see the weight wc,s in Equa-
see below, interactions are also important for identi- tion (4)). Previous work (Weng et al., 2009; Lin
fying main characters in a screenplay. et al., 2013) extracts social networks where nodes
We use the term interaction to refer to conversa- correspond to roles in the movie, and edges to their
tions between two characters, as well as their rela- co-occurrence. Leading roles (and their communi-
tions (e.g., if a character kills another). For con- ties) are then identified by measuring their centrality
versational interactions, we simply need to iden- in the network (i.e., number of edges terminating in
tify the speaker generating an utterance and the lis- a given node).
tener. Speaker attribution comes for free in our It is relatively straightforward to obtain a so-
case, as speakers are clearly marked in the text (see cial network from a screenplay. Formally, for each
Figure 1). Listener identification is more involved, movie we define a weighted and undirected graph:
especially when there are multiple characters in a
G = {C, E}, : C = {c1 , . . . cn },
scene. We rely on a few simple heuristics. We as-
E = {(ci , c j , w)|ci , c j ∈ C, w ∈ N>0 }
sume that the previous speaker in the same scene,
who is different from the current speaker, is the lis- where vertices correspond to movie characters7 ,
tener. If there is no previous speaker, we assume and edges denote character-to-character interac-
that the listener is the closest character mentioned in tions. Figure 5 shows an example of a social net-
the speaker’s utterance (e.g., via a coreferring proper work for “The Silence of the Lambs”. Due to lack
name or a pronoun). In cases where we cannot find of space, only main characters are displayed, how-
a suitable listener, we assume the current speaker is ever the actual graph contains all characters (42 in
the listener. this case). Importantly, edge weights are not nor-
We obtain character relations from the output of malized, but directly reflect the strength of associa-
a semantic role labeler. Relations are denoted by tion between different characters.
verbs whose ARG0 and ARG1 roles are charac- We do not solely rely on the social net-
ter names. We extract relations from the dialogue work to identify main characters. We esti-
but also from scene descriptions. For example, mate P(c ∈ main(M)), the probability of c being a
in Figure 1 the description Suddenly, [...] he leading character in movie M, using a Multi Layer
6 https://www.gnu.org/software/glpk/ 7 We assume one node per speaking role in the script.
1071
39
Mr. Gumb 6 Experimental Setup
44
24
Catherine 6 Clarice 377 Crawford Gold Standard Chains The development and
1 480 5 tuning of the chain extraction model presented in
Dr. Lecter 16 1
Section 4 necessitates access to a gold standard of
45
1
33 31
key scene chains representing the movie’s most im-
13
1072
1. Why does Trevor leave New York and where does 10% 20% 30% 40% 50%
he move to? MaxOv 0.40 0.50 0.58 0.64 0.71
2. What is KOS, who is their leader, and why is he MinOv 0.13 0.27 0.40 0.53 0.66
attending high school? SceneSum 0.23 0.37 0.50 0.60 0.68
3. What happened to Cesar’s finger, how did he Random 0.10 0.20 0.30 0.40 0.50
eventually die?
4. Who killed Benny and how does Ellen find out? Table 2: Model performance on automatically generated
5. Who is Rita and what becomes of her? gold standard (test set) at different compression rates.
1073
Beginning Middle End Movies MaxOv MinOv SceneSum Random
MaxOv 33.95 34.89 31.16 Nightmare 3 69.18 74.49 60.24 56.33
Little Athens 34.92 31.75 36.90 33.33
MinOv 34.30 33.91 31.80 Living in Oblivion 40.95 35.00 60.00 30.24
SceneSum 35.30 33.54 31.16 Mumford 72.86 60.00 30.00 54.29
Random 34.30 33.91 31.80 One Eight Seven 47.30 38.89 67.86 30.16
Anniversary Party 45.39 56.35 62.46 37.62
We Own the Night 28.57 32.14 52.86 28.57
Table 3: Average percentage of scenes taken from the While She Was Out 72.86 75.71 85.00 45.71
beginning, middle and ends of movies, on automatic gold
All Questions 51.51 50.54 56.91 39.53
standard test set. Five Questions 51.00 53.13 57.38 36.88
Plot Question 60.00 56.88 73.75 55.00
Characters Question 45.54 37.34 37.75 31.29
erage percentage of scenes selected from the be-
ginning, middle, and end of the movie (based on Table 4: Percentage of questions answered correctly.
an equal division of the number of scenes in the
screenplay). As can be seen, the number of se-
lected scenes tends to be evenly distributed across ple dream sequences, whereas “While She was Out”
the entire movie. SceneSum has a slight bias to- contains only a few characters and a series of im-
wards the beginning of the movie which is probably portant scenes towards the end. Despite this variety,
natural, since leading characters appear early on, as SceneSum performs consistently better in our task-
well as important scenes introducing essential story based evaluation.
elements (e.g., setting, points of view).
8 Conclusions
The results of our human evaluation study are
summarized in Table 4. We observe that SceneSum In this paper we have developed a graph-based
summaries are overall more informative compared model for script summarization. We formalized
to those created by the baselines. In other words, the process of generating a shorter version of a
AMT participants are able to answer more ques- screenplay as the task of finding an optimal chain
tions regarding the story of the movie when reading of scenes, which are diverse, important, and ex-
SceneSum summaries. In two instances (“A Night- hibit logical progression. A large-scale evaluation
mare on Elm Street 3” and “Mumford”), the over- based on a question-answering task revealed that our
lap models score better, however, in this case the method produces more informative summaries com-
movies largely consist of scenes with the same char- pared to several baselines. In the future, we plan
acters and relatively little variation (“A Nightmare to explore model performance in a wider range of
on Elm Street 3”), or the camera follows the main movie genres as well as its applicability to other
lead in his interactions with other characters (“Mum- NLP tasks (e.g., book summarization or event ex-
ford”). Since our model is not so character-centric, traction). We would also like to automatically deter-
it might be thrown off by non-character-based terms mine the compression rate which should presumably
in its objective, leading to the selection of unfavor- vary according to the movie’s length and content. Fi-
able scenes. Table 4 also presents a break down of nally, our long-term goal is to be able to generate
the different types of questions answered by our par- loglines as well as movie plot summaries.
ticipants. Again, we see that in most cases a larger
percentage is answered correctly when reading Sce- Acknowledgments We would like to thank Rik
neSum summaries. Sarkar, Jon Oberlander and Annie Louis for their
Overall, we observe that SceneSum extracts valuable feedback. Special thanks to Bharat Am-
chains which encapsulate important movie content bati, Lea Frermann, and Daniel Renshaw for their
across the board. We should point out that al- help with system evaluation.
though our movies are broadly classified as come-
References
dies and thrillers, they have very different structure
and content. For example, “Little Athens” has a Bamman, David, Brendan O’Connor, and Noah A.
very loose plotline, “Living in Oblivion” has multi- Smith. 2013. Learning Latent Personas of
1074
Film Characters. In Proceedings of the 51st Lu, Zheng and Kristen Grauman. 2013. Story-
Annual Meeting of the Association for Compu- Driven Summarization for Egocentric Video. In
tational Linguistics. Sofia, Bulgaria, pages 352– Proceedings of the 2013 IEEE Conference on
361. Computer Vision and Pattern Recognition. Port-
Bamman, David, Ted Underwood, and A. Noah land, OR, USA, pages 2714–2721.
Smith. 2014. A Bayesian Mixed Effects Model Mani, Inderjeet, Gary Klein, David House, Lynette
of Literary Character. In Proceedings of the 52nd Hirschman, Therese Firmin, and Beth Sundheim.
Annual Meeting of the Association for Computa- 2002. SUMMAC: A Text Summarization Evalua-
tional Linguistics. Baltimore, MD, USA, pages tion. Natural Language Engineering 8(1):43–68.
370–379. Manning, Christopher, Mihai Surdeanu, John Bauer,
Björkelund, Anders, Love Hafdell, and Pierre Jenny Finkel, Steven Bethard, and David Mc-
Nugues. 2009. Multilingual Semantic Role La- Closky. 2014. The Stanford CoreNLP Natural
beling. In Proceedings of the 13th Conference Language Processing Toolkit. In Proceedings of
on Computational Natural Language Learning: 52nd Annual Meeting of the Association for Com-
Shared Task. Boulder, Colorado, pages 43–48. putational Linguistics: System Demonstrations.
Clarke, James and Mirella Lapata. 2010. Discourse pages 55–60.
Constraints for Document Compression. Compu- Michel, Jean-Baptiste, Yuan Kui Shen, aviva
tational Linguistics 36(3):411–441. Presser Aiden, Adrian Veres, Matthew K. Gray,
Elsner, Micha. 2012. Character-based kernels for The Google Books Team, Joseph P. Pickett, Dale
novelistic plot structure. In Proceedings of the Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant,
13th Conference of the European Chapter of the Steven Pinker, Martin A. Nowak, and Erez Liber-
Association for Computational Linguistics. Avi- man Aiden. 2010. Quantitative Analysis of Cul-
gnon, France, pages 634–644. ture Using Millions of Digitized Books. Science
Elson, David K., Nicholas Dames, and Kathleen R. 331(6014):176–182.
McKeown. 2010. Extracting Social Networks Monaco, James. 1982. How to Read a Film: The
from Literary Fiction. In Proceedings of the 48th Art, Technology, Language, History and Theory of
Annual Meeting of the Association for Computa- Film and Media. OUP, New York, NY, USA.
tional Linguistics. Uppsala, Sweden, pages 138– Morris, A., G. Kasper, and D. Adams. 1992.
147. The Effects and Limitations of Automated Text
Kim, Jun-Seong, Jae-Young Sim, and Chang-Su Condensing on Reading Comprehension Perfor-
Kim. 2014. Multiscale Saliency Detection Us- mance. Information Systems Research 3(1):17–
ing Random Walk With Restart. IEEE Transac- 35.
tions on Circuits and Systems for Video Technol- Mosteller, Frederick and David Wallace. 1964. In-
ogy 24(2):198–210. ference and Dispituted Authorship: The Federal-
Lee, Heeyoung, Yves Peirsman, Angel Chang, ists. Addison-Wesley, Boston, MA, USA.
Nathanael Chambers, Mihai Surdeanu, and Dan Nalisnick, T. Eric and S. Henry Baird. 2013.
Jurafsky. 2011. Stanford’s Multi-Pass Sieve Character-to-Character Sentiment Analysis in
Coreference Resolution System at the CoNLL- Shakespeare’s Plays. In Proceedings of the 51st
2011 Shared Task. In Proceedings of the 15th Annual Meeting of the Association for Computa-
Conference on Computational Natural Language tional Linguistics. Sofia, Bulgaria, pages 479–
Learning: Shared Task. Portland, OR, USA, 483.
pages 28–34. Nelken, Rani and Stuart Shieber. 2006. Towards
Lin, C., C. Tsai, L. Kang, and Weisi Lin. 2013. Robust Context-Sensitive Sentence Alignment for
Scene-Based Movie Summarization via Role- Monolingual Corpora. In Proceedings of the 11th
Community Networks. IEEE Transactions Conference of the European Chapter of the As-
on Circuits and Systems for Video Technology sociation for Computational Linguistics. Trento,
23(11):1927–1940. Italy, pages 161–168.
1075
Nielsen, Finn Arup. 2011. A new ANEW: Eval-
uation of a word list for sentiment analysis in
microblogs. In Proceedings of the ESWC2011
Workshop on ’Making Sense of Microposts’: Big
Things Come in Small Packages. Heraklion,
Crete, pages 93–98.
Page, Lawrence, Sergey Brin, Rajeev Motwani, and
Terry Winograd. 1999. The pagerank citation
ranking: Bringing order to the web. Technical Re-
port 1999-66, Stanford InfoLab. Previous number
SIDL-WP-1999-0120.
Rasheed, Z., Y. Sheikh, and M. Shah. 2005. On the
Use of Computable Features for Film Classifica-
tion. IEEE Transactions on Circuits and Systems
for Video Technology 15(1):52–64.
Reed, Todd, editor. 2004. Digital Image Sequence
Processing. Taylor & Francis.
Sang, Jitao and Changsheng Xu. 2010. Character-
based Movie Summarization. In Proceedings
of the International Conference on Multimedia.
Firenze, Italy, pages 855–858.
Tong, Hanghang, Christos Faloutsos, and Jia-Yu
Pan. 2006. Fast Random Walk with Restart and
Its Applications. In Proceedings of the Sixth In-
ternational Conference on Data Mining. Hong
Kong, pages 613–622.
Weng, Chung-yi, Wei-Ta Chu, and Ja ling Wu. 2009.
Rolenet: Movie Analysis from the perspective of
Social Networks. IEEE Transactions on Multime-
dia 11(2):256–271.
1076