0% found this document useful (0 votes)
69 views11 pages

A Dataset For Movie Description

This document introduces a new dataset containing over 54,000 sentences describing movies aligned with video snippets from 72 HD movies. The dataset contains both Descriptive Video Service (DVS) descriptions, which are audio narratives of visual elements in movies to make them accessible, as well as movie scripts. DVS is more visual than scripts, precisely describing what is shown rather than what should happen. The dataset will help computer vision and natural language processing by providing real-world video paired with multiple sentence descriptions over entire movies.

Uploaded by

Ryuben Gil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views11 pages

A Dataset For Movie Description

This document introduces a new dataset containing over 54,000 sentences describing movies aligned with video snippets from 72 HD movies. The dataset contains both Descriptive Video Service (DVS) descriptions, which are audio narratives of visual elements in movies to make them accessible, as well as movie scripts. DVS is more visual than scripts, precisely describing what is shown rather than what should happen. The dataset will help computer vision and natural language processing by providing real-world video paired with multiple sentence descriptions over entire movies.

Uploaded by

Ryuben Gil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A Dataset for Movie Description

Anna Rohrbach1 Marcus Rohrbach2 Niket Tandon1 Bernt Schiele1


1
Max Planck Institute for Informatics, Saarbrücken, Germany
2
UC Berkeley EECS and ICSI, Berkeley, CA, United States
arXiv:1501.02530v1 [cs.CV] 12 Jan 2015

Abstract

Descriptive video service (DVS) provides linguistic de-


scriptions of movies and allows visually impaired people to
follow a movie along with their peers. Such descriptions are
by design mainly visual and thus naturally form an inter- DVS: Abby gets in the Mike leans over and sees Abby clasps her hands
basket. how high they are. around his face and
esting data source for computer vision and computational kisses him passionately.
linguistics. In this work we propose a novel dataset which Script: After a moment a Mike looks down to see – For the first time in
frazzled Abby pops up in they are now fifteen feet her life, she stops think-
contains transcribed DVS, which is temporally aligned to his place. above the ground. ing and grabs Mike and
full length HD movies. In addition we also collected the kisses the hell out of him.
aligned movie scripts which have been used in prior work
Figure 1: Audio descriptions (DVS - descriptive video ser-
and compare the two different sources of descriptions. In
vice), movie scripts (scripts) from the movie “Ugly Truth”.
total the Movie Description dataset contains a parallel cor-
pus of over 54,000 sentences and video snippets from 72
HD movies. We characterize the dataset by benchmark-
ing different approaches for generating video descriptions. sion and computational linguistics. To understand the visual
Comparing DVS to scripts, we find that DVS is far more input one has to reliably recognize scenes, human activities,
visual and describes precisely what is shown rather than and participating objects. To generate a good description
what should happen according to the scripts created prior one has to decide what part of the visual information to ver-
to movie production. balize, i.e. recognize what is salient.
Large datasets of objects [18] and scenes [68, 70] had an
important impact in the field and significantly improved our
1. Introduction
ability to recognize objects and scenes in combination with
Audio descriptions (DVS - descriptive video service) CNNs [38]. To be able to learn how to generate descrip-
make movies accessible to millions of blind or visually im- tions of visual content, parallel datasets of visual content
paired people1 . DVS provides an audio narrative of the paired with descriptions are indispensable [56]. While re-
“most important aspects of the visual information” [58], cently several large datasets have been released which pro-
namely actions, gestures, scenes, and character appearance vide images with descriptions [51, 29, 47], video descrip-
as can be seen in Figures 1 and 2. DVS is prepared by tion datasets focus on short video snippets only and are
trained describers and read by professional narrators. More limited in size [12] or not publicly available [52]. TACoS
and more movies are audio transcribed, but it may take up to Multi-Level [55] and YouCook [16] are exceptions by pro-
60 person-hours to describe a 2-hour movie [42], resulting viding multiple sentence descriptions and longer videos,
in the fact that only a small subset of movies and TV pro- however they are restricted to the cooking scenario. In con-
grams are available for the blind. Consequently, automating trast, the data available with DVS provides realistic, open
this would be a noble task. domain video paired with multiple sentence descriptions. It
In addition to the benefits for the blind, generating de- even goes beyond this by telling a story which means it al-
scriptions for video is an interesting task in itself requiring lows to study how to extract plots and understand long term
to understand and combine core techniques of computer vi- semantic dependencies and human interactions from the vi-
1 In this work we refer for simplicity to “the blind” to account for all
sual and textual data.
blind and visually impaired people which benefit from DVS, knowing of Figures 1 and 2 show examples of DVS and compare
the variety of visually impaired and that DVS is not accessible to all. them to movie scripts. Scripts have been used for various

1
DVS: Buckbeak rears and at- Hagrid lifts Malfoy up. As Hagrid carries Malfoy away,
tacks Malfoy. the hippogriff gently nudges
Harry.
Script: In a flash, Buckbeak’s Malfoy freezes. Looks down at the blood blos- Buckbeak whips around, raises
steely talons slash down. soming on his robes. its talons and - seeing Harry -
lowers them.

DVS: Another room, the wife She smokes a cigarette with a Putting the cigarette out, she She pats her face and hands She pats her face and hands
and mother sits at a window latex-gloved hand. uncovers her hair, removes the with a wipe, then sprays herself with a wipe, then sprays herself
with a towel over her hair. glove and pops gum in her with perfume. with perfume.
mouth.
Script: Debbie opens a win- She holds her cigarette with a She puts out the cigarette and She puts some weird oil in her She sprays cologne and walks
dow and sneaks a cigarette. yellow dish washing glove. goes through an elaborate rou- hair and uses a wet nap on her through it.
tine of hiding the smell of neck and clothes and brushes
smoke. her teeth.

DVS: They rush out onto the A man is trapped under a cart. Valjean is crouched down be- Javert watches as Valjean Javert’s eyes narrow.
street. side him. places his shoulder under the
shaft.
Script: Valjean and Javert A heavily laden cart has toppled Valjean, Javert and Javert’s as- He throws himself under the Javert stands back and looks on.
hurry out across the factory onto the cart driver. sistant all hurry to help, but cart at this higher end, and
yard and down the muddy track they can’t get a proper purchase braces himself to lift it from be-
beyond to discover - in the spongy ground. neath.

Figure 2: Audio descriptions (DVS - descriptive video service), movie scripts (scripts) from the movies “Harry Potter and
the prisoner of azkaban”, “This is 40”, “Les Miserables”. Typical mistakes contained in scripts marked with red italic.

tasks [43, 14, 49, 20, 46], but so far not for the video de- tions. First are nearest neighbour retrieval using state-of-
scription. The main reason for this is that automatic align- the-art visual features [67, 70, 30] which do not require
ment frequently fails due to the discrepancy between the any additional labels, but retrieve sentences form the train-
movie and the script. Even when perfectly aligned to the ing data. Second, we propose to use semantic parsing of
movie it frequently is not as precise as the DVS because the sentence to extract training labels for recently proposed
it is typically produced prior to the shooting of the movie. translation approach [56] for video description.
E.g. in Figure 2 see the mistakes marked with red. A typi- The main contribution of this work is a novel movie
cal case is that part of the sentence is correct, while another description dataset which provides transcribed and aligned
part contains irrelevant information. DVS and script data sentences. We will release sentences,
In this work we present a novel dataset which provides alignments, video snippets, and intermediate computed fea-
transcribed DVS, which is aligned to full length HD movies. tures to foster research in different areas including video
For this we retrieve audio streams from blu-ray HD disks, description, activity recognition, visual grounding, and un-
segment out the sections of the DVS audio and transcribe derstanding of plots.
them via a crowd-sourced transcription service [2]. As the As a first study on this dataset we benchmark several ap-
audio descriptions are not fully aligned to the activities in proaches for movie description. Besides sentence retrieval,
the video, we manually align each sentence to the movie. we adapt the approach of [56] by automatically extracting
Therefore, in contrast to the (non public) corpus used in the semantic representation from the sentences using se-
[59, 58], our dataset provides alignment to the actions in the mantic parsing. This approach achieves competitive perfor-
video, rather than just to the audio track of the description. mance on TACoS Multi-Level corpus [55] without using the
In addition we also mine existing movie scripts, pre-align annotations and outperforms the retrieval approaches on our
them automatically, similar to [43, 14] and then manually novel movie description dataset. Additionally we present an
align the sentences to the movie. approach to semi-automatically collect and align DVS data
We benchmark different approaches to generate descrip- and analyse the differences between DVS and movie scripts.

2
2. Related Work this they first segment the video into events by detecting di-
alogue, exciting, and musical events using audio and visual
We first discuss recent approaches to video description features. Then they rely on the dialogue transcription and
and then the existing works using movie scripts and DVS. DVS to identify when characters occur together in the same
In recent years there has been an increased interest in event which allows them to defer interaction patterns. In
automatically describing images [23, 39, 40, 50, 45, 40, 41, contrast to our dataset their DVS is not aligned and they try
34, 61, 22] and videos [37, 27, 8, 28, 32, 62, 16, 26, 64, 55] to resolve this by a heuristic to move the event which is not
with natural language. While recent works on image de- quantitatively evaluated. Our dataset will allow to study the
scription show impressive results by learning the relations quality of automatic alignment approaches, given annotated
between images and sentences and generating novel sen- ground truth alignment.
tences [41, 19, 48, 56, 35, 31, 65, 13], the video description
There are some initial works to support DVS productions
works typically rely on retrieval or templates [16, 63, 26, 27,
using scripts as source [42] and automatically finding scene
37, 39, 62] and frequently use a separate language corpus to
boundaries [25]. However, we believe that our dataset will
model the linguistic statistics. A few exceptions exist: [64]
allow learning much more advanced multi-modal models,
uses a pre-trained model for image-description and adapts
using recent techniques in visual recognition and natural
it to video description. [56, 19] learn a translation model,
language processing.
however, the approaches rely on a strongly annotated corpus
Semantic parsing has received much attention in com-
with aligned videos, annotations, and sentences. The main
putational linguistics recently, see, for example, the tutorial
reason for video description lacking behind image descrip-
[6] and references given there. Although aiming at general-
tion seems to be a missing corpus to learn and understand
purpose applicability, it has so far been successful rather
the problem of video description. We try to address this lim-
for specific use-cases such as natural-language question an-
itation by collecting a large, aligned corpus of video snip-
swering [9, 21] or understanding temporal expressions [44].
pets and descriptions. To handle the setting of having only
videos and sentences without annotations for each video
snippet, we propose an approach which adapts [56], by ex- 3. The Movie Description dataset
tracting annotations from the sentences. Our extraction of Despite the potential benefit of DVS for computer vision,
annotations has similarities to [63], but we try to extract the it has not been used so far apart from [25, 42] who study
senses of the words automatically by using semantic parsing how to automate DVS production. We believe the main rea-
as discussed in Section 5. son for this is that it is not available in the text format, i.e.
Movie scripts have been used for automatic discovery transcribed. We tried to get access to DVS transcripts from
and annotation of scenes and human actions in videos description services as well as movie and TV production
[43, 49, 20]. We rely on the approach presented in [43] companies, but they were not ready to provide or sell them.
to align movie scripts using the subtitles. [10] attacks the While script data is easier to obtain, large parts of it do not
problem of learning a joint model of actors and actions in match the movie, and they have to be “cleaned up”. In the
movies using weak supervision provided by scripts. They following we describe our semi-automatic approach to ob-
also rely on a semantic parser (SEMAFOR [15]) trained on tain DVS and scripts and align them to the video.
FrameNet database [7], however they limit the recognition
only to two frames. [11] aims to localize individual short 3.1. Collection of DVS
actions in longer clips by exploiting the ordering constrains
as weak supervision. We search for the blu-ray movies with DVS in the “Au-
DVS has so far mainly been studied from a linguistic dio Description” section of the British Amazon [1] and se-
prospective. [58] analyses the language properties on a non- lect a set of 46 movies of diverse genres2 . As DVS is only
public corpus of DVS from 91 films. Their corpus is based available in audio format, we first retrieve audio stream
on the original sources to create the DVS and contains dif- 2 2012, Bad Santa, Body Of Lies, Confessions Of A Shopaholic, Crazy
ferent kinds of artifacts not present in actual description, Stupid Love, 27 Dresses, Flight, Gran Torino, Harry Potter and the deathly
such as dialogs and production notes. In contrast our text hallows Disk One, Harry Potter and the Half-Blood Prince, Harry Potter
corpus is much cleaner as it consists only of the actual DVS. and the order of phoenix, Harry Potter and the philosophers stone, Harry
With respect to word frequency they identify that especially Potter and the prisoner of azkaban, Horrible Bosses, How to Lose Friends
and Alienate People, Identity Thief, Juno, Legion, Les Miserables, Mar-
actions, objects, and scenes, as well as the characters are ley and me, No Reservations, Pride And Prejudice Disk One, Pride And
mentioned. The analysis of our corpus reveals similar statis- Prejudice Disk Two, Public Enemies, Quantum of Solace, Rambo, Seven
tics to theirs. pounds, Sherlock Holmes A Game of Shadows, Signs, Slumdog Million-
aire, Spider-Man1, Spider-Man3, Super 8, The Adjustment Bureau, The
The only work we are aware of, which uses DVS in con- Curious Case Of Benjamin Button, The Damned united, The devil wears
nection with computer vision is [59]. The authors try to prada, The Great Gatsby, The Help, The Queen, The Ugly Truth, This is
understand which characters interact with each other. For 40, TITANIC, Unbreakable, Up In The Air, Yes man.

3
Before alignment After alignment
Movies Words Words Sentences Avg. length Total length
DVS 46 284,401 276,676 30,680 4.1 sec. 34.7 h.
Movie script 31 262,155 238,889 23,396 3.4 sec. 21.7 h.
Total 72 546,556 515,565 54,076 3.8 sec. 56.5 h.
Table 1: Movie Description dataset statistics. Discussion see Section 3.3.

from blu-ray HD disk3 . Then we semi-automatically seg- we analyze 5 such movies6 in our dataset. This way we end
ment out the sections of the DVS audio (which is mixed up with 31 movie scripts in total. We follow existing ap-
with the original audio stream) with the approach described proaches [43, 14] to automatically align scripts to movies.
below. The audio segments are then transcribed by a crowd- First we parse the scripts, extending the method of [43] to
sourced transcription service [2] that also provides us the handle scripts which deviate from the default format. Sec-
time-stamps for each spoken sentence. As the DVS is ond, we extract the subtitles from the blu-ray disks7 . Then
added to the original audio stream between the dialogs, we use the dynamic programming method of [43] to align
there might be a small misalignment between the time of scripts to subtitles and infer the time-stamps for the de-
speech and the corresponding visual content. Therefore, we scription sentences. We select the sentences with a reliable
manually align each sentence to the movie in-house. alignment score (the ratio of matched words in the near-by
monologues) of at least 0.5. The obtained sentences are then
Semi-Automatic segmentation of DVS. We first esti- manually aligned to video in-house.
mate the temporal alignment difference between the DVS 3.3. Statistics and comparison to other datasets
and the original audio (which is part of the DVS), as they
might be off a few time frames. The precise alignment is During the manual alignment we filter out: a) sentences
important to compute the similarity of both streams. Both describing the movie introduction/ending (production logo,
steps (alignment and similarity) are computed using the cast etc); b) texts read from the screen; c) irrelevant sen-
spectograms of the audio stream, which is computed us- tences describing something not present in the video; d)
ing Fast Fourier Transform (FFT). If the difference between sentences related to audio/sounds/music. Table 1 presents
both audio streams is larger than a given threshold we as- statistics on the number of words before and after the alig-
sume the DVS contains audio description at that point in ment to video. One can see that for the movie scripts the re-
time. We smooth this decision over time using a minimum duction in number of words is about 8.9%, while for DVS it
segment length of 1 second. The threshold was picked on a is 2.7%. In case of DVS the filtering mainly happens due to
few sample movies, but has to be adjusted for each movie inital/ending movie intervals and transcribed dialogs (when
due to different mixing of the audio description stream, dif- shown as text). For the scripts it is mainly attributed to ir-
ferent narrator voice level, and movie sound. relevant sentences. Note, that in cases when the sentences
are “alignable” but have minor mistakes we still keep them.
3.2. Collection of script data We end up with the parallel corpus of over 50K video-
In addition we mine the script web resources4 and select sentence pairs and a total length over 56 hours. We com-
26 movie scripts5 As starting point we use the movies fea- pare our corpus to other existing parallel corpora in Table 2.
turing in [49] that have highest alignment scores. We are The main limitations of existing datasets are single domain
also interested in comparing the two sources (movie scripts [16, 54, 55] or limited number of video clips [26]. We fill in
and DVS), so we are looking for the scripts labeled as “Fi- the gap with a large dataset featuring realistic open domain
nal”, “Shooting”, or “Production Draft” where DVS is also videos, which also provides high quality (professional) sen-
available. We found that the “overlap” is quite narrow, so tences and allows for multi-sentence description.

3 We use [3] to extract a blu-ray in the .mkv file, then [5] to select and
3.4. Visual features
extract the audio streams from it.
4 http://www.weeklyscript.com, We extract video snippets from the full movie based on
http://www.simplyscripts.com,
http://www.dailyscript.com, http://www.imsdb.com the aligned sentence intervals. We also uniformly extract
5 Amadeus, American Beauty, As Good As It Gets, Casablanca, 10 frames from each video snippet. As discussed above
Charade, Chinatown, Clerks, Double Indemnity, Fargo, Forrest Gump, DVS and scripts describe activities, object, and scenes (as
Gandhi, Get Shorty, Halloween, It is a Wonderful Life, O Brother Where
6 Harry Potter and the prisoner of azkaban, Les Miserables, Signs, The
Art Thou, Pianist, Raising Arizona, Rear Window, The Crying Game, The
Graduate, The Hustler, The Lord Of The Rings The Fellowship Of The Ugly Truth, This is 40.
Ring, The Lord Of The Rings The Return Of The King, The Lost Weekend, 7 We extract .srt from .mkv with [4]. It also allows for subtitle alignment

The Night of the Hunter, The Princess Bride. and spellchecking.

4
Dataset multi-sentence domain sentence source clips videos sentences
YouCook [26] x cooking crowd 88 2,668
TACoS [54, 56] x cooking crowd 7,206 127 18,227
TACoS Multi-Level [55] x cooking crowd 14,105 273 52,593
MSVD [12] open crowd 1,970 70,028
Movie Description (ours) x open professional 54,076 72 54,076
Table 2: Comparison of video description datasets. Discussion see Section 3.3.

well as emotions which we do not explicitly handle with cal Machine Translation (SMT) [36]. For this the approach
these features, but they might still be captured, e.g. by the concatenates SR as input language, e.g. cut knife tomato,
context or activities). In the following we briefly introduce and the natural sentence pairs as output language, e.g. The
the visual features computed on our data which we will also person slices the tomato. While we cannot rely on an an-
make publicly available. notated SR as in [56], we automatically mine the SR from
DT We extract the improved dense trajectories compen- sentences using semantic parsing which we introduce in the
sated for camera motion [67]. For each feature (Trajectory, next section. In addition to dense trajectories we use the
HOG, HOF, MBH) we create a codebook with 4000 clus- features described in Section 3.4.
ters and compute the corresponding histograms. We apply SMT Visual words As an alternative on potentially
L1 normalization to the obtained histograms and use them noisy labels extracted from the sentences, we try to directly
as features. translate visual classifiers and visual words to a sentence.
LSDA We use the recent large scale object detection We model the essential components by relying on activity,
CNN [30] which distinguishes 7604 ImageNet [18] classes. object, and scene recognition. For objects and scenes we
We run the detector on every second extracted frame (due rely on the pre-trained models LSDA and PLACES. For
to computational constraints). Within each frame we max- activities we rely on the state-of-the-art activity recognition
pool the network responses for all classes, then do mean- feature DT. We cluster the DT histograms to 300 visual
pooling over the frames within a video snippet and use the words using k-means. The index of the closest cluster
result as a feature. center from our activity category is chosen as label. To
PLACES and HYBRID Finally, we use the recent scene build our tuple we obtain the highest scoring class labels of
classification CNNs [70] featuring 205 scene classes. We the object detector and scene classifier. More specifically
use both available networks: Places-CNN and Hybrid- for the object detector we consider two highest scoring
CNN, where the first is trained on the Places dataset [70] classes: for subject and object. Thus we obtain the tuple
only, while the second is additionally trained on the 1.2 mil- hSU BJECT, ACT IV IT Y, OBJECT, SCEN Ei =
lion images of ImageNet (ILSVRC 2012) [57]. We run the hargmax(LSDA), DTi , argmax2(LSDA),
classifiers on all the extracted frames of our dataset. We argmax(P LACES)i, for which we learn translation
mean-pool over the frames of each video snippet, using the to a natural sentence using the SMT approach discussed
result as a feature. above.

4. Approaches to video description 5. Semantic parsing


In this section we describe the approaches to video de- Learning from a parallel corpus of videos and sentences
scription that we benchmark on our proposed dataset. without having annotations is challenging. In this section
Nearest neighbor We retrieve the closest sentence from we introduce our approach to exploit the sentences using
the training corpus using the L1-normalized visual features semantic parsing. The proposed method aims to extract an-
introduced in Section 3.4 and the intersection distance. notations from the natural sentences and make it possible
SMT We adapt the two-step translation approach of [56] to avoid the tedious annotation task. Later in the section
which uses an intermediate semantic representation (SR), we perform the evaluation of our method on a corpus where
modeled as a tuple, e.g. hcut, knive, tomatoi. As the first annotations are available in context of a video description
step it learns a mapping from the visual input to the seman- task.
tic representation (SR), modeling pairwise dependencies in
5.1. Semantic parsing approach
a CRF using visual classifiers as unaries. The unaries are
trained using an SVM on dense trajectories [66]. In the sec- We lift the words in a sentence to a semantic space of
ond step [56] translates the SR to a sentence using Statisti- roles and WordNet [53, 24] senses by performing SRL (Se-

5
Seman&c(parsing(
Phrase WordNet VerbNet Expected ((Input:((
Someone&puts&the&tools&back&in&the&shed.(
Mapping Mapping Frame
((Output:(
the man man#1 Agent.animate Agent: man#1
text$ role$ sense$ WordNet$synset$
begin to shoot shoot#2 shoot#vn#2 Action: shoot#2 someone( SUBJECT( 100007846( {person,(individual,(someone,…}(
putHback( VERB( 201308381( {replace,(put(back}(
a video video#1 Patient.solid Patient: video#1 theHtool( OBJECT( 104451818( {tool}(
in in PP.in theHshed( LOCATION( 104187547( {shed}(

the moving bus#1 NP.Location. Location: mov- (a) Semantic representation extracted from a sentence.
Examples:(same(verb,(different(senses(
bus solid ing bus#1
•  The&van&pulls&into&the&forecourt.(
–  sense(of({pull}:(move(into(a(certain(direc&on(
Table 3: Semantic parse for “He began to shoot a video in
•  Someone&pulls&the&purse&impercep9bly&closer&to&himself.(
the moving bus”. Discussion see Section 5.1 –  sense(of({pull,draw,force}:(cause(to(move(by(pulling(

•  People&play&a&fast&and&furious&game. ((
mantic Role Labeling) and WSD (Word Sense Disambigua- –  sense(of({play}:(par&cipate(in(games(or(sport(
•  At&one&end&of&the&room&an&orchestra&is(playing.(
tion). For an example, refer to Table 3, the expected out-
–  sense(of({play}:(play(on(an(instrument(
come of semantic parsing on the input sentence “He shot
a video in the moving bus” is “Agent: man, Action: (b) Same verb, different senses.
shoot, Patient: video, Location: bus”. Ad-
ditionally, the role fillers are disambiguated. •  Someone  leaps  onto  a  bench  by  a  couple  hugging.  
•  Someone  drops  to  his  knee  to  embrace  his  son.  
We use the ClausIE tool [17] to decompose sentences –  sense  of  {hug,  embrace}:  squeeze  (someone)  /ghtly  in  your  arms,  
into their respective clauses. For example, “he shot and usually  with  fondness  
modified the video” is split into two phrases “he shot the  
video” and “the modified the video”). We then use the •  Someone  spins  and  grabs  his  car-­‐door  handle.  
OpenNLP tool suite8 for chunking the text of each clause. •  And  someone  takes  hold  of  her  hand.  
–  sense  of  {grab,  take  hold  of}:  take  hold  of  so  as  to  seize  or  
In order to provide the linking of words in the sentence to restrain  or  stop  the  mo/on  of  
their WordNet sense mappings, we rely on a state-of-the-art
(c) Different verbs, same sense.
WSD system, IMS [69]. The WSD system, however, works
at a word level. We enable it to work at a phrase level. For
every noun phrase, we identify and disambiguate its head Figure 3: Semantic parsing example, see Section 5.1
word (e.g. the moving bus to “bus#1”, where “bus#1”
refers to the first sense of the word bus). We link verb
phrases to the proper sense of its head word in WordNet
(e.g. begin to shoot to “shoot#2”). VerbNet contains over 20 roles and not all of them are
In order to obtain word role labels, we link verbs to general or can be recognized reliably. Therefore, we fur-
VerbNet [60, 33], a manually curated high-quality linguis- ther group them to get the SUBJECT, VERB, OBJECT and
tic resource for English verbs. VerbNet is already mapped LOCATION roles. We explore two approaches to obtaining
to WordNet, thus we map to VerbNet via WordNet. We the labels based on the output of the semantic parser. First
perform two levels of matches in order to obtain role la- is to use the extracted text chunks directly as labels. Second
bels. First is the syntactic match. Every VerbNet verb is to use the corresponding senses as a labels (and there-
sense comes with a syntactic frame e.g. for shoot, the fore group multiple text labels). In the following we refer to
syntactic frame is NP V NP. We first match the sentence’s these as text- and sense-labels. Thus from each sentence we
verb against the VerbNet frames. These become candi- extract a semantic representation in a form of (SUBJECT,
dates for the next step. Second we perform the seman- VERB, OBJECT, LOCATION), see Figure 3a for example.
tic match: VerbNet also provides a role restriction on the Using the WSD allows to identify different senses (Word-
arguments of the roles e.g. for shoot (sense killing), the Net synsets) for the same verb (Figure 3b) and the same
role restriction is Agent.animate V Patient.animate sense for different verbs (Figure 3c).
PP Instrument.solid. The other sense for shoot
5.2. Applying parsing to TACoS Multi-Level corpus
(sense snap), the semantic restriction is Agent.animate
V Patient.solid. We only accept candidates from the We apply the proposed semantic parsing to the TACoS
syntactic match that satisfy the semantic restriction. Multi-Level [55] parallel corpus. We extract the SR from
the sentences as described above and use those as anno-
8 http://opennlp.sourceforge.net/ tations. Note, that this corpus is annotated with the tu-

6
Approach BLEU Correctness Relevance
SMT [56] 24.9 DVS 63.0 60.7
SMT [55] 26.9 Movie scripts 37.0 39.3
SMT with our text-labels 22.3
SMT with our sense-labels 24.0 Table 6: Human evaluation of DVS and movie scripts:
which sentence is more correct/relevant with respect to the
Table 4: BLEU@4 in % on sentences of Detailed Descrip- video, in %. Discussion in Section 6.1.
tions of the TACoS Multi-Level [55] corpus, see Section
5.2.
Corpus Clause NLP Labels WSD
Annotations activity tool object source target TACoS Multi-Level [55] 0.96 0.86 0.91 0.75
Movie Description (ours) 0.89 0.62 0.86 0.7
Manual [55] 78 53 138 69 49
verb object location Table 7: Semantic parser accuracy for TACoS Multi-Level
Our text-labels 145 260 85 and our new corpus. Discussion in Section 6.2.
Our sense-labels 158 215 85

Table 5: Label statistics from our semantic parser on TACoS 6. Evaluation


Multi-Level [55] corpus, see Section 5.2. In this section we provide more insights about our movie
description dataset. First we compare DVS to movie script
ples (ACTIVITY, OBJECT, TOOL, SOURCE, TARGET)
and then we benchmark the approaches to video description
and the subject is always the person. Therefore we drop
introduced in Section 4.
the SUBJECT role and only use (VERB, OBJECT, LOCA-
TION) as our SR. Then, similar to [55], we train the visual 6.1. Comparison DVS vs script data
classifiers for our labels (proposed by the parser), we only
We compare the DVS and script data using 5 movies
use the ones that appear at least 30 times. Next we train a
from our dataset where both are available (see Section 3.2).
CRF with 3 nodes for verbs, objects and locations, using the
For these movies we select the overlapping time intervals
visual classifier responses as unaries. We follow the trans-
with the intersection over union overlap of at least 75%,
lation approach of [56] and train the SMT on the Detailed
which results in 126 sentence pairs. We ask humans via
Descriptions part of the corpus using our labels. Finally,
Amazon Mechanical Turk (AMT) to compare the sentences
we translate the SR predicted by our CRF to generate the
with respect to their correctness and relevance to the video,
sentences. Table 4 shows the results comparing our method
using both video intervals as a reference (one at a time, re-
to [56] and [55] who use manual annotations to train their
sulting in 252 tasks). Each task was completed by 3 dif-
models. As we can see the sense-labels perform better than
ferent human subjects. Table 6 presents the results of this
the text-labels as they provide better grouping of the labels.
evaluation. DVS is ranked as more correct and relevant in
Our method produces competitive result which is only 0.9%
over 60% of the cases, which supports our intuition that
below the result of [56]. At the same time [55] uses more
scrips contain mistakes and irrelevant content even after be-
training data, additional color Sift features and recognizes
ing cleaned up and manually aligned.
the dish prepared in the video. All these points, if added to
our approach, would also improve the performance. 6.2. Semantic parser evaluation
We analyze the labels selected by our method in Table Table 7 reports the accuracy of the different compo-
5. It is clear that our labels are still imperfect, i.e. different nents of the semantic parsing pipeline. The components are
labels might be assigned to similar concepts. However the clause splitting (Clause), POS tagging and chunking (NLP),
number of extracted labels is quite close to the number of semantic role labeling (Labels) and word sense disambigua-
manual labels. Note, that the annotations were created prior tion (WSD). We manually evaluate the correctness on a ran-
to the sentence collection, so some verbs used by humans in domly sampled set of sentences using human judges. It is
sentences might not be present in the annotations. evident that the poorest performing parts are the NLP and
From this experiment we conclude that the output of our the WSD components. Some of the NLP mistakes arise due
automatic parsing approach can serve as a replacement of to incorrect POS tagging. WSD is considered a hard prob-
manual annotations and allows to achieve competitive re- lem and when the dataset contains less frequent words, the
sults. In the following we apply this approach to our movie performance is severely affected. Overall we see that the
description dataset. movie description corpus is more challanging than TACoS

7
Correctness Grammar Relevance Annotations subject verb object location
Nearest neighbor text-labels 30 24 380 137 71
DT 7.6 5.1 7.5 sense-labels 30 47 440 244 110
LSDA 7.2 4.9 7.0 text-labels 100 8 121 26 8
PLACES 7.0 5.0 7.1 sense-labels 100 8 143 51 37
HYBRID 6.8 4.6 7.1
Table 9: Label statistics from our semantic parser on the
SMT Visual words: 7.6 8.1 7.5
movie description corpus. 30 and 100 indicate the minimum
SMT with our text-labels number of label occurrences in the corpus, see Section 6.3.
DT 30 6.9 8.1 6.7
DT 100 5.8 6.8 5.5 much higher number of attributes (see Table 9) predicting
All 100 4.6 5.0 4.9 the SR turns into a more difficult recognition task, result-
ing in worse mean rankings. “All 100” refers to combining
SMT with our sense-labels
all the visual features as unaries in the CRF. Finally, the last
DT 30 6.3 6.3 5.8
“Movie script/DVS” block refers to the actual test sentences
DT 100 4.9 5.7 5.1
from the corpus and not surprisingly ranks best.
All 100 5.5 5.7 5.5
Overall we can observe three main tendencies: (1) Using
Movie script/DVS 2.9 4.2 3.2 our parsing with SMT outperforms nearest neighbor base-
lines and SMT Visual words. (2) In contrast to the kitchen
Table 8: Comparison of approaches. Mean Ranking (1-12). dataset, the sense labels perform slightly worse than the text
Lower is better. Discussion in Section 6.3. labels, which we attribute to the errors made in the WSD.
(3) The actual movie script/DVS are ranked on average
significantly better than any of the automatic approaches.
Multi-Level but the drop in performance is reasonable com- These tendencies are also reflected in Figure 4, showing ex-
pared to the siginificantly larger variability. ample outputs of all the evaluated approaches for a single
movie snippet. Examining more qualitative examples which
6.3. Video description
we provide on our web page indicates that it is possible to
As the collected text data comes from the movie context, learn relevant information from this corpus.
it contains a lot of information specific to the plot, such as
names of the characters. We pre-process each sentence in 7. Conclusions
the corpus, transforming the names and other person related
information (such as “a young woman”) to “someone” or In this work we presented a novel dataset of movies with
“people”. The transformed version of the corpus is used in aligned descriptions sourced from movie scripts and DVS
all the experiments below. We will release the transformed (audio descriptions for the blind). We present first experi-
and the original corpus. ments on this dataset using state-of-the art visual features,
We use the 5 movies mentioned before (see Section 3.2) combined with a recent movie description approach from
as a test set for the video description task, while all the oth- [56]. We adapt the approach for this dataset to work with-
ers (67) are used for training. Human judges were asked to out annotations, but rely on semantic parsing of labels. We
rank multiple sentence outputs with respect to their correct- show competitive performance on the TACoS Multi-Level
ness, grammar and relevance to the video. dataset and promising results on our movie description data.
Table 8 summarizes results of the human evaluation from We compare DVS with previously used script data and find
250 randomly selected test video snippets, showing the that DVS tends to be more correct and relevant to the movie
mean rank, where lower is better. In the top part of the ta- than script sentences. Beyond our first study on single sen-
ble we show the nearest neighbor results based on multiple tences, the dataset opens new possibilities to understand sto-
visual features. When comparing the different features, we ries and plots across multiple sentences in an open domain
notice that the pre-trained features (LSDA, PLACES, HY- scenario on large scale. Something no other video nor im-
BRID) perform better than DT, where HYBRID perform- age description dataset can offer as of now.
ing best. Next is the translation approach with the visual
words as labels, performing overall worst of all approaches. 8. Acknowledgements
The next two blocks correspond to the translation approach
when using the labels from our semantic parser. After ex- Marcus Rohrbach was supported by a fellowship within
tracting the labels we select the ones which appear at least the FITweltweit-Program of the German Academic Ex-
30 or 100 times as our visual attributes. As 30 results in a change Service (DAAD).

8
Nearest neighbor
DT People stand with a happy group, including someone.
LSDA The hovering Dementors chase the group into the lift.
HYBRID Close by, a burly fair-haired someone in an orange jumpsuit runs down a dark street.
PLACES Someone is on his way to look down the passage way between the houses.
SMT Visual words Someone in the middle of the car pulls up ahead
SMT with our text-labels
DT 30 Someone opens the door to someone
DT 100 Someone, the someone, and someone enters the room
All 100 Someone opens the door and shuts the door, someone and his someone
SMT with our sense-labels
DT 30 Someone, the someone, and someone enters the room
DT 100 Someone goes over to the door
All 100 Someone enters the room
Movie script/DVS Someone follows someone into the leaky cauldron

Figure 4: Qualitative comparison of different video description methods. Discussion in Section 6.3. More examples on our
web page.

References [9] J. Berant, A. Chou, R. Frostig, and P. Liang. Semantic pars-


ing on freebase from question-answer pairs. In EMNLP,
[1] British amazon. http://www.amazon.co.uk/, 2014. pages 1533–1544, 2013. 3
3 [10] P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, and
[2] Castingwords transcription service. http: J. Sivic. Finding actors and actions in movies. In Communi-
//castingwords.com/, 2014. 2, 4 cations of the ACM, 2013. 3
[3] Makemkv. http://www.makemkv.com/, 2014. 4 [11] P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce,
[4] Subtitle edit. http://www.nikse.dk/ C. Schmid, and J. Sivic. Weakly supervised action labeling
SubtitleEdit/, 2014. 4 in videos under ordering constraints. In Proc. ECCV, 2014.
3
[5] Xmedia recode. http://www.xmedia-recode.de/,
[12] D. Chen and W. Dolan. Collecting highly parallel data for
2014. 4
paraphrase evaluation. In Proceedings of the Annual Meet-
[6] Y. Artzi, N. FitzGerald, and L. S. Zettlemoyer. Semantic ing of the Association for Computational Linguistics (ACL),
parsing with combinatory categorial grammars. In ACL (Tu- 2011. 1, 5
torial Abstracts), 2013. 3 [13] X. Chen and C. L. Zitnick. Learning a recurrent visual rep-
[7] C. F. Baker, C. J. Fillmore, and J. B. Lowe. The berkeley resentation for image caption generation. arXiv:1411.5654,
framenet project. In Proceedings of the 36th Annual Meet- 2014. 3
ing of the Association for Computational Linguistics and [14] T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar.
17th International Conference on Computational Linguistics Movie/script: Alignment and parsing of video and text tran-
- Volume 1, ACL ’98, pages 86–90, Stroudsburg, PA, USA, scription. In Proceedings of the European Conference on
1998. Association for Computational Linguistics. 3 Computer Vision (ECCV), 2008. 2, 4
[8] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickin- [15] D. Das, A. F. Martins, and N. A. Smith. An exact dual
son, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, decomposition algorithm for shallow semantic parsing with
D. Salvi, L. Schmidt, J. Shangguan, J. M. Siskind, J. Wag- constraints. In Proceedings of the First Joint Conference on
goner, S. Wang, J. Wei, Y. Yin, and Z. Zhang. Video in sen- Lexical and Computational Semantics-Volume 1: Proceed-
tences out. In UAI, 2012. 3 ings of the main conference and the shared task, and Volume

9
2: Proceedings of the Sixth International Workshop on Se- [30] J. Hoffman, S. Guadarrama, E. Tzeng, J. Donahue, R. Gir-
mantic Evaluation, pages 209–217. Association for Compu- shick, T. Darrell, and K. Saenko. LSDA: Large scale detec-
tational Linguistics, 2012. 3 tion through adaptation. In Advances in Neural Information
[16] P. Das, C. Xu, R. Doell, and J. Corso. Thousand frames in Processing Systems (NIPS), 2014. 2, 5
just a few words: Lingual description of videos through la- [31] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-
tent topics and sparse object stitching. In Proceedings of the ments for generating image descriptions. arXiv:1412.2306,
IEEE Conference on Computer Vision and Pattern Recogni- 2014. 3
tion (CVPR), 2013. 1, 3, 4 [32] M. U. G. Khan, L. Zhang, and Y. Gotoh. Human focused
[17] L. Del Corro and R. Gemulla. Clausie: Clause-based open video description. In Proceedings of the IEEE International
information extraction. In Proc. WWW 2013, WWW ’13, Conference on Computer Vision Workshops (ICCV Work-
pages 355–366, 2013. 6 shops), 2011. 3
[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- [33] K. Kipper, A. Korhonen, N. Ryant, and M. Palmer.
Fei. Imagenet: A large-scale hierarchical image database. Extending verbnet with novel verb classes. In Pro-
In Proceedings of the IEEE Conference on Computer Vision ceedings of the Fifth International Conference on
and Pattern Recognition (CVPR), 2009. 1, 5 Language Resources and Evaluation – LREC’06
[19] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, (http://verbs.colorado.edu/ mpalmer/projects/verbnet.html),
S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur- 2006. 6
rent convolutional networks for visual recognition and de- [34] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neu-
scription. arXiv:1411.4389, 2014. 3 ral language models. In Proceedings of the International
[20] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Auto- Conference on Machine Learning (ICML), 2014. 3
matic annotation of human actions in video. In Proceedings [35] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying
of the IEEE International Conference on Computer Vision visual-semantic embeddings with multimodal neural lan-
(ICCV), 2009. 2, 3 guage models. arXiv:1411.2539, 2014. 3
[21] A. Fader, L. Zettlemoyer, and O. Etzioni. Open question an- [36] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Fed-
swering over curated and extracted knowledge bases. KDD erico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens,
’14, pages 1156–1165, New York, NY, USA, 2014. ACM. 3 C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses:
Open source toolkit for statistical machine translation. In
[22] H. Fang, S. Gupta, F. N. Iandola, R. Srivastava, L. Deng,
Proceedings of the Annual Meeting of the Association for
P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zit-
Computational Linguistics (ACL), 2007. 5
nick, and G. Zweig. From captions to visual concepts and
back. arXiv:1411.4952, 2014. 3 [37] A. Kojima, T. Tamura, and K. Fukunaga. Natural language
description of human activities from video images based on
[23] A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian,
concept hierarchy of actions. International Journal of Com-
J. Hockenmaier, and D. Forsyth. Every picture tells a story:
puter Vision (IJCV), 2002. 3
Generating sentences from images. In Proceedings of the
[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
European Conference on Computer Vision (ECCV), 2010. 3
classification with deep convolutional neural networks. In
[24] C. Fellbaum, editor. WordNet: An Electronic Lexical P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and
Database. The MIT Press, 1998. 5 K. Q. Weinberger, editors, Advances in Neural Information
[25] L. Gagnon, C. Chapdelaine, D. Byrns, S. Foucher, M. Her- Processing Systems (NIPS), pages 1106–1114, 2012. 1
itier, and V. Gupta. A computer-vision-assisted system for [39] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C.
videodescription scripting. 2010. 3 Berg, and T. L. Berg. Baby talk: Understanding and gen-
[26] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, erating simple image descriptions. In Proceedings of the
S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. IEEE Conference on Computer Vision and Pattern Recog-
Youtube2text: Recognizing and describing arbitrary activi- nition (CVPR), 2011. 3
ties using semantic hierarchies and zero-shoot recognition. [40] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
In Proceedings of the IEEE International Conference on Y. Choi. Collective generation of natural image descriptions.
Computer Vision (ICCV), 2013. 3, 4, 5 In Proceedings of the 50th Annual Meeting of the Associa-
[27] A. Gupta, P. Srinivasan, J. Shi, and L. Davis. Understanding tion for Computational Linguistics: Long Papers-Volume 1,
videos, constructing plots learning a visually grounded sto- pages 359–368. Association for Computational Linguistics,
ryline model from annotated videos. In Proceedings of the 2012. 3
IEEE Conference on Computer Vision and Pattern Recogni- [41] P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and
tion (CVPR), 2009. 3 Y. Choi. Treetalk: Composition and compression of trees
[28] P. Hanckmann, K. Schutte, and G. J. Burghouts. Automated for image descriptions. Transactions of the Association for
textual descriptions for a wide range of video events with 48 Computational Linguistics, 2(10):351–362, 2014. 3
human actions. 2012. 3 [42] Lakritz and Salway. The semi-automatic generation of au-
[29] P. Hodosh, A. Young, M. Lai, and J. Hockenmaier. From im- dio description from screenplays. Technical report, Dept. of
age descriptions to visual denotations: New similarity met- Computing Technical Report, University of Surrey, 2006. 1,
rics for semantic inference over event descriptions. 1 3

10
[43] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
Learning realistic human actions from movies. In Proceed- Recognition Challenge, 2014. 5
ings of the IEEE Conference on Computer Vision and Pattern [58] A. Salway. A corpus-based analysis of audio description.
Recognition (CVPR), 2008. 2, 3, 4 Media for all: Subtitling for the deaf, audio description and
[44] K. Lee, Y. Artzi, J. Dodge, and L. Zettlemoyer. Context- sign language, pages 151–174, 2007. 1, 2, 3
dependent semantic parsing for time expressions. In Pro- [59] A. Salway, B. Lehane, and N. E. O’Connor. Associating
ceedings of the 52nd Annual Meeting of the Association for characters with events in films. pages 510–517. ACM, 2007.
Computational Linguistics (Volume 1: Long Papers), pages 2, 3
1437–1447, Baltimore, Maryland, June 2014. ACL. 3 [60] K. K. Schuler, A. Korhonen, and S. W. Brown. Verbnet
[45] S. Li, G. Kulkarni, T. Berg, A. Berg, and Y. Choi. Compos- overview, extensions, mappings and applications. In HLT-
ing simple image descriptions using web-scale N-grams. In NAACL (Tutorial Abstracts), pages 13–14, 2009. 6
Proceedings of the Fifteenth Conference on Computational [61] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y.
Natural Language Learning (CoNLL), pages 220–228. As- Ng. Grounded compositional semantics for finding and de-
sociation for Computational Linguistics, 2011. 3 scribing images with sentences. 2:207–218, 2014. 3
[46] C. Liang, C. Xu, J. Cheng, and H. Lu. Tvparser: An auto- [62] C. C. Tan, Y.-G. Jiang, and C.-W. Ngo. Towards textually
matic tv video parsing method. In CVPR, 2011. 2 describing complex video contents with audio-visual concept
[47] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- classifiers. In MM, 2011. 3
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com- [63] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko,
mon objects in context. arXiv preprint arXiv:1405.0312, and R. J. Mooney. Integrating language and vision to gen-
2014. 1 erate natural language descriptions of videos in the wild. In
[48] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Deep COLING 2014, 25th International Conference on Computa-
captioning with multimodal recurrent neural networks (m- tional Linguistics, Proceedings of the Conference: Technical
rnn). arXiv:1412.6632, 2014. 3 Papers, August 23-29, 2014, Dublin, Ireland, 2014. 3
[49] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. [64] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,
In Proceedings of the IEEE Conference on Computer Vision R. Mooney, and K. Saenko. Translating videos to
and Pattern Recognition (CVPR), june 2009. 2, 3, 4 natural language using deep recurrent neural networks.
[50] M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, arXiv:1412.4729, 2014. 3
X. Han, A. Mensch, A. C. Berg, T. L. Berg, and H. D. III. [65] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
Midge: Generating image descriptions from computer vision tell: A neural image caption generator. arXiv:1411.4555,
detections. In Proceedings of the Conference of the Euro- 2014. 3
pean Chapter of the Association for Computational Linguis- [66] H. Wang, A. Kläser, C. Schmid, and C. Liu. Dense trajecto-
tics (EACL), 2012. 3 ries and motion boundary descriptors for action recognition.
[51] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: De- International Journal of Computer Vision (IJCV), 2013. 5
scribing images using 1 million captioned photographs. In [67] H. Wang and C. Schmid. Action recognition with im-
Advances in Neural Information Processing Systems (NIPS), proved trajectories. In Proceedings of the IEEE International
2011. 1 Conference on Computer Vision (ICCV), Sydney, Australia,
[52] P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, B. Shaw, 2013. 2, 5
A. F. Smeaton, and G. Quéenot. Trecvid 2012 – an overview [68] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
of the goals, tasks, data, evaluation mechanisms and metrics. Sun database: Large-scale scene recognition from abbey to
In Proceedings of TRECVID 2012. NIST, USA, 2012. 1 zoo. Proceedings of the IEEE Conference on Computer Vi-
[53] T. Pedersen, S. Patwardhan, and J. Michelizzi. Word- sion and Pattern Recognition (CVPR), 0:3485–3492, 2010.
net:: Similarity: measuring the relatedness of concepts. In 1
Demonstration Papers at HLT-NAACL 2004, pages 38–41. [69] Z. Zhong and H. T. Ng. It makes sense: A wide-coverage
Association for Computational Linguistics, 2004. 5 word sense disambiguation system for free text. In Proceed-
[54] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, ings of the ACL 2010 System Demonstrations, pages 78–83.
and M. Pinkal. Grounding Action Descriptions in Videos. 1, Association for Computational Linguistics, 2010. 6
2013. 4, 5 [70] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
[55] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, Learning Deep Features for Scene Recognition using Places
and B. Schiele. Coherent multi-sentence video description Database. Advances in Neural Information Processing Sys-
with variable level of detail. September 2014. 1, 2, 3, 4, 5, tems (NIPS), 2014. 1, 2, 5
6, 7
[56] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and
B. Schiele. Translating video content to natural language
descriptions. In Proceedings of the IEEE International Con-
ference on Computer Vision (ICCV), 2013. 1, 2, 3, 5, 7, 8
[57] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy