0% found this document useful (0 votes)
4 views15 pages

19 LS

This paper presents a multimodal-based and aesthetic-guided narrative video summarization method (MANVS) that integrates visual, audio, and subtitle information to create high-quality video summaries. The approach addresses challenges in shot selection and assembly by ensuring content completeness and adherence to aesthetic guidelines, ultimately allowing users to generate personalized and coherent summaries efficiently. Extensive evaluations demonstrate that MANVS produces summaries comparable to those created by experts while being less time-consuming.

Uploaded by

Sowraba J n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views15 pages

19 LS

This paper presents a multimodal-based and aesthetic-guided narrative video summarization method (MANVS) that integrates visual, audio, and subtitle information to create high-quality video summaries. The approach addresses challenges in shot selection and assembly by ensuring content completeness and adherence to aesthetic guidelines, ultimately allowing users to generate personalized and coherent summaries efficiently. Extensive evaluations demonstrate that MANVS produces summaries comparable to those created by experts while being less time-consuming.

Uploaded by

Sowraba J n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

4894 IEEE TRANSACTIONS ON MULTIMEDIA, VOL.

25, 2023

Multimodal-Based and Aesthetic-Guided


Narrative Video Summarization
Jiehang Xie , Xuanbai Chen, Tianyi Zhang , Graduate Student Member, IEEE, Yixuan Zhang,
Shao-Ping Lu , Member, IEEE, Pablo Cesar , Senior Member, IEEE, and Yulu Yang

Abstract—Narrative videos usually illustrate the main content Creating a high-quality narrative video summary is currently
through multiple narrative information such as audios, video a very challenging issue. A common workflow for creating a
frames and subtitles. Existing video summarization approaches narrative video summary manually begins with the outline writ-
rarely consider the multiple dimensional narrative inputs, or
ignore the impact of shots artistic assembly when directly applied ing stage [6], in which the user (video producer) writes a story
to narrative videos. This paper introduces a multimodal-based outline by repeatedly watching the given video. The story out-
and aesthetic-guided narrative video summarization method. line records the narrative threads and the sequence of important
Our method leverages multimodal information including visual events. The next stage is usually shots selection, where the user
content, subtitles and audio information through our specified key chooses the matched shots with the outline content from the
shots selection, subtitle summarization, and highlight extraction
components. Furthermore, under the guidance of cinematographic given video, and decides the cut points and timing (beginning
aesthetic, we design a novel shots assembly module to ensure and ending time points, and shot length) for each selected shot.
the shot content completeness and then assemble the selected Experienced users intend to capture some highlight and prefer-
shots into a desired summary. Besides, our method also provides ence/personalized shots that are not mentioned in the outline,
the flexible specification for shots selection, to achieve which it to ensure that the generated summary is diverse enough. The
automatically selects semantically related shots according to the
user-designed text. By conducting a large number of quantitative last stage is the shots assembly, where the user composes the se-
experimental evaluations and user studies, we demonstrate that lected shots into a summary under some reasonable order. In this
our method effectively preserves important narrative information stage, experienced users try to balance multiple criteria based on
of the original video, and it is capable of rapidly producing high- video editing conventions and aesthetic guidelines (e.g., avoid-
quality and aesthetic-guided narrative video summaries. ing too short shots, including complete shots content, ensuring
Index Terms—Narrative video summarization, multimodal color continuity between adjacent shots, etc.). Thus, manually
information, aesthetic guidance. creating a narrative video summary is labor-intensive, even for
experienced users.
I. INTRODUCTION In this context, various automatic video summarization ap-
proaches have been introduced in the research community [7]–
ARRATIVE videos, such as documentaries, movies and
N scientific explainers, share the immersive visual informa-
tion along with the narrated story-telling subtitles, voiceover
[9]. However, managing to automatically produce both short
and coherent summaries for long narrative videos is extremely
difficult [10]. Conventional methods relying solely on visual fea-
and background musics (BGM) [1], [2]. As the huge number tures find it difficult in capturing narrative threads and highlights,
of narrative videos are uploaded on various online social plat- letting alone showing a personalized visual content [11], [12].
forms, there is an urgent need in creating narrative video sum- What is more, many works rely on the strong encoding ability
maries which can help viewers browse and understand the con- of neural networks to help skip the outline writing stage, and
tent quickly, and presenting them in knowledge popularization the shots selection stage is usually completed by constructing
platforms and many other applications [3]–[5]. a classifier. However, these works all assume that the selected
shots conform to aesthetic guidelines, thus their shots assembly
Manuscript received 22 September 2021; revised 5 April 2022; accepted 31 is merely a process of stitching the shots together in chronologi-
May 2022. Date of publication 15 June 2022; date of current version 30 October
2023. This work was supported by NSFC under Grants 62132012 and 61972216. cal order. In practice, there are often cases where a model cannot
The Associate Editor coordinating the review of this manuscript and approving it obtain a good or complete shot in the shots selection stage. For
for publication was Prof. Ngai-Man Cheung. (Corresponding author: Shao-Ping instance, the selected shots are either too short to cover a com-
Lu.)
Jiehang Xie, Xuanbai Chen, Yixuan Zhang, Shao-Ping Lu, and Yulu plete voiceover or too long to be interested by the viewers, result-
Yang are with the TKLNDST, CS, Nankai University, Nankai 300071, ing in the limited quality of generated summaries. Therefore, in
China (e-mail: jiehangxie@mail.nankai.edu.cn; 1711314@mail.nankai.edu.cn; this work, we propose a multimodal-based and aesthetic-guided
2011432@mail.nankai.edu.cn; slu@nankai.edu.cn; yangyl@nankai.edu.cn).
Tianyi Zhang and Pablo Cesar are with Centrum Wiskunde and In- narrative video summarization framework, named MANVS, to
formatica, 098 XG Amsterdam, Netherlands (e-mail: tianyi.zhang@cwi.nl; generate high-quality video summaries. This framework selects
p.s.cesar@cwi.nl). meaningful and personalized shots based on multimodal infor-
This article has supplementary downloadable material available at
https://doi.org/10.1109/TMM.2022.3183394, provided by the authors. mation, and considers the completeness of shot content and
Digital Object Identifier 10.1109/TMM.2022.3183394 the aesthetics of selected shots in the shots assembly stage, to

1520-9210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4895

assemble them into a video with smooth visual transitions while II. RELATED WORK
preserving an overall pleasing aesthetics.
In this section, we briefly introduce some techniques on video
We assume that the user desires to generate a condensed ver- summarization, text summarization, language video localiza-
sion of the given narrative video, which can allow viewers to
tion, highlight detection and computational cinematography,
acquaint a comprehensive overview of a given narrative video
which are most relevant to our work.
quickly. Moreover, the time-aligned subtitles related to the given Video summarization: The main objective of general video
video are assumed to be available, gathered from online or per-
summarization is to produce a shortened video containing the
sonal resources. However, there are several technical challenges
most representative visual information of a given one [14], [15].
need to be addressed in our approach. Firstly, the method should A typical video summarization solution usually begins with
determine, based on the multimodal information, which shots
selecting key frames or video segments [16], [17], which can
contain important content and need to be selected. Secondly, mainly be divided into supervised and unsupervised styles. The
if a user wants to select the personalized visual content from former supposes to own human annotations of key frames in the
the input video and preserves it into the summary, how does
original videos [7]. It is noticeable that constructing datasets with
the model locate this personalized content. Finally, professional sufficient labels is very difficult in practice. There thus appears
videos usually have a group of features in accordance with the some unsupervised learning based approaches [18], [19]. How-
conventions and aesthetic guidelines of video editing, which
ever, although the aforementioned methods can obtain some im-
can be utilized to distinguish them from amateur ones and make portant visual information from original videos, there are some
videos more visually attractive. Thus, when assembling the se- common disadvantages. For example, some image information
lected shots into a video summary, both the cinematographic aes-
are just considered by searching for shot boundaries [20] in the
thetics and the completeness of shots content need to be jointly shots selection process, where the switched shots are regarded as
considered. Note that though the completeness of shot content
important content and the multimodal information of the original
is preserved well, this shot may not be visual attracted enough.
video are ignored. Consequently, the generated video summary
Similarly, an engaging shot does not necessarily guarantee con- loses a lot of information, which makes it look like a truncated
taining a complete voiceover.
version of the original video without coherent narrative infor-
To address the challenges above, our proposed MANVS con-
mation. The current research shows that human cognition is a
sists of two main modules. The first module is the multimodal- process of cross-media information interaction [21]. Therefore,
based shots selection module, which integrates visual, audio and
multimodal information such as video frames, audios, and subti-
time-aligned subtitles into the shots selection process to cap-
tles should be leveraged to select crucial shots and provide view-
ture meaningful narrative content from consecutive sequences ers with vivid and comprehensive content. Which shots should
of shots. Furthermore, this module provides a flexible way to
be chosen, and how to assemble them into a video with smooth
acquire the shots that users are interested in, which allows users visual transitions while preserving an overall good aesthetics,
to choose a shot by inputting text. Next, the aesthetic-guided deserve our attention.
shots assembly module firstly filters repetitive and low-quality
Text summarization: Approaches to text summarization can
shots, and then automatically checks whether the selected shots be briefly classified into two categories: abstractive or extrac-
are content-complete. If a shot fails in these checks, through fol- tive [22]. The former usually generates new sentences to express
lowing cinematographic aesthetic guidelines and developing a
the crucial information [23]. However, the state-of-the-art meth-
series of shot completion strategies, we achieve a good trade-off ods of this class are likely to generate abstracts that are not fluent
between aesthetics and the completeness of shots, and finally as- or introduce grammatical errors [24]. In contrast, the extractive
sembles these shots into a video with smooth visual transitions.
methods focus on selecting some subsets sentences containing
In summary, the contributions of this paper are as follows.
r We design an aesthetic-guided shots assembly module, important contextual information from the source text and as-
sembling them to form a text summary [25]. The main advan-
which establishes a series of strategies to preserve the shot
tage of this kind of methods is that grammatical errors can be
content completeness and aesthetics. To the best of our
avoided for the sentences of the generated text summary. Pavel
knowledge, we are the first to consider aesthetic guidelines
et al. [26] use crowdsourcing to make a text summary for the
in the shots assembly stage.
r We present a multimodal-based shots selection module that original video, and find the corresponding crucial video frames
according to the content of the text summary. However, crowd-
comprehensively analyzes subtitle, image, and audio infor-
sourcing is not only time-consuming and laborious, but also can
mation to capture narrative, representative, and highlight
not ensure the overall consistency of the summary generated
shots. Besides, we provide a flexible way for shots selec-
by different staffs. Inspired by the above methods, we further
tion in order that users can choose the shots they desire to
consider that the subtitle and video shots are semantically rele-
watch.
r We conduct extensive quantitative evaluations and user vant [27], and our work concentrates on automatically dividing
subtitle document to extract the text summary, which effectively
studies to evaluate the effectiveness of MANVS. The re-
helps generate topically coherent video summaries.
sults demonstrate that our method can generate video sum-
Language video localization: The goal of language video
maries with the quality comparable to that produced by
localization is to locate a matching shot from the video that
experts and is much less time-consuming.
semantically corresponds to the language [28], [29] which has

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4896 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

Fig. 1. The proposed MANVS framework for narrative video summarization. The upper left part is a simplified flow chart and the upper right part is the detailed
processes corresponding to the annotated components on the left. The detect-summarize network is established based on [13]. Best viewed in color.

attracted the attention of a large number of researchers [30], al. [42] have some observations by analyzing a large number of
[31]. Xu et al. [31] propose a multilevel model by doing text Vlogs on Youtube: shots for story-telling are usually A-roll, it
features injection, modulating the processing of query sentences is reasonable to insert B-roll when the speaker stops for a long
at the word level in a recurrent neural network. Li et al. [32] time, and B-roll usually distributes uniformly in the whole video.
propose a deep collaborative embedding method for multiple Wang et al. [43] utilize the text information to connect multiple
image understanding tasks. It is the first approach attempting shots, and the well preserved attributes of videos achieve that
to solve this issue under the framework of deep factor analysis both the color continuity is ensured and the movement of shot
and acquiring fruitful achievements. Chen et al. [30] intro- is avoided. Hu et al. [44] propose to generate the informative
duce a semantic activity proposal which utilizes the semantic and interesting summary using a set of aesthetic features such
information of sentence queries to get discriminative activity as saturation and brightness to select shots. Without training and
proposals. To fill the gap between question-answering and human annotation, it is convenient to process videos. Our work
language localizing, a query guided highlighting approach was mainly focuses on introducing the computational cinematogra-
proposed in [33]. In our work, we design a semantic similarity phy to automatically improve the quality of selected shots in the
component to reduce redundant and irrelevant shots, and then shots assembly stage.
leverage the language video localization method to select the
personalized shots for every user in the shots selection stage.
Highlight detection: The purpose of highlight detection is III. OVERVIEW
to select video clips that can attract viewer’s attention [34], Fig. 1 shows the overall architecture of our MANVS. The
[35]. In the past research, the audio information is widely used input of MANVS includes an original video, the correspond-
in this task, as audio-based modeling is computationally eas- ing time-aligned subtitles and the user-designed procedural text.
ier than that of videos when representing remarkable contex- Particularly, the time-aligned subtitles include subtitle text se-
tual semantics [36]. Highlight shots in sports videos, e.g. goals quence and corresponding time maps with the video shots. We
in soccer games and home runs in baseball games, can cause use the term of procedural text [45] to emphasize that it should
the sports commentators excited voices and cheers from audi- focus on describing a specific visual content that users are in-
ence [37]. While in narrative videos such as movie and docu- terested in. In the shots selection stage, our method employs
mentary, highlight shots would also cause the change of audio the multimodal-based shots selection module, which consists
information [38]. The louder and denser of video sounds in a of subtitle summarization, visual semantic matching, highlight
period of time, the more possibility it is wonderful [39]. Based extraction, and key shots selection components to seek the nar-
on this observation, we take the change of video sounds into rative, personalized, highlight, and key shots.
consideration for our shots selection. In the shots assembly stage, our method passes these shots ob-
Computational cinematography: A desired narrative video tained by shots selection module to the aesthetic-guided shots
needs not only a fluent narration, but also shots of high-quality assembly module, to obtain a good summary. Note that shots
and artistic expression [40]. Niu et al. [41] discuss the aesthetic selection is an ill-posed problem, and incomplete or repeated
differences between professional and amateur videos by mod- shots could be introduced in this process. If incomplete or
eling image attributes including the color and noise, and video repeated shots appear, the overall quality would significantly
attributes such as camera motion and shots duration. Huber et degrade even if the visual aesthetic of the summary itself is

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4897

in the text, T = {t1 , t2 , . . .tn } is the time mapping, ti = (bi , ei ),


bi and ei respectively denotes the start and end time point of si , n
is the total number of sentences in S. The objective of the chap-
ter clustering component is to divides S into a chapter sequence
D = {d1 , d2 , . . .dm }. Here m is the total number of chapters d,
which can be defined as follows:
D =ς
(ϑ(S)) , 
m n
s.t. h=0 dh = i=0 si , (1)
m << n,
where ϑ(·) represents the TF-IDF similarity score, ς(·) denotes
K-means clustering. In this process, we use the NLTK toolkit to
exclude ‘or,’ ‘the,’ and other stop words. Our chapter clustering
component ensures that the text of every chapter describes the
coherent stories of such chapter.
After that, a pretrained extractive text summarization
Fig. 2. The pipeline of the subtitle summarization component. The green,
yellow, and blue blocks represent different chapters. The text summarization method [47] is employed as the backbone of our text summariza-
component is established based on [47]. tion component, aiming to obtain the representative context se-
mantic information and meaningful narrative clues. Concretely,
this component leverages the robust transformer encoder [48]
satisfactory. Similarly, an engaging shot does not necessarily to map a chapter to a semantic feature space. Then, this com-
guarantee containing a complete voiceover. We thus design our ponent uses the auto-regressive decoder pointer network with
aesthetic-guided shots assembly module so that it preserves the attention mechanism [49] to extract a subset to form text sum-
completeness of the shots content and the overall aesthetics. mary from each chapter. This process and objective function can
Additionally, a user-designed voiceover text is allowed as an be formulated as follows:
optional input to the post-processing component, to provide
customized effects for the generated summary, such as custom r {sb , so , . . .sq } = λ- (ε(d)) , (2)
voiceover. Finally, MANVS outputs a high-quality video sum- q

mary. P (r|d) = argmax P (sq |{sb , . . ., sq−1 }, d), (3)
sq
b
IV. MULTIMODAL-BASED SHOTS SELECTION
where {sb , so , . . .sq } denotes the sentences that form a text
In this section, we describe the multimodal based shots se- summary r, the first and last sentence of r is represented by
lection module that are completed the four components. All the -
sb and sq , respectively. λ(·) means the transformer encoder,
components presented in this section output a series of shots and ε(·) denotes pointer network decoder, P is the probability.
corresponding timelines with a beginning and ending time point Finally, our subtitle summarization component searches nar-
pairs. rative shots corresponding to the time mappings t from the given
video based on r, ensuring that the video summary can effec-
A. Subtitle Summarization tively cover those significant narrative information of the input
video.
In order to increase the narration capacity for video sum-
marization, we design a subtitle summarization component. It
B. Visual Semantic Matching
is comprised of two cascaded components: chapter clustering
component and text summarization component. To find some specific shots that users might be delighted to
Fig. 2 illustrates the whole pipeline of subtitle summarization watch, we construct a visual semantic matching component to
component. The input data includes the given narrative video search for those shots that match the user-designed procedural
and the corresponding time-aligned subtitles. Firstly, the chap- text. Our visual semantic matching component is consisted of
ter clustering component, which utilizes the term frequency in- two cascaded components, which are text semantic similarity
verse document frequency (TF-IDF) [46] and K-means algo- component and visual semantic localizing component.
rithm, automatically organizes the storytelling structure of the Fig. 3 presents the workflow of visual semantic matching com-
input subtitle text sequence and divides it into different chapters. ponent. The input data includes the given video, the time-aligned
Next, the text summarization component is conducted in every subtitles and the user-designed procedural text. By calculating
single chapter to generate text summary. Finally, our method ex- the semantic similarity weight between procedural text and the
tracts narrative shots from the input video according to the time text of all subtitles, the text semantic similarity component ob-
mapping corresponding to every generated text summary. tains some subtitles semantically related to the procedural text.
Formally, the time-aligned subtitle of a narrative video is rep- This component then creates a sub-video by extracting corre-
resented by a two-tuple (S, T ), where S = {s1 , s2 , . . .sn } de- sponding shots from the input video based on the time map-
notes the subtitle text sequence, si represents the i − th subtitle ping of these subtitles. The semantic similarity component is

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4898 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

where Tk1 (·) represents selecting top k1 subtitles which are sim-
ilar to the procedural text x, and n is the total number of subtitles
in a whole narrative video. W (·, ·) means the computing method
of word co-occurrence. k1 and k2 are set as 12 and 6, respec-
tively.
Since the procedural text is a set of semantic phrases com-
bination F (·), its different levels of semantic information can
be used to match the visual feature of the created sub-video,
and the visual content can be located by the text. Therefore, the
objective of our localization task is defined as maximizing an
expected log-likelihood:
θ∗ = argmax E {logpθ [SV (C)|F (P )]} , (6)
θ

where θ means the parameters that need to be optimized, SV


denotes our generated sub-video, C represents a time interval of
the target region within SV , and P is the procedural text.
Fig. 3. The workflow of visual semantic matching component. Generated sub- Our visual semantic localizing component based on the state-
video and located shots are outlined in green and yellow, respectively. The visual of-the-art temporal language grounding method, i.e. LG4 [50].
semantic localizing is established based on [50]. This method firstly uses a sequential query attention net-
work [51] to explore the sequence relationships between sen-
tences. Next, it takes video-text interaction in three different
designed to reduce redundant and irrelevant shots. Next, the
levels, where the first level is the segment-level modality fusion.
sub-video and procedural text are fed into the visual semantic
This encourages that the segment features relevant (or irrele-
localizing component. Finally, in the generated sub-video, this
vant) to the semantic phrase features should be highlighted (or
component locates the shot that corresponds to the scene de-
suppressed). The second level interaction is for local context
scribed by the procedural text.
modeling, where the neighbors of individual segments are con-
The flow of our text semantic similarity component is as fol-
sidered by leveraging the Hadamard product [52]. The last level
lows. Firstly, it computes the word co-occurrence for procedural
interaction deals with contextual and temporal relations between
text x and every subtitle text yi , where i ∈ {1, 2, . . .n}. This is to
semantic phrases by employing the non-local block presented
measure the number of same words for every text-subtitle pair.
in [53]. After that, the shots that are the most semantically re-
Secondly, it selects the top k1 subtitles which own relatively high
lated to the procedural text are located and selected from the
word co-occurrence rate. Then, LSTM is employed to extract the
sub-video.
semantic features for text x and k1 subtitles, and the semantic
similarity weight for every selected text-subtitle pair is calcu-
C. Crucial Shots Selection
lated. This assists us to find k2 subtitles owning high weights
from k1 , and we take them as the final candidates. Subsequently, To effectively capture those representative and highlight
a sub-video is created by assembling video shots extracted from shots, we design the key shots selection and highlight extrac-
input video corresponding to k2 subtitle candidates. In order to tion components.
increase the story completeness of the sub-video, we lengthen Key shots selection: Here we use a pretrained detect-
the duration of each subtitle, some content before the beginning summarize network [13] to achieve this purpose by only in-
(for l1 seconds) and after the end (for l2 seconds) are also in- putting the given video. The temporal consistency is formulated
cluded in such sub-video. This enables us to involve some more to ground the representative contents of the given video in this
supplementary shots (B-roll), and in our work l1 and l2 are both approach. In detail, our method firstly samples a series of tempo-
set as 6 according to detailed B-roll analysis provided in [42]. ral interest proposals with different scales of intervals. After that,
Formally, our text semantic similarity component is defined long-range temporal features are extracted for both predicting
as: important shots and selecting the relatively representative ones.
  Finally, a set of correlated consecutive frames within a temporal
LST M (x) ∗ LST M (yi )
SC = Tk2 Γgi=0 , (4) slot are considered for shots selection.
LST M (x) LST M (yi ) Highlight extraction: Audio owns remarkable representation
where SC denotes the final subtitle candidates, Tk2 (·) represents of the corresponding semantic content, and its processing is com-
selecting top k2 subtitles which are similar to the procedural putationally easier than that of video, so it has been widely used
text x. Γgi=0 is the value range of i from 0 to g, the operator ∗ in the area of highlight extraction. Following the above obser-
means dot product, and  ·  represents computing the norm of vations, here we utilize the fluctuation of the sound energy as a
the vector. Here g is the number of selected subtitles G, which supervised prior to extract highlight shots and the sound energy
are obtained by is determined by the volume level of the audio track in video. The
input data includes the given video and the audio extracted from
G = Tk1 (Γnj=0 W (x, yj )), (5) the video. Firstly, we divide the audio into clips with the same

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4899

thus establish the aesthetic-guided shots assembly module to ef-


fectively solve the issues above. To the best of our knowledge,
we are the first to use aesthetic guidelines and keep the content
completeness of the selected shots in the shots assembly stage.
Specifically, this module takes the shots selected from the pre-
vious module as input, and refers to the classic cinematographic
Fig. 4. A example of highlight extraction. The audio segment with higher
aesthetic guidelines and time-aligned subtitles, to preserve the
sound energy and the corresponding highlight shot are marked with red block aesthetic quality and content integrity. The guidelines provide a
and line respectively. set of predefined constraints for ensuring the aesthetic quality,
while the time-aligned subtitles serve as reference cutting points
length and then compute the value of sound energy for those for selecting complete narrative shots.
clips. Then, some clips with larger audio energy are selected. Firstly, we make a simple shots aesthetic evaluation for those
Finally, desired shots are extracted Formally, it is constructed as selected shots. Though converting the shots aesthetic evaluation
follows: into a regression task is a straightforward solution, it brings great
  w difficulties in training the model. The reason lies in the subjec-
 tivity to determine whether a shot is aesthetic, and the annota-
HS = Tx Γlk=0 2
Ek+i , (7)
i=0
tions are generally not available. To simplify the problem, our
method automatically checks whether the selected shots meet
from the video based on the time mapping corresponding to the some simple predefined aesthetic constraints, such as shot sta-
selected audio clips. where HS is the desired highlight shots to bility, opposite movement. If a shot fails these simple checks,
be selected, Tx (·) denotes top x (we fix it as 10) percent of the we abandon it to avoid introducing low-quality shots. Secondly,
calculated sound energies of all audio clips, Γk means the value our method checks the content redundancy and completeness of
range of k, and l is the duration of the video. Suppose Ek is the those remaining shots. Generally, if there is an overlap between
value of the audio signal in time k, for each audio clip
from time the timelines of multiple selected shots, it is considered as re-
k to k + w, w is the clip size (e.g. 5 seconds), then w 2
i=0 Ek+i dundant. If the shot timeline with the beginning and ending time
is the value of the sound energy of such clip. Fig. 4 shows a points can be aligned with a complete subtitle, the content of the
example of highlight extraction. shot is regarded as complete, which can bring the viewer an en-
joyable audio-visual experience. Otherwise, incomplete shots
V. AESTHETIC-GUIDED SHOTS ASSEMBLY can make some cut off for spoken sentences appearing in the
Although some works successfully apply aesthetic guidelines summary. Once a shot fails the above checks, our method anal-
to the video summarization task [44], they only take the bright- yses its eight possible situations and leverages corresponding
ness and saturation features of frames as a supervisory signal to strategies of shot complement, to remove the redundant shots as
select shots in the shots selection stage, while ignoring main- well as ensure the completeness of the content. Meanwhile, our
taining the completeness of the shot content and following the method automatically expands or filters shots so that the selected
aesthetics guidelines in the shots assembly stage. Thus, existing shots satisfy some predefined aesthetic constraints, such as color
methods cannot be directly applied to solve the following issues: continuity and shot length. In this way, MANVS achieves a good
(1) Whether the aesthetic quality of the selected shot itself (e.g., trade-off between ensuring the completeness of shots and pre-
shot length and shot stability) meets the cinematographic aes- serving an excellent aesthetic quality of the generated video. Our
thetic guidelines, or further optimization is needed. (2) Whether method preserves smooth visual transitions among consecutive
there is an overlap between the timelines of the selected multi- shots, by automatically adjusting the saturation and brightness
ple shots, or filtration is needed. (3) Whether the timeline of a of adjacent shots.
selected shot with a beginning and ending time point pair is per-
fectly in accordance with a complete narrative subtitle or content,
or completion is needed. (4) How to solve the above situations A. Cinematographic Aesthetic Constraints
while preserving an overall pleasing aesthetic feeling as well as Considering that our focus is to generate summaries of pro-
the smooth visual transitions among consecutive shots. fessional narrative videos, some cinematography rules are not
We believe that the quality of a single shot itself is highly cor- always applicable for our task. For instance, for some specific
related with the overall one of the summary, i.e., both aesthetic artistic expressions or authenticity, shots with low saturation and
and content integrity factors of each shot can affect the overall brightness do not necessarily make a low-quality video. There-
quality. For example, for a narrative video summary, most of the fore, we select several classical aesthetics guidelines which are
attention should be paid to narrative shots and the voiceover. If suitable for our task, and explain below how to apply them in
the meaningful narrative shots and voiceover are incomplete (cut our setting:
off in spoken sentences) or repetitive, the overall quality would Shot Stability: The high-quality video shots should move
significantly degrade even if the visual aesthetic of the shots smoothly and stably. The lower the local acceleration of the
themselves is satisfactory. Similarly, shots that are complete but shot content is, the more stable the shot is, and conversely,
do not conform to aesthetic guidelines (such as too long shots) it will reach the opposite [54]. In our setting, we calculate
could result in a limited quality of the generated summary. We the local shake value of a shot according to the homography

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4900 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

transformation matrices from some consecutive sampled frames


in the shot:

F (fi ) = H (f, f  ) pf  (i) − pf  (i) − H (f, f  ) pf (i)2 ,

(8)
4

1
FSS (s) = − F (f ), (9)
4Nsf f ∈s i Fig. 5. Eight possible shot situations and corresponding shot complement
strategies. The long bar outlined by black is the timeline of video.
where f , f  and f  represent three consecutive frames in the
shot s. The symbol fi is the corner i of f , pf (i) is the position
fi in pixels, where i = 1, . . ., 4. We use H(f, f  ) to denote the brightness:
homography transformation matrix between f and f  . F (fi ) 1
Fcc (e, b) = (Ψ (ηS (e), ηS (b))
means the local shake values of fi . The shot stability Fss is 2
computed as the negative average of local shake values over
+Ψ (ηL (e), ηL (b))) , (12)
time, and Nsf is the number of sampled frames in shot s, and the
sampling step is 8. where Fcc (.) denotes the tonal difference between two shots,
Opposite Movements: Adjacent shots with opposite camera e and b respectively represent the last and the first frame of
movements may result in a terrible viewing experience for the adjacent shots, ηS (.) and ηL (.) represent the S-channel and L-
audience [43]. We avoid this situation by calculating the two- channel histogram of this frame, both quantified to 256 bins. Ψ
dimensional motion of the shots: means the chi-square measure.

4
 ρ (fl , i) ρ ff , i B. Shots Assembly and Post-Processing
Fom fl , ff = , (10)
i=1 |ρ (fl , i) | ∗ |ρ ff , i | Based on the aforementioned classical cinematographic aes-
thetic guidelines, it becomes practical for us to appropriately
ρ (fl , i) = (i) − H fl , ff (i), (11) place the beginning and ending points of the selected shots, re-
move some repeated shots and extend incomplete ones. Specif-
ically, our method automatically checks whether the selected
where fl and ff respectively represent the last and the first frame shots meet the stability and opposite movements constraints,
of two adjacent shots s and s. Fom denotes the cosine distance of and then preserves the shots which pass the checks. Further-
the frame corner movement vectors of fl and ff . ρ(fl , i) is the more, our method checks whether each preserved shot is long
function estimating the two-dimensional movement of the i-th enough to contain the duration of a complete voiceover based
corner (i) of fl . H(fl , ff ) is the homography transformation on the time-aligned subtitles. We summarize three types, adding
matrix between fl and ff . up to eight possible situations and corresponding strategies of
Shots length: Some long shots without interesting content shot complement as shown in Fig. 5, to make the preserved
may easily let the viewer lose attention, and oppositely, some shots contain complete narrative content while satisfying aes-
too short shots may affect visual smoothness. In order to avoid thetic constraints such as shots length and color continuity. Now
these extreme shots, we set the duration of individual shots as 3 we provide more details of such strategies.
to 8 seconds inspired by [43]. 1) Incomplete and non-overlapping shot: In this type, the se-
B-roll selection: In professional narrative videos, those A-roll lected shot is within a either A-roll or B-roll. Moreover, for
and B-roll shots are reasonably mixed [42]. For telling stories, selected shots their timelines do not overlap with each other.
most shots are usually A-roll, and B-roll is usually placed at the Therefore, there would be two situations. (I) Incomplete A-roll
speaker’s natural pause to support A-roll in visual experience. (ξA ): it should be lengthened such that its timeline is equal to
In this work, we can take the shots with or without subtitles as that of a complete A-roll (δA ), or we directly remove it once
A-roll and B-roll, respectively. It is noticeable that the frequent the timeline ratio of ξA to δA is smaller than a constant thresh-
insertion of B-rolls would easily distract the audience, while too old (e.g. 0.5). (II) Incomplete B-roll (ξB ): we extend this kind
few b-rolls may make the story much less interesting. Therefore, of shot according to the color continuity. For each iteration, if
we set the interval between two B-rolls as 9 seconds motivated the difference of color continuity between the beginning frame
from [42]. and its previous frame is not great, the beginning time point is
Color continuity: The video color continuity is often the most updated to the previous frame. Similarly, the end time point is
representative feature in identifying professional videos [41]. processed by comparing the end frame and its next frame. The
Therefore, preserving the continuity of saturation and bright- extreme case is extending ξB to a complete B-roll (δB ).
ness in a shot is crucial to improving the viewing experience. 2) Across boundary and non-overlapping shots: For each be-
In our work, the color continuity between two adjacent frames ginning and ending time points pair of such shots, they are lo-
is measured by the histogram difference of the saturation and cated within two complete shots and do not overlap with each

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4901

other. This type can be simply divided into three cases: (III) 2) MPII movie description (MPII) [58]: It is a popular
ξA ∪ ξB , (IV) δA ∪ ξB , and (V) ξA ∪ δB . However, selected video description dataset in language video localization field,
shots can be decomposed into separate sub shots which exactly containing 94 movies and 68375 manually written description
follow (I) or (II) in all these cases. sentences of movie plot.
3) Overlapping shots: This type contains several cases below: 3) Documentary description: We collected 72 documentaries
(VI) the overlapping parts are in a complete shot, (VII) at least with different themes such as earth, travel, adventure, food, his-
one overlapping shot is across the boundary of complete shots, tory and science from public websites, and annotated 21,643
and (VIII) at least an overlapping shot contains a complete shot. temporal descriptions of plot like [58].
For these shots, we first remove the repeated parts, and then 4) Movie-documentary (M-D): This is a video repository that
merge the consecutive shots into a single one. This merged shot consisted of the movies in MPII and the documentaries in docu-
then conforms to the above processing method of type (1) or (2). mentary description dataset. We further collect their correspond-
Similarly, the beginning and ending time points pair that is ing subtitles from public websites. Unlike TVsum, M-D does not
located within more than two complete shots can firstly be sepa- provide frame-level importance scores.
rated by the complete shot boundary, and then be complemented
by leveraging corresponding strategies. Finally, we select the B. Baselines
qualified shots according to the aforementioned aesthetic con-
We compared the performance of our model with several
straints, and the resulting shots are assembled to get the video
advanced video summarization methods. 1) Random: we la-
summary. In particular, if the color continuity between two shots
beled scores of importance for each frame randomly to generate
is quite different, we employ fade-in and fade-out effects to en-
summaries which are independent with the video content. 2)
sure visual smoothness.
DR [18]: it proposes an encoder-decoder framework based on
Post processing: we further apply a series of post process-
reinforcement learning to predict probabilities for every video
ing effects, it is worth noting that the operations 2) and 3) are
frames. For training the framework, a diversity and a representa-
optional: 1) we leverage the Spleeter [55] to extract the original
tiveness reward function is designed to generate summaries. 3)
BGM and only keep the human voice. 2) We automatically select
DR-Sup [13]: it is a ablation version of DR, which has the same
a coherent audio clip from the extracted BGM that matches the
model backbone as DR, but only utilizes a representativeness
style of the generated summary. We also allow users to select
reward function in the training process. 4) VAS [7]: it proposes
some externally BGM for the summary. 3) We allow users to
a self-attention based network, which performs the entire se-
manually design a text of voiceover and select the start and end
quence to sequence transformation in a single feed forward pass
time that they desire to insert. We leverage the text-to-speech
and single backward pass during training. 5) DSN-AB [13]: it
method, i.e. espnet [56], to implement it.
firstly samples a series of temporal interest proposals with differ-
ent scales of intervals. After that, long-range temporal features
VI. EXPERIMENTS are extracted for both predicting important frames and selecting
In this section, we describe the datasets, evaluation metrics, the relatively representative ones. 6) DSN-AF [13]: it is a vari-
baselines, experimental designs and manage to evaluate the ef- ant of DSN-AB. The feature extraction and key shots selection
fectiveness of our approach by answering the following research steps of this method are the same as those of DSN-AB, but the
questions (RQs): importance scores at frame level are converted into shot level
RQ1: Compared to traditional video summarization methods, scores. 7) HSA [8]: it proposes a framework with two layers.
can our method provide a more high-quality summary? The first layer is to locate the shot boundaries in the video and
RQ2: Does every shots selection component or aesthetic con- generate the visual features for them. The second layer is uti-
straint have positive influence to our final result? lized to predict which shots are most representative to the video
RQ3: In contrast to professional manual summarization, what content. 8) FCN [59]: it adapts semantic segmentation models
are the advantages and disadvantages of MANVS? based on fully convolutional sequence networks for video sum-
marization. 9) HMT [60]: it proposes a hierarchical transformer
A. Datasets model based on audio and visual information, which can cap-
ture the long-range dependency information among frames and
We employ several existing and collected datasets to conduct shots. 10) VSN [61]: it proposes a deep learning framework to
extensive experiments. These datasets are shown below. learn video summarization from unpaired data.
1) TVsum [57]: It is a widely used and manually annotated
video summarization dataset, which contains 50 multimodal C. Experimental Designs
videos from public websites. The topics of videos include how
to change tires for off road vehicles, paper wasp removal etc., In order to ensure the comparison as fair as possible, we
with durations varying from 2 minutes to 10 minutes. In our ex- set our method as the ones with and without manual opera-
periments, Aliyun interface1 is used to automatically generate tion, which are denoted by MANVS and MANVS-auto, respec-
subtitle documents for these videos. tively. Specifically, the manual operation includes visual seman-
tic matching component, user-designed voiceover and BGM in
the post processing component. Next, the following experiments
1 [Online]. Available: https://www.aliyun.com/ are designed to answer the aforementioned RQs:

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4902 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

Comparison to traditional video summarization methods available online.2 This model employs a pretrained uncased base
(RQ1): We compared MANVS-auto with baselines in the fol- model of BERT as transformer encoder, and utilizes a pointer
lowing two experimental designs. R1a: we conduct the compar- network with attention mechanism as decoder. The number of
ison experiments on the TVsum dataset, by quantitatively eval- transformer blocks and self-attention heads are both 12, the
uating the performance of the proposed method and the compar- hidden layer size is 768, the maximum input sequence length
ison methods. R1b: we conduct user study on the comparison is 512, the batch size is 32 and the vocabulary size is 30000.
experiments for the TVsum and M-D dataset, and further explore For the temporal language grounding method [50] leveraged in
the statistical significance of the user data. the visual semantic localizing component, the official code is
Ablation study (RQ2): We conduct two types of ablation stud- available online.3 We follow the official training setup and train
ies. R2a: we conduct a quantitative experiment on MPII and doc- the model on the Charades dataset [63], which is composed of
umentary description dataset to evaluate the performance of our 12408 and 3720 time interval and text query pairs in training
visual semantic matching component objectively. In this exper- and test set, respectively. This model utilizes I3D [64] to extract
iment, we compare the employed method LG4 and an alterna- segment features for training data, while fixing their parame-
tive method VSL [33] for visual-semantic localizing. Besides, ters during a training step. The feature dimension is set to 512.
we compare them against those without either text semantic This method uniformly samples 128 segments from each video
similarity (TSS) component, without visual semantic localizing and uses the Adam optimizer to learn models with a mini-batch
component, or both. R2b: we evaluate the effect of every single of 100 video-query pairs and a fixed learning rate of 0.0004.
component and aesthetic constraint to the quality of our gener- Then, we fine-tune the pretrained model on the MPII movie and
ated video summary on the TVsum and M-D datasets. For the Documentary Description dataset. Parameters of the pretrained
former dataset, the experiments include quantitative evaluations model are fixed during training. For the detect-summarize net-
and user study, while for the latter, due to the lack of annota- work in the key shots selection component, we use the pretrained
tions, only user study is conducted. Specifically, for TVsum, we anchor-base model provided by [13]. This model includes a
achieve which MANVS is applied without a specific individual multi-head self-attention layer with 8 heads, a layer normaliza-
component or aesthetic constraint: without subtitle summariza- tion, a fully-connected layer with a dropout layer and a tanh acti-
tion (w/o SS), without key shots selection (w/o KSS), without vation function, followed by two output fully-connected layers.
highlight extraction (w/o HE), without shots length (w/o SL), In our setting, all the hyperparameters of implemented meth-
without B-roll selection (w/o BS), without color continuity (w/o ods are kept the same with the official ones. We evaluate the
CC), without shot stability (w/o SSA), without opposite move- performance of the implemented method with the same evalu-
ments (w/o OM) and without post processing (w/o PP). Since ation criteria as the official ones, and the performance of these
the videos in TVsum are unlabeled with temporal descriptions, methods is also consistent with official ones. The supplemen-
we do not use VSM component in this experiment on TVsum. tary material provides partially generated summaries, and we
Similarily, we achieve which MANVS is applied, keep the same encourage readers to watch these videos.
setting on M-D dataset.
Comparison to professional manual editing (RQ3): R3: we D. Evaluation Metrics
compare the editing time and quality of generated summaries
among MANVS, MANVS-auto and an experienced video pro- In order to comprehensively evaluate the performance of our
ducer who creates a video summary manually. Here the manual framework, we adopt different evaluation metrics for different
editing tool is the commonly used Adobe Premiere© , and the experimental designs. A detailed version can be seen in supple-
producer is asked to make a video summary for test videos. The mentary material.
summary should also contain a coherent BGM, and some simple 1) In the quantitative evaluations of R1a and R2b for TVsum,
splicing and fading effects could be used to make the summary commonly used [9] Precision, Recall and F-score are utilized
visually appealing. To make a fair comparison, we only counted as the evaluation metrics to evaluate the quality of generated
the human active time during producing. summaries.
Implementation details: In the user study of R1b, R2b and 2) In experiments R2a, we adopt R@n, IoU = μ and mIoU as
R3, we randomly select 10 movies, 10 documentaries and 10 the evaluation metrics, which is commonly used in the field of
videos from M-D and TVsum datasets to generate summaries. language video localization [33], [50]. R@n, IoU = μ represents
In order to keep the consistency of the evaluations, we re- the percentage of testing samples which have at least one of the
fer to [43], randomly pick a movie (Big Fish), a documentary top n results with IoU larger than μ. IoU means intersection
(Planet Earth Season I Episode 2) and a video from TVsum of time between visual semantic matching and ground truth on
(Poor Man’s Meals: Spicy Sausage Sandwich) for demonstrat- union. mIoU means the average IoU over all testing samples. In
ing results. In the quantitative experiments, the TVsum, MPII this paper, following [28], [29], when reporting R@n, IoU = μ,
and documentary description datasets are randomly divided into
training and test dataset with the ratio of 8:2. For the extractive
2 [Online]. Available: https://github.com/maszhongming/Effective_Extra-cti
text summarization method [47] leveraged in the text summa-
ve_Summarization
rization component, we adapted the model pretrained on the 3 [Online]. Available: https://github.com/JonghwanMun/LGI4temporalgrou-
CNN/DailyMail [62]. The official code and pretrained model are nding

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4903

TABLE I
THE STATISTICS FOR SUMMARY ATTRIBUTES OF SUMMARIES GENERATED BY BASELINES AND MANVS-AUTO. THE IC, CON, TNS AND TAL RESPECTIVELY
REPRESENT INFORMATION COVERAGE, CONSISTENCY, THE NUMBER OF SHOTS IN THE SUMMARY AND THE AVERAGE LENGTH OF SHOTS

we set n as 1 and μ ∈ {0.3, 0.5}, and when reporting mIoU, we


set μ ∈ {0.1, 0.3, 0.5, 0.7}.
3) In the user study of R1b and R3, 100 participants were
invited to watch the involved video summaries, and they were
asked to rate every video by using a 7-point Likert scale (1 =
poor, 7 = excellent), taking visual attraction (VA) and narra-
tive completeness (NC) into account. The NC reflects viewers’
subjective perceptions of the narrative coherence and content in-
tegrity of the generated summaries, while the VA reflects view-
ers’ viewing experience. Before rating, we explain the whole
procedure of our model and present some example videos which
are with incomplete shots and do not meet aesthetic guidelines
to participants. When displaying these videos, we explain the
definition of NC and VA to each participant for a more precise
understanding. Videos which they require to rate are not included
in these examples, and concepts are emphasized when they start
rating videos. Videos were displayed in full-screen mode on
calibrated 27-inch LED monitors (Dell P2717H). Viewing con-
ditions are in accordance with the guidelines of international
standard procedures for multimedia subjective testing [65]. The
subjects are all university undergraduate or graduate students
with at least two years experience in image processing, and they
claimed to browse videos frequently. The percentage of female
subjects is about 40%. All the subjects are aging from 20 to 27
years old. Before giving the final rating, we allowed participants Fig. 6. User study questions and results for individual component and aesthetic
to watch each video for multiple times. Besides, we have taken constraint.
the time needed to make video summaries by different methods
into account in R3. Similarly, we asked 100 participants to rate
retained in the summary from the input video. Furthermore, we
each video according to the given requirements in the user study
record the ratio of complete sentences and emerging sentences
of R2b.
read by voiceover in the summary, so as to acquire consistency
4) For further investigating the quality of generated sum-
(CON). Intuitively, videos in high consistency can provide view-
maries, we analyse some desirable attributes of them in exper-
ers with a enjoyable viewing experience. Finally, we counted the
iments R1b following [17]. Specifically, our study asked 100
number of shots in the summary (TNS) and the average length
participants to first read a text summary for the input video.
of shots (TAL). This measures shots switching frequency and
The text summaries of the TVsum videos are made manually,
content completeness, videos owning too many or too few shots
while those of the documentaries and movies are taken from
switching degrade the viewer experience and satisfaction.
Wikipedia. Subsequently, they were asked to watch the gener-
ated video summaries and determine whether there was a par-
ticular plot or entity in the video that the text summaries de- E. Results and Analysis
scribed. Participants answered with “Yes” if they were certain Fig. 6 and Tables I–VII show our results. Among them, Ta-
it was present in the video, “No” if the event was absent. After- ble I to Table IV show the comparison in objective and subjective
wards, we compute the information coverage (IC), i.e. the ratio evaluations between MANVS-auto and several baseline meth-
of the selected entities or plots to the total number of representa- ods. Table V, Table VI and Fig. 6 show the results of different
tive ones. This measures how much representative information is kinds of ablation studies. Table VII shows the comparison of the

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4904 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

TABLE II TABLE VI
THE USER STUDY OF VIDEO SUMMARIES IN THE QUALITY OF VISUAL ABLATION STUDY OF DIFFERENT COMPONENTS IN MANVS
ATTRACTION (VA) AND NARRATIVE COMPLETENESS (NC) GENERATED BY ON THE TVSUM DATASET
STATE-OF-ART METHODS AND MANVS-AUTO

TABLE VII
COMPARISON BETWEEN PROFESSIONAL MANUAL EDITING AND OURS

TABLE III
THE STATISTICAL SIGNIFICANCE (P -VALUE) WHICH IS THE IMPROVEMENT OF
MANVS-AUTO OVER OTHER METHODS IN TERMS OF VA AND NC
(WILCOXON TEST). NUMBERS FROM 1 TO 10 CORRESPOND FROM THE
RANDOM TO VSN METHOD IN TABLE II RESPECTIVELY

spent time and quality evaluation of summaries between manu-


ally editing method and ours.
RQ1: Table I tabulates the video attributes of generated sum-
maries. We clearly observe that our method achieves the best
TABLE IV performance. Specifically, our method performs the best in the
THE QUANTITATIVE EVALUATIONS OF MANVS-AUTO WITH BASELINE attribute of CON and IC. Especially with the support of the shot
METHODS ON TVSUM DATASET
complement strategies, the consistency attribute far superior to
other methods and can reach 100%. In addition, the TAL of the
generated summary by our method conforms to the cinemato-
graphic aesthetic guidelines.
From the user study in Table II, we observe that our method
acquires the highest score on both visual attraction and narra-
tive completeness, which demonstrates that by leveraging multi-
modal information and aesthetic guidance, our method can pro-
duce high-quality summaries. In the visual attraction, the ap-
propriate length of shots helps improve the professional of the
entire video summary, neither boring the audience nor affect-
ing visual continuity, while those in other methods are either
too long or too short. In the narrative completeness, incomplete
TABLE V voiceover and low information coverage can greatly reduce for
ABLATION STUDY OF DIFFERENT COMPONENTS IN VSM ON DOCUMENTARY the audience’s understanding of the video content.
AND MPII MOVIE DESCRIPTION DATASETS
In order to validate the performance of our method on both
VA and NC from a statistical perspective, a Wilcoxon test was
implemented on the data of user study. The P -values between
MANVS-auto and every comparison method were calculated to
study the statistical significance by a Wilcoxon test. The results
shown in Table III suggest that our method is significantly better
than other methods under the P -value threshold of 0.01 in terms
of both VA and NC.

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4905

Fig. 7. Summarization examples (on TVsum). The five bars below each video represent the results generated by DR-Sup, VAS, DSN-AB, MANVS-auto and
gound truth, respectively. The long gray bar and the short colored bar are the time stream of video and the selected key frame, respectively.

According to the quantitative evaluation for TVsum in Ta- achieve the best results in both subjective and objective evalua-
ble IV, our method clearly outperforms baselines. Comparing tion.
F-score of HSA and DR, we find that our method improves the RQ2: Table V shows experimental results of ablation study
performance significantly. Compared with DR-Sup, the F-score R2a. We see that the performance significantly increases by
of MANVS-auto have 5% gain. Besides, Fig. 7 illustrates some leveraging methods with VL and TSS. In particular, our method
summarization examples generated on TVsum, we compare the achieves 48.56%, 41.55% in mIoU on Documentary Description
durations of selected key frames of MANVS-auto with DR-Sup, and MPII datasets respectively, which performs the best in all
VAS, DSN-AB and ground truth. The results show that com- approaches. Besides, we can find that the performance of LG4
pared with baselines, the shots selected by our method is closer without using TSS is slightly less than that of alternative VSL.
to ground truth. However, the performance of LG4 + TSS outperforms a lot than
It is worth noting that the quantitative results in Table IV are the alternative VSL + TSS. The reason may be that after utiliz-
inconsistent with the user study results in Table II. For example, ing TSS, the generated sub-video is much shorter than original
the F-score of HSA in Table IV is lower than that of DSN-AF one. VSL performs better than LG4 in dealing with long videos,
and VAS, but its visual attraction and narrative completeness in while LG4 is better at localizing language in short videos. Fig. 8
Table II are higher than both. The reason may be that the annota- shows some shot localization results of our component and alter-
tions only record the importance of each frame, and the objective natives, which shows that our VSM component can acquire the
evaluation metrics in Table IV calculate the scores through the shots that users are interested in. The above results demonstrates
selected frame, but they can not evaluate the overall quality of that the performance of our VSM component is better than that
video summary [17]. Combining the results of Table I, we notice of alternatives and each cascaded components plays a vital role.
the average shot length of the former six methods is too short, Fig. 6 and Table VI present the subjective and objective re-
which results in frequent changes of scenes, and makes the video sults of ablation study R2b, respectively. For the shots selection
summaries like selecting several frames in some consecutive components, we find that w/o SS can cause serious impacts on
frames. Though they select most of the representative entities the narrative completeness of video summaries compared with
or plots, they appear intermittently or just for a few seconds in the original one. For instance, the performance of w/o SS on the
these video summaries. Therefore, the summary can not tell a documentary with the score of only 3.5 and the score is lower
fluent story and causes low ratio of consistency, which means than our result by 1.8 in the user study. Besides, the F-score of
that for many emerging sentences in the summary, only a few w/o SS on Tvsum also suffer from low accuracy, with the value
words are read by the voiceover and can dramatically reduce of only 30.60%. These results show that narrative information
the narrative completeness and visual attraction of the gener- is an important factor in affecting the quality of the generated
ated summary, especially for these strong narrative videos such summary, and the SS component is able to capture narrative de-
as documentary and movie. In the opposite, the performance of tails. In addition, w/o VSM gets the close performance to our
HSA in Table II is relatively better compared with the former result in Fig. 6. This is in line with our expectation since the
six baselines possibly because of its longer shots length, higher only difference in this group is the shots matching the procedu-
ratio of consistency. However, compared to our method, it also ral text. Besides, the subjective evaluation results indicate that
suffers from low ratio of complete sentences and less ratio of KSS and HE are plays a positive role in its corresponding aspect.
information coverage, overlong average length of shots, which For the aesthetic constraints and post processing component, it
is tedious for watching. These results prove that our method can can cause some score discrepancies between w/o SL, w/o BS

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4906 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

is completely automatic. Therefore, the results of MANVS-auto


are slightly inferior to the MANVS in narrative completeness
and visual attraction as a whole. After finishing the task, we
invited the producer for commenting. Through these comments,
we can draw the conclusions below: Even if the fast forward
function is leveraged to browse video quickly, a lot of time
has been spent in watching and cutting it when dealing with
an unfamiliar video. Besides, these videos involved different
places and seasons and the visual content changed frequently, so
it cost lots of time in assembling the appropriate video shots and
adding the necessary fade-in and fade-out effects. In contrast,
users only need to spend little time writing procedural text in
our method or even do not need to spend any time by choosing
MANVS and MANVS-auto, respectively. This illustrates that
our method can rapidly produce a high-quality video summary.

VII. DISCUSSION AND FUTURE WORK


We have proposed MANVS, a novel method for generat-
ing a narrative video summary with aesthetic appealing. By in-
putting a narrative video and the corresponding text of subtitles,
MANVS automatically selects representative video shots, and
assembles them into a visual appealing video summary based
Fig. 8. Some visual semantic matching examples on documentary and MPII
Movie datasets. Procedural texts are listed top the thumbnails. Our results are on cinematographic aesthetics guidelines. Our method also al-
outlined in green. lows users to 1) design procedural text to find the desired video
shots, and 2) provide customized voiceover to create a person-
alized video summary. Both experiments and user study show
and w/o CC and our original method in the user study, which that MANVS can significantly save editing time and help users
are lower than out result by 0.5, 0.7 and 1.1 respectively for the generate satisfactory narrative video summaries. However, our
summary of movie. This is possibly because of they increase the current MANVS still has some limitations, which also points
viewing experience by providing shots in an appropriate length, out the direction of future work.
supplementing visual content and preserving color smoothness Multimodal Feature Fusion: To the best of our knowledge,
respectively. However, w/o SSA and w/o OM perform nearly the there are currently no multimodal datasets available for the video
same compared with the original one. This is reasonable since summarization task. Learning-based multimodal feature fusion
most of our videos are professional videos and they hardly suffer heavily depends on a great number of well labelled audio, subti-
from the issues of shot stability and opposite movement. Apart tles, visual data. In the future, constructing multimodal datasets
from that, w/o PP performs the worst with only 2.7 and the score and training models suitable for video summarization may mo-
is lower than our result by 3.3 for the summary of documentary. tivate more extensive applications.
This is because w/o PP significantly reduces the attraction by Consistency for Subtitle Summarization: Our subtitle summa-
directly assembling the incoherent BGM clips together. For the rization component relies on the extractive text summarization
objective evaluation, without individual aesthetic constraint can method. However, if the text of subtitles contains too many pro-
cause the score discrepancy for a little bit. It keeps the same for nouns, directly connecting multiple sentences into a summary
w/o PP because the function of PP lies in producing a coherent may produce inconsistency of subjects, which has a negative im-
BGM and user-designed voiceover not selecting frames. These pact on understanding the video. Although other components in
results verify the effectiveness of our multimodal-based shots the shots selection module may make up for this defect in some
selection and aesthetic-guided shots assembly module. sense, semi-automatic generation of subtitle summary combined
RQ3: As shown in Table VII, our results are close to the with script information and user interaction may bring better
quality of video summary generated by experienced video performance.
producers. For instance, MANVS is only lower than the Detailed Expression of Film Art: Our aesthetic constraints are
manually editing result on documentary by 0.4 and 0.7 in set according to the general film rules, but sometimes in order to
visual attraction and narrative completeness respectively. More reflect a special artistic style, the photography conventions may
importantly, manually editing a documentary summary costs be deliberately broken. Our shots assembly component may face
2 hours 28 minutes 12 seconds and MANVS only costs 12 challenges when dealing with some videos with special narrative
minutes 34 seconds. Furthermore, the MANVS-auto does not methods such as flashback. In the future, we will combine more
cost any human producing time, such as writing program text interactions with users to provide various artistic modes of fine
and user-designed voiceover and background music, because it shot switching.

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4907

ACKNOWLEDGMENT [22] T. Makino, T. Iwakura, H. Takamura, and M. Okumura, “Global optimiza-


tion under length constraint for neural text summarization,” in Proc. Assoc.
The authors would like to thank S. Zhao that helped improve Comput. Linguist., 2019, pp. 1039–1048.
the presentation of this paper. Finally, we would like to express [23] S. Xu et al., “Self-attention guided copy mechanism for abstractive sum-
marization,” in Proc. Assoc. Comput. Linguist., 2020, pp. 1355–1362.
our great appreciation again to the editors and reviewers for [24] H. Jin, T. Wang, and X. Wan, “Multi-granularity interaction network for
valuable comments and suggestions on our manuscript. extractive and abstractive multi-document summarization,” in Proc. Assoc.
Comput. Linguist., 2020, pp. 6244–6254.
[25] J. Xu, Z. Gan, Y. Cheng, and J. Liu, “Discourse-aware neural extractive text
summarization,” in Proc. Assoc. Comput. Linguist., 2020, pp. 5021–5031.
REFERENCES [26] A. Pavel, C. Reed, B. Hartmann, and M. Agrawala, “Video digests: A
browsable, skimmable format for informational lecture videos,” in Proc.
[1] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli, “Video storytelling: ACM Symp. User Interface Softw. Technol., 2014, pp. 1–10.
Textual summaries for events,” IEEE Trans. Multimedia, vol. 22, no. 2, [27] S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, “Coot: Cooperative
pp. 554–565, 2020. hierarchical transformer for video-text representation learning,” in Proc.
[2] H. Li, J. Zhu, C. Ma, J. Zhang, and C. Zong, “Read, watch, listen, and sum- Adv. Neural Inf. Process. Syst., 2020, pp. 1–14.
marize: Multi-modal summarization for asynchronous text, image, audio [28] X. Qu et al., “Fine-grained iterative attention network for tempo-
and video,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 5, pp. 996–1009, ral language localization in videos,” in Proc. ACM Multimedia, 2020,
May 2019. pp. 4280–4288.
[3] Y. Saquil, D. Chen, Y. He, C. Li, and Y.-L. Yang, “Multiple pairwise [29] D. Cao et al., “Adversarial video moment retrieval by jointly modeling
ranking networks for personalized video summarization,” in Proc. IEEE ranking and localization,” in Proc. ACM Multimedia, 2020, pp. 898–906.
Int. Conf. Comput. Vis., 2021, pp. 1698–1707. [30] S. Chen and Y.-G. Jiang, “Semantic proposal for activity localization
[4] X. Li, H. Li, and Y. Dong, “Meta learning for task-driven video sum- in videos via sentence query,” in Proc. AAAI Conf. Artif. Intell, 2019,
marization,” IEEE Trans. Ind. Electron., vol. 67, no. 7, pp. 5778–5786, pp. 8199–8206.
Jul. 2020. [31] H. Xu et al., “Multilevel language and vision integration for text-to-clip
[5] R. Panda and A. K. Roy-Chowdhury, “Multi-view surveillance video sum- retrieval,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 9062–9069.
marization via joint embedding and sparse optimization,” IEEE Trans. [32] Z. Li, J. Tang, and T. Mei, “Deep collaborative embedding for social image
Multimedia, vol. 19, no. 9, pp. 2010–2021, Sep. 2017. understanding,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 9,
[6] A. Truong, F. Berthouzoz, W. Li, and M. Agrawala, “Quickcut: An inter- pp. 2070–2083, Sep. 2019.
active tool for editing narrated video,” in Proc. ACM Symp. User Interface [33] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Span-based localizing network
Softw. Technol., 2016, pp. 497–507. for natural language video localization,” in Proc. Assoc. Comput. Linguist.,
[7] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, 2020, pp. 6543–6554.
“Summarizing videos with attention,” in Proc. Asian Conf. Comput. Vis., [34] M. Merler et al., “The excitement of sports: Automatic highlights using
2018, pp. 39–54. audio/visual cues,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit,
[8] B. Zhao, X. Li, and X. Lu, “Hsa-rnn: Hierarchical structure-adaptive rnn 2018, pp. 2520–2523.
for video summarization,” in Proc. IEEE Conf. Comput.Vis. Pattern Recog- [35] Z. Guo et al., “Taohighlight: Commodity-aware multi-modal video
nit., 2018, pp. 7405–7414. highlight detection in e-commerce,” IEEE Trans. Multimedia, vol. 24,
[9] M. Basavarajaiah and P. Sharma, “Survey of compressed domain video pp. 2606–2616, 2022.
summarization techniques,” ACM Comput. Surv., vol. 52, no. 6, pp. 1–29, [36] M. Merler et al., “Automatic curation of sports highlights using mul-
2019. timodal excitement features,” IEEE Trans. Multimedia, vol. 21, no. 5,
[10] S. A. Ahmed et al., “Query-based video synopsis for intelligent traffic pp. 1147–1160, May 2019.
monitoring applications,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 8, [37] T. Decroos, V. Dzyuba, J. Van Haaren, and J. Davis, “Predicting soccer
pp. 3457–3468, Aug. 2020. highlights from spatio-temporal match event streams,” in Proc. AAAI Conf.
[11] J.-H. Choi and J.-S. Lee, “Automated video editing for aesthetic quality Artif. Intell, 2017, pp. 1302–1308.
improvement,” in Proc. ACM Multimedia, 2015, pp. 1003–1006. [38] L. Hu et al., “Detecting highlighted video clips through emotion-enhanced
[12] P. Varini, G. Serra, and R. Cucchiara, “Personalized egocentric video sum- audio-visual cues,” in Proc. IEEE Int. Conf. Multimedia Expo., 2021,
marization of cultural tour on user preferences input,” IEEE Trans. Multi- pp. 1–6.
media, vol. 19, no. 12, pp. 2832–2845, Dec. 2017. [39] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley,
[13] W. Zhu, J. Lu, J. Li, and J. Zhou, “Dsnet: A flexible detect-to-summarize “Detection and classification of acoustic scenes and events,” IEEE Trans.
network for video summarization,” IEEE Trans. Image Process, vol. 30, Multimedia, vol. 17, no. 10, pp. 1733–1746, Oct. 2015.
pp. 948–962, 2021. [40] Y. Zhang, L. Zhang, and R. Zimmermann, “Aesthetics-guided summariza-
[14] M. Sun, A. Farhadi, B. Taskar, and S. Seitz, “Summarizing unconstrained tion from multiple user generated videos,” ACM Trans. Multimed. Comput.
videos using salient montages,” IEEE Trans. Pattern Anal. Mach. Intell., Commun. Appl., vol. 11, no. 2, pp. 1–23, 2015.
vol. 39, no. 11, pp. 2256–2269, Nov. 2017. [41] Y. Niu and F. Liu, “What makes a professional video? a computational
[15] S.-P. Lu, S.-H. Zhang, J. Wei, S.-M. Hu, and R. R. Martin, “Timeline aesthetics approach,” IEEE Trans. Circuits Syst. Video Technol., vol. 22,
editing of objects in video,” IEEE Trans. Vis. Comput. Graph., vol. 19, no. 7, pp. 1037–1049, Jul. 2012.
no. 7, pp. 1218–1227, Jul. 2012. [42] B. Huber, H. V. Shin, B. Russell, O. Wang, and G. J. Mysore, “B-script:
[16] B. Zhao, X. Li, and X. Lu, “Property-constrained dual learning for video Transcript-based b-roll video editing with recommendations,” in Proc.
summarization,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 10, SIGCHI Conf. Hum. Factors Comput. Syst., 2019, pp. 1–11.
pp. 3989–4000, Oct. 2020. [43] M. Wang, G.-W. Yang, S.-M. Hu, S.-T. Yau, and A. Shamir, “Write-
[17] M. Otani, Y. Nakashima, E. Rahtu, and J. Heikkila, “Rethinking the eval- a-video: Computational video montage from themed text,” ACM Trans.
uation of video summaries,” in Proc. IEEE Conf. Comput.Vis. Pattern Graph., vol. 38, no. 6, pp. 1–13, 2019.
Recognit, 2019, pp. 7596–7604. [44] T. Hu, Z. Li, W. Su, X. Mu, and J. Tang, “Unsupervised video summaries
[18] K. Zhou, Y. Qiao, and T. Xiang, “Deep reinforcement learning for unsu- using multiple features and image quality,” in Proc. IEEE Third Int. Conf.
pervised video summarization with diversity-representativeness reward,” Multimedia Big Data (BigMM), 2017, pp. 117–120.
in Proc. AAAI Conf. Artif. Intell., 2018, pp. 7582–7589. [45] B. Shi et al., “Learning semantic concepts and temporal alignment for
[19] L. Yuan, F. E. H. Tay, P. Li, and J. Feng, “Unsupervised video summariza- narrated video procedural captioning,” in Proc. ACM Multimedia, 2020,
tion with cycle-consistent adversarial lstm networks,” IEEE Trans. Multi- pp. 4355–4363.
media, vol. 22, no. 10, pp. 2711–2722, Oct. 2020. [46] H. Schütze, C. D. Manning, and P. Raghavan, Introduction to Information
[20] K. Zhang, K. Grauman, and F. Sha, “Retrospective encoders for Retrieval. vol. 39, Cambridge, U.K.: Cambridge Univ. Press, 2008.
video summarization,” in Proc. Eur. Conf. Comput. Vis., 2018, [47] M. Zhong, P. Liu, D. Wang, X. Qiu, and X. Huang, “Searching for effective
pp. 383–399. neural extractive summarization: What works and what’s next,” in Proc.
[21] Z. Li, J. Tang, X. Wang, J. Liu, and H. Lu, “Multimedia news summariza- Assoc. Comput. Linguist., 2019, pp. 1049–1058.
tion in search,” ACM Trans. Intell. Syst. Technol., vol. 7, no. 3, pp. 1–20, [48] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
2016. Process. Syst., 2017, pp. 6000–6010.

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4908 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023

[49] Y.-C. Chen and M. Bansal, “Fast abstractive summarization with reinforce- Tianyi Zhang (Graduate Student Member, IEEE) is
selected sentence rewriting,” in Proc. Assoc. Comput. Linguist., 2018, currently working toward the Ph.D. degree with the
pp. 675–686. Faculty of Electrical Engineering, Mathematics &
[50] J. Mun, M. Cho, and B. Han, “Local-global video-text interactions for Computer Science, Delft University of Technology,
temporal grounding,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit, Delft, The Netherlands. He is associated with the
2020, pp. 10 810–10819. Distributed & Interactive Systems Group, Centrum
[51] D. A. Hudson and C. D. Manning, “Compositional attention networks Wiskunde & Informatica, the national research insti-
for machine reasoning,” in Proc. Int. Conf. Learn. Representations, 2018, tute for mathematics and computer science in The
pp. 1–20. Netherlands. His research interests include human-
[52] J.-H. Kim et al., “Hadamard product for low-rank bilinear pooling,” in computer interaction and machine learning based af-
Proc. Int. Conf. Learn. Representations, 2017, pp. 1–17. fective computing.
[53] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
in Proc. IEEE Conf. Comput.Vis. Pattern Recognit, 2018, pp. 7794–7803.
[54] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”
in Proc. Eur. Conf. Comput. Vis, 2006, pp. 404–417.
[55] R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: A Yixuan Zhang is currently working toward the
fast and efficient music source separation tool with pre-trained models,” undergraduation degree with Nankai University,
J. Open Source Softw., vol. 5, no. 50, pp. 1–4, 2020. Tianjin, China. His research interests include movie
[56] T. Hayashi et al., “ESPnet-TTS: Unified, reproducible, and integratable summary and the sentiment analysis.
open source end-to-end text-to-speech toolkit,” in Proc, IEEE Int. Conf.
Acoust., Speech, Signal Process., 2020, pp. 7654–7658.
[57] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing
web videos using titles,” in Proc. IEEE Conf. Comput.Vis. Pattern Recog-
nit, 2015, pp. 5179–5187.
[58] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for
movie description,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit,
2015, pp. 3202–3212. Shao-Ping Lu (Member, IEEE) received the Ph.D.
[59] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully con- degree in computer science from Tsinghua Univer-
volutional sequence networks,” in Proc. Eur. Conf. Comput. Vis., 2018, sity, Beijing, China. He was a Postdoc and Senior
pp. 347–363. Researcher with Vrije Universiteit Brussels, Brussels,
[60] B. Zhao, M. Gong, and X. Li, “Hierarchical multimodal transformer to Belgium. He had been an Associate Professor with
summarize videos,” Neurocomputing, vol. 468, pp. 360–369, 2022. Nankai University, Tianjin, China. His research in-
[61] M. Rochan and Y. Wang, “Video summarization by learning from unpaired terests include the intersection of visual computing,
data,” in Proc. CVPR, 2019, pp. 7902–7911. with particular focus on computational photography,
[62] R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, and B. Xiang, “Abstrac- 3D image and video representation, visual scene anal-
tive text summarization using sequence-to-sequence RNNs and beyond,” ysis, and machine learning.
in Proc. SIGNLL, 2016, pp. 280–290.
[63] J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity local-
ization via language query,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
pp. 5267–5275.
[64] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in Proc. CVPR, 2017, pp. 6299–6308.
[65] “Subjective video quality assessment methods for multimedia applica- Pablo Cesar (Senior Member, IEEE) is currently
tions,” document ITU-T P.910, 2008. leads the Distributed & Interactive Systems Group,
Centrum Wiskunde & Informatica (CWI) and a Pro-
fessor with TU Delft, Delft, The Netherlands. His
research interests include HCI and multimedia sys-
tems, and focuses on modelling and controlling com-
plex collections of media objects distributed in time
and space. He is an ACM Distinguished Member,
Jiehang Xie received the master’s degree in computer and part of the Editorial Board of IEEE MULTIMEDIA,
software and theory from Shaanxi Normal University, ACM Transactions on Multimedia, and IEEE TRANS-
Shaanxi, China, in 2020. He is currently working to- ACTIONS ON MULTIMEDIA. He was the recipient of the
ward the Ph.D. degree with the College of Computer Prestigious Netherlands Prize for ICT Research in 2020 because of his work on
Science, Nankai University, Tianjin, China. His re- human-centered multimedia systems. He is the Principal investigator from CWI
search interests include multimodal, multimedia anal- in a number of National and European projects, and acted as an Invited Expert
ysis, and affective computing. at the European Commissions Future Media Internet Architecture Think Tank.

Xuanbai Chen received the B.E. degree in computer Yulu Yang received the B.E. degree from Beijing
science and technology in 2021 from Nankai Univer- Agriculture Engineering University, Beijing, China,
sity, Tianjin, China. He is currently working toward in 1984. He received the M.E. and Ph.D. degrees from
the M.S. degree in computer vision with the Robotics Keio University, Tokyo, Japan, in 1993 and 1996, re-
Institute, School of Computer Science, Carnegie Mel- spectively. He is currently a Full Professor with the
lon University, Pittsburgh, PA, USA. His research in- Department of Computer Science, Nankai University,
terests include domain adaptation and video summa- Tianjin, China. His research interests include parallel
rization in computer vision. processing and intelligence computing.

Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy