19 LS
19 LS
25, 2023
Abstract—Narrative videos usually illustrate the main content Creating a high-quality narrative video summary is currently
through multiple narrative information such as audios, video a very challenging issue. A common workflow for creating a
frames and subtitles. Existing video summarization approaches narrative video summary manually begins with the outline writ-
rarely consider the multiple dimensional narrative inputs, or
ignore the impact of shots artistic assembly when directly applied ing stage [6], in which the user (video producer) writes a story
to narrative videos. This paper introduces a multimodal-based outline by repeatedly watching the given video. The story out-
and aesthetic-guided narrative video summarization method. line records the narrative threads and the sequence of important
Our method leverages multimodal information including visual events. The next stage is usually shots selection, where the user
content, subtitles and audio information through our specified key chooses the matched shots with the outline content from the
shots selection, subtitle summarization, and highlight extraction
components. Furthermore, under the guidance of cinematographic given video, and decides the cut points and timing (beginning
aesthetic, we design a novel shots assembly module to ensure and ending time points, and shot length) for each selected shot.
the shot content completeness and then assemble the selected Experienced users intend to capture some highlight and prefer-
shots into a desired summary. Besides, our method also provides ence/personalized shots that are not mentioned in the outline,
the flexible specification for shots selection, to achieve which it to ensure that the generated summary is diverse enough. The
automatically selects semantically related shots according to the
user-designed text. By conducting a large number of quantitative last stage is the shots assembly, where the user composes the se-
experimental evaluations and user studies, we demonstrate that lected shots into a summary under some reasonable order. In this
our method effectively preserves important narrative information stage, experienced users try to balance multiple criteria based on
of the original video, and it is capable of rapidly producing high- video editing conventions and aesthetic guidelines (e.g., avoid-
quality and aesthetic-guided narrative video summaries. ing too short shots, including complete shots content, ensuring
Index Terms—Narrative video summarization, multimodal color continuity between adjacent shots, etc.). Thus, manually
information, aesthetic guidance. creating a narrative video summary is labor-intensive, even for
experienced users.
I. INTRODUCTION In this context, various automatic video summarization ap-
proaches have been introduced in the research community [7]–
ARRATIVE videos, such as documentaries, movies and
N scientific explainers, share the immersive visual informa-
tion along with the narrated story-telling subtitles, voiceover
[9]. However, managing to automatically produce both short
and coherent summaries for long narrative videos is extremely
difficult [10]. Conventional methods relying solely on visual fea-
and background musics (BGM) [1], [2]. As the huge number tures find it difficult in capturing narrative threads and highlights,
of narrative videos are uploaded on various online social plat- letting alone showing a personalized visual content [11], [12].
forms, there is an urgent need in creating narrative video sum- What is more, many works rely on the strong encoding ability
maries which can help viewers browse and understand the con- of neural networks to help skip the outline writing stage, and
tent quickly, and presenting them in knowledge popularization the shots selection stage is usually completed by constructing
platforms and many other applications [3]–[5]. a classifier. However, these works all assume that the selected
shots conform to aesthetic guidelines, thus their shots assembly
Manuscript received 22 September 2021; revised 5 April 2022; accepted 31 is merely a process of stitching the shots together in chronologi-
May 2022. Date of publication 15 June 2022; date of current version 30 October
2023. This work was supported by NSFC under Grants 62132012 and 61972216. cal order. In practice, there are often cases where a model cannot
The Associate Editor coordinating the review of this manuscript and approving it obtain a good or complete shot in the shots selection stage. For
for publication was Prof. Ngai-Man Cheung. (Corresponding author: Shao-Ping instance, the selected shots are either too short to cover a com-
Lu.)
Jiehang Xie, Xuanbai Chen, Yixuan Zhang, Shao-Ping Lu, and Yulu plete voiceover or too long to be interested by the viewers, result-
Yang are with the TKLNDST, CS, Nankai University, Nankai 300071, ing in the limited quality of generated summaries. Therefore, in
China (e-mail: jiehangxie@mail.nankai.edu.cn; 1711314@mail.nankai.edu.cn; this work, we propose a multimodal-based and aesthetic-guided
2011432@mail.nankai.edu.cn; slu@nankai.edu.cn; yangyl@nankai.edu.cn).
Tianyi Zhang and Pablo Cesar are with Centrum Wiskunde and In- narrative video summarization framework, named MANVS, to
formatica, 098 XG Amsterdam, Netherlands (e-mail: tianyi.zhang@cwi.nl; generate high-quality video summaries. This framework selects
p.s.cesar@cwi.nl). meaningful and personalized shots based on multimodal infor-
This article has supplementary downloadable material available at
https://doi.org/10.1109/TMM.2022.3183394, provided by the authors. mation, and considers the completeness of shot content and
Digital Object Identifier 10.1109/TMM.2022.3183394 the aesthetics of selected shots in the shots assembly stage, to
1520-9210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4895
assemble them into a video with smooth visual transitions while II. RELATED WORK
preserving an overall pleasing aesthetics.
In this section, we briefly introduce some techniques on video
We assume that the user desires to generate a condensed ver- summarization, text summarization, language video localiza-
sion of the given narrative video, which can allow viewers to
tion, highlight detection and computational cinematography,
acquaint a comprehensive overview of a given narrative video
which are most relevant to our work.
quickly. Moreover, the time-aligned subtitles related to the given Video summarization: The main objective of general video
video are assumed to be available, gathered from online or per-
summarization is to produce a shortened video containing the
sonal resources. However, there are several technical challenges
most representative visual information of a given one [14], [15].
need to be addressed in our approach. Firstly, the method should A typical video summarization solution usually begins with
determine, based on the multimodal information, which shots
selecting key frames or video segments [16], [17], which can
contain important content and need to be selected. Secondly, mainly be divided into supervised and unsupervised styles. The
if a user wants to select the personalized visual content from former supposes to own human annotations of key frames in the
the input video and preserves it into the summary, how does
original videos [7]. It is noticeable that constructing datasets with
the model locate this personalized content. Finally, professional sufficient labels is very difficult in practice. There thus appears
videos usually have a group of features in accordance with the some unsupervised learning based approaches [18], [19]. How-
conventions and aesthetic guidelines of video editing, which
ever, although the aforementioned methods can obtain some im-
can be utilized to distinguish them from amateur ones and make portant visual information from original videos, there are some
videos more visually attractive. Thus, when assembling the se- common disadvantages. For example, some image information
lected shots into a video summary, both the cinematographic aes-
are just considered by searching for shot boundaries [20] in the
thetics and the completeness of shots content need to be jointly shots selection process, where the switched shots are regarded as
considered. Note that though the completeness of shot content
important content and the multimodal information of the original
is preserved well, this shot may not be visual attracted enough.
video are ignored. Consequently, the generated video summary
Similarly, an engaging shot does not necessarily guarantee con- loses a lot of information, which makes it look like a truncated
taining a complete voiceover.
version of the original video without coherent narrative infor-
To address the challenges above, our proposed MANVS con-
mation. The current research shows that human cognition is a
sists of two main modules. The first module is the multimodal- process of cross-media information interaction [21]. Therefore,
based shots selection module, which integrates visual, audio and
multimodal information such as video frames, audios, and subti-
time-aligned subtitles into the shots selection process to cap-
tles should be leveraged to select crucial shots and provide view-
ture meaningful narrative content from consecutive sequences ers with vivid and comprehensive content. Which shots should
of shots. Furthermore, this module provides a flexible way to
be chosen, and how to assemble them into a video with smooth
acquire the shots that users are interested in, which allows users visual transitions while preserving an overall good aesthetics,
to choose a shot by inputting text. Next, the aesthetic-guided deserve our attention.
shots assembly module firstly filters repetitive and low-quality
Text summarization: Approaches to text summarization can
shots, and then automatically checks whether the selected shots be briefly classified into two categories: abstractive or extrac-
are content-complete. If a shot fails in these checks, through fol- tive [22]. The former usually generates new sentences to express
lowing cinematographic aesthetic guidelines and developing a
the crucial information [23]. However, the state-of-the-art meth-
series of shot completion strategies, we achieve a good trade-off ods of this class are likely to generate abstracts that are not fluent
between aesthetics and the completeness of shots, and finally as- or introduce grammatical errors [24]. In contrast, the extractive
sembles these shots into a video with smooth visual transitions.
methods focus on selecting some subsets sentences containing
In summary, the contributions of this paper are as follows.
r We design an aesthetic-guided shots assembly module, important contextual information from the source text and as-
sembling them to form a text summary [25]. The main advan-
which establishes a series of strategies to preserve the shot
tage of this kind of methods is that grammatical errors can be
content completeness and aesthetics. To the best of our
avoided for the sentences of the generated text summary. Pavel
knowledge, we are the first to consider aesthetic guidelines
et al. [26] use crowdsourcing to make a text summary for the
in the shots assembly stage.
r We present a multimodal-based shots selection module that original video, and find the corresponding crucial video frames
according to the content of the text summary. However, crowd-
comprehensively analyzes subtitle, image, and audio infor-
sourcing is not only time-consuming and laborious, but also can
mation to capture narrative, representative, and highlight
not ensure the overall consistency of the summary generated
shots. Besides, we provide a flexible way for shots selec-
by different staffs. Inspired by the above methods, we further
tion in order that users can choose the shots they desire to
consider that the subtitle and video shots are semantically rele-
watch.
r We conduct extensive quantitative evaluations and user vant [27], and our work concentrates on automatically dividing
subtitle document to extract the text summary, which effectively
studies to evaluate the effectiveness of MANVS. The re-
helps generate topically coherent video summaries.
sults demonstrate that our method can generate video sum-
Language video localization: The goal of language video
maries with the quality comparable to that produced by
localization is to locate a matching shot from the video that
experts and is much less time-consuming.
semantically corresponds to the language [28], [29] which has
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4896 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
Fig. 1. The proposed MANVS framework for narrative video summarization. The upper left part is a simplified flow chart and the upper right part is the detailed
processes corresponding to the annotated components on the left. The detect-summarize network is established based on [13]. Best viewed in color.
attracted the attention of a large number of researchers [30], al. [42] have some observations by analyzing a large number of
[31]. Xu et al. [31] propose a multilevel model by doing text Vlogs on Youtube: shots for story-telling are usually A-roll, it
features injection, modulating the processing of query sentences is reasonable to insert B-roll when the speaker stops for a long
at the word level in a recurrent neural network. Li et al. [32] time, and B-roll usually distributes uniformly in the whole video.
propose a deep collaborative embedding method for multiple Wang et al. [43] utilize the text information to connect multiple
image understanding tasks. It is the first approach attempting shots, and the well preserved attributes of videos achieve that
to solve this issue under the framework of deep factor analysis both the color continuity is ensured and the movement of shot
and acquiring fruitful achievements. Chen et al. [30] intro- is avoided. Hu et al. [44] propose to generate the informative
duce a semantic activity proposal which utilizes the semantic and interesting summary using a set of aesthetic features such
information of sentence queries to get discriminative activity as saturation and brightness to select shots. Without training and
proposals. To fill the gap between question-answering and human annotation, it is convenient to process videos. Our work
language localizing, a query guided highlighting approach was mainly focuses on introducing the computational cinematogra-
proposed in [33]. In our work, we design a semantic similarity phy to automatically improve the quality of selected shots in the
component to reduce redundant and irrelevant shots, and then shots assembly stage.
leverage the language video localization method to select the
personalized shots for every user in the shots selection stage.
Highlight detection: The purpose of highlight detection is III. OVERVIEW
to select video clips that can attract viewer’s attention [34], Fig. 1 shows the overall architecture of our MANVS. The
[35]. In the past research, the audio information is widely used input of MANVS includes an original video, the correspond-
in this task, as audio-based modeling is computationally eas- ing time-aligned subtitles and the user-designed procedural text.
ier than that of videos when representing remarkable contex- Particularly, the time-aligned subtitles include subtitle text se-
tual semantics [36]. Highlight shots in sports videos, e.g. goals quence and corresponding time maps with the video shots. We
in soccer games and home runs in baseball games, can cause use the term of procedural text [45] to emphasize that it should
the sports commentators excited voices and cheers from audi- focus on describing a specific visual content that users are in-
ence [37]. While in narrative videos such as movie and docu- terested in. In the shots selection stage, our method employs
mentary, highlight shots would also cause the change of audio the multimodal-based shots selection module, which consists
information [38]. The louder and denser of video sounds in a of subtitle summarization, visual semantic matching, highlight
period of time, the more possibility it is wonderful [39]. Based extraction, and key shots selection components to seek the nar-
on this observation, we take the change of video sounds into rative, personalized, highlight, and key shots.
consideration for our shots selection. In the shots assembly stage, our method passes these shots ob-
Computational cinematography: A desired narrative video tained by shots selection module to the aesthetic-guided shots
needs not only a fluent narration, but also shots of high-quality assembly module, to obtain a good summary. Note that shots
and artistic expression [40]. Niu et al. [41] discuss the aesthetic selection is an ill-posed problem, and incomplete or repeated
differences between professional and amateur videos by mod- shots could be introduced in this process. If incomplete or
eling image attributes including the color and noise, and video repeated shots appear, the overall quality would significantly
attributes such as camera motion and shots duration. Huber et degrade even if the visual aesthetic of the summary itself is
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4897
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4898 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
where Tk1 (·) represents selecting top k1 subtitles which are sim-
ilar to the procedural text x, and n is the total number of subtitles
in a whole narrative video. W (·, ·) means the computing method
of word co-occurrence. k1 and k2 are set as 12 and 6, respec-
tively.
Since the procedural text is a set of semantic phrases com-
bination F (·), its different levels of semantic information can
be used to match the visual feature of the created sub-video,
and the visual content can be located by the text. Therefore, the
objective of our localization task is defined as maximizing an
expected log-likelihood:
θ∗ = argmax E {logpθ [SV (C)|F (P )]} , (6)
θ
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4899
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4900 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
(8)
4
1
FSS (s) = − F (f ), (9)
4Nsf f ∈s i Fig. 5. Eight possible shot situations and corresponding shot complement
strategies. The long bar outlined by black is the timeline of video.
where f , f and f represent three consecutive frames in the
shot s. The symbol fi is the corner i of f , pf (i) is the position
fi in pixels, where i = 1, . . ., 4. We use H(f, f ) to denote the brightness:
homography transformation matrix between f and f . F (fi ) 1
Fcc (e, b) = (Ψ (ηS (e), ηS (b))
means the local shake values of fi . The shot stability Fss is 2
computed as the negative average of local shake values over
+Ψ (ηL (e), ηL (b))) , (12)
time, and Nsf is the number of sampled frames in shot s, and the
sampling step is 8. where Fcc (.) denotes the tonal difference between two shots,
Opposite Movements: Adjacent shots with opposite camera e and b respectively represent the last and the first frame of
movements may result in a terrible viewing experience for the adjacent shots, ηS (.) and ηL (.) represent the S-channel and L-
audience [43]. We avoid this situation by calculating the two- channel histogram of this frame, both quantified to 256 bins. Ψ
dimensional motion of the shots: means the chi-square measure.
4
ρ (fl , i) ρ ff , i B. Shots Assembly and Post-Processing
Fom fl , ff = , (10)
i=1 |ρ (fl , i) | ∗ |ρ ff , i | Based on the aforementioned classical cinematographic aes-
thetic guidelines, it becomes practical for us to appropriately
ρ (fl , i) = (i) − H fl , ff (i), (11) place the beginning and ending points of the selected shots, re-
move some repeated shots and extend incomplete ones. Specif-
ically, our method automatically checks whether the selected
where fl and ff respectively represent the last and the first frame shots meet the stability and opposite movements constraints,
of two adjacent shots s and s. Fom denotes the cosine distance of and then preserves the shots which pass the checks. Further-
the frame corner movement vectors of fl and ff . ρ(fl , i) is the more, our method checks whether each preserved shot is long
function estimating the two-dimensional movement of the i-th enough to contain the duration of a complete voiceover based
corner (i) of fl . H(fl , ff ) is the homography transformation on the time-aligned subtitles. We summarize three types, adding
matrix between fl and ff . up to eight possible situations and corresponding strategies of
Shots length: Some long shots without interesting content shot complement as shown in Fig. 5, to make the preserved
may easily let the viewer lose attention, and oppositely, some shots contain complete narrative content while satisfying aes-
too short shots may affect visual smoothness. In order to avoid thetic constraints such as shots length and color continuity. Now
these extreme shots, we set the duration of individual shots as 3 we provide more details of such strategies.
to 8 seconds inspired by [43]. 1) Incomplete and non-overlapping shot: In this type, the se-
B-roll selection: In professional narrative videos, those A-roll lected shot is within a either A-roll or B-roll. Moreover, for
and B-roll shots are reasonably mixed [42]. For telling stories, selected shots their timelines do not overlap with each other.
most shots are usually A-roll, and B-roll is usually placed at the Therefore, there would be two situations. (I) Incomplete A-roll
speaker’s natural pause to support A-roll in visual experience. (ξA ): it should be lengthened such that its timeline is equal to
In this work, we can take the shots with or without subtitles as that of a complete A-roll (δA ), or we directly remove it once
A-roll and B-roll, respectively. It is noticeable that the frequent the timeline ratio of ξA to δA is smaller than a constant thresh-
insertion of B-rolls would easily distract the audience, while too old (e.g. 0.5). (II) Incomplete B-roll (ξB ): we extend this kind
few b-rolls may make the story much less interesting. Therefore, of shot according to the color continuity. For each iteration, if
we set the interval between two B-rolls as 9 seconds motivated the difference of color continuity between the beginning frame
from [42]. and its previous frame is not great, the beginning time point is
Color continuity: The video color continuity is often the most updated to the previous frame. Similarly, the end time point is
representative feature in identifying professional videos [41]. processed by comparing the end frame and its next frame. The
Therefore, preserving the continuity of saturation and bright- extreme case is extending ξB to a complete B-roll (δB ).
ness in a shot is crucial to improving the viewing experience. 2) Across boundary and non-overlapping shots: For each be-
In our work, the color continuity between two adjacent frames ginning and ending time points pair of such shots, they are lo-
is measured by the histogram difference of the saturation and cated within two complete shots and do not overlap with each
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4901
other. This type can be simply divided into three cases: (III) 2) MPII movie description (MPII) [58]: It is a popular
ξA ∪ ξB , (IV) δA ∪ ξB , and (V) ξA ∪ δB . However, selected video description dataset in language video localization field,
shots can be decomposed into separate sub shots which exactly containing 94 movies and 68375 manually written description
follow (I) or (II) in all these cases. sentences of movie plot.
3) Overlapping shots: This type contains several cases below: 3) Documentary description: We collected 72 documentaries
(VI) the overlapping parts are in a complete shot, (VII) at least with different themes such as earth, travel, adventure, food, his-
one overlapping shot is across the boundary of complete shots, tory and science from public websites, and annotated 21,643
and (VIII) at least an overlapping shot contains a complete shot. temporal descriptions of plot like [58].
For these shots, we first remove the repeated parts, and then 4) Movie-documentary (M-D): This is a video repository that
merge the consecutive shots into a single one. This merged shot consisted of the movies in MPII and the documentaries in docu-
then conforms to the above processing method of type (1) or (2). mentary description dataset. We further collect their correspond-
Similarly, the beginning and ending time points pair that is ing subtitles from public websites. Unlike TVsum, M-D does not
located within more than two complete shots can firstly be sepa- provide frame-level importance scores.
rated by the complete shot boundary, and then be complemented
by leveraging corresponding strategies. Finally, we select the B. Baselines
qualified shots according to the aforementioned aesthetic con-
We compared the performance of our model with several
straints, and the resulting shots are assembled to get the video
advanced video summarization methods. 1) Random: we la-
summary. In particular, if the color continuity between two shots
beled scores of importance for each frame randomly to generate
is quite different, we employ fade-in and fade-out effects to en-
summaries which are independent with the video content. 2)
sure visual smoothness.
DR [18]: it proposes an encoder-decoder framework based on
Post processing: we further apply a series of post process-
reinforcement learning to predict probabilities for every video
ing effects, it is worth noting that the operations 2) and 3) are
frames. For training the framework, a diversity and a representa-
optional: 1) we leverage the Spleeter [55] to extract the original
tiveness reward function is designed to generate summaries. 3)
BGM and only keep the human voice. 2) We automatically select
DR-Sup [13]: it is a ablation version of DR, which has the same
a coherent audio clip from the extracted BGM that matches the
model backbone as DR, but only utilizes a representativeness
style of the generated summary. We also allow users to select
reward function in the training process. 4) VAS [7]: it proposes
some externally BGM for the summary. 3) We allow users to
a self-attention based network, which performs the entire se-
manually design a text of voiceover and select the start and end
quence to sequence transformation in a single feed forward pass
time that they desire to insert. We leverage the text-to-speech
and single backward pass during training. 5) DSN-AB [13]: it
method, i.e. espnet [56], to implement it.
firstly samples a series of temporal interest proposals with differ-
ent scales of intervals. After that, long-range temporal features
VI. EXPERIMENTS are extracted for both predicting important frames and selecting
In this section, we describe the datasets, evaluation metrics, the relatively representative ones. 6) DSN-AF [13]: it is a vari-
baselines, experimental designs and manage to evaluate the ef- ant of DSN-AB. The feature extraction and key shots selection
fectiveness of our approach by answering the following research steps of this method are the same as those of DSN-AB, but the
questions (RQs): importance scores at frame level are converted into shot level
RQ1: Compared to traditional video summarization methods, scores. 7) HSA [8]: it proposes a framework with two layers.
can our method provide a more high-quality summary? The first layer is to locate the shot boundaries in the video and
RQ2: Does every shots selection component or aesthetic con- generate the visual features for them. The second layer is uti-
straint have positive influence to our final result? lized to predict which shots are most representative to the video
RQ3: In contrast to professional manual summarization, what content. 8) FCN [59]: it adapts semantic segmentation models
are the advantages and disadvantages of MANVS? based on fully convolutional sequence networks for video sum-
marization. 9) HMT [60]: it proposes a hierarchical transformer
A. Datasets model based on audio and visual information, which can cap-
ture the long-range dependency information among frames and
We employ several existing and collected datasets to conduct shots. 10) VSN [61]: it proposes a deep learning framework to
extensive experiments. These datasets are shown below. learn video summarization from unpaired data.
1) TVsum [57]: It is a widely used and manually annotated
video summarization dataset, which contains 50 multimodal C. Experimental Designs
videos from public websites. The topics of videos include how
to change tires for off road vehicles, paper wasp removal etc., In order to ensure the comparison as fair as possible, we
with durations varying from 2 minutes to 10 minutes. In our ex- set our method as the ones with and without manual opera-
periments, Aliyun interface1 is used to automatically generate tion, which are denoted by MANVS and MANVS-auto, respec-
subtitle documents for these videos. tively. Specifically, the manual operation includes visual seman-
tic matching component, user-designed voiceover and BGM in
the post processing component. Next, the following experiments
1 [Online]. Available: https://www.aliyun.com/ are designed to answer the aforementioned RQs:
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4902 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
Comparison to traditional video summarization methods available online.2 This model employs a pretrained uncased base
(RQ1): We compared MANVS-auto with baselines in the fol- model of BERT as transformer encoder, and utilizes a pointer
lowing two experimental designs. R1a: we conduct the compar- network with attention mechanism as decoder. The number of
ison experiments on the TVsum dataset, by quantitatively eval- transformer blocks and self-attention heads are both 12, the
uating the performance of the proposed method and the compar- hidden layer size is 768, the maximum input sequence length
ison methods. R1b: we conduct user study on the comparison is 512, the batch size is 32 and the vocabulary size is 30000.
experiments for the TVsum and M-D dataset, and further explore For the temporal language grounding method [50] leveraged in
the statistical significance of the user data. the visual semantic localizing component, the official code is
Ablation study (RQ2): We conduct two types of ablation stud- available online.3 We follow the official training setup and train
ies. R2a: we conduct a quantitative experiment on MPII and doc- the model on the Charades dataset [63], which is composed of
umentary description dataset to evaluate the performance of our 12408 and 3720 time interval and text query pairs in training
visual semantic matching component objectively. In this exper- and test set, respectively. This model utilizes I3D [64] to extract
iment, we compare the employed method LG4 and an alterna- segment features for training data, while fixing their parame-
tive method VSL [33] for visual-semantic localizing. Besides, ters during a training step. The feature dimension is set to 512.
we compare them against those without either text semantic This method uniformly samples 128 segments from each video
similarity (TSS) component, without visual semantic localizing and uses the Adam optimizer to learn models with a mini-batch
component, or both. R2b: we evaluate the effect of every single of 100 video-query pairs and a fixed learning rate of 0.0004.
component and aesthetic constraint to the quality of our gener- Then, we fine-tune the pretrained model on the MPII movie and
ated video summary on the TVsum and M-D datasets. For the Documentary Description dataset. Parameters of the pretrained
former dataset, the experiments include quantitative evaluations model are fixed during training. For the detect-summarize net-
and user study, while for the latter, due to the lack of annota- work in the key shots selection component, we use the pretrained
tions, only user study is conducted. Specifically, for TVsum, we anchor-base model provided by [13]. This model includes a
achieve which MANVS is applied without a specific individual multi-head self-attention layer with 8 heads, a layer normaliza-
component or aesthetic constraint: without subtitle summariza- tion, a fully-connected layer with a dropout layer and a tanh acti-
tion (w/o SS), without key shots selection (w/o KSS), without vation function, followed by two output fully-connected layers.
highlight extraction (w/o HE), without shots length (w/o SL), In our setting, all the hyperparameters of implemented meth-
without B-roll selection (w/o BS), without color continuity (w/o ods are kept the same with the official ones. We evaluate the
CC), without shot stability (w/o SSA), without opposite move- performance of the implemented method with the same evalu-
ments (w/o OM) and without post processing (w/o PP). Since ation criteria as the official ones, and the performance of these
the videos in TVsum are unlabeled with temporal descriptions, methods is also consistent with official ones. The supplemen-
we do not use VSM component in this experiment on TVsum. tary material provides partially generated summaries, and we
Similarily, we achieve which MANVS is applied, keep the same encourage readers to watch these videos.
setting on M-D dataset.
Comparison to professional manual editing (RQ3): R3: we D. Evaluation Metrics
compare the editing time and quality of generated summaries
among MANVS, MANVS-auto and an experienced video pro- In order to comprehensively evaluate the performance of our
ducer who creates a video summary manually. Here the manual framework, we adopt different evaluation metrics for different
editing tool is the commonly used Adobe Premiere© , and the experimental designs. A detailed version can be seen in supple-
producer is asked to make a video summary for test videos. The mentary material.
summary should also contain a coherent BGM, and some simple 1) In the quantitative evaluations of R1a and R2b for TVsum,
splicing and fading effects could be used to make the summary commonly used [9] Precision, Recall and F-score are utilized
visually appealing. To make a fair comparison, we only counted as the evaluation metrics to evaluate the quality of generated
the human active time during producing. summaries.
Implementation details: In the user study of R1b, R2b and 2) In experiments R2a, we adopt R@n, IoU = μ and mIoU as
R3, we randomly select 10 movies, 10 documentaries and 10 the evaluation metrics, which is commonly used in the field of
videos from M-D and TVsum datasets to generate summaries. language video localization [33], [50]. R@n, IoU = μ represents
In order to keep the consistency of the evaluations, we re- the percentage of testing samples which have at least one of the
fer to [43], randomly pick a movie (Big Fish), a documentary top n results with IoU larger than μ. IoU means intersection
(Planet Earth Season I Episode 2) and a video from TVsum of time between visual semantic matching and ground truth on
(Poor Man’s Meals: Spicy Sausage Sandwich) for demonstrat- union. mIoU means the average IoU over all testing samples. In
ing results. In the quantitative experiments, the TVsum, MPII this paper, following [28], [29], when reporting R@n, IoU = μ,
and documentary description datasets are randomly divided into
training and test dataset with the ratio of 8:2. For the extractive
2 [Online]. Available: https://github.com/maszhongming/Effective_Extra-cti
text summarization method [47] leveraged in the text summa-
ve_Summarization
rization component, we adapted the model pretrained on the 3 [Online]. Available: https://github.com/JonghwanMun/LGI4temporalgrou-
CNN/DailyMail [62]. The official code and pretrained model are nding
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4903
TABLE I
THE STATISTICS FOR SUMMARY ATTRIBUTES OF SUMMARIES GENERATED BY BASELINES AND MANVS-AUTO. THE IC, CON, TNS AND TAL RESPECTIVELY
REPRESENT INFORMATION COVERAGE, CONSISTENCY, THE NUMBER OF SHOTS IN THE SUMMARY AND THE AVERAGE LENGTH OF SHOTS
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4904 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
TABLE II TABLE VI
THE USER STUDY OF VIDEO SUMMARIES IN THE QUALITY OF VISUAL ABLATION STUDY OF DIFFERENT COMPONENTS IN MANVS
ATTRACTION (VA) AND NARRATIVE COMPLETENESS (NC) GENERATED BY ON THE TVSUM DATASET
STATE-OF-ART METHODS AND MANVS-AUTO
TABLE VII
COMPARISON BETWEEN PROFESSIONAL MANUAL EDITING AND OURS
TABLE III
THE STATISTICAL SIGNIFICANCE (P -VALUE) WHICH IS THE IMPROVEMENT OF
MANVS-AUTO OVER OTHER METHODS IN TERMS OF VA AND NC
(WILCOXON TEST). NUMBERS FROM 1 TO 10 CORRESPOND FROM THE
RANDOM TO VSN METHOD IN TABLE II RESPECTIVELY
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4905
Fig. 7. Summarization examples (on TVsum). The five bars below each video represent the results generated by DR-Sup, VAS, DSN-AB, MANVS-auto and
gound truth, respectively. The long gray bar and the short colored bar are the time stream of video and the selected key frame, respectively.
According to the quantitative evaluation for TVsum in Ta- achieve the best results in both subjective and objective evalua-
ble IV, our method clearly outperforms baselines. Comparing tion.
F-score of HSA and DR, we find that our method improves the RQ2: Table V shows experimental results of ablation study
performance significantly. Compared with DR-Sup, the F-score R2a. We see that the performance significantly increases by
of MANVS-auto have 5% gain. Besides, Fig. 7 illustrates some leveraging methods with VL and TSS. In particular, our method
summarization examples generated on TVsum, we compare the achieves 48.56%, 41.55% in mIoU on Documentary Description
durations of selected key frames of MANVS-auto with DR-Sup, and MPII datasets respectively, which performs the best in all
VAS, DSN-AB and ground truth. The results show that com- approaches. Besides, we can find that the performance of LG4
pared with baselines, the shots selected by our method is closer without using TSS is slightly less than that of alternative VSL.
to ground truth. However, the performance of LG4 + TSS outperforms a lot than
It is worth noting that the quantitative results in Table IV are the alternative VSL + TSS. The reason may be that after utiliz-
inconsistent with the user study results in Table II. For example, ing TSS, the generated sub-video is much shorter than original
the F-score of HSA in Table IV is lower than that of DSN-AF one. VSL performs better than LG4 in dealing with long videos,
and VAS, but its visual attraction and narrative completeness in while LG4 is better at localizing language in short videos. Fig. 8
Table II are higher than both. The reason may be that the annota- shows some shot localization results of our component and alter-
tions only record the importance of each frame, and the objective natives, which shows that our VSM component can acquire the
evaluation metrics in Table IV calculate the scores through the shots that users are interested in. The above results demonstrates
selected frame, but they can not evaluate the overall quality of that the performance of our VSM component is better than that
video summary [17]. Combining the results of Table I, we notice of alternatives and each cascaded components plays a vital role.
the average shot length of the former six methods is too short, Fig. 6 and Table VI present the subjective and objective re-
which results in frequent changes of scenes, and makes the video sults of ablation study R2b, respectively. For the shots selection
summaries like selecting several frames in some consecutive components, we find that w/o SS can cause serious impacts on
frames. Though they select most of the representative entities the narrative completeness of video summaries compared with
or plots, they appear intermittently or just for a few seconds in the original one. For instance, the performance of w/o SS on the
these video summaries. Therefore, the summary can not tell a documentary with the score of only 3.5 and the score is lower
fluent story and causes low ratio of consistency, which means than our result by 1.8 in the user study. Besides, the F-score of
that for many emerging sentences in the summary, only a few w/o SS on Tvsum also suffer from low accuracy, with the value
words are read by the voiceover and can dramatically reduce of only 30.60%. These results show that narrative information
the narrative completeness and visual attraction of the gener- is an important factor in affecting the quality of the generated
ated summary, especially for these strong narrative videos such summary, and the SS component is able to capture narrative de-
as documentary and movie. In the opposite, the performance of tails. In addition, w/o VSM gets the close performance to our
HSA in Table II is relatively better compared with the former result in Fig. 6. This is in line with our expectation since the
six baselines possibly because of its longer shots length, higher only difference in this group is the shots matching the procedu-
ratio of consistency. However, compared to our method, it also ral text. Besides, the subjective evaluation results indicate that
suffers from low ratio of complete sentences and less ratio of KSS and HE are plays a positive role in its corresponding aspect.
information coverage, overlong average length of shots, which For the aesthetic constraints and post processing component, it
is tedious for watching. These results prove that our method can can cause some score discrepancies between w/o SL, w/o BS
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4906 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
XIE et al.: MULTIMODAL-BASED AND AESTHETIC-GUIDED NARRATIVE VIDEO SUMMARIZATION 4907
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.
4908 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 25, 2023
[49] Y.-C. Chen and M. Bansal, “Fast abstractive summarization with reinforce- Tianyi Zhang (Graduate Student Member, IEEE) is
selected sentence rewriting,” in Proc. Assoc. Comput. Linguist., 2018, currently working toward the Ph.D. degree with the
pp. 675–686. Faculty of Electrical Engineering, Mathematics &
[50] J. Mun, M. Cho, and B. Han, “Local-global video-text interactions for Computer Science, Delft University of Technology,
temporal grounding,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit, Delft, The Netherlands. He is associated with the
2020, pp. 10 810–10819. Distributed & Interactive Systems Group, Centrum
[51] D. A. Hudson and C. D. Manning, “Compositional attention networks Wiskunde & Informatica, the national research insti-
for machine reasoning,” in Proc. Int. Conf. Learn. Representations, 2018, tute for mathematics and computer science in The
pp. 1–20. Netherlands. His research interests include human-
[52] J.-H. Kim et al., “Hadamard product for low-rank bilinear pooling,” in computer interaction and machine learning based af-
Proc. Int. Conf. Learn. Representations, 2017, pp. 1–17. fective computing.
[53] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
in Proc. IEEE Conf. Comput.Vis. Pattern Recognit, 2018, pp. 7794–7803.
[54] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”
in Proc. Eur. Conf. Comput. Vis, 2006, pp. 404–417.
[55] R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: A Yixuan Zhang is currently working toward the
fast and efficient music source separation tool with pre-trained models,” undergraduation degree with Nankai University,
J. Open Source Softw., vol. 5, no. 50, pp. 1–4, 2020. Tianjin, China. His research interests include movie
[56] T. Hayashi et al., “ESPnet-TTS: Unified, reproducible, and integratable summary and the sentiment analysis.
open source end-to-end text-to-speech toolkit,” in Proc, IEEE Int. Conf.
Acoust., Speech, Signal Process., 2020, pp. 7654–7658.
[57] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing
web videos using titles,” in Proc. IEEE Conf. Comput.Vis. Pattern Recog-
nit, 2015, pp. 5179–5187.
[58] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for
movie description,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit,
2015, pp. 3202–3212. Shao-Ping Lu (Member, IEEE) received the Ph.D.
[59] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully con- degree in computer science from Tsinghua Univer-
volutional sequence networks,” in Proc. Eur. Conf. Comput. Vis., 2018, sity, Beijing, China. He was a Postdoc and Senior
pp. 347–363. Researcher with Vrije Universiteit Brussels, Brussels,
[60] B. Zhao, M. Gong, and X. Li, “Hierarchical multimodal transformer to Belgium. He had been an Associate Professor with
summarize videos,” Neurocomputing, vol. 468, pp. 360–369, 2022. Nankai University, Tianjin, China. His research in-
[61] M. Rochan and Y. Wang, “Video summarization by learning from unpaired terests include the intersection of visual computing,
data,” in Proc. CVPR, 2019, pp. 7902–7911. with particular focus on computational photography,
[62] R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, and B. Xiang, “Abstrac- 3D image and video representation, visual scene anal-
tive text summarization using sequence-to-sequence RNNs and beyond,” ysis, and machine learning.
in Proc. SIGNLL, 2016, pp. 280–290.
[63] J. Gao, C. Sun, Z. Yang, and R. Nevatia, “Tall: Temporal activity local-
ization via language query,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
pp. 5267–5275.
[64] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in Proc. CVPR, 2017, pp. 6299–6308.
[65] “Subjective video quality assessment methods for multimedia applica- Pablo Cesar (Senior Member, IEEE) is currently
tions,” document ITU-T P.910, 2008. leads the Distributed & Interactive Systems Group,
Centrum Wiskunde & Informatica (CWI) and a Pro-
fessor with TU Delft, Delft, The Netherlands. His
research interests include HCI and multimedia sys-
tems, and focuses on modelling and controlling com-
plex collections of media objects distributed in time
and space. He is an ACM Distinguished Member,
Jiehang Xie received the master’s degree in computer and part of the Editorial Board of IEEE MULTIMEDIA,
software and theory from Shaanxi Normal University, ACM Transactions on Multimedia, and IEEE TRANS-
Shaanxi, China, in 2020. He is currently working to- ACTIONS ON MULTIMEDIA. He was the recipient of the
ward the Ph.D. degree with the College of Computer Prestigious Netherlands Prize for ICT Research in 2020 because of his work on
Science, Nankai University, Tianjin, China. His re- human-centered multimedia systems. He is the Principal investigator from CWI
search interests include multimodal, multimedia anal- in a number of National and European projects, and acted as an Invited Expert
ysis, and affective computing. at the European Commissions Future Media Internet Architecture Think Tank.
Xuanbai Chen received the B.E. degree in computer Yulu Yang received the B.E. degree from Beijing
science and technology in 2021 from Nankai Univer- Agriculture Engineering University, Beijing, China,
sity, Tianjin, China. He is currently working toward in 1984. He received the M.E. and Ph.D. degrees from
the M.S. degree in computer vision with the Robotics Keio University, Tokyo, Japan, in 1993 and 1996, re-
Institute, School of Computer Science, Carnegie Mel- spectively. He is currently a Full Professor with the
lon University, Pittsburgh, PA, USA. His research in- Department of Computer Science, Nankai University,
terests include domain adaptation and video summa- Tianjin, China. His research interests include parallel
rization in computer vision. processing and intelligence computing.
Authorized licensed use limited to: VTU Consortium. Downloaded on May 26,2025 at 15:27:02 UTC from IEEE Xplore. Restrictions apply.