Video Captioning Approaches
Video Captioning Approaches
Abstract—In recent years, automatically generating natural that are, understanding the fine motion details of video
language descriptions for videos has created a lot of focus in contents and also the interactions of different objects, learning
computer vision and natural language processing research. Video better representations of video between video domain and
understanding has several applications such as video retrieval and
indexing etc., but the video captioning is the quite challenging
language domain, ranking the activity identified in the video
topic because of the complex and diverse nature of video content. [3]. Different video captioning approaches have been proposed
However, the understanding between video content and natural to overcome these challenges. As mentioned in the Fig. 1. the
language sentence remains an open problem to create several video captioning methodologies [4] can be categorized into
methodology to better understand the video and generate the two methods that are template-based methods, deep learning-
sentence automatically. Deep learning methodologies have based methods.
increased great focus towards video processing because of their
better performance and the high-speed computing capability. Template
This survey discusses various methods using the end-to-end based
framework of encoder-decoder network based on deep learning method
Image
approaches to generate the natural language description for video
Captioning
sequences. This paper also addresses the different dataset used for Retrieval
video captioning and image captioning and also various based
evaluation parameters used for measuring the performance of Visual method
different video captioning models. Captioning
Template
Index Terms—Video Captioning, Image captioning, CNN, based
RNN, LSTM, aLSTM, GRU Video method
Captioning End-end-
end
I. INTRODUCTION Deep framework
0657
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.
TABLE I
ANALYSIS OF VARIOUS METHODOLOGIES INVOLVED IN IMAGE AND VIDEO CAPTIONING
Sl. Paper
Input Methodology Issues Future Direction
No Keyword
N. Xu [4] proposed a dual stream-RNN model for video The most existing system employed max and mean pooling
captioning, which is used to explore and integrate the hidden over each frames of video to produce vector representation but
states of the semantic and visual streams. Proposed model failed to capture temporal structure. In [7] 3D CNN explores
enhances the local feature learning process along with the temporal information i.e most relevant temporal fragments
global semantic feature by exploiting the hidden states of chosen automatically and forwarded to natural language
vector representation and semantic concepts separately by description generation. Joint visual modeling approach
using two modalities specific RNN called as Attentive Multi- combining forward LSTM and Backward LSTM and CNN to
Grained Encoder (AMGE) and it makes the video encode video data to video representation and inject into
representation efficient for video caption generation. Dual- language model to create a description for the video. The
Stream RNN decoder fuse both the streams from AMGE for proposed model is the first one to use a bidirectional recurrent
textual description generation. neural network. And it constructs 2 different sequential
J. Song [8] proposed a generative approach Multi-modal processing modules that are adaptive video representation
Stochastic Recurrent Neural Network (MS-RNN) to generate learning and textual description generation. This model utilizes
multiple sentences for the same event by using both prior and 2 different LSTM unit for both frames encoding and textual
posterior distribution. This model also used to overcome the decoding. S. Venugopalan [13] proposed a new sequence to
problem of uncertainty which cannot be modeled using sequence framework in end-to-end manner to provide
deterministic models. Stochastic LSTM ((S-LSTM) is descriptions for short videos. Proposed LSTM model was
proposed to propagate uncertainty using hidden variable. trained by using video and sentence pairs then automatically
learns to attach the frames to a set of words to generate a
0658
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.
natural language sentence. Also, this model could understand representation of video generates multiple action annotation or
the temporal structure of the video and also the generated video description.
sentence. Y. Yang [19] introduced a novel adversarial learning
L. Gao [14] proposed an hierarchical LSTM with adaptive concept by the expansion of LSTM with Generative
attention (hLSTMat) framework to generate sentence for Adversarial Network for video captioning problem. GAN
images and videos. In the proposed model the spatial attention composed of two interplay module, generator and
or temporal attention mechanism is used for choosing the discriminator, in which generator generates the sentences
particular portion of the frame to detect the related words. given the video content using existing video captioning
Adaptive attention mechanism is used to decide whether to methodology. A novel discriminator module is proposed which
rely on visual context or language context. Also, this model act as adversary towards the generator to improve the
considers both low-level visual context and high-level accuracy. Embedding layer is proposed into discriminator
language context to produce the textural description for video. which can transform the discrete output of the generator into
Y. Pan [15] proposed LSTM unit with transferred semantic consecutive representation. Also proposed a novel
attributes (LSTM–TSA) framework to extract the semantic discriminative framework to resolve the inefficient
features from the videos by using CNN and RNN framework. classification of sequence to sentence.
Semantic features are used to learn the sequence generate the This literature survey presents clear picture about different
textual description and these features reflect the stationary methods to generate the textural description for video shown in
objects and scenes in the image but failed to reflect the Table I. Proposed methodologies in this survey significantly
temporal structure of video. Generating natural description for improves the performance of video captioning task and still
the video has been improved by merging image and video needs improvement in this field and many unattended
sources together. Also results showing better performance on challenges available which allow researchers to focus in video
different datasets. captioning and generate description similar to human.
Y. Xu [16] developed a novel Sequential VLAD layer,
named as SeqVLAD which generates the better representation III. DATASET
of video by combining the VLAD and the RCN framework. This section presents various datasets used for an image and
This model exploring the fine motion details present in the video captioning. These training datasets usually consists of an
video by learning the spatial and temporal structure. An image or video and its ground truth sentences.
improved version of Gated Recurrent Unit of Recurrent
Convolutional Network (RCN) named as Shared GRU-RCN A. Image Dataset
(SGRU-RCN) was proposed to learn the spatial and temporal 1) Microsoft COCO [20]: The biggest corpus for an image
assignment. Overfitting problem is resolved in this model captioning. In this dataset 82,783 images are allocated for
because the SGRU-RCN contains only less parameters and this training, 40,504 images are allocated for validation and 40,775
achieve better results. images are allocated for testing and each image is annotated
Describing longer videos semantically in one sentence with five captions by human.
misses out most of the details and generate uninformative and 2) Flickr30K dataset [21]: consists of 31,783 images taken
unexciting results. Yu et al. [17] generated multiple textual from Flickr. In this dataset 29,000 images are allocated for
descriptions for the longer duration video composed of many traning, 1,000 images allocated for validation and 1,000
different events using hierarchical RNN (hRNN) method. This images are allocated testing and also each image is associated
framework utilising the sequential dependencies between the with 5 descriptions. It mostly covers the human activities.
multiple descriptions in a passage where the next sentence is B. Video Dataset
generated using the semantic context of the previous sentence. 1) Montreal Video Annotation Dataset (M-VAD) [22]:
In this approach two types of generators used, sentence consist of 49,000 video clips that are extracted from 92 DVD
generator takes the spatial and temporal information exist in movies. In this dataset 39,000 video clips are allocated for
the long video to generate a single description whereas training, 49,000 video clips are allocated for validation and
paragraph generator process the dependencies between the 5,000 video clips are allocated for testing. Each video clips are
multiple descriptions. Ning Xu [18] proposed a model to described by single sentence. It is a very challenging task to
recognize multiple events in the video and generate the natural describe movie snippets with one single sentence of ground
language description using Attention-in-Attention model. It truth.
consists of two different attention modules, first one is 2) MPII Movie Description Corpus (MPII-MD) [23]:
Encoder attention modules which selects the most salient consist of 37,000 video clips colleceted from 55 movies along
visual and semantic features and average both features into with audio descriptions and 31,000 video clips from 49
single attentive feature to highlight the space-specific features. Hollywood movies. Each video clips are well-appointed with
Second is Fusion attention module which activate the multi- one single sentence from descriptive video service and movie
space features and adjust and fuse them for better scripts.
representation of video. LSTM used to decode the
0659
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.
3) Microsoft Research Video Description Corpus (MSVD) B. ROUGE
[24]: consist of 1970 YouTube video clips created by Amazon The system ROUGE (Recall-Oriented Understudy of Gisting
Mechanical Turk (AMT). Each video clip is about 10 seconds. Evaluation) [29] was initially developed to summarize the
In this dataset 1,200 video clips allocated for training, 100 documents automatically. ROUGE is similar to BLEU metrics
video clips allocated for validation and 670 video clips for but the difference is ROUGE metric is used to measure based
testing. These video clips are annotated in different language on the n-gram occurrences in the sum of number of human
with single sentence. Each video clips annotated by 40 annotated sentences while the BLEU is calculated by
different sentences in English. Fig. 3 shows the sample output considering the occurrences in the total sum of generated
sentence generated for video. sentences. It has four types namely ROUGE-N, ROUGE-L,
ROUGE-W and ROUGE-S(U), where version N and L are
popularly used for video captioning.
C. METEOR
METEOR (Metric for Evaluation of Translation with
Explicit Ordering) [30] is a mean value of unigram-based
precision and recall scores. The major difference between
METEOR metric and BLEU metric is that it combines both
recall and precision metric. BLEU and ROUGE have the
limitation of strict matching but it is resolved in METEOR by
Fig. 3. An example from MSVD dataset with the associated ground truth utilizing the unigrams of words and synonyms.
0660
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.
- Sentences generated by the system are still less [11] Z. Gany, C. Gan, X. Hez, Y. Puy, K. Tranz, J. Gaoz, L. Cariny, L.
Dengz, “Semantic Compositional Networks for Visual Captioning”,
acceptable level than the human annotated sentences. arXiv:1611.08002v2, 28 Mar 2017.
- Image and video captioning mainly learn to map low- [12] Quanzeng You1, Hailin Jin2, Zhaowen Wang2, Chen Fang2, and Jiebo
level visual features to sentence without focusing the Luo “Image Captioning with Semantic Attention”, IEEE CVPR, 2016.
high-level semantic video concepts (i.e. objects, [13] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K.
Saenko, “Sequence to sequence-video to text”, In: Proceedings of the
actions etc.). IEEE International Conference on Computer Vision, pp. 4534–4542
- Existing methods focus on predefined templates, 2015.
instead of generating more natural and diverse [14] L. Gao, X. Li, J. Song and H. T. Shen, “Hierarchical LSTMs with
Adaptive Attention for Visual Captioning”, accepted in IEEE Journal of
sentences. Latex Class Files, Vol. 14, No. 8, August 2015.
- Most of the existing techniques on video captioning not [15] Y. Pan, T. Yao, H. Li and T. Mei, “Video Captioning with Transferred
exploring the temporal nature of video, which is Semantic Attributes”, IEEE CVPR, pp. 984-992, 2017.
[16] Y. Xu, Y. Han, R. Hong, Q. Tian, “Sequential Video VLAD: Training
important to describe the long duration videos. the Aggregation Locally and Temporally” in IEEE Transactions on
Image Processing, Vol. 27, No. 10, October 2018.
VI. CONCLUSION [17] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. “Video paragraph
captioning using hierarchical recurrent neural networks”, In IEEE
Thus this survey provides information about various CVPR, pp.4584-4593, 2016.
methods using the end-to-end framework of encoder-decoder [18] N. Xu, A. Liu, W. Nie, Y. Su, “Attention-In-Attention Networks for
network based on deep learning to generate the natural Surveillance Video Understanding in IoT”, accepted in IEEE Internet of
Things Journal, 2018.
language description for video sequences. This paper also [19] Y. Yang, J. Zhou, J. Ai, Y. Bin, A. Hanjalic, H.T. hen, Y. Ji, “Video
addresses the different dataset used for video captioning and Captioning by Adversarial LSTM”, in IEEE Transaction on Image
image captioning and also various evaluation parameters used Processing, 2018.
[20] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P.
for measuring the performance of different video captioning Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in
models. This survey gives a clear picture to readers that what context,” in European conference on computer vision. Springer, 2014,
has been achieved in this video captioning field so far and also pp. 740–755.
[21] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
presents where the gaps exist so that future research can be descriptions to visual denotations: New similarity metrics for semantic
better focused. inference over event descriptions,” ACL, vol. 2, pp. 67–78, 2014.
[22] A. Torabi, C. Pal, H. Larochelle, and A. Courville, “Using descriptive
REFERENCES video services to create a large data source for video annotation
research,” arXiv preprint arXiv:1503.01070, 2015.
[1] X. He and L. Deng, “Deep Learning for Image-to-Text Generation A [23] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for
technical overview”, IEEE Signal Processing Magazine, Vol-32, no-6, movie description,” in Proceedings of the IEEE conference on computer
pp:109-116 Nov. 2017. vision and pattern recognition, 2015, pp. 3202–3212.
[2] N. Aafaq, S. Z. Gilani, W. Liu, and A. Mian, “Video Description: A [24] D. L. Chen and W. B. Dolan, “Collecting highly parallel data for
Survey of Methods, Datasets and Evaluation Metrics”, arXiv preprint paraphrase evaluation,” in ACL: Human Language Technologies-Vol. 1.
arXiv:1806.00186, Jun. 2018. Association for Computational Linguistics, 2011, pp.190–200.
[3] S. Li, Z. Tao, K. Li, and Y. Fu, “Visual to Text: Survey of Image and [25] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video
Video Captioning”, accepted In IEEE Transactions on Emerging Topics description dataset for bridging video and language,” in Proc. IEEE
in Computational Intelligence, 2019. Conf. Comput.Vis. Pattern Recognit., Las Vegas, NV, USA, 2016, pp.
[4] N. Xu, A. Liu, Y. wong, Y Zhang, W. Nie, Y. Su, M. Kankanhalli, 5288–5296.
“Dual-Stream Recurrent Neural Network for Video Captioning”, [26] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan,
accepted in IEEE Transactions on Circuit and systems for video R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and
technology, Mar. 2018. describing arbitrary activities using semantic hierarchies and zero-shot
[5] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. recognition. In Proceedings of the IEEE International Conference on
Gao, X. He, M. Mitchell, J. C. Platt, et al. “From captions to visual Computer Vision, pages 2712–2719, 2013.
concepts and back,” in Proc. Conf. Computer Vision and Pattern [27] J. Park, C. Song, J.-H. Han, “A Study of Evaluation Metrics and
Recognition, pp. 1473–1482, 2015. Datasets for Video Captioning”, ICIIBMS 2017.
[6] L. Gao, Z. Guo, H. Zhang, X. Xu, H. T. Shen, “Video Captioning with [28] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: a method for
Attention-Based LSTM and Semantic Consistency” in IEEE automatic evaluation of machine translation”, in Proceedings of the 40th
Transactions on Multimedia, vol-19, no-9, Sep. 2017. Annual Meeting on Association for Computational Linguistics (ACL
[7] Y. Bin, Y. Yang, F. Shen, N. Xie, H. T. Shen, X. Li, “Describing Video '02). Association for Computational Linguistics, Stroudsburg, PA, USA,
with Attention-Based Bidirectional LSTM”, accepted in IEEE 311-318, 2002.
Transactions on Cybernetics, 2019. [29] Lin CY, “ROUGE: a package for automatic evaluation of summaries”,
[8] J. Song, Y. Guo, L. Gao, X. Li, H.T. Shen, “From Deterministic to in Proceedings of the workshop on text summarization branches out,
Generative: Multimodal Stochastic RNNs for Video Captioning” in Barcelona, Spain, (WAS2004) 2004.
IEEE Transactions on Neural Networks and Learning Systems, 2018. [30] D. Elliott and F. Keller, “ Image description using visual depedency
[9] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. representations,” in Proc. Empirical Methods Natural Lang. Process.
Gao, X. He, M. Mitchell, J. C. Platt, et al. “From captions to visual 2013, vol. 13, pp. 1292-1302.
concepts and back,” in Proc. Conf. Computer Vision and Pattern [31] R. Vedantam, C. L. Zitnick and D. Parikh, “CIDER: Consensus-based
Recognition, 2015, pp. 1473–1482. image description evaluation,” in Proc. IEEE Conf. Comput. Vis.
[10] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, “Show and Tell: A Neural Pattern Recognit., 2015, pp. 4566-4575.
Image Caption Generator”, IEEE CVPR, pp. 3156-3164, 2015.
0661
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.