0% found this document useful (0 votes)
67 views6 pages

Video Captioning Approaches

Uploaded by

Ayush Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views6 pages

Video Captioning Approaches

Uploaded by

Ayush Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Conference on Communication and Signal Processing, April 4-6, 2019, India

Video Captioning using Deep Learning: An


Overview of Methods, Datasets and Metrics
M. Amaresh and S. Chitrakala


Abstract—In recent years, automatically generating natural that are, understanding the fine motion details of video
language descriptions for videos has created a lot of focus in contents and also the interactions of different objects, learning
computer vision and natural language processing research. Video better representations of video between video domain and
understanding has several applications such as video retrieval and
indexing etc., but the video captioning is the quite challenging
language domain, ranking the activity identified in the video
topic because of the complex and diverse nature of video content. [3]. Different video captioning approaches have been proposed
However, the understanding between video content and natural to overcome these challenges. As mentioned in the Fig. 1. the
language sentence remains an open problem to create several video captioning methodologies [4] can be categorized into
methodology to better understand the video and generate the two methods that are template-based methods, deep learning-
sentence automatically. Deep learning methodologies have based methods.
increased great focus towards video processing because of their
better performance and the high-speed computing capability. Template
This survey discusses various methods using the end-to-end based
framework of encoder-decoder network based on deep learning method
Image
approaches to generate the natural language description for video
Captioning
sequences. This paper also addresses the different dataset used for Retrieval
video captioning and image captioning and also various based
evaluation parameters used for measuring the performance of Visual method
different video captioning models. Captioning
Template
Index Terms—Video Captioning, Image captioning, CNN, based
RNN, LSTM, aLSTM, GRU Video method
Captioning End-end-
end
I. INTRODUCTION Deep framework

U NDERSTANDING the video is the key research aspect of


multimedia analysis, and generating a natural language
Learning
based
method
Compositio
nal
sentence for a given video called as video captioning, has been framework
showing great attention in computer vision [1]. Automatic
video description generation involves the understanding of Fig. 1. Categories of Visual Captioning
many background concepts and also the detection of their
occurrences in the video such as objects, actions, scenes, A. Template-based Methods
person-person relations, person-object relations and the The set of predefined specific grammar rules are used in
temporal order of the events. Moreover, it requires translation template-based methods and the sentences are divided into
of the extracted visual information into a comprehensible and different terms such as subject, object, verb and each term
grammatically correct natural language description. Video should be associated with video data and then natural language
captioning has many applications such as video indexing, description is generated. Although template-based method [5]
human-robot interaction, assisting the visually disabled, can generate caption for video based on grammar, this
automatic video subtitling, procedure generation for approach relies on video content and output variety got suffer
instructional videos, video surveillance and understanding sign because of the implied limitation.
language [2]. The following are the major challenges in
understanding video and generating natural language sentence B. Deep Learning based Methods
1) End-to-end framework: The encoder-decoder
architecture initially utilized for machine translation purpose to
generate a sentence from the one language to another
language.
M. Amaresh, Research scholar at Anna University, Chennai, India
(amareshgood@gmail.com).
Dr. S. Chitrakala, Professor at Anna University, Chennai, India
(chitrakala.au@gmail.com ).

978-1-5386-7595-3/19/$31.00 ©2019 IEEE 0656


Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.
The use of the encoder-decoder framework [6][7][8] for II. LITERATURE SURVEY
video captioning improves the results significantly. Fig. 2
shows the typical encoder-decoder framework for video A. Image Captioning
captioning. In the end-to-end model first, the key frames of the Image Captioning has attracted growing interests recently.
video are encoded into a sequence of feature vector, it Early methods for image captioning task can be classified into
represents the semantic information of the video using two types that are template matching approaches and retrieval-
Convolutional Neural Network (CNN). CNN composed of based approaches. In template matching approaches objects
multiple convolutional layer, max pooling, and fully connected and actions present in the images are detected and matched
layers. The extracted global visual feature vector is decoded with the template to generate appropriate sentence. In
using Recurrent Neural Network (RNN) based decoder for to retrieval-based approaches, visually similar to the given test
generate the textual description. A Long-Short Memory images are gathered from the huge database and then fit the
Network (LSTM) or Gated Recurrent Unit (GRU) is used as a descriptions of retrieved images to the test image.
variation to the RNN and shown to be more efficient and O. Vinayals [10] proposed an end-to-end framework to
effective in sentence generation. generate the descriptions for image by replacing RNN encoder
An attention-based mechanism is used to learn to focus into CNN encoder which produce a better image
particular region in the frame while generating the description. representation to generate textual descriptions. The proposed
In an attention-based mechanism, along with global vector, the single joint model named as Neural Image caption model
CNN provide the group of visual vectors for important regions provide training to the CNN for image classification task. The
in the frame. Then, in sentence generation, RNN refers the last hidden layer from the CNN is provided as an input to the
particular region vectors to identify the probability of which RNN decoder where it is used to generate the textual
part of frame is relevant to the present state to produce the next description for the image. Likelihood of the target sentence is
consecutive words and determine the likelihood that which sub maximized by training the images using stochastic gradient
region is relevant to the current state to generate the next word. descent.
In the end-to-end framework, the model is trained jointly in an Z. Gan [11] constructed a semantic compositional network
end-to-end manner including the CNN, RNN and the attention using semantic concepts to generate a textual description for
model. the query image. Likelihoods of all tags are used to compose
the semantic concept vector to process the LSTM weight
matrices in the ensemble. This provides the advantage of
learning the collaborative semantic concept dependent weight
matrices to produce the description for the image.
Q. You [12] proposed a new approach that combines top-
down, bottom-up approaches using semantic attention model.
Based on the top-down approach, a convolutional neural
network extracts the visual features and also detect the visual
concepts (objects, regions, attributes, etc.). The semantic
attention model is proposed to combine both the image
attributes and the visual concepts to produce the description
for the image using RNN. By using the bottom-up approach,
the attention weights are changed with respect to the RNN
iterations for several candidate concepts.
Fig. 2. End-to-End framework of Video captioning B. Video Captioning
2) Compositional Framework: In this framework a different Describing images using natural language has received
category of video to sentence methodology [9] uses the considerable attention and the research is focusing on the
semantic concept-detection explicitly for generating the natural video descriptions. The simple idea to describe the video
language sentences for video. In this framework, a group of content is to utilize the deep learning approaches. Natural
semantic labels and actions are generated from the video. language description for video content is generated using deep
These labels may related to different part of sentence including learning technique in two stages, the first step is to use the
nouns, verb, and adjectives. Language models use these tags to Convolutional Neural Networks (CNN) which encodes the
generate natural language sentence for the video. vector representation of the video and the second step is use
In this paper Section II presents the literature survey of the Recurrent Neural Networks (RNN) which decodes vector
various methodology proposed for an image and video into to textual description. These deep learning networks
captioning and Section III reveals different datasets used for significantly achieving good results in many applications such
an image and video captioning and evaluation metrics are as video indexing, language modeling machine translation and
presented in Section IV then Section V and VI discusses the more.
inference from this survey and Conclusion.

0657
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.
TABLE I
ANALYSIS OF VARIOUS METHODOLOGIES INVOLVED IN IMAGE AND VIDEO CAPTIONING

Sl. Paper
Input Methodology Issues Future Direction
No Keyword

Using unsupervised image and text


Show & Tell Neural Image Caption model Selecting salient features from
1 Image data, to improve sentence generation
[10] images
approaches.

Captioning Combined model of top-down Plan to experiment on sentence based


Extracting better context from
2 with Semantic Image methodology and bottom-up visual attributes with distributed
an image
Attention [12] methodology representations.
Attention mechanism on non-
Hierarchical Image Hierarchical LSTMs with Adaptive
3 visual words, representation of -
LSTMs [14] /Video Attention
visual data
Dual Stream
Image/V Attentive Multi-Grained Encoder Representation and fusion of Generating multiple descriptions or
4 RNN (DS-
ideo (AMGE), Dual-Stream RNN model heterogeneous information paragraph for videos.
RNN) [5]

Extracting fine motion details in


seqVLAD Sequential VLAD method to encoding
5 Video consecutive frames in a video, -
[16] using shared GRU-RCN architecture.
and also overfitting
Compressing static
Attention-
representation of video, Generating sentence for domain
6 Based LSTM Video aLSTM and multi-modal correlation
considering unnecessary portion specific videos.
[6]
of frames
Bidirectional Exploiting object based spatial and
Unidirection for generating
7 LSTM Video BiLSTM & soft attention mechanism temporal dependency, Achieving
sentence.
(BiLSTM) [7] spatial reasoning, temporal reasoning
From
Integrating the state-of-the-art
Deterministic Multimodal stochastic recurrent neural Decoders propagate only
8 Video attention scheme along with proposed
to Generative networks (MS-RNNs) deterministic hidden states
model
[8]
Attention-In- AIA mechanism using encoder Exploring multi-stacked attention
9 Attention Video attention modules (EAMs) and a Better video representation mechanism to fuse the feature into
Networks [18] fusion attention module (FAM). multi-space
Video
Captioning by Adversarial learning and LSTM in LSTM-GAN relying on
10 Video Accuracy of generated sentence
Adversarial Generative Adversarial Network Reinforcement Learning
LSTM [19]

N. Xu [4] proposed a dual stream-RNN model for video The most existing system employed max and mean pooling
captioning, which is used to explore and integrate the hidden over each frames of video to produce vector representation but
states of the semantic and visual streams. Proposed model failed to capture temporal structure. In [7] 3D CNN explores
enhances the local feature learning process along with the temporal information i.e most relevant temporal fragments
global semantic feature by exploiting the hidden states of chosen automatically and forwarded to natural language
vector representation and semantic concepts separately by description generation. Joint visual modeling approach
using two modalities specific RNN called as Attentive Multi- combining forward LSTM and Backward LSTM and CNN to
Grained Encoder (AMGE) and it makes the video encode video data to video representation and inject into
representation efficient for video caption generation. Dual- language model to create a description for the video. The
Stream RNN decoder fuse both the streams from AMGE for proposed model is the first one to use a bidirectional recurrent
textual description generation. neural network. And it constructs 2 different sequential
J. Song [8] proposed a generative approach Multi-modal processing modules that are adaptive video representation
Stochastic Recurrent Neural Network (MS-RNN) to generate learning and textual description generation. This model utilizes
multiple sentences for the same event by using both prior and 2 different LSTM unit for both frames encoding and textual
posterior distribution. This model also used to overcome the decoding. S. Venugopalan [13] proposed a new sequence to
problem of uncertainty which cannot be modeled using sequence framework in end-to-end manner to provide
deterministic models. Stochastic LSTM ((S-LSTM) is descriptions for short videos. Proposed LSTM model was
proposed to propagate uncertainty using hidden variable. trained by using video and sentence pairs then automatically
learns to attach the frames to a set of words to generate a

0658
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.
natural language sentence. Also, this model could understand representation of video generates multiple action annotation or
the temporal structure of the video and also the generated video description.
sentence. Y. Yang [19] introduced a novel adversarial learning
L. Gao [14] proposed an hierarchical LSTM with adaptive concept by the expansion of LSTM with Generative
attention (hLSTMat) framework to generate sentence for Adversarial Network for video captioning problem. GAN
images and videos. In the proposed model the spatial attention composed of two interplay module, generator and
or temporal attention mechanism is used for choosing the discriminator, in which generator generates the sentences
particular portion of the frame to detect the related words. given the video content using existing video captioning
Adaptive attention mechanism is used to decide whether to methodology. A novel discriminator module is proposed which
rely on visual context or language context. Also, this model act as adversary towards the generator to improve the
considers both low-level visual context and high-level accuracy. Embedding layer is proposed into discriminator
language context to produce the textural description for video. which can transform the discrete output of the generator into
Y. Pan [15] proposed LSTM unit with transferred semantic consecutive representation. Also proposed a novel
attributes (LSTM–TSA) framework to extract the semantic discriminative framework to resolve the inefficient
features from the videos by using CNN and RNN framework. classification of sequence to sentence.
Semantic features are used to learn the sequence generate the This literature survey presents clear picture about different
textual description and these features reflect the stationary methods to generate the textural description for video shown in
objects and scenes in the image but failed to reflect the Table I. Proposed methodologies in this survey significantly
temporal structure of video. Generating natural description for improves the performance of video captioning task and still
the video has been improved by merging image and video needs improvement in this field and many unattended
sources together. Also results showing better performance on challenges available which allow researchers to focus in video
different datasets. captioning and generate description similar to human.
Y. Xu [16] developed a novel Sequential VLAD layer,
named as SeqVLAD which generates the better representation III. DATASET
of video by combining the VLAD and the RCN framework. This section presents various datasets used for an image and
This model exploring the fine motion details present in the video captioning. These training datasets usually consists of an
video by learning the spatial and temporal structure. An image or video and its ground truth sentences.
improved version of Gated Recurrent Unit of Recurrent
Convolutional Network (RCN) named as Shared GRU-RCN A. Image Dataset
(SGRU-RCN) was proposed to learn the spatial and temporal 1) Microsoft COCO [20]: The biggest corpus for an image
assignment. Overfitting problem is resolved in this model captioning. In this dataset 82,783 images are allocated for
because the SGRU-RCN contains only less parameters and this training, 40,504 images are allocated for validation and 40,775
achieve better results. images are allocated for testing and each image is annotated
Describing longer videos semantically in one sentence with five captions by human.
misses out most of the details and generate uninformative and 2) Flickr30K dataset [21]: consists of 31,783 images taken
unexciting results. Yu et al. [17] generated multiple textual from Flickr. In this dataset 29,000 images are allocated for
descriptions for the longer duration video composed of many traning, 1,000 images allocated for validation and 1,000
different events using hierarchical RNN (hRNN) method. This images are allocated testing and also each image is associated
framework utilising the sequential dependencies between the with 5 descriptions. It mostly covers the human activities.
multiple descriptions in a passage where the next sentence is B. Video Dataset
generated using the semantic context of the previous sentence. 1) Montreal Video Annotation Dataset (M-VAD) [22]:
In this approach two types of generators used, sentence consist of 49,000 video clips that are extracted from 92 DVD
generator takes the spatial and temporal information exist in movies. In this dataset 39,000 video clips are allocated for
the long video to generate a single description whereas training, 49,000 video clips are allocated for validation and
paragraph generator process the dependencies between the 5,000 video clips are allocated for testing. Each video clips are
multiple descriptions. Ning Xu [18] proposed a model to described by single sentence. It is a very challenging task to
recognize multiple events in the video and generate the natural describe movie snippets with one single sentence of ground
language description using Attention-in-Attention model. It truth.
consists of two different attention modules, first one is 2) MPII Movie Description Corpus (MPII-MD) [23]:
Encoder attention modules which selects the most salient consist of 37,000 video clips colleceted from 55 movies along
visual and semantic features and average both features into with audio descriptions and 31,000 video clips from 49
single attentive feature to highlight the space-specific features. Hollywood movies. Each video clips are well-appointed with
Second is Fusion attention module which activate the multi- one single sentence from descriptive video service and movie
space features and adjust and fuse them for better scripts.
representation of video. LSTM used to decode the

0659
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.
3) Microsoft Research Video Description Corpus (MSVD) B. ROUGE
[24]: consist of 1970 YouTube video clips created by Amazon The system ROUGE (Recall-Oriented Understudy of Gisting
Mechanical Turk (AMT). Each video clip is about 10 seconds. Evaluation) [29] was initially developed to summarize the
In this dataset 1,200 video clips allocated for training, 100 documents automatically. ROUGE is similar to BLEU metrics
video clips allocated for validation and 670 video clips for but the difference is ROUGE metric is used to measure based
testing. These video clips are annotated in different language on the n-gram occurrences in the sum of number of human
with single sentence. Each video clips annotated by 40 annotated sentences while the BLEU is calculated by
different sentences in English. Fig. 3 shows the sample output considering the occurrences in the total sum of generated
sentence generated for video. sentences. It has four types namely ROUGE-N, ROUGE-L,
ROUGE-W and ROUGE-S(U), where version N and L are
popularly used for video captioning.
C. METEOR
METEOR (Metric for Evaluation of Translation with
Explicit Ordering) [30] is a mean value of unigram-based
precision and recall scores. The major difference between
METEOR metric and BLEU metric is that it combines both
recall and precision metric. BLEU and ROUGE have the
limitation of strict matching but it is resolved in METEOR by
Fig. 3. An example from MSVD dataset with the associated ground truth utilizing the unigrams of words and synonyms.

4) MSR Video to Text (MSR-VTT) [25]: the recent extensive D. CIDEr


dataset for video captioning. It consist of 10,000 Web video CIDEr (Consensus-based Image Description Evaluation)
snippets of total 41.2 hours, with 20 different categories such [31] is a metric used for image captioning. This metric
as, sports, music, gaming, and TV shows. There are 20 measures the consensus between generated sentence for an
descriptions for each video annotated by human. In this dataset image and sentences annotated by the human. Also it is an
6,513 video clips allocated for training, 2,990 video clips extension of TF-IDF method. By using the cosine similarity
allocated for validation and 497 video clips allocated for the two sequences are compared and it causes insignificant,
testing. However, the small size of the clips would be a ineffective image caption evaluation.
limitation of it. TABLE II
5) Youtube2Text video corpus [26]: consist of 1,970 video COMPARISON FOR PERFORMANCE MEASURE OF VARIOUS METHODS
snippets along with 80,839 sentences in total where 41 human Model B@1 B@2 B@3 B@4 METEOR CIDEr
annotated descriptions per video and also each description
DS-
composed of 8 words. It is an open domain dataset covers RNN[4]
- - - 53.0 34.7 79.4
several topics such as sports, music etc. In this dataset 1,200 aLSTM
81.8 70.8 61.1 50.8 33.3 74.8
videos allocated for training task, 100 videos are allocated for [6]
validation task and 670 videos allocated for testing task. BiLSTM
79.4 60.5 48.6 37.1 29.8 79.0
[7]
MS-
IV. EVALUATION METRICS 82.9 72.6 63.5 53.3 33.8 74.8
RNN[8]
To evaluate the results of image captioning and video hLSTMat
79.4 63.5 48.7 36.8 28.2 120.5
[14]
captioning there are several evaluation metrics have been
LSTM-
proposed [27]. The accuracy of generated sentence compared TSA[15]
82.8 72.0 62.8 52.8 33.5 74.0
with the ground truth sentence is measured using the n-gram SeqVLAD
- - - 50.4 33.17 77.13
for human annotated sentence and machine generated [16]
sentence. Table II shows the evaluation results of various h-RNN
81.5 70.4 60.4 49.9 32.6 65.8
[17]
methods used for video captioning. Following evaluation
metrics are commonly used for video captioning. AIA[18] - - - 49.5 32.7 67.0
LSTM-
A. BLEU GAN [19]
- - - 42.9 30.4 -
BLEU (BiLingual Evaluation Understudy) [28] is the most
simple and popularly used metrics for video description V. INFERENCE MADE
generation. It measures the numerical translation closeness
Despite extensive research techniques for video captioning,
between ground truth sentence and machine output. Small
this survey shows that the following inferences made from
changes or grammatical errors in the word order is not
considered in this metric. It is well suited for shorter sentences. studying the existing works that are

0660
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.
- Sentences generated by the system are still less [11] Z. Gany, C. Gan, X. Hez, Y. Puy, K. Tranz, J. Gaoz, L. Cariny, L.
Dengz, “Semantic Compositional Networks for Visual Captioning”,
acceptable level than the human annotated sentences. arXiv:1611.08002v2, 28 Mar 2017.
- Image and video captioning mainly learn to map low- [12] Quanzeng You1, Hailin Jin2, Zhaowen Wang2, Chen Fang2, and Jiebo
level visual features to sentence without focusing the Luo “Image Captioning with Semantic Attention”, IEEE CVPR, 2016.
high-level semantic video concepts (i.e. objects, [13] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K.
Saenko, “Sequence to sequence-video to text”, In: Proceedings of the
actions etc.). IEEE International Conference on Computer Vision, pp. 4534–4542
- Existing methods focus on predefined templates, 2015.
instead of generating more natural and diverse [14] L. Gao, X. Li, J. Song and H. T. Shen, “Hierarchical LSTMs with
Adaptive Attention for Visual Captioning”, accepted in IEEE Journal of
sentences. Latex Class Files, Vol. 14, No. 8, August 2015.
- Most of the existing techniques on video captioning not [15] Y. Pan, T. Yao, H. Li and T. Mei, “Video Captioning with Transferred
exploring the temporal nature of video, which is Semantic Attributes”, IEEE CVPR, pp. 984-992, 2017.
[16] Y. Xu, Y. Han, R. Hong, Q. Tian, “Sequential Video VLAD: Training
important to describe the long duration videos. the Aggregation Locally and Temporally” in IEEE Transactions on
Image Processing, Vol. 27, No. 10, October 2018.
VI. CONCLUSION [17] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. “Video paragraph
captioning using hierarchical recurrent neural networks”, In IEEE
Thus this survey provides information about various CVPR, pp.4584-4593, 2016.
methods using the end-to-end framework of encoder-decoder [18] N. Xu, A. Liu, W. Nie, Y. Su, “Attention-In-Attention Networks for
network based on deep learning to generate the natural Surveillance Video Understanding in IoT”, accepted in IEEE Internet of
Things Journal, 2018.
language description for video sequences. This paper also [19] Y. Yang, J. Zhou, J. Ai, Y. Bin, A. Hanjalic, H.T. hen, Y. Ji, “Video
addresses the different dataset used for video captioning and Captioning by Adversarial LSTM”, in IEEE Transaction on Image
image captioning and also various evaluation parameters used Processing, 2018.
[20] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P.
for measuring the performance of different video captioning Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in
models. This survey gives a clear picture to readers that what context,” in European conference on computer vision. Springer, 2014,
has been achieved in this video captioning field so far and also pp. 740–755.
[21] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
presents where the gaps exist so that future research can be descriptions to visual denotations: New similarity metrics for semantic
better focused. inference over event descriptions,” ACL, vol. 2, pp. 67–78, 2014.
[22] A. Torabi, C. Pal, H. Larochelle, and A. Courville, “Using descriptive
REFERENCES video services to create a large data source for video annotation
research,” arXiv preprint arXiv:1503.01070, 2015.
[1] X. He and L. Deng, “Deep Learning for Image-to-Text Generation A [23] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for
technical overview”, IEEE Signal Processing Magazine, Vol-32, no-6, movie description,” in Proceedings of the IEEE conference on computer
pp:109-116 Nov. 2017. vision and pattern recognition, 2015, pp. 3202–3212.
[2] N. Aafaq, S. Z. Gilani, W. Liu, and A. Mian, “Video Description: A [24] D. L. Chen and W. B. Dolan, “Collecting highly parallel data for
Survey of Methods, Datasets and Evaluation Metrics”, arXiv preprint paraphrase evaluation,” in ACL: Human Language Technologies-Vol. 1.
arXiv:1806.00186, Jun. 2018. Association for Computational Linguistics, 2011, pp.190–200.
[3] S. Li, Z. Tao, K. Li, and Y. Fu, “Visual to Text: Survey of Image and [25] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video
Video Captioning”, accepted In IEEE Transactions on Emerging Topics description dataset for bridging video and language,” in Proc. IEEE
in Computational Intelligence, 2019. Conf. Comput.Vis. Pattern Recognit., Las Vegas, NV, USA, 2016, pp.
[4] N. Xu, A. Liu, Y. wong, Y Zhang, W. Nie, Y. Su, M. Kankanhalli, 5288–5296.
“Dual-Stream Recurrent Neural Network for Video Captioning”, [26] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan,
accepted in IEEE Transactions on Circuit and systems for video R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and
technology, Mar. 2018. describing arbitrary activities using semantic hierarchies and zero-shot
[5] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. recognition. In Proceedings of the IEEE International Conference on
Gao, X. He, M. Mitchell, J. C. Platt, et al. “From captions to visual Computer Vision, pages 2712–2719, 2013.
concepts and back,” in Proc. Conf. Computer Vision and Pattern [27] J. Park, C. Song, J.-H. Han, “A Study of Evaluation Metrics and
Recognition, pp. 1473–1482, 2015. Datasets for Video Captioning”, ICIIBMS 2017.
[6] L. Gao, Z. Guo, H. Zhang, X. Xu, H. T. Shen, “Video Captioning with [28] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: a method for
Attention-Based LSTM and Semantic Consistency” in IEEE automatic evaluation of machine translation”, in Proceedings of the 40th
Transactions on Multimedia, vol-19, no-9, Sep. 2017. Annual Meeting on Association for Computational Linguistics (ACL
[7] Y. Bin, Y. Yang, F. Shen, N. Xie, H. T. Shen, X. Li, “Describing Video '02). Association for Computational Linguistics, Stroudsburg, PA, USA,
with Attention-Based Bidirectional LSTM”, accepted in IEEE 311-318, 2002.
Transactions on Cybernetics, 2019. [29] Lin CY, “ROUGE: a package for automatic evaluation of summaries”,
[8] J. Song, Y. Guo, L. Gao, X. Li, H.T. Shen, “From Deterministic to in Proceedings of the workshop on text summarization branches out,
Generative: Multimodal Stochastic RNNs for Video Captioning” in Barcelona, Spain, (WAS2004) 2004.
IEEE Transactions on Neural Networks and Learning Systems, 2018. [30] D. Elliott and F. Keller, “ Image description using visual depedency
[9] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. representations,” in Proc. Empirical Methods Natural Lang. Process.
Gao, X. He, M. Mitchell, J. C. Platt, et al. “From captions to visual 2013, vol. 13, pp. 1292-1302.
concepts and back,” in Proc. Conf. Computer Vision and Pattern [31] R. Vedantam, C. L. Zitnick and D. Parikh, “CIDER: Consensus-based
Recognition, 2015, pp. 1473–1482. image description evaluation,” in Proc. IEEE Conf. Comput. Vis.
[10] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, “Show and Tell: A Neural Pattern Recognit., 2015, pp. 4566-4575.
Image Caption Generator”, IEEE CVPR, pp. 3156-3164, 2015.

0661
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on June 15,2021 at 18:40:19 UTC from IEEE Xplore. Restrictions apply.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy