Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder. The generation loss yielded by the encoder-decoder and the reconstruction loss introduced by the reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the encoder-decoder models and leads to significant gains in video caption accuracy.

[1]  Ning Zhang,et al.  Deep Reinforcement Learning-Based Image Captioning with Embedding Reward , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Nicu Sebe,et al.  A Survey on Learning to Hash , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Lin Ma,et al.  Learning to Answer Questions from Image Using Convolutional Neural Network , 2015, AAAI.

[5]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Nicu Sebe,et al.  Quantization-based hashing: a general framework for scalable image and video retrieval , 2018, Pattern Recognit..

[7]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[8]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[10]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[11]  Jia Chen,et al.  Describing Videos using Multi-modal Fusion , 2016, ACM Multimedia.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[16]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[18]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[21]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Wei Liu,et al.  Learning to Guide Decoding for Image Captioning , 2018, AAAI.

[24]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[26]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[28]  Zhongchao Shi,et al.  Video Captioning with Listwise Supervision , 2017, AAAI.

[29]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[30]  Heng Tao Shen,et al.  Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning , 2017, IJCAI.

[31]  Nenghai Yu,et al.  Dual Supervised Learning , 2017, ICML.

[32]  Xiaodong Yu,et al.  Learning Bidirectional Temporal Cues for Video-Based Person Re-Identification , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[33]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.

[34]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[35]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[36]  Yang Liu,et al.  Neural Machine Translation with Reconstruction , 2016, AAAI.

[37]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Xiaogang Wang,et al.  Deep Dual Learning for Semantic Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Zhou Su,et al.  Weakly Supervised Dense Video Captioning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yingli Tian,et al.  Automatic video description generation via LSTM with joint two-stream encoding , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[41]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Bernt Schiele,et al.  Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.

[44]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Marcus Rohrbach,et al.  Multimodal Video Description , 2016, ACM Multimedia.

[47]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[48]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[49]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[51]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[52]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[53]  Gilly Leshed,et al.  How Blind People Interact with Visual Content on Social Networking Services , 2016, CSCW.