Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning

Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

[1]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[2]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Wei Liu,et al.  Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[6]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Bohyung Han,et al.  Streamlined Dense Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[9]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[10]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[11]  Alexander Schwing,et al.  Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[16]  Rita Cucchiara,et al.  Hierarchical Boundary-Aware Neural Encoder for Video Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yi Yang,et al.  Decoupled Novel Object Captioner , 2018, ACM Multimedia.

[18]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[23]  Yasuhiro Fujiwara,et al.  Sigsoftmax: Reanalysis of the Softmax Bottleneck , 2018, NeurIPS.

[24]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Alexander G. Schwing,et al.  Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[30]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[33]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[34]  Xin Wang,et al.  Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[36]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[39]  Nianwen Xue,et al.  Steven Bird, Evan Klein and Edward Loper. Natural Language Processing with Python. O'Reilly Media, Inc 2009. ISBN: 978-0-596-51649-9 , 2010, Natural Language Engineering.

[40]  Tao Mei,et al.  Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Wei Liu,et al.  Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[43]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[44]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[45]  Tieniu Tan,et al.  M3: Multimodal Memory Modelling for Video Captioning , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Gui-Song Xia,et al.  Image Caption Generation with Part of Speech Guidance , 2017, Pattern Recognit. Lett..

[47]  Qingming Huang,et al.  Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.