Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning

Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[3]  Wang Ling,et al.  Reference-Aware Language Models , 2016, EMNLP.

[4]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Ming Zhou,et al.  Dense Procedure Captioning in Narrated Instructional Videos , 2019, ACL.

[7]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Radu Soricut,et al.  A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions , 2019, CoNLL.

[9]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Jiebo Luo,et al.  Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments , 2015, HLT-NAACL.

[12]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[14]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[17]  Feiping Nie,et al.  Dense Multimodal Fusion for Hierarchically Joint Representation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[24]  Tamara L. Berg,et al.  Baby Talk : Understanding and Generating Image Descriptions , 2011 .

[25]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Bingbing Ni,et al.  Fine-Grained Video Captioning for Sports Narrative , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Kevin Murphy,et al.  What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[28]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[29]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[32]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).