End-to-end Generative Pretraining for Multimodal Video Captioning

Recent video and language pretraining frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective – we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.

[1]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Paul Hongsuck Seo,et al.  Look Before you Speak: Visually Contextualized Utterances , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tao Mei,et al.  Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Rami Ben-Ari,et al.  Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning , 2020, AAAI.

[5]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning For Video Understanding , 2017, ArXiv.

[6]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  James Glass,et al.  AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2021, Interspeech 2021.

[9]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Radu Soricut,et al.  Multimodal Pretraining for Dense Video Captioning , 2020, AACL.

[11]  Xin Wang,et al.  Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning , 2018, NAACL.

[12]  Wei Liu,et al.  Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Yuxin Peng,et al.  Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Kan Chen,et al.  Billion-scale semi-supervised learning for image classification , 2019, ArXiv.

[15]  Radu Soricut,et al.  A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions , 2019, CoNLL.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Fabio Petroni,et al.  Video Understanding as Machine Translation , 2020, ArXiv.

[18]  Shengsheng Qian,et al.  HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[20]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[21]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[22]  Jie Lei,et al.  DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization , 2021, NAACL.

[23]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[24]  Jiebo Luo,et al.  Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26]  Yu-Gang Jiang,et al.  Motion Guided Spatial Attention for Video Captioning , 2019, AAAI.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Cordelia Schmid,et al.  Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, ArXiv.

[30]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[31]  Xiaolin Hu,et al.  A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling , 2019, Frontiers in Robotics and AI.

[32]  Nan Duan,et al.  UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[33]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[34]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[35]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[38]  Esa Rahtu,et al.  Multi-modal Dense Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39]  Florian Metze,et al.  Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[40]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jun Yu,et al.  ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.

[43]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Leonid Sigal,et al.  Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[47]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[48]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[49]  Zhe Gan,et al.  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[50]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[51]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[52]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Mohit Bansal,et al.  MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning , 2020, ACL.

[54]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[55]  Esa Rahtu,et al.  A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer , 2020, BMVC.

[56]  Bing Li,et al.  Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[58]  Wei Liu,et al.  Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[60]  Delving Deeper into the Decoder for Video Captioning , 2020, ECAI.

[61]  Ming Zhou,et al.  Dense Procedure Captioning in Narrated Instructional Videos , 2019, ACL.

[62]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .