VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

We present a simplified, task-agnostic multimodal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single crossmodal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training1.

[1]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[3]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[4]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Juergen Gall,et al.  NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Florian Metze,et al.  Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models , 2021, NAACL.

[8]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[11]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[12]  Zhe Gan,et al.  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[13]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[14]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Thomas Brox,et al.  COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning , 2020, NeurIPS.

[16]  Florian Metze,et al.  Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[17]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Yansong Tang,et al.  COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[23]  Shashi Narayan,et al.  Leveraging Pre-trained Checkpoints for Sequence Generation Tasks , 2019, Transactions of the Association for Computational Linguistics.

[24]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[25]  Fabio Petroni,et al.  Video Understanding as Machine Translation , 2020, ArXiv.

[26]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[27]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[28]  Ivan Laptev,et al.  Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Chenliang Xu,et al.  Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[31]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[33]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[36]  Nan Duan,et al.  UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[37]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[39]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.