VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page https://casia-iva-group.github.io/projects/VALOR.

[1]  Yonghui Wu,et al.  VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners , 2022, 2212.04979.

[2]  Yi Wang,et al.  InternVideo: General Video Foundation Models via Generative and Discriminative Learning , 2022, ArXiv.

[3]  Jianlong Fu,et al.  Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning , 2022, NeurIPS.

[4]  Xin Wang,et al.  AVQA: A Dataset for Audio-Visual Question Answering on Videos , 2022, ACM Multimedia.

[5]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, arXiv.org.

[6]  Jianlong Fu,et al.  CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment , 2022, ArXiv.

[7]  William Yang Wang,et al.  An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling , 2022, ArXiv.

[8]  Rongrong Ji,et al.  Clover: Towards A Unified Video-Language Alignment and Fusion Model , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Luhui Xu,et al.  TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval , 2022, ECCV.

[10]  Ming Yan,et al.  X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval , 2022, ACM Multimedia.

[11]  Aniruddha Kembhavi,et al.  Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.

[12]  C. Schmid,et al.  Zero-Shot Video Question Answering via Frozen Bidirectional Language Models , 2022, NeurIPS.

[13]  Zhe Gan,et al.  LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tamara L. Berg,et al.  Revealing Single Frame Bias for Video-and-Language Learning , 2022, ACL.

[15]  Zhe Gan,et al.  GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[16]  Zirui Wang,et al.  CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[17]  N. Codella,et al.  i-Code: An Integrative and Composable Multimodal Learning Framework , 2022, AAAI.

[18]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[19]  Alex Wang,et al.  MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval , 2022, ECCV.

[20]  N. Codella,et al.  DaViT: Dual Attention Vision Transformers , 2022, ECCV.

[21]  Mohit Bansal,et al.  ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound , 2022, ECCV.

[22]  C. Schmid,et al.  Learning Audio-Video Modalities from Image Captions , 2022, ECCV.

[23]  Yapeng Tian,et al.  Learning to Answer Questions in Dynamic Audio-Visual Scenarios , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Xiansheng Hua,et al.  Disentangled Representation Learning for Text-Video Retrieval , 2022, ArXiv.

[25]  Mike Zheng Shou,et al.  All in One: Exploring Unified Video-Language Pre-Training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Wenwu Wang,et al.  Leveraging Pre-trained BERT for Audio Captioning , 2022, 2022 30th European Signal Processing Conference (EUSIPCO).

[27]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[28]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[29]  C. Schmid,et al.  End-to-end Generative Pretraining for Multimodal Video Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Xihui Liu,et al.  Bridging Video-text Retrieval with Multiple Choice Questions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yejin Choi,et al.  MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Fenglin Liu,et al.  Aligning Source Visual and Target Language Domains for Unpaired Video Captioning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Xizhou Zhu,et al.  Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Mike Zheng Shou,et al.  Object-aware Video-language Pre-training for Retrieval , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Faisal Ahmed,et al.  SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Xiaowei Hu,et al.  Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Faisal Ahmed,et al.  UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling , 2021, ECCV.

[38]  B. Guo,et al.  Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Zhenguo Li,et al.  FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[40]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.

[41]  Cordelia Schmid,et al.  Masking Modalities for Cross-modal Video Retrieval , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[42]  James R. Glass,et al.  SSAST: Self-Supervised Audio Spectrogram Transformer , 2021, AAAI.

[43]  David J. Fleet,et al.  Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.

[44]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[45]  Chng Eng Siong,et al.  Automated Audio Captioning Using Transfer Learning and Reconstruction Latent Space Similarity Regularization , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[48]  Xirong Li,et al.  Dual Encoding for Video Retrieval by Text , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Zi Huang,et al.  MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Bingbing Ni,et al.  Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Quoc V. Le,et al.  Combined Scaling for Open-Vocabulary Image Classification , 2022 .

[52]  Zhifeng Li,et al.  HunYuan_tvr for Text-Video Retrievial , 2022, ArXiv.

[53]  Tsu-Jui Fu,et al.  VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.

[54]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[55]  Zhe Gan,et al.  UFO: A UniFied TransfOrmer for Vision-Language Representation Learning , 2021, ArXiv.

[56]  Longteng Guo,et al.  MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques , 2021, ACM Multimedia.

[57]  Youngjae Yu,et al.  Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Fan Yang,et al.  Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss , 2021, ArXiv.

[59]  Yonatan Bisk,et al.  TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Mark D. Plumbley,et al.  Audio Captioning Transformer , 2021, DCASE.

[61]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[62]  Zhe Gan,et al.  VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.

[63]  Ali Farhadi,et al.  MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.

[64]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Florian Metze,et al.  VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding , 2021, FINDINGS.

[66]  Zeynep Akata,et al.  Audio Retrieval with Natural Language Queries , 2021, Interspeech.

[67]  James Glass,et al.  Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[68]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[69]  James R. Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech.

[70]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Chunfeng Yuan,et al.  Open-book Video Captioning with Retrieve-Copy-Generate Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[73]  Kai Yu,et al.  Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Radu Soricut,et al.  Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[77]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[78]  C. Schmid,et al.  Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[79]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[80]  Florian Metze,et al.  Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[81]  James R. Glass,et al.  AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2020, Interspeech.

[82]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[83]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[84]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[85]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[86]  Licheng Yu,et al.  Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[87]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[88]  Yue Gao,et al.  Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering , 2020, AAAI.

[89]  Bing Li,et al.  Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Xilin Chen,et al.  UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[91]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[92]  Tuomas Virtanen,et al.  Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[93]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[94]  Kun Chen,et al.  Audio Captioning Based on Transformer and Pre-Trained CNN , 2020, DCASE.

[95]  Kai Yu,et al.  A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning , 2020, DCASE.

[96]  Wei Liu,et al.  Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[97]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[98]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[99]  Jun Yu,et al.  ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.

[100]  Gunhee Kim,et al.  AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.

[101]  Xin Wang,et al.  VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[102]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[103]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[104]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[105]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[106]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[107]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[108]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[109]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[110]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[111]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[112]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[113]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[114]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[115]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[116]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[117]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[118]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[119]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[120]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[121]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[122]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[123]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[124]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[125]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[126]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.