VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks. We evaluate various baseline methods with and without large-scale VidL pre-training, and systematically investigate the impact of video input channels, fusion methods, and different video representations. We also study the transferability between tasks and conduct multitask learning under different settings. The significant gap between our best model and human performance calls for future study for advanced VidL models. VALUE is available at https://value-benchmark.github.io/.2

[1]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Zhe Gan,et al.  CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning , 2021, ArXiv.

[3]  Yue Gao,et al.  Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering , 2020, AAAI.

[4]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[5]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[6]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[7]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[8]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[10]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[11]  Jingzhou Liu,et al.  Violin: A Large-Scale Dataset for Video-and-Language Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[13]  Mohit Bansal,et al.  What Is More Likely to Happen Next? Video-and-Language Future Event Prediction , 2020, EMNLP.

[14]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[15]  Zhe Gan,et al.  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Graham Neubig,et al.  XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , 2020, ICML.

[18]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[19]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[20]  Mohit Bansal,et al.  TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.

[21]  Truyen Tran,et al.  Hierarchical Conditional Relation Networks for Video Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[23]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[24]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, ArXiv.

[26]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[28]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[31]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[32]  Weihua Luo,et al.  On Learning Universal Representations Across Languages , 2021, ICLR.

[33]  Hao Tian,et al.  ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora , 2021, EMNLP.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[36]  Florian Metze,et al.  Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models , 2021, NAACL.

[37]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[38]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Timothy Baldwin,et al.  Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity , 2013 .

[40]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[41]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[43]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Florian Metze,et al.  Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[45]  Bernard Ghanem,et al.  The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020) , 2020, ArXiv.

[46]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[47]  Licheng Yu,et al.  TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.

[48]  Xin Wang,et al.  VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2021, NAACL.

[50]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[52]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[53]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[54]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[55]  Fan Yang,et al.  XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation , 2020, EMNLP.

[56]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[58]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[59]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[60]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[61]  Claire Cardie,et al.  SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability , 2015, *SEMEVAL.

[62]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[64]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[66]  Claire Cardie,et al.  SemEval-2014 Task 10: Multilingual Semantic Textual Similarity , 2014, *SEMEVAL.

[67]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Andrew Zisserman,et al.  Visual Grounding in Video for Unsupervised Word Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Tamara L. Berg,et al.  mTVR: Multilingual Moment Retrieval in Videos , 2021, ACL.

[70]  Zhe Gan,et al.  FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding , 2021, AAAI.

[71]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[72]  Jie Lei,et al.  DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization , 2021, NAACL.

[73]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[74]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[75]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[76]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[77]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Shih-Fu Chang,et al.  VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).