MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

As humans, we navigate the world through all our senses, using perceptual input from each one to correct the others. We introduce MERLOT Reserve, a model that represents videos jointly over time – through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong representations about videos through all constituent modalities. When finetuned, it sets a new state-of-theart on both VCR and TVQA, outperforming prior work by 5% and 7% respectively. Ablations show that both tasks benefit from audio pretraining – even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why incorporating audio leads to better visionlanguage representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

[1]  Danah Boyd,et al.  Networked privacy: How teenagers negotiate context in social media , 2014, New Media Soc..

[2]  Morgan Klaus Scheuerman,et al.  Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems , 2018, CHI.

[3]  Cordelia Schmid,et al.  Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, ArXiv.

[4]  John R Clark,et al.  When good isn't good enough. , 2014, Air medical journal.

[5]  Preethi Jyothi,et al.  Cross-Modal learning for Audio-Visual Video Parsing , 2021, Interspeech 2021.

[6]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[7]  Mariana L. Neves,et al.  Neural Domain Adaptation for Biomedical Question Answering , 2017, CoNLL.

[8]  Donna Harawy Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective , 2022, Philosophical Literary Journal Logos.

[9]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[10]  Tarleton Gillespie,et al.  Content moderation, AI, and the question of scale , 2020, Big Data Soc..

[11]  Dmytro Okhonko,et al.  VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[12]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Giovanni Maria Farinella,et al.  Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Daniel Brissaud,et al.  Drawing a chip environmental profile: environmental indicators for the semiconductor industry , 2015 .

[15]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[16]  R S Chapman,et al.  Children's language learning: an interactionist perspective. , 2000, Journal of child psychology and psychiatry, and allied disciplines.

[17]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[18]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[19]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[20]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Radu Soricut,et al.  A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions , 2019, CoNLL.

[22]  Jon E. Froehlich,et al.  Toward User-Driven Sound Recognizer Personalization with People Who Are d/Deaf or Hard of Hearing , 2021, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[23]  James Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech 2021.

[24]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[25]  Yael Pritch,et al.  Clustered Synopsis of Surveillance Video , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[26]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[27]  Matthew Crain,et al.  The limits of transparency: Data brokers and commodification , 2018, New Media Soc..

[28]  Christian Fuchs,et al.  An Alternative View of Privacy on Facebook , 2011, Inf..

[29]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[30]  Aäron van den Oord,et al.  Multimodal Self-Supervised Learning of General Audio Representations , 2021, ArXiv.

[31]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[32]  P. L. Adams THE ORIGINS OF INTELLIGENCE IN CHILDREN , 1976 .

[33]  G. Edelman Neural Darwinism: Selection and reentrant signaling in higher brain function , 1993, Neuron.

[34]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Vinay Uday Prabhu,et al.  Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[38]  Nojun Kwak,et al.  Self-supervised pre-training and contrastive representation learning for multiple-choice video QA , 2020, AAAI.

[39]  Maarten Sap,et al.  Documenting the English Colossal Clean Crawled Corpus , 2021, ArXiv.

[40]  Hanqing Lu,et al.  OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation , 2021, ArXiv.

[41]  Michael Gasser,et al.  The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.

[42]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ArXiv.

[43]  Virgílio A. F. Almeida,et al.  Auditing radicalization pathways on YouTube , 2019, FAT*.

[44]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[45]  Christopher D. Manning,et al.  Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[46]  David Reitter,et al.  Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[47]  Laura A. Dabbish,et al.  "My Data Just Goes Everywhere: " User Mental Models of the Internet and Implications for Privacy and Security , 2015, SOUPS.

[48]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Janice Singer,et al.  Exploring the Gender Divide on YouTube: An Analysis of the Creation and Reception of Vlogs , 2008 .

[50]  Shih-Fu Chang,et al.  VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Benjamin Van Durme,et al.  Reporting bias and knowledge acquisition , 2013, AKBC '13.

[52]  Ali Farhadi,et al.  MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.

[53]  Zoë Hitzig,et al.  Truth from the machine: artificial intelligence and the materialization of identity , 2021, Interdisciplinary Science Reviews.

[54]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[56]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[57]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[58]  Bo Wu,et al.  STAR: A Benchmark for Situated Reasoning in Real-World Videos , 2021 .

[59]  Travis L. Dixon,et al.  Overrepresentation and Underrepresentation of African Americans and Latinos as Lawbreakers on Television News , 2000 .

[60]  Jian Ma,et al.  Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2021, Int. J. Comput. Vis..

[61]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[62]  Karan Desai,et al.  VirTex: Learning Visual Representations from Textual Annotations , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Anaelia Ovalle,et al.  Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies , 2021, EMNLP.

[64]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[65]  Shengfeng Pan,et al.  RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, ArXiv.

[66]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[67]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[69]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[70]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[71]  Jack Hessel,et al.  Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think! , 2020, EMNLP.

[72]  Rachael Tatman,et al.  Gender and Dialect Bias in YouTube’s Automatic Captions , 2017, EthNLP@EACL.

[73]  Gabriel Ilharco,et al.  Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.

[74]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[75]  Tegan Maharaj,et al.  A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Joachim Bingel,et al.  Disembodied Machine Learning: On the Illusion of Objectivity in NLP , 2021, ArXiv.

[77]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[78]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[79]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[81]  Andreas Dengel,et al.  AudioCLIP: Extending CLIP to Image, Text and Audio , 2021, ArXiv.

[82]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[83]  Travis L. Dixon Crime News and Racialized Beliefs: Understanding the Relationship Between Local News Viewing and Perceptions of African Americans and Crime , 2008 .

[84]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[85]  Roy Schwartz,et al.  Data Efficient Masked Language Modeling for Vision and Language , 2021, EMNLP.

[86]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[88]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[89]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[90]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[91]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[92]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[93]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[94]  Kyunghyun Cho,et al.  Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , 2020, ICLR.

[95]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[96]  Takeo Kanade,et al.  Computer Vision and Image Understanding Computer Vision for Assistive Technologies , 2022 .

[97]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[98]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[99]  Eduard H. Hovy,et al.  Five sources of bias in natural language processing , 2021, Lang. Linguistics Compass.

[100]  Rohit Girdhar,et al.  Anticipative Video Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[101]  Felix Gutierrez,et al.  White News: Why Local News Programs Don't Cover People of Color , 2000 .

[102]  Jieyu Zhao,et al.  Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[103]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[104]  Matthijs Douze,et al.  Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[105]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[106]  Hao Tian,et al.  ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[107]  Ali Farhadi,et al.  Watching the World Go By: Representation Learning from Unlabeled Videos , 2020, ArXiv.

[108]  Emily Ahn,et al.  Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts , 2019, EMNLP.

[109]  Jitendra Malik,et al.  From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[110]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[111]  James Glass,et al.  AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2021, Interspeech 2021.

[112]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[113]  Huaping Liu,et al.  Understanding the Behaviour of Contrastive Loss , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[114]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[115]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[116]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[117]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.