论文信息 - MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

This task enables it to perform well variety Abstract As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve , a model that represents videos jointly over time – through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that Reserve learns strong multimodal representations. When ﬁnetuned, it sets state-of-the-art on Visual Commonsense Reasoning (VCR), TVQA, and Kinetics-600; outperforming prior work by 5%, 7%, and 1.5% respectively. Ablations show that these tasks beneﬁt from audio pretraining – even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why audio enables better vision-language representations, suggesting signiﬁcant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining.

[1] C. Schmid,et al. Multiview Transformers for Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Jian Ma,et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2021, Int. J. Comput. Vis..

[3] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[4] Federico Raue,et al. Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[6] Vinay Uday Prabhu,et al. Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[7] Dmytro Okhonko,et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[8] Roy Schwartz,et al. Data Efficient Masked Language Modeling for Vision and Language , 2021, EMNLP.

[9] Anaelia Ovalle,et al. Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies , 2021, EMNLP.

[10] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[11] Eduard H. Hovy,et al. Five sources of bias in natural language processing , 2021, Lang. Linguistics Compass.

[12] Hanqing Lu,et al. OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation , 2021, ArXiv.

[13] Jon E. Froehlich,et al. Toward User-Driven Sound Recognizer Personalization with People Who Are d/Deaf or Hard of Hearing , 2021, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[14] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Ali Farhadi,et al. MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.

[16] Rohit Girdhar,et al. Anticipative Video Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Yejin Choi,et al. VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Aäron van den Oord,et al. Multimodal Self-Supervised Learning of General Audio Representations , 2021, ArXiv.

[19] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[20] David R. So,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[21] Jianlin Su,et al. RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, Neurocomputing.

[22] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.

[23] Abhishek,et al. Cross-Modal learning for Audio-Visual Video Parsing , 2021, Interspeech.

[24] M. Blell,et al. Truth from the machine: artificial intelligence and the materialization of identity , 2021, Interdisciplinary Science Reviews.

[25] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[26] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[29] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[30] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[31] Shih-Fu Chang,et al. VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Joachim Bingel,et al. Disembodied Machine Learning: On the Illusion of Objectivity in NLP , 2021, ArXiv.

[33] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[34] Feng Wang,et al. Understanding the Behaviour of Contrastive Loss , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[36] C. Schmid,et al. Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[38] Nojun Kwak,et al. Self-supervised pre-training and contrastive representation learning for multiple-choice video QA , 2020, AAAI.

[39] Hao Tian,et al. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[40] James R. Glass,et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2020, Interspeech.

[41] Justin Johnson,et al. VirTex: Learning Visual Representations from Textual Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Giovanni Maria Farinella,et al. Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.

[44] Bo Wu,et al. STAR: A Benchmark for Situated Reasoning in Real-World Videos , 2021 .

[45] Maarten Sap,et al. Documenting the English Colossal Clean Crawled Corpus , 2021, ArXiv.

[46] Jack Hessel,et al. Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think! , 2020, EMNLP.

[47] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[48] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[49] Tarleton Gillespie,et al. Content moderation, AI, and the question of scale , 2020, Big Data Soc..

[50] Emily M. Bender,et al. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[51] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[52] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[53] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[54] D. Fox,et al. Watching the World Go By: Representation Learning from Unlabeled Videos , 2020, ArXiv.

[55] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[56] S. Gelly,et al. Big Transfer (BiT): General Visual Representation Learning , 2019, ECCV.

[57] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[58] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[59] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[60] Kyunghyun Cho,et al. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , 2019, ICLR.

[61] Virgílio A. F. Almeida,et al. Auditing radicalization pathways on YouTube , 2019, FAT*.

[62] Omer Levy,et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[63] Victo José da Silva Neto. Platform capitalism , 2019, Revista Brasileira de Inovação.

[64] Emily Ahn,et al. Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts , 2019, EMNLP.

[65] Gabriel Ilharco,et al. Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.

[66] Luke Zettlemoyer,et al. Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[67] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[68] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[69] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[70] David Reitter,et al. Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[71] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[72] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[73] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[74] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[75] Ali Farhadi,et al. Defending Against Neural Fake News , 2019, NeurIPS.

[76] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[77] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[79] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80] Radu Soricut,et al. A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions , 2019, CoNLL.

[81] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[82] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[83] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.

[84] Quoc V. Le,et al. AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[85] Morgan Klaus Scheuerman,et al. Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems , 2018, CHI.

[86] Matthew Crain,et al. The limits of transparency: Data brokers and commodification , 2018, New Media Soc..

[87] Jitendra Malik,et al. From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[88] Yejin Choi,et al. Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[89] Omkar M. Parkhi,et al. VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[90] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[91] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[92] Jieyu Zhao,et al. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.

[93] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[94] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[95] Mariana L. Neves,et al. Neural Domain Adaptation for Biomedical Question Answering , 2017, CoNLL.

[96] Rachael Tatman,et al. Gender and Dialect Bias in YouTube’s Automatic Captions , 2017, EthNLP@EACL.

[97] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[98] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[99] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[100] Tegan Maharaj,et al. A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[101] Takeo Kanade,et al. Computer Vision and Image Understanding Computer Vision for Assistive Technologies , 2022 .

[102] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[103] Philipp Koehn,et al. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[104] Martial Hebert,et al. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[105] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.

[106] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[107] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[108] Laura A. Dabbish,et al. "My Data Just Goes Everywhere: " User Mental Models of the Internet and Implications for Privacy and Security , 2015, SOUPS.

[109] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[110] Shoshana Zuboff,et al. Big other: surveillance capitalism and the prospects of an information civilization , 2015, J. Inf. Technol..

[111] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[112] Daniel Brissaud,et al. Drawing a chip environmental profile: environmental indicators for the semiconductor industry , 2015 .

[113] Justin Salamon,et al. A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[114] Danah Boyd,et al. Networked privacy: How teenagers negotiate context in social media , 2014, New Media Soc..

[115] John R Clark,et al. When good isn't good enough. , 2014, Air medical journal.

[116] Benjamin Van Durme,et al. Reporting bias and knowledge acquisition , 2013, AKBC '13.

[117] Christian Fuchs,et al. An Alternative View of Privacy on Facebook , 2011, Inf..

[118] Yael Pritch,et al. Clustered Synopsis of Surveillance Video , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[119] Travis L. Dixon. Crime News and Racialized Beliefs: Understanding the Relationship Between Local News Viewing and Perceptions of African Americans and Crime , 2008 .

[120] Janice Singer,et al. Exploring the Gender Divide on YouTube: An Analysis of the Creation and Reception of Vlogs , 2008 .

[121] Michael Gasser,et al. The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.

[122] Felix Gutierrez,et al. White News: Why Local News Programs Don't Cover People of Color , 2000 .

[123] Travis L. Dixon,et al. Overrepresentation and Underrepresentation of African Americans and Latinos as Lawbreakers on Television News , 2000 .

[124] R S Chapman,et al. Children's language learning: an interactionist perspective. , 2000, Journal of child psychology and psychiatry, and allied disciplines.

[125] G. Edelman. Neural Darwinism: Selection and reentrant signaling in higher brain function , 1993, Neuron.

[126] Donna Harawy. Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective , 2022, Philosophical Literary Journal Logos.

[127] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[128] P. L. Adams. THE ORIGINS OF INTELLIGENCE IN CHILDREN , 1976 .

[129] R. Schank,et al. Scripts, plans, and knowledge , 1975, IJCAI 1975.