MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound
暂无分享,去创建一个
Yejin Choi | Ali Farhadi | Aditya Kusupati | Rowan Zellers | Jiasen Lu | Jack Hessel | Youngjae Yu | Yanpeng Zhao | Ximing Lu | Mohammadreza Salehi
[1] C. Schmid,et al. Multiview Transformers for Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Jian Ma,et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2021, Int. J. Comput. Vis..
[3] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[4] Federico Raue,et al. Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[5] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[6] Vinay Uday Prabhu,et al. Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.
[7] Dmytro Okhonko,et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.
[8] Roy Schwartz,et al. Data Efficient Masked Language Modeling for Vision and Language , 2021, EMNLP.
[9] Anaelia Ovalle,et al. Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies , 2021, EMNLP.
[10] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.
[11] Eduard H. Hovy,et al. Five sources of bias in natural language processing , 2021, Lang. Linguistics Compass.
[12] Hanqing Lu,et al. OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation , 2021, ArXiv.
[13] Jon E. Froehlich,et al. Toward User-Driven Sound Recognizer Personalization with People Who Are d/Deaf or Hard of Hearing , 2021, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..
[14] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Ali Farhadi,et al. MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.
[16] Rohit Girdhar,et al. Anticipative Video Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[17] Yejin Choi,et al. VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Aäron van den Oord,et al. Multimodal Self-Supervised Learning of General Audio Representations , 2021, ArXiv.
[19] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[20] David R. So,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.
[21] Jianlin Su,et al. RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, Neurocomputing.
[22] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.
[23] Abhishek,et al. Cross-Modal learning for Audio-Visual Video Parsing , 2021, Interspeech.
[24] M. Blell,et al. Truth from the machine: artificial intelligence and the materialization of identity , 2021, Interdisciplinary Science Reviews.
[25] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.
[26] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[27] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[29] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[30] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.
[31] Shih-Fu Chang,et al. VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Joachim Bingel,et al. Disembodied Machine Learning: On the Illusion of Objectivity in NLP , 2021, ArXiv.
[33] Charles Foster,et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.
[34] Feng Wang,et al. Understanding the Behaviour of Contrastive Loss , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.
[36] C. Schmid,et al. Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[37] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[38] Nojun Kwak,et al. Self-supervised pre-training and contrastive representation learning for multiple-choice video QA , 2020, AAAI.
[39] Hao Tian,et al. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.
[40] James R. Glass,et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2020, Interspeech.
[41] Justin Johnson,et al. VirTex: Learning Visual Representations from Textual Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Giovanni Maria Farinella,et al. Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[43] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.
[44] Bo Wu,et al. STAR: A Benchmark for Situated Reasoning in Real-World Videos , 2021 .
[45] Maarten Sap,et al. Documenting the English Colossal Clean Crawled Corpus , 2021, ArXiv.
[46] Jack Hessel,et al. Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think! , 2020, EMNLP.
[47] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.
[48] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.
[49] Tarleton Gillespie,et al. Content moderation, AI, and the question of scale , 2020, Big Data Soc..
[50] Emily M. Bender,et al. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.
[51] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.
[52] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.
[53] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[54] D. Fox,et al. Watching the World Go By: Representation Learning from Unlabeled Videos , 2020, ArXiv.
[55] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[56] S. Gelly,et al. Big Transfer (BiT): General Visual Representation Learning , 2019, ECCV.
[57] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[58] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[59] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[60] Kyunghyun Cho,et al. Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , 2019, ICLR.
[61] Virgílio A. F. Almeida,et al. Auditing radicalization pathways on YouTube , 2019, FAT*.
[62] Omer Levy,et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.
[63] Victo José da Silva Neto. Platform capitalism , 2019, Revista Brasileira de Inovação.
[64] Emily Ahn,et al. Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts , 2019, EMNLP.
[65] Gabriel Ilharco,et al. Large-Scale Representation Learning from Visually Grounded Untranscribed Speech , 2019, CoNLL.
[66] Luke Zettlemoyer,et al. Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.
[67] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[68] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[69] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[70] David Reitter,et al. Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.
[71] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.
[72] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[73] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.
[74] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.
[75] Ali Farhadi,et al. Defending Against Neural Fake News , 2019, NeurIPS.
[76] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[77] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[78] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[79] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[80] Radu Soricut,et al. A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions , 2019, CoNLL.
[81] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[82] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[83] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.
[84] Quoc V. Le,et al. AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.
[85] Morgan Klaus Scheuerman,et al. Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems , 2018, CHI.
[86] Matthew Crain,et al. The limits of transparency: Data brokers and commodification , 2018, New Media Soc..
[87] Jitendra Malik,et al. From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[88] Yejin Choi,et al. Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[89] Omkar M. Parkhi,et al. VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).
[90] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[91] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.
[92] Jieyu Zhao,et al. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints , 2017, EMNLP.
[93] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.
[94] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[95] Mariana L. Neves,et al. Neural Domain Adaptation for Biomedical Question Answering , 2017, CoNLL.
[96] Rachael Tatman,et al. Gender and Dialect Bias in YouTube’s Automatic Captions , 2017, EthNLP@EACL.
[97] Samy Bengio,et al. Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.
[98] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[99] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[100] Tegan Maharaj,et al. A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[101] Takeo Kanade,et al. Computer Vision and Image Understanding Computer Vision for Assistive Technologies , 2022 .
[102] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.
[103] Philipp Koehn,et al. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .
[104] Martial Hebert,et al. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.
[105] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[106] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[107] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.
[108] Laura A. Dabbish,et al. "My Data Just Goes Everywhere: " User Mental Models of the Internet and Implications for Privacy and Security , 2015, SOUPS.
[109] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[110] Shoshana Zuboff,et al. Big other: surveillance capitalism and the prospects of an information civilization , 2015, J. Inf. Technol..
[111] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[112] Daniel Brissaud,et al. Drawing a chip environmental profile: environmental indicators for the semiconductor industry , 2015 .
[113] Justin Salamon,et al. A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.
[114] Danah Boyd,et al. Networked privacy: How teenagers negotiate context in social media , 2014, New Media Soc..
[115] John R Clark,et al. When good isn't good enough. , 2014, Air medical journal.
[116] Benjamin Van Durme,et al. Reporting bias and knowledge acquisition , 2013, AKBC '13.
[117] Christian Fuchs,et al. An Alternative View of Privacy on Facebook , 2011, Inf..
[118] Yael Pritch,et al. Clustered Synopsis of Surveillance Video , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.
[119] Travis L. Dixon. Crime News and Racialized Beliefs: Understanding the Relationship Between Local News Viewing and Perceptions of African Americans and Crime , 2008 .
[120] Janice Singer,et al. Exploring the Gender Divide on YouTube: An Analysis of the Creation and Reception of Vlogs , 2008 .
[121] Michael Gasser,et al. The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.
[122] Felix Gutierrez,et al. White News: Why Local News Programs Don't Cover People of Color , 2000 .
[123] Travis L. Dixon,et al. Overrepresentation and Underrepresentation of African Americans and Latinos as Lawbreakers on Television News , 2000 .
[124] R S Chapman,et al. Children's language learning: an interactionist perspective. , 2000, Journal of child psychology and psychiatry, and allied disciplines.
[125] G. Edelman. Neural Darwinism: Selection and reentrant signaling in higher brain function , 1993, Neuron.
[126] Donna Harawy. Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective , 2022, Philosophical Literary Journal Logos.
[127] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.
[128] P. L. Adams. THE ORIGINS OF INTELLIGENCE IN CHILDREN , 1976 .
[129] R. Schank,et al. Scripts, plans, and knowledge , 1975, IJCAI 1975.