When and why vision-language models behave like bags-of-words, and what to do about it?
暂无分享,去创建一个
[1] Aylin Caliskan,et al. Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias , 2022, FAccT.
[2] James Y. Zou,et al. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale , 2022, ArXiv.
[3] Kyle Mahowald,et al. Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality , 2022, EMNLP.
[4] Kai-Wei Chang,et al. How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? , 2022, EMNLP.
[5] Yu-Gang Jiang,et al. OmniVL: One Foundation Model for Image-Language and Video-Language Tasks , 2022, NeurIPS.
[6] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.
[7] T. Ullman,et al. Testing Relational Understanding in Text-Guided Image Generation , 2022, ArXiv.
[8] Aylin Caliskan,et al. American == White in Multimodal Language-and-Image AI , 2022, AIES.
[9] Thomas Serre,et al. A Benchmark for Compositional Visual Reasoning , 2022, NeurIPS.
[10] David J. Fleet,et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.
[11] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[12] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.
[13] Tristan Thrush,et al. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[15] Anette Frank,et al. VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena , 2021, ACL.
[16] Marcus Rohrbach,et al. FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Hang Li,et al. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.
[18] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Xuezhi Wang,et al. Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation , 2021, NeurIPS.
[20] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.
[21] Mohit Bansal,et al. DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers , 2022, ArXiv.
[22] Madian Khabsa,et al. A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision , 2021, ArXiv.
[23] Jonathan Berant,et al. COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images , 2021, EMNLP.
[24] Desmond Elliott,et al. Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers , 2021, EMNLP.
[25] Arvind Narayanan,et al. Mitigating dataset harms requires stewardship: Lessons from 1000 papers , 2021, NeurIPS Datasets and Benchmarks.
[26] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[27] Jacob Andreas,et al. What Context Features Can Transformer Language Models Use? , 2021, ACL.
[28] Douwe Kiela,et al. Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , 2021, EMNLP.
[29] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[30] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[31] Inioluwa Deborah Raji,et al. About Face: A Survey of Facial Recognition Evaluation , 2021, ArXiv.
[32] Long Mai,et al. Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks? , 2020, FINDINGS.
[33] Albert Gatt,et al. Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks , 2020, MMSR.
[34] S. Sra,et al. Contrastive Learning with Hard Negative Samples , 2020, ICLR.
[35] Vinay Uday Prabhu,et al. Large image datasets: A pyrrhic win for computer vision? , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[37] Alexandra Schofield,et al. How effective is BERT without word ordering? Implications for language understanding and data privacy , 2021, ACL.
[38] Yannis Kalantidis,et al. Hard Negative Mixing for Contrastive Learning , 2020, NeurIPS.
[39] M. Bethge,et al. Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.
[40] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[41] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[42] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[43] Allyson Ettinger,et al. What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.
[44] Matthias Bethge,et al. Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet , 2019, ICLR.
[45] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Sanjay Sharma. Artificial unintelligence: how computers misunderstand the world , 2019, Labour & Industry: a journal of the social and economic relations of work.
[47] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[48] Matthias Bethge,et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.
[49] Weilin Huang,et al. Deep Metric Learning with Hierarchical Triplet Loss , 2018, ECCV.
[50] Meredith Broussard,et al. Artificial Unintelligence: How Computers Misunderstand the World , 2018 .
[51] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[52] Yoav Artzi,et al. A Corpus of Natural Language for Visual Reasoning , 2017, ACL.
[53] Alexander J. Smola,et al. Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[54] Gustavo Carneiro,et al. Smart Mining for Deep Metric Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[55] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[57] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[58] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[59] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[60] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .