When and why vision-language models behave like bags-of-words, and what to do about it?

Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO&Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.

[1]  Aylin Caliskan,et al.  Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias , 2022, FAccT.

[2]  James Y. Zou,et al.  Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale , 2022, ArXiv.

[3]  Kyle Mahowald,et al.  Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality , 2022, EMNLP.

[4]  Kai-Wei Chang,et al.  How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? , 2022, EMNLP.

[5]  Yu-Gang Jiang,et al.  OmniVL: One Foundation Model for Image-Language and Video-Language Tasks , 2022, NeurIPS.

[6]  Li Dong,et al.  Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[7]  T. Ullman,et al.  Testing Relational Understanding in Text-Guided Image Generation , 2022, ArXiv.

[8]  Aylin Caliskan,et al.  American == White in Multimodal Language-and-Image AI , 2022, AIES.

[9]  Thomas Serre,et al.  A Benchmark for Compositional Visual Reasoning , 2022, NeurIPS.

[10]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[11]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[12]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[13]  Tristan Thrush,et al.  Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[15]  Anette Frank,et al.  VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena , 2021, ACL.

[16]  Marcus Rohrbach,et al.  FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Hang Li,et al.  Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.

[18]  Daniel Keysers,et al.  LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xuezhi Wang,et al.  Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation , 2021, NeurIPS.

[20]  Christopher D. Manning,et al.  Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[21]  Mohit Bansal,et al.  DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers , 2022, ArXiv.

[22]  Madian Khabsa,et al.  A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision , 2021, ArXiv.

[23]  Jonathan Berant,et al.  COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images , 2021, EMNLP.

[24]  Desmond Elliott,et al.  Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers , 2021, EMNLP.

[25]  Arvind Narayanan,et al.  Mitigating dataset harms requires stewardship: Lessons from 1000 papers , 2021, NeurIPS Datasets and Benchmarks.

[26]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[27]  Jacob Andreas,et al.  What Context Features Can Transformer Language Models Use? , 2021, ACL.

[28]  Douwe Kiela,et al.  Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , 2021, EMNLP.

[29]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[30]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[31]  Inioluwa Deborah Raji,et al.  About Face: A Survey of Facial Recognition Evaluation , 2021, ArXiv.

[32]  Long Mai,et al.  Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks? , 2020, FINDINGS.

[33]  Albert Gatt,et al.  Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks , 2020, MMSR.

[34]  S. Sra,et al.  Contrastive Learning with Hard Negative Samples , 2020, ICLR.

[35]  Vinay Uday Prabhu,et al.  Large image datasets: A pyrrhic win for computer vision? , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36]  Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , 2021 .

[37]  Alexandra Schofield,et al.  How effective is BERT without word ordering? Implications for language understanding and data privacy , 2021, ACL.

[38]  Yannis Kalantidis,et al.  Hard Negative Mixing for Contrastive Learning , 2020, NeurIPS.

[39]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[40]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[41]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[42]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[43]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[44]  Matthias Bethge,et al.  Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet , 2019, ICLR.

[45]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Sanjay Sharma Artificial unintelligence: how computers misunderstand the world , 2019, Labour & Industry: a journal of the social and economic relations of work.

[47]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.

[48]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[49]  Weilin Huang,et al.  Deep Metric Learning with Hierarchical Triplet Loss , 2018, ECCV.

[50]  Meredith Broussard,et al.  Artificial Unintelligence: How Computers Misunderstand the World , 2018 .

[51]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[52]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[53]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Gustavo Carneiro,et al.  Smart Mining for Deep Metric Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[57]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[58]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[59]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .