Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of nouns', mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these `compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning and pre-training the VL model: (i) the caption quality, or in other words `image-alignment', of the texts; and (ii) the `density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors leveraging a standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to $\sim27\%$ over the base model, up to $\sim20\%$ over the strongest baseline, and by $6.7\%$ on average.

[1]  Trevor Darrell,et al.  Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs , 2023, ArXiv.

[2]  Haoming Jiang,et al.  Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond , 2023, ACM Trans. Knowl. Discov. Data.

[3]  Mohamed Elhoseiny,et al.  MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.

[4]  Yong Jae Lee,et al.  Visual Instruction Tuning , 2023, ArXiv.

[5]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[8]  Trevor Darrell,et al.  PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data , 2022, 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[9]  S. Ullman,et al.  Teaching Structured Vision & Language Concepts to Vision & Language Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  R. Feris,et al.  ConStruct-VL: Data-Free Continual Structured VL Concepts Learning* , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  James Y. Zou,et al.  When and why vision-language models behave like bags-of-words, and what to do about it? , 2022, ICLR.

[12]  Kate Saenko,et al.  FETA: Towards Specializing Foundation Models for Expert Task Applications , 2022, ArXiv.

[13]  Percy Liang,et al.  Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning , 2022, ArXiv.

[14]  Tiancheng Zhao,et al.  VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations , 2022, arXiv.org.

[15]  Trevor Darrell,et al.  Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens , 2022, 2206.06346.

[16]  Ryan A. Rossi,et al.  CyCLIP: Cyclic Contrastive Language-Image Pretraining , 2022, NeurIPS.

[17]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[18]  Guy Edward Toh Emerson,et al.  Visual Spatial Reasoning , 2022, TACL.

[19]  Chunhua Shen,et al.  PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining , 2022, NeurIPS.

[20]  Yong Jae Lee,et al.  ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models , 2022, NeurIPS.

[21]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[22]  Tristan Thrush,et al.  Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Shalini De Mello,et al.  GroupViT: Semantic Segmentation Emerges from Text Supervision , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Trishul M. Chilimbi,et al.  Vision-Language Pre-Training with Triple Contrastive Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[26]  Yin Cui,et al.  Scaling Open-Vocabulary Image Segmentation with Image-Level Labels , 2021, ECCV.

[27]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Alexander S. Ecker,et al.  Image Segmentation Using Text and Image Prompts , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lu Yuan,et al.  RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Daniel Keysers,et al.  LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Zhenguo Li,et al.  FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[32]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[33]  David P. Kreil,et al.  CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP , 2021, NeurIPS.

[34]  Trevor Darrell,et al.  Object-Region Video Transformers , 2021, Computer Vision and Pattern Recognition.

[35]  Junjie Yan,et al.  Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm , 2021, ICLR.

[36]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[37]  Scott Cohen,et al.  Learning to Predict Visual Attributes in the Wild , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Yin Cui,et al.  Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.

[39]  Cordelia Schmid,et al.  Unified Graph Structured Models for Video Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[41]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[42]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[43]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[44]  Shih-Fu Chang,et al.  Open-Vocabulary Object Detection Using Captions , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jonathan Berant,et al.  Learning Object Detection from Captions via Textual Scene Attributes , 2020, ArXiv.

[46]  Chen Gao,et al.  DRG: Dual Relation Graph for Human-Object Interaction Detection , 2020, ECCV.

[47]  Trevor Darrell,et al.  Compositional Video Synthesis with Action Graphs , 2020, ICML.

[48]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[49]  Ali Farhadi,et al.  Grounded Situation Recognition , 2020, ECCV.

[50]  Trevor Darrell,et al.  Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Trevor Darrell,et al.  Learning Canonical Representations for Scene Graph to Image Generation , 2019, ECCV.

[52]  Juan Carlos Niebles,et al.  Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[55]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[56]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[57]  Wei Li,et al.  Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Mohan S. Kankanhalli,et al.  Learning to Detect Human-Object Interactions With Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Cewu Lu,et al.  HAKE: Human Activity Knowledge Engine , 2019, ArXiv.

[60]  Jonathan Berant,et al.  Differentiable Scene Graphs , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[61]  Trevor Darrell,et al.  Spatio-Temporal Action Graph Networks , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[62]  Yin Li,et al.  Compositional Learning for Human Object Interaction , 2018, ECCV.

[63]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[64]  Andreas Dengel,et al.  Introducing Eurosat: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2018, IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium.

[65]  Cordelia Schmid,et al.  A flexible model for training action localization with varying levels of supervision , 2018, NeurIPS.

[66]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.

[67]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[68]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[69]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70]  Michael S. Bernstein,et al.  Referring Relationships , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[71]  Jonathan Berant,et al.  Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction , 2018, NeurIPS.

[72]  Ivan Laptev,et al.  Learning from Video and Text via Large-Scale Discriminative Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[73]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Guy Cazuguel,et al.  Multiple-Instance Learning for Medical Image and Video Analysis , 2017, IEEE Reviews in Biomedical Engineering.

[75]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[76]  N. Rajpoot,et al.  Locality Sensitive Deep Learning for Detection and Classification of Nuclei in Routine Colon Cancer Histology Images , 2016, IEEE Trans. Medical Imaging.

[77]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[80]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[81]  Cordelia Schmid,et al.  Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[82]  Greg Mori,et al.  Similarity Constrained Latent Support Vector Machine: An Application to Weakly Supervised Action Classification , 2012, ECCV.

[83]  Yang Song,et al.  Handling label noise in video classification via multiple instance learning , 2011, 2011 International Conference on Computer Vision.

[84]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .