Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

Despite recent progress towards scaling up multimodal vision-language models, these models are still known to struggle on compositional generalization benchmarks such as Winoground. We find that a critical com-ponent lacking from current vision-language models is relation-level alignment: the ability to match directional semantic relations in text (e.g., ‘mug in grass’) with spatial rela-tionships in the image (e.g., the position of the mug relative to the grass). To tackle this problem, we show that relation alignment can be enforced by encouraging the directed language attention from ‘mug’ to ‘grass’ (captur-ing the semantic relation ‘in’) to match the directed visual attention from the mug to the grass. Tokens and their corresponding objects are softly identified using the cross-modal attention. We prove that this notion of soft relation alignment is equivalent to enforcing congruence between vision and language attention matrices under a ‘change of basis’ provided by the cross-modal attention matrix. Intuitively, our approach projects visual attention into the language attention space to calculate its divergence from the actual language attention, and vice versa. We apply our Cross-modal Attention Congruence Regularization (CACR) loss to UNITER and improve on the state-of-the-art approach to Winoground.

[1]  Paul Pu Liang,et al.  Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions , 2022, ArXiv.

[2]  Byoung-Tak Zhang,et al.  Cross-Modal Alignment Learning of Vision-Language Conceptual Systems , 2022, ArXiv.

[3]  J. Leskovec,et al.  VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering , 2022, ArXiv.

[4]  Tristan Thrush,et al.  Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Nan Duan,et al.  VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Y. Fu,et al.  Single-Stream Multi-Level Alignment for Vision-Language Pretraining , 2022, ECCV.

[7]  Miryam de Lhoneux,et al.  Finding Structural Knowledge in Multimodal-BERT , 2022, ACL.

[8]  Shih-Fu Chang,et al.  SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning , 2021, AAAI.

[9]  T. Tuytelaars,et al.  Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling , 2022, ICLR.

[10]  Zhongyu Wei,et al.  MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment , 2022, ArXiv.

[11]  Feifei Zhang,et al.  Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning , 2022, IEEE Transactions on Multimedia.

[12]  Jianfei Cai,et al.  Auto-Parsing Network for Image Captioning and Visual Question Answering , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Zhou Yu,et al.  ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration , 2021, ACM Multimedia.

[14]  Jianlong Fu,et al.  Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training , 2021, NeurIPS.

[15]  Song-Chun Zhu,et al.  VLGrammar: Grounded Grammar Induction of Vision and Language , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Yunhai Tong,et al.  Syntax-BERT: Improving Pre-trained Transformers with Syntax Trees , 2021, EACL.

[17]  Jianfei Cai,et al.  Causal Attention for Vision-Language Tasks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Qingyu Zhou,et al.  Improving BERT with Syntax-aware Local Attention , 2020, FINDINGS.

[19]  Ryan Cotterell,et al.  Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs , 2020, Transactions of the Association for Computational Linguistics.

[20]  Hao Tian,et al.  ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[21]  Hanqing Lu,et al.  Aligning Linguistic Words and Visual Semantic Units for Image Captioning , 2019, ACM Multimedia.

[22]  Rudolf Rosa,et al.  From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self-Attentions , 2019, BlackboxNLP@ACL.

[23]  Wei-Ying Ma,et al.  Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yuxin Peng,et al.  Hierarchical Vision-Language Alignment for Video Captioning , 2018, MMM.

[25]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[26]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.