Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
暂无分享,去创建一个
[1] Paul Pu Liang,et al. Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions , 2022, ArXiv.
[2] Byoung-Tak Zhang,et al. Cross-Modal Alignment Learning of Vision-Language Conceptual Systems , 2022, ArXiv.
[3] J. Leskovec,et al. VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering , 2022, ArXiv.
[4] Tristan Thrush,et al. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Nan Duan,et al. VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Y. Fu,et al. Single-Stream Multi-Level Alignment for Vision-Language Pretraining , 2022, ECCV.
[7] Miryam de Lhoneux,et al. Finding Structural Knowledge in Multimodal-BERT , 2022, ACL.
[8] Shih-Fu Chang,et al. SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning , 2021, AAAI.
[9] T. Tuytelaars,et al. Unsupervised Vision-Language Grammar Induction with Shared Structure Modeling , 2022, ICLR.
[10] Zhongyu Wei,et al. MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment , 2022, ArXiv.
[11] Feifei Zhang,et al. Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning , 2022, IEEE Transactions on Multimedia.
[12] Jianfei Cai,et al. Auto-Parsing Network for Image Captioning and Visual Question Answering , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[13] Zhou Yu,et al. ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration , 2021, ACM Multimedia.
[14] Jianlong Fu,et al. Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training , 2021, NeurIPS.
[15] Song-Chun Zhu,et al. VLGrammar: Grounded Grammar Induction of Vision and Language , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[16] Yunhai Tong,et al. Syntax-BERT: Improving Pre-trained Transformers with Syntax Trees , 2021, EACL.
[17] Jianfei Cai,et al. Causal Attention for Vision-Language Tasks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Qingyu Zhou,et al. Improving BERT with Syntax-aware Local Attention , 2020, FINDINGS.
[19] Ryan Cotterell,et al. Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs , 2020, Transactions of the Association for Computational Linguistics.
[20] Hao Tian,et al. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.
[21] Hanqing Lu,et al. Aligning Linguistic Words and Visual Semantic Units for Image Captioning , 2019, ACM Multimedia.
[22] Rudolf Rosa,et al. From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self-Attentions , 2019, BlackboxNLP@ACL.
[23] Wei-Ying Ma,et al. Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Yuxin Peng,et al. Hierarchical Vision-Language Alignment for Video Captioning , 2018, MMM.
[25] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[26] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.