Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning

Cross-lingual image captioning is confronted with both cross-lingual and cross-modal challenges for multimedia analysis. The crucial issue in this task is to model the global and local matching between the image and different languages. Existing cross-modal embedding methods based on Transformer architecture oversight the local matching between the image region and monolingual words, not to mention in the face of a variety of differentiated languages. Due to the heterogeneous property of the cross-modal and cross-lingual task, we utilize the heterogeneous network to establish cross-domain relationships and the local correspondences between the image and different languages. In this paper, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to build reasoning paths bridging cross-domain for cross-lingual image captioning and integrate into transformer. The proposed EHAT consists of a Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN) and Heterogeneous Co-attention (HCA). HARN as the core network, models and infers cross-domain relationship anchored by vision bounding box representation features to connect two languages word features and learn the heterogeneous maps. MHCA and HCA implement cross-domain integration in the encoder through the special heterogeneous attention and enable single model to generate two language captioning. We test on MSCOCO dataset to generate English and Chinese, which are most widely used and have obvious difference between their language families. Our experiments show that our method even achieve better than advanced monolingual methods.

[1]  Pengfei Zhu,et al.  Latent Heterogeneous Graph Network for Incomplete Multi-View Learning , 2022, IEEE Transactions on Multimedia.

[2]  Jingkuan Song,et al.  S2 Transformer for Image Captioning , 2022, IJCAI.

[3]  Zhenzhen Hu,et al.  Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning , 2022, Multimedia Systems.

[4]  Xiaoyan Cai,et al.  Relation-aware Heterogeneous Graph Transformer based drug repurposing , 2021, Expert Syst. Appl..

[5]  Pranav Aggarwal,et al.  Towards Zero-shot Cross-lingual Image Retrieval and Tagging , 2021, ArXiv.

[6]  Hongliang Fei,et al.  Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval , 2021, SIGIR.

[7]  Yang Wang,et al.  Exploring Pairwise Relationships Adaptively From Linguistic Context in Image Captioning , 2021, IEEE Transactions on Multimedia.

[8]  Xuanjing Huang,et al.  TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning , 2021, IJCAI.

[9]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yongjian Wu,et al.  RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yanfang Ye,et al.  Heterogeneous Graph Structure Learning for Graph Neural Networks , 2021, AAAI.

[12]  Jingjing Liu,et al.  UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xirong Li,et al.  Towards annotation-free evaluation of cross-lingual image captioning , 2020, MMAsia.

[14]  Philip S. Yu,et al.  A Survey on Heterogeneous Graph Embedding: Methods, Techniques, Applications and Sources , 2020, IEEE Transactions on Big Data.

[15]  Qiang Wu,et al.  Dual Attention on Pyramid Feature Maps for Image Captioning , 2020, IEEE Transactions on Multimedia.

[16]  Shafiq R. Joty,et al.  UNISON: Unpaired Cross-Lingual Image Captioning , 2020, AAAI.

[17]  Weifeng Zhang,et al.  Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering , 2020, Pattern Recognit..

[18]  Xiaojun Wan,et al.  Heterogeneous Graph Transformer for Graph-to-Sequence Learning , 2020, ACL.

[19]  Xing Xie,et al.  Graph Neural News Recommendation with Unsupervised Preference Disentanglement , 2020, ACL.

[20]  Yujing Wang,et al.  Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering , 2020, IJCAI.

[21]  Yahong Han,et al.  Reasoning with Heterogeneous Graph Alignment for Video Question Answering , 2020, AAAI.

[22]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ruotian Luo A Better Variant of Self-Critical Sequence Training , 2020, ArXiv.

[24]  Yizhou Sun,et al.  Heterogeneous Graph Transformer , 2020, WWW.

[25]  Xirong Li,et al.  iCap: Interactive Image Captioning with Predictive Text , 2020, ICMR.

[26]  Marcella Cornia,et al.  Meshed-Memory Transformer for Image Captioning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Linmei Hu,et al.  Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification , 2019, EMNLP.

[28]  Jiajun Zhang,et al.  Synchronously Generating Two Languages with Interactive Decoding , 2019, EMNLP.

[29]  Nong Xiao,et al.  Heterogeneous Graph Learning for Visual Commonsense Reasoning , 2019, NeurIPS.

[30]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Simao Herdade,et al.  Image Captioning: Transforming Objects into Words , 2019, NeurIPS.

[32]  Lingfeng Wang,et al.  Deep Hierarchical Encoder–Decoder Network for Image Captioning , 2019, IEEE Transactions on Multimedia.

[33]  Shu Zhang,et al.  Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Wei Zhao,et al.  Multitask Learning for Cross-Domain Image Captioning , 2019, IEEE Transactions on Multimedia.

[35]  Chee Seng Chan,et al.  COMIC: Toward A Compact Image Captioning Model With Attention , 2019, IEEE Transactions on Multimedia.

[36]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[38]  Wei Liu,et al.  Recurrent Fusion Network for Image Captioning , 2018, ECCV.

[39]  Philip S. Yu,et al.  Leveraging Meta-path based Context for Top- N Recommendation with A Neural Co-Attention Model , 2018, KDD.

[40]  Xirong Li,et al.  COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval , 2018, IEEE Transactions on Multimedia.

[41]  Gang Wang,et al.  Unpaired Image Captioning by Language Pivoting , 2018, ECCV.

[42]  Philip S. Yu,et al.  Heterogeneous Information Network Embedding for Recommendation , 2017, IEEE Transactions on Knowledge and Data Engineering.

[43]  Wang-Chien Lee,et al.  HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning , 2017, CIKM.

[44]  Gang Wang,et al.  Stack-Captioning: Coarse-to-Fine Learning for Image Captioning , 2017, AAAI.

[45]  Alan Jaffe,et al.  Generating Image Descriptions using Multilingual Data , 2017, WMT.

[46]  Xirong Li,et al.  Fluency-Guided Cross-Lingual Image Captioning , 2017, ACM Multimedia.

[47]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  David J. Crandall,et al.  Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation , 2017, ArXiv.

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[50]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[51]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Nobuyuki Shimizu,et al.  Cross-Lingual Image Caption Generation , 2016, ACL.

[53]  Xirong Li,et al.  Adding Chinese Captions to Images , 2016, ICMR.

[54]  Desmond Elliott,et al.  Multilingual Image Description with Neural Sequence Models , 2015, 1510.04709.

[55]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[56]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[57]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[58]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Lexing Xie,et al.  Picture tags and world knowledge: learning tag relations from visual semantic sources , 2013, ACM Multimedia.

[61]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[62]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[63]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[64]  Changsheng Xu,et al.  Heterogeneous Graph Contrastive Learning Network for Personalized Micro-Video Recommendation , 2023, IEEE Transactions on Multimedia.

[65]  Fei Yin,et al.  Cross-Lingual Text Image Recognition via Multi-Hierarchy Cross-Modal Mimic , 2023, IEEE Transactions on Multimedia.

[66]  Changsheng Xu,et al.  Heterogeneous Hierarchical Feature Aggregation Network for Personalized Micro-Video Recommendation , 2022, IEEE Transactions on Multimedia.

[67]  Ho-fung Leung,et al.  Image Difference Captioning With Instance-Level Fine-Grained Feature Representation , 2022, IEEE Transactions on Multimedia.

[68]  Zhenzhen Hu,et al.  A Text-Guided Generation and Refinement Model for Image Captioning , 2023, IEEE Transactions on Multimedia.

[69]  Rita Cucchiara,et al.  From Show to Tell: A Survey on Image Captioning , 2021, ArXiv.

[70]  Hanli Wang,et al.  CaptionNet: A Tailor-made Recurrent Neural Network for Generating Image Descriptions , 2021, IEEE Transactions on Multimedia.

[71]  Kuizhi Mei,et al.  Integrating Part of Speech Guidance for Image Captioning , 2021, IEEE Transactions on Multimedia.

[72]  Chenhui Chu,et al.  Cross-Lingual Visual Grounding , 2021, IEEE Access.

[73]  Yang Wang,et al.  Cross-Lingual Image Caption Generation Based on Visual Attention Model , 2020, IEEE Access.

[74]  Shaowei Liu,et al.  General Knowledge Embedded Image Representation Learning , 2018, IEEE Transactions on Multimedia.