AnANet: Modeling Association and Alignment for Cross-modal Correlation Classification

The explosive increase of multimodal data makes a great demand in many cross-modal applications that follow the strict prior related assumption. Thus researchers study the definition of cross-modal correlation category and construct various classification systems and predictive models. However, those systems pay more attention to the fine-grained relevant types of cross-modal correlation, ignoring lots of implicit relevant data which are often divided into irrelevant types. What’s worse is that none of previous predictive models manifest the essence of crossmodal correlation according to their definition at the modeling stage. In this paper, we present a comprehensive analysis of the image-text correlation and redefine a new classification system based on implicit association and explicit alignment. To predict the type of image-text correlation, we propose the Association and Alignment Network according to our proposed definition (namely AnANet) which implicitly represents the global discrepancy and commonality between image and text and explicitly captures the cross-modal local relevance. The experimental results on our constructed new image-text correlation dataset show the effectiveness of our model.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Mingda Zhang,et al.  Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text , 2018, BMVC.

[3]  Daniel Preotiuc-Pietro,et al.  Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts , 2019, ACL.

[4]  Xuanjing Huang,et al.  Co-Attention Memory Network for Multimodal Microblog's Hashtag Recommendation , 2019, IEEE Transactions on Knowledge and Data Engineering.

[5]  Zhixiong Zeng,et al.  Reasoning with Multimodal Sarcastic Tweets via Modeling Cross-Modality Contrast and Semantic Association , 2020, ACL.

[6]  Wenji Mao,et al.  A Co-Memory Network for Multimodal Sentiment Analysis , 2018, SIGIR.

[7]  Xuanjing Huang,et al.  Adaptive Co-attention Network for Named Entity Recognition in Tweets , 2018, AAAI.

[8]  Marilyn Domas White,et al.  A taxonomy of relationships between images and text , 2003, J. Documentation.

[9]  Ralph Ewerth,et al.  Characterization and classification of semantic image-text relations , 2020, International Journal of Multimedia Information Retrieval.

[10]  Tao Chen,et al.  Understanding and classifying image tweets , 2013, ACM Multimedia.

[11]  Xiao Lin,et al.  Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts , 2019, EMNLP/IJCNLP.

[12]  Matthew Stone,et al.  Cross-modal Coherence Modeling for Caption Generation , 2020, ACL.

[13]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Timothy Baldwin,et al.  Multimodal Topic Labelling , 2017, EACL.

[15]  Yu Zhou,et al.  MSMO: Multimodal Summarization with Multimodal Output , 2018, EMNLP.

[16]  Zhiyu Wang,et al.  Bilateral Correspondence Model for Words-and-Pictures Association in Multimedia-Rich Microblogs , 2014, TOMM.

[17]  A. D. Manning,et al.  Understanding Comics: The Invisible Art , 1993 .

[18]  Daniel Schopfhauser,et al.  Multimodal classification of events in social media , 2016, Image Vis. Comput..

[19]  Ralph Ewerth,et al.  Estimating the information gap between textual and visual representations , 2017, International Journal of Multimedia Information Retrieval.

[20]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[21]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Matthew Stone,et al.  CITE: A Corpus of Image-Text Discourse Relations , 2019, NAACL.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Andrew Salway,et al.  A system for image–text relations in new (and old) media , 2005 .

[25]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[26]  Shuxuan Wu,et al.  A Multimodal Analysis of Image-text Relations in Picture Books , 2014 .

[27]  Devi Parikh,et al.  Image specificity , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Tao Chen,et al.  VELDA: Relating an Image Tweet's Text and Images , 2015, AAAI.