论文信息 - Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example, a caption might evoke an ironic contrast with the image, so neither caption nor image is a mere transcript of the other. Instead they combine -- via what has been called meaning multiplication -- to create a new meaning that has a more complex relation to the literal meanings of text and image. Here we introduce a multimodal dataset of 1299 Instagram posts labeled for three orthogonal taxonomies: the authorial intent behind the image-caption pair, the contextual relationship between the literal meanings of the image and caption, and the semiotic relationship between the signified meanings of the image and caption. We build a baseline deep multimodal classifier to validate the taxonomy, showing that employing both text and image improves intent detection by 9.6% compared to using only the image modality, demonstrating the commonality of non-intersective meaning multiplication. The gain with multimodality is greatest when the image and caption diverge semiotically. Our dataset offers a new resource for the study of the rich meanings that result from pairing text and image.

[1] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2] Matthew Stone,et al. CITE: A Corpus of Image-Text Discourse Relations , 2019, NAACL.

[3] Antonio Torralba,et al. Understanding and Predicting Image Memorability at a Large Scale , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4] Mingda Zhang,et al. Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text , 2018, BMVC.

[5] Jianfeng Gao,et al. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[6] Marilyn Domas White,et al. A taxonomy of relationships between images and text , 2003, J. Documentation.

[7] Frédo Durand,et al. What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Rada Mihalcea,et al. Multimodal Analysis and Prediction of Latent User Dimensions , 2017, SocInfo.

[9] Raffay Hamid,et al. What makes an image popular? , 2014, WWW.

[10] Mingda Zhang,et al. Automatic Understanding of Image and Video Advertisements , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Song-Chun Zhu,et al. Visual Persuasion: Inferring Communicative Intents of Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Adriana Kovashka,et al. Inferring Visual Persuasion via Body Language, Setting, and Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] D. Knight. Ways of seeing , 2015, Nature.

[19] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[20] Csr Young,et al. How to Do Things With Words , 2009 .

[21] E. Goffman. The Presentation of Self in Everyday Life , 1959 .

[22] Margaret Mitchell,et al. Generating Natural Questions About an Image , 2016, ACL.

[23] John A. Bateman,et al. Text and Image , 2014 .

[24] Ajay Divakaran,et al. Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention , 2018, ArXiv.

[25] S. Hirsch. Image Music Text , 2016 .

[26] Paul Lukowicz,et al. Dealing with Class Skew in Context Recognition , 2006, 26th IEEE International Conference on Distributed Computing Systems Workshops (ICDCSW'06).

[27] Fernando De la Torre,et al. Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[28] Takayuki Okatani,et al. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Wolfgang Effelsberg,et al. Knowledge-rich image gist understanding beyond literal meaning , 2018, Data Knowl. Eng..

[30] Tom Feltwell,et al. Constructing the Visual Online Political Self: An Analysis of Instagram Use by the Scottish Electorate , 2016, CHI.

[31] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[32] Dave Chisholm,et al. Exploiting Multimodal Affect and Semantics to Identify Politically Persuasive Web Videos , 2015, ICMI.

[33] Mohammad Soleymani,et al. A survey of multimodal sentiment analysis , 2017, Image Vis. Comput..

[34] B. Hogan. The Presentation of Self in the Age of Social Media: Distinguishing Performances and Exhibitions Online , 2010 .

[35] Yejin Choi,et al. Benchmarking Hierarchical Script Knowledge , 2019, NAACL.

[36] R. Kloepfer,et al. Komplementarität von Sprache und Bild Am Beispiel von Comic, Karikatur und Reklame. (La complémentarité de la langue et de l'image. L'exemple des bandes dessinées, des caricatures et des réclames) , 1976 .

[37] Devi Parikh,et al. Understanding image virality , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] David M. Mimno,et al. Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity , 2017, WWW.

[39] R. Barthes,et al. Image-Music-Text , 1977 .

[40] Nicu Sebe,et al. Viraliency: Pooling Local Virality , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).