Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example, a caption might evoke an ironic contrast with the image, so neither caption nor image is a mere transcript of the other. Instead they combine -- via what has been called meaning multiplication -- to create a new meaning that has a more complex relation to the literal meanings of text and image. Here we introduce a multimodal dataset of 1299 Instagram posts labeled for three orthogonal taxonomies: the authorial intent behind the image-caption pair, the contextual relationship between the literal meanings of the image and caption, and the semiotic relationship between the signified meanings of the image and caption. We build a baseline deep multimodal classifier to validate the taxonomy, showing that employing both text and image improves intent detection by 9.6% compared to using only the image modality, demonstrating the commonality of non-intersective meaning multiplication. The gain with multimodality is greatest when the image and caption diverge semiotically. Our dataset offers a new resource for the study of the rich meanings that result from pairing text and image.

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Matthew Stone,et al.  CITE: A Corpus of Image-Text Discourse Relations , 2019, NAACL.

[3]  Antonio Torralba,et al.  Understanding and Predicting Image Memorability at a Large Scale , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Mingda Zhang,et al.  Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text , 2018, BMVC.

[5]  Jianfeng Gao,et al.  Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation , 2017, IJCNLP.

[6]  Marilyn Domas White,et al.  A taxonomy of relationships between images and text , 2003, J. Documentation.

[7]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Rada Mihalcea,et al.  Multimodal Analysis and Prediction of Latent User Dimensions , 2017, SocInfo.

[9]  Raffay Hamid,et al.  What makes an image popular? , 2014, WWW.

[10]  Mingda Zhang,et al.  Automatic Understanding of Image and Video Advertisements , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Song-Chun Zhu,et al.  Visual Persuasion: Inferring Communicative Intents of Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Adriana Kovashka,et al.  Inferring Visual Persuasion via Body Language, Setting, and Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  D. Knight Ways of seeing , 2015, Nature.

[19]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[20]  Csr Young,et al.  How to Do Things With Words , 2009 .

[21]  E. Goffman The Presentation of Self in Everyday Life , 1959 .

[22]  Margaret Mitchell,et al.  Generating Natural Questions About an Image , 2016, ACL.

[23]  John A. Bateman,et al.  Text and Image , 2014 .

[24]  Ajay Divakaran,et al.  Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention , 2018, ArXiv.

[25]  S. Hirsch Image Music Text , 2016 .

[26]  Paul Lukowicz,et al.  Dealing with Class Skew in Context Recognition , 2006, 26th IEEE International Conference on Distributed Computing Systems Workshops (ICDCSW'06).

[27]  Fernando De la Torre,et al.  Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[28]  Takayuki Okatani,et al.  Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Wolfgang Effelsberg,et al.  Knowledge-rich image gist understanding beyond literal meaning , 2018, Data Knowl. Eng..

[30]  Tom Feltwell,et al.  Constructing the Visual Online Political Self: An Analysis of Instagram Use by the Scottish Electorate , 2016, CHI.

[31]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[32]  Dave Chisholm,et al.  Exploiting Multimodal Affect and Semantics to Identify Politically Persuasive Web Videos , 2015, ICMI.

[33]  Mohammad Soleymani,et al.  A survey of multimodal sentiment analysis , 2017, Image Vis. Comput..

[34]  B. Hogan The Presentation of Self in the Age of Social Media: Distinguishing Performances and Exhibitions Online , 2010 .

[35]  Yejin Choi,et al.  Benchmarking Hierarchical Script Knowledge , 2019, NAACL.

[36]  R. Kloepfer,et al.  Komplementarität von Sprache und Bild Am Beispiel von Comic, Karikatur und Reklame. (La complémentarité de la langue et de l'image. L'exemple des bandes dessinées, des caricatures et des réclames) , 1976 .

[37]  Devi Parikh,et al.  Understanding image virality , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  David M. Mimno,et al.  Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity , 2017, WWW.

[39]  R. Barthes,et al.  Image-Music-Text , 1977 .

[40]  Nicu Sebe,et al.  Viraliency: Pooling Local Virality , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).