Intentonomy: a Dataset and Study towards Human Intent Understanding

An image is worth a thousand words, conveying information that goes beyond the mere visual content therein. In this paper, we study the intent behind social media images with an aim to analyze how visual information can facilitate recognition of human intent. Towards this goal, we introduce an intent dataset, Intentonomy, comprising 14K images covering a wide range of everyday scenes. These images are manually annotated with 28 intent categories derived from a social psychology taxonomy. We then systematically study whether, and to what extent, commonly used visual information, i.e., object and context, contribute to human motive understanding. Based on our findings, we conduct further study to quantify the effect of attending to object and context classes as well as textual information in the form of hashtags when training an intent classifier. Our results quantitatively and qualitatively shed light on how visual and textual information can produce observable effects when predicting intent.

[1]  Claire Cardie,et al.  Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset , 2020, ECCV.

[2]  Antonio Torralba,et al.  Context models and out-of-context objects , 2012, Pattern Recognit. Lett..

[3]  Pietro Perona,et al.  Recognition in Terra Incognita , 2018, ECCV.

[4]  Carsten Rother,et al.  Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Luis von Ahn Games with a Purpose , 2006, Computer.

[6]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Xiao Lin,et al.  Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts , 2019, EMNLP/IJCNLP.

[11]  Pietro Perona,et al.  Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[14]  Jonathan Krause,et al.  Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Chih-Hui Lai,et al.  Motivations, Usage, and Perceived Social Networks Within and Beyond Social Media , 2019, J. Comput. Mediat. Commun..

[17]  Mingda Zhang,et al.  Interpreting the Rhetoric of Visual Advertisements , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Matthias Bethge,et al.  Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet , 2019, ICLR.

[19]  Shiguang Shan,et al.  Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Greg Mori,et al.  Learning Structured Inference Neural Networks with Label Relations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[22]  Eric Gilbert,et al.  Why We Filter Our Photos and How It Impacts Engagement , 2015, ICWSM.

[23]  Alexei A. Efros,et al.  Geometric context from a single image , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[24]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[25]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Eero P. Simoncelli,et al.  A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients , 2000, International Journal of Computer Vision.

[27]  Andrew Owens,et al.  Fighting Fake News: Image Splice Detection via Learned Self-Consistency , 2018, ECCV.

[28]  Zhou Yu,et al.  Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good , 2019, ACL.

[29]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[30]  Adriana Kovashka,et al.  Inferring Visual Persuasion via Body Language, Setting, and Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[32]  S. Kosslyn,et al.  Mental imagery , 2013, Front. Psychol..

[33]  B. Rimer,et al.  Advancing Tailored Health Communication: A Persuasion and Message Effects Perspective , 2006 .

[34]  Ravi Iyer,et al.  Toward a comprehensive taxonomy of human motives , 2017, PloS one.

[35]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  David J. Kriegman,et al.  Pose, illumination and expression invariant pairwise face-similarity measure via Doppelgänger list comparison , 2011, 2011 International Conference on Computer Vision.

[37]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Tamar Ashuri,et al.  Watching Me Watching You: How Observational Learning Affects Self-disclosure on Social Network Sites? , 2018, J. Comput. Mediat. Commun..

[39]  Zhou Yu,et al.  MIDAS: A Dialog Act Annotation Scheme for Open Domain HumanMachine Spoken Conversations , 2019, EACL.

[40]  Song-Chun Zhu,et al.  Visual Persuasion: Inferring Communicative Intents of Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Yong Jae Lee,et al.  Don’t Judge an Object by Its Context: Learning to Overcome Contextual Bias , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Martial Hebert,et al.  From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Hee-Woong Kim,et al.  Why people post benevolent and malicious comments online , 2015, Commun. ACM.

[45]  Antonio Torralba,et al.  Inferring the Why in Images , 2014, ArXiv.

[46]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[47]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Gerhard Weikum,et al.  DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning , 2018, EMNLP.

[49]  Charless C. Fowlkes,et al.  Multiresolution Models for Object Detection , 2010, ECCV.

[50]  Mingda Zhang,et al.  Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text , 2018, BMVC.

[51]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[52]  Gabriel Kreiman,et al.  Putting Visual Object Recognition in Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[54]  Dave Chisholm,et al.  Exploiting Multimodal Affect and Semantics to Identify Politically Persuasive Web Videos , 2015, ICMI.

[55]  Adriana Kovashka,et al.  Predicting the Politics of an Image Using Webly Supervised Data , 2019, NeurIPS.

[56]  Virgílio A. F. Almeida,et al.  Dawn of the Selfie Era: The Whos, Wheres, and Hows of Selfies on Instagram , 2015, COSN.

[57]  Antonio Torralba,et al.  Using the forest to see the trees: exploiting context for visual object detection and localization , 2010, CACM.

[58]  Song-Chun Zhu,et al.  Automated Facial Trait Judgment and Election Outcome Prediction: Social Dimensions of Face , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Arie Dijkstra,et al.  The Psychology of Tailoring-Ingredients in Computer-Tailored Persuasion , 2008 .

[60]  Mingda Zhang,et al.  Automatic Understanding of Image and Video Advertisements , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Xinlei Chen,et al.  Iterative Visual Reasoning Beyond Convolutions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  S. Kosslyn,et al.  Mental imagery , 2013, Front. Psychol..

[65]  Antonio Torralba,et al.  Predicting Motivations of Actions by Leveraging Text , 2014, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).