Using Object Detection, NLP, and Knowledge Bases to Understand the Message of Images

With the increasing amount of multimodal content from social media posts and news articles, there has been an intensified effort towards conceptual labeling and multimodal (topic) modeling of images and of their affiliated texts. Nonetheless, the problem of identifying and automatically naming the core abstract message (gist) behind images has received less attention. This problem is especially relevant for the semantic indexing and subsequent retrieval of images. In this paper, we propose a solution that makes use of external knowledge bases such as Wikipedia and DBpedia. Its aim is to leverage complex semantic associations between the image objects and the textual caption in order to uncover the intended gist. The results of our evaluation prove the ability of our proposed approach to detect gist with a best MAP score of 0.74 when assessed against human annotations. Furthermore, an automatic image tagging and caption generation API is compared to manually set image and caption signals. We show and discuss the difficulty to find the correct gist especially for abstract, non-depictable gists as well as the impact of different types of signals on gist detection quality.

[1]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[2]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[3]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[4]  Tamara L. Berg,et al.  Baby Talk : Understanding and Generating Image Descriptions , 2011 .

[5]  C. V. Jawahar,et al.  Choosing Linguistics over Vision to Describe Images , 2012, AAAI.

[6]  S. O'Neill,et al.  Climate change and visual imagery , 2014 .

[7]  Simone Paolo Ponzetto,et al.  Understanding the Message of Images with Knowledge Base Traversals , 2016, ICTIR.

[8]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[9]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[12]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[13]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures (Extended Abstract) , 2017, IJCAI.

[14]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[15]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[16]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[17]  Yansong Feng,et al.  Topic Models for Image Annotation and Text Illustration , 2010, HLT-NAACL.

[18]  Yansong Feng,et al.  How Many Words Is a Picture Worth? Automatic Caption Generation for News Images , 2010, ACL.

[19]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Mark Stevenson,et al.  Computing Similarity between Cultural Heritage Items using Multimodal Features , 2012, LaTeCH@EACL.

[21]  Sophie A. Nicholson-Cole Promoting Positive Engagement With Climate Change Through Visual and Iconic Representations , 2009 .

[22]  Derek Greene,et al.  Unsupervised graph-based topic labelling using dbpedia , 2013, WSDM.

[23]  Mirella Lapata,et al.  Learning to Interpret and Describe Abstract Scenes , 2015, NAACL.

[24]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[25]  Ioana Hulpus,et al.  Path-Based Semantic Relatedness on Linked Data and Its Use to Word and Entity Disambiguation , 2015, International Semantic Web Conference.

[26]  Wei Shi,et al.  Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification , 2016, ACL.

[27]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[28]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Pradipto Das,et al.  Translating related words to videos and back through latent topics , 2013, WSDM.

[31]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Latifur Khan,et al.  Image annotations by combining multiple evidence & wordNet , 2005, ACM Multimedia.

[33]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[34]  Sven J. Dickinson,et al.  Video In Sentences Out , 2012, UAI.

[35]  Christoph Meinel,et al.  Concept-Based Multimodal Learning for Topic Generation , 2015, MMM.

[36]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[37]  Nicu Sebe,et al.  Distributional semantics with eyes: using image analysis to improve computational representations of word meaning , 2012, ACM Multimedia.

[38]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[39]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[40]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.