DeepIU : An Architecture for Image Understanding

Image Understanding is fundamental to systems that need to extract contents and infer concepts from images. In this paper, we develop an architecture for understanding images, through which a system can recognize the content and the underlying concepts of an image and, reason and answer questions about both using a visual module, a reasoning module, and a commonsense knowledge base. In this architecture, visual data combines with background knowledge and; iterates through visual and reasoning modules to answer questions about an image or to generate a textual description of an image. We first provide motivations of such a Deep Image Understanding architecture and then, we describe the necessary components it should include. We also introduce our own preliminary implementation of this architecture and empirically show how this more generic implementation compares with a recent end-to-end Neural approach on specific applications. We address the knowledge-representation challenge in such an architecture by representing an image using a directed labeled graph (called Scene Description Graph). Our implementation uses generic visual recognition techniques and commonsense reasoning1 to extract such graphs from images. Our experiments show that the extracted graphs capture the syntactic and semantic content of an image with reasonable accuracy.

[1]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[5]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[6]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[7]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[8]  Jimmy J. Lin,et al.  Gathering Knowledge for a Question Answering System from Heterogeneous Information Sources , 2001, HTLKM@ACL.

[9]  Chitta Baral,et al.  Towards Addressing the Winograd Schema Challenge - Building and Using a Semantic Parser and a Knowledge Hunting Module , 2015, IJCAI.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Yi Li,et al.  Neural Self Talk: Image Understanding via Continuous Questioning and Answering , 2015, ArXiv.

[12]  Ernest Davis,et al.  Commonsense reasoning and commonsense knowledge in artificial intelligence , 2015, Commun. ACM.

[13]  Jean-Christophe Nebel,et al.  Common-Sense Knowledge for a Computer Vision System for Human Action Recognition , 2012, IWAAL.

[14]  Albert Gatt,et al.  SimpleNLG: A Realisation Engine for Practical Applications , 2009, ENLG.

[15]  Chitta Baral,et al.  Visual Commonsense for Scene Understanding Using Perception, Semantic Parsing and Reasoning , 2015, AAAI Spring Symposia.

[16]  Stuart C. Shapiro,et al.  Encyclopedia of artificial intelligence, vols. 1 and 2 (2nd ed.) , 1992 .

[17]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[18]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[19]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[20]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[21]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[23]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[24]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[25]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[27]  Lin Ma,et al.  Learning to Answer Questions from Image Using Convolutional Neural Network , 2015, AAAI.

[28]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[29]  T. Poggio,et al.  BOOK REVIEW David Marr’s Vision: floreat computational neuroscience VISION: A COMPUTATIONAL INVESTIGATION INTO THE HUMAN REPRESENTATION AND PROCESSING OF VISUAL INFORMATION , 2009 .

[30]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[31]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[32]  Catherine Havasi,et al.  ConceptNet 3 : a Flexible , Multilingual Semantic Network for Common Sense Knowledge , 2007 .

[33]  Sharon L. Thompson-Schill,et al.  The Cognitive Neuroscience of Semantic Memory , 2012 .

[34]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[35]  Ali Farhadi,et al.  Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects , 2016, AAAI.

[36]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[37]  Lise Getoor,et al.  A short introduction to probabilistic soft logic , 2012, NIPS 2012.

[38]  Pat Langley,et al.  Cognitive architectures: Research issues and challenges , 2009, Cognitive Systems Research.

[39]  Marco Scutari,et al.  Learning Bayesian Networks with the bnlearn R Package , 2009, 0908.3817.

[40]  Lise Getoor,et al.  Hinge-loss Markov Random Fields: Convex Inference for Structured Prediction , 2013, UAI.

[41]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[43]  Peter Clark,et al.  KM – The Knowledge Machine 2.0: Users Manual , 2003 .

[44]  Yang Wang,et al.  Image Retrieval with Structured Object Queries Using Latent Ranking SVM , 2012, ECCV.

[45]  Yiannis Aloimonos,et al.  A Cognitive System for Understanding Human Manipulation Actions , 2014 .

[46]  Yiannis Aloimonos,et al.  The Cognitive Dialogue: A new model for vision implementing common sense reasoning , 2015, Image Vis. Comput..

[47]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[48]  Chitta Baral,et al.  From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge , 2015, ArXiv.

[49]  Tamara L. Berg,et al.  Baby Talk: Understanding and Generating Image Descriptions , 2011 .

[50]  Douglas Summers-Stay,et al.  Productive Vision: Methods for Automatic Image Comprehension , 2013 .