Adopting Abstract Images for Semantic Scene Understanding

Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages over real images. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of real images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of real images that are semantically similar would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract images with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity. Finally, we study the relation between the saliency and memorability of objects and their semantic importance.

[1]  F. Heider,et al.  An experimental study of apparent behavior , 1944 .

[2]  I. Rock,et al.  A study of memory for visual form. , 1959 .

[3]  L. Standing Learning 10,000 pictures. , 1973, The Quarterly journal of experimental psychology.

[4]  I. Biederman,et al.  Scene perception: Detecting and judging objects undergoing relational violations , 1982, Cognitive Psychology.

[5]  K. Oatley,et al.  Perception of personal and interpersonal action in a cartoon film , 1985 .

[6]  James L. McClelland,et al.  Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[7]  R. Shiffrin,et al.  A model for recognition memory: REM—retrieving effectively from memory , 1997, Psychonomic bulletin & review.

[8]  Claudio M. Privitera,et al.  Algorithms for Defining Visual Regions-of-Interest: Comparison with Eye Fixations , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Erik Reinhard,et al.  Artistic Composition for Image Creation , 2001, Rendering Techniques.

[10]  Marc W. Howard,et al.  A distributed representation of temporal context , 2002 .

[11]  Michel Vidal-Naquet,et al.  Visual features of intermediate complexity and their use in classification , 2002, Nature Neuroscience.

[12]  Beat Fasel,et al.  Automatic facial expression analysis: a survey , 2003, Pattern Recognit..

[13]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[14]  Antonio Torralba,et al.  Contextual Models for Object Detection Using Boosted Random Fields , 2004, NIPS.

[15]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[16]  Daniel Cohen-Or,et al.  Color harmonization , 2006, ACM Trans. Graph..

[17]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[18]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  J. Worthen,et al.  Distinctiveness and memory. , 2006 .

[20]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[21]  Wei-Hao Lin,et al.  Which Thousand Words are Worth a Picture? Experiments on Video Retrieval using a Thousand Concepts , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[22]  Gordon D. A. Brown,et al.  A temporal ratio model of memory. , 2007, Psychological review.

[23]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[24]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[25]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[26]  Nanning Zheng,et al.  Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Andrew J. Chosak,et al.  OVVV: Using Virtual Worlds to Design and Evaluate Surveillance Systems , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Dani Lischinski,et al.  Data-driven enhancement of facial attractiveness , 2008, SIGGRAPH 2008.

[29]  Laurent Itti,et al.  Interesting objects are visually salient. , 2008, Journal of vision.

[30]  Xiaoou Tang,et al.  Photo and Video Quality Evaluation: Focusing on the Subject , 2008, ECCV.

[31]  Larry S. Davis,et al.  Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[32]  Ashutosh Saxena,et al.  Robotic Grasping of Novel Objects using Vision , 2008, Int. J. Robotics Res..

[33]  Aude Oliva,et al.  Visual long-term memory has a massive storage capacity for object details , 2008, Proceedings of the National Academy of Sciences.

[34]  Dani Lischinski,et al.  Data-driven enhancement of facial attractiveness , 2008, ACM Trans. Graph..

[35]  P. Perona,et al.  Objects predict fixations better than early saliency. , 2008, Journal of vision.

[36]  Serge J. Belongie,et al.  Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[38]  Charless C. Fowlkes,et al.  Discriminative Models for Multi-Class Object Layout , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[39]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[41]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  L. Itti,et al.  Quantifying center bias of observers in free viewing of dynamic natural scenes. , 2009, Journal of vision.

[43]  David Vázquez,et al.  Learning appearance in virtual scenarios for pedestrian detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44]  Pietro Perona,et al.  Measuring and Predicting Object Importance , 2011, International Journal of Computer Vision.

[45]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Daniel Cohen-Or,et al.  Optimizing Photo Composition , 2010, Comput. Graph. Forum.

[48]  Kristen Grauman,et al.  Reading between the lines: Object localization using implicit cues from image tags , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[50]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[51]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[52]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[53]  Antonio Torralba,et al.  Understanding the Intrinsic Memorability of Images , 2011, NIPS.

[54]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[55]  Antonio Torralba,et al.  Evaluation of image features using a photorealistic virtual world , 2011, 2011 International Conference on Computer Vision.

[56]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[57]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[58]  Vicente Ordonez,et al.  High level describable attributes for predicting aesthetics and interestingness , 2011, CVPR 2011.

[59]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[60]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[61]  Jianxiong Xiao,et al.  What makes an image memorable? , 2011, CVPR 2011.

[62]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[63]  Kristen Grauman,et al.  Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search , 2011, International Journal of Computer Vision.

[64]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[65]  Jianxiong Xiao,et al.  Memorability of Image Regions , 2012, NIPS.

[66]  Karl Stratos,et al.  Understanding and predicting importance in images , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[69]  Antonio Torralba,et al.  Modifying the Memorability of Face Photographs , 2013, 2013 IEEE International Conference on Computer Vision.

[70]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[71]  Devi Parikh,et al.  Attribute Dominance: What Pops Out? , 2013, 2013 IEEE International Conference on Computer Vision.

[72]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[73]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[74]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[76]  Joseph L. Mundy,et al.  Change Detection , 2014, Computer Vision, A Reference Guide.

[77]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.