Applying artificial vision models to human scene understanding

How do we understand the complex patterns of neural responses that underlie scene understanding? Studies of the network of brain regions held to be scene-selective—the parahippocampal/lingual region (PPA), the retrosplenial complex (RSC), and the occipital place area (TOS)—have typically focused on single visual dimensions (e.g., size), rather than the high-dimensional feature space in which scenes are likely to be neurally represented. Here we leverage well-specified artificial vision systems to explicate a more complex understanding of how scenes are encoded in this functional network. We correlated similarity matrices within three different scene-spaces arising from: (1) BOLD activity in scene-selective brain regions; (2) behavioral measured judgments of visually-perceived scene similarity; and (3) several different computer vision models. These correlations revealed: (1) models that relied on mid- and high-level scene attributes showed the highest correlations with the patterns of neural activity within the scene-selective network; (2) NEIL and SUN—the models that best accounted for the patterns obtained from PPA and TOS—were different from the GIST model that best accounted for the pattern obtained from RSC; (3) The best performing models outperformed behaviorally-measured judgments of scene similarity in accounting for neural data. One computer vision method—NEIL (“Never-Ending-Image-Learner”), which incorporates visual features learned as statistical regularities across web-scale numbers of scenes—showed significant correlations with neural activity in all three scene-selective regions and was one of the two models best able to account for variance in the PPA and TOS. We suggest that these results are a promising first step in explicating more fine-grained models of neural scene understanding, including developing a clearer picture of the division of labor among the components of the functional scene-selective brain network.

[1]  Dirk B Walther,et al.  Nonaccidental Properties Underlie Human Categorization of Complex Natural Scenes , 2014, Psychological science.

[2]  Alexei A. Efros,et al.  Recovering Surface Layout from an Image , 2007, International Journal of Computer Vision.

[3]  Michelle R. Greene,et al.  Recognition of natural scenes from global properties: Seeing the forest without representing the trees , 2009, Cognitive Psychology.

[4]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  R. Tootell,et al.  Thinking Outside the Box: Rectilinear Shapes Selectively Activate Scene-Selective Cortex , 2014, The Journal of Neuroscience.

[6]  Lila Davachi,et al.  Object Unitization and Associative Memory Formation Are Supported by Distinct Brain Regions , 2010, The Journal of Neuroscience.

[7]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[8]  Michael J. Tarr,et al.  RECONSIDERING THE ROLE OF STRUCTURE IN VISION , 2006 .

[9]  Soojin Park,et al.  Disentangling Scene Content from Spatial Boundary: Complementary Roles for the Parahippocampal Place Area and Lateral Occipital Complex in Representing Real-World Scenes , 2011, The Journal of Neuroscience.

[10]  Tom Hartley,et al.  Patterns of response to visual scenes are linked to the low-level properties of the image , 2014, NeuroImage.

[11]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[12]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[13]  H. Intraub,et al.  Beyond the Edges of a View: Boundary Extension in Human Scene-Selective Visual Cortex , 2007, Neuron.

[14]  M. Bar,et al.  Cortical Analysis of Visual Context , 2003, Neuron.

[15]  E. Jones,et al.  Patterns of Response , 1968 .

[16]  Dwight J. Kravitz,et al.  Real-World Scene Representations in High-Level Visual Cortex: It's the Spaces More Than the Places , 2011, The Journal of Neuroscience.

[17]  Carlo Baldassi,et al.  Shape Similarity, Better than Semantic Membership, Accounts for the Structure of Visual Object Representations in a Population of Monkey Inferotemporal Neurons , 2013, PLoS Comput. Biol..

[18]  Aude Oliva,et al.  Parametric Coding of the Size and Clutter of Natural Scenes in the Human Brain. , 2014, Cerebral cortex.

[19]  Jonathan S. Cant,et al.  Scratching Beneath the Surface: New Insights into the Functional Properties of the Lateral Occipital Area and Parahippocampal Place Area , 2011, The Journal of Neuroscience.

[20]  John A. Pyles,et al.  Comparing visual representations across human fMRI and computational vision. , 2013, Journal of vision.

[21]  Eli Shechtman,et al.  Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Koen E. A. van de Sande,et al.  Empowering Visual Categorization With the GPU , 2011, IEEE Transactions on Multimedia.

[23]  Jonathan S. Cant,et al.  Object Ensemble Processing in Human Anterior-Medial Ventral Visual Cortex , 2012, The Journal of Neuroscience.

[24]  Samuel P. Huntington,et al.  Patterns of Response , 2000 .

[25]  A. Oliva,et al.  Diagnostic Colors Mediate Scene Recognition , 2000, Cognitive Psychology.

[26]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[27]  Soojin Park,et al.  Different roles of the parahippocampal place area (PPA) and retrosplenial cortex (RSC) in panoramic scene perception , 2009, NeuroImage.

[28]  Jörn Diedrichsen,et al.  Detecting and adjusting for artifacts in fMRI time series data , 2005, NeuroImage.

[29]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[30]  M. Bar,et al.  Scenes Unseen: The Parahippocampal Cortex Intrinsically Subserves Contextual Associations, Not Scenes or Places Per Se , 2008, The Journal of Neuroscience.

[31]  Russell A. Epstein,et al.  Differential parahippocampal and retrosplenial involvement in three types of visual scene recognition. , 2006, Cerebral cortex.

[32]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  Jonathan S. Cant,et al.  Crinkling and crumpling: An auditory fMRI study of material properties , 2008, NeuroImage.

[37]  Jack L. Gallant,et al.  Natural Scene Statistics Account for the Representation of Scene Categories in Human Visual Cortex , 2013, Neuron.

[38]  Dwight J. Kravitz,et al.  Deconstructing visual scenes in cortex: gradients of object and spatial layout information. , 2013, Cerebral cortex.

[39]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Michael J. Tarr,et al.  Task-Specific Codes for Face Recognition: How they Shape the Neural Representation of Features for Detection and Individuation , 2008, PloS one.

[41]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[42]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[43]  In-Seuck Jeung,et al.  Investigation of the pseudo-shock wave in a two-dimensional supersonic inlet , 2010, J. Vis..

[44]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[45]  Abhinav Gupta,et al.  Constrained Semi-Supervised Learning Using Attributes and Comparative Attributes , 2012, ECCV.

[46]  M. Bar,et al.  The role of the parahippocampal cortex in cognition , 2013, Trends in Cognitive Sciences.

[47]  Stefano Soatto,et al.  Knowing a Good Feature When You See It: Ground Truth and Methodology to Evaluate Local Features for Recognition , 2010, Computer Vision: Detection, Recognition and Reconstruction.

[48]  Rongrong Ji,et al.  Weak attributes for large-scale image retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.