Pixels to Voxels: Modeling Visual Representation in the Human Brain

The human brain is adept at solving difficult high-level visual processing problems such as image interpretation and object recognition in natural scenes. Over the past few years neuroscientists have made remarkable progress in understanding how the human brain represents categories of objects and actions in natural scenes. However, all current models of high-level human vision operate on hand annotated images in which the objects and actions have been assigned semantic tags by a human operator. No current models can account for high-level visual function directly in terms of low-level visual input (i.e., pixels). To overcome this fundamental limitation we sought to develop a new class of models that can predict human brain activity directly from low-level visual input (i.e., pixels). We explored two classes of models drawn from computer vision and machine learning. The first class of models was based on Fisher Vectors (FV) and the second was based on Convolutional Neural Networks (ConvNets). We find that both classes of models accurately predict brain activity in high-level visual areas, directly from pixels and without the need for any semantic tags or hand annotation of images. This is the first time that such a mapping has been obtained. The fit models provide a new platform for exploring the functional principles of human vision, and they show that modern methods of computer vision and machine learning provide important tools for characterizing brain function.

[1]  C. Enroth-Cugell,et al.  The contrast sensitivity of retinal ganglion cells of the cat , 1966, The Journal of physiology.

[2]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[3]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[4]  D. V. van Essen,et al.  Selectivity for polar, hyperbolic, and Cartesian gratings in macaque visual cortex. , 1993, Science.

[5]  Karl J. Friston,et al.  Detecting Activations in PET and fMRI: Levels of Inference and Power , 1996, NeuroImage.

[6]  N. Kanwisher,et al.  The Fusiform Face Area: A Module in Human Extrastriate Cortex Specialized for Face Perception , 1997, The Journal of Neuroscience.

[7]  Nancy Kanwisher,et al.  A cortical representation of the local visual environment , 1998, Nature.

[8]  I. Wilkinson,et al.  Introduction to functional magnetic resonance imaging , 1999 .

[9]  M. Tarr,et al.  The Fusiform Face Area is Part of a Network that Processes Faces at the Individual Level , 2000, Journal of Cognitive Neuroscience.

[10]  T. Schormann,et al.  Functional delineation of the human occipito-temporal areas related to face and scene processing. A PET study. , 2000, Brain : a journal of neurology.

[11]  E. Maguire The retrosplenial contribution to human navigation: a review of lesion and neuroimaging findings. , 2001, Scandinavian journal of psychology.

[12]  R. Buxton Introduction to Functional Magnetic Resonance Imaging: Principles and Techniques , 2002 .

[13]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[14]  J. Gallant,et al.  Predicting neuronal responses during natural vision , 2005, Network.

[15]  N. Kanwisher,et al.  Location and spatial profile of category‐specific regions in human extrastriate cortex , 2006, Human brain mapping.

[16]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[17]  Ryan J. Prenger,et al.  Bayesian Reconstruction of Natural Images from Human Brain Activity , 2009, Neuron.

[18]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[20]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Jack L. Gallant,et al.  Encoding and decoding in fMRI , 2011, NeuroImage.

[22]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Jack L. Gallant,et al.  A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain , 2012, Neuron.

[25]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[26]  Jack L. Gallant,et al.  Natural Scene Statistics Account for the Representation of Scene Categories in Human Visual Cortex , 2013, Neuron.

[27]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.