Shared spatiotemporal category representations in biological and artificial deep neural networks

Visual scene category representations emerge very rapidly, yet the computational transformations that enable such invariant categorizations remain elusive. Deep convolutional neural networks (CNNs) perform visual categorization at near human-level accuracy using a feedforward architecture, providing neuroscientists with the opportunity to assess one successful series of representational transformations that enable categorization in silico. The goal of the current study is to assess the extent to which sequential scene category representations built by a CNN map onto those built in the human brain as assessed by high-density, time-resolved event-related potentials (ERPs). We found correspondence both over time and across the scalp: earlier (0–200 ms) ERP activity was best explained by early CNN layers at all electrodes. Although later activity at most electrode sites corresponded to earlier CNN layers, activity in right occipito-temporal electrodes was best explained by the later, fully-connected layers of the CNN around 225 ms post-stimulus, along with similar patterns in frontal electrodes. Taken together, these results suggest that the emergence of scene category representations develop through a dynamic interplay between early activity over occipital electrodes as well as later activity over temporal and frontal electrodes.

[1]  David D. Cox,et al.  Untangling invariant object recognition , 2007, Trends in Cognitive Sciences.

[2]  J. DiCarlo,et al.  Using goal-driven deep learning models to understand sensory cortex , 2016, Nature Neuroscience.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Kunihiko Fukushima,et al.  Neocognitron: A hierarchical neural network capable of visual pattern recognition , 1988, Neural Networks.

[7]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[10]  Dimitrios Pantazis,et al.  Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks , 2015, NeuroImage.

[11]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[12]  Michelle R. Greene,et al.  PSYCHOLOGICAL SCIENCE Research Article The Briefest of Glances The Time Course of Natural Scene Understanding , 2022 .

[13]  S. Hochstein,et al.  View from the Top Hierarchies and Reverse Hierarchies in the Visual System , 2002, Neuron.

[14]  Li Fei-Fei,et al.  Visual categorization is automatic and obligatory: evidence from Stroop-like paradigm. , 2014, Journal of vision.

[15]  Katherine R. Storrs,et al.  Deep Convolutional Neural Networks Outperform Feature-Based But Not Categorical Models in Explaining Object Similarity Judgments , 2017, Front. Psychol..

[16]  S. Hochstein,et al.  The reverse hierarchy theory of visual perceptual learning , 2004, Trends in Cognitive Sciences.

[17]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[18]  Brad Wyble,et al.  Detecting meaning in RSVP at 13 ms per picture , 2013, Attention, perception & psychophysics.

[19]  Richard A. Robb,et al.  Visualization of patient-specific volume images in image-guided diagnosis and therapy , 2003, J. Vis..

[20]  Marcel A. J. van Gerven,et al.  Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream , 2014, The Journal of Neuroscience.

[21]  Philippe G Schyns,et al.  Accurate statistical tests for smooth classification images. , 2005, Journal of vision.

[22]  Nikolaus Kriegeskorte,et al.  Frontiers in Systems Neuroscience Systems Neuroscience , 2022 .

[23]  Li Fei-Fei,et al.  Distinct contributions of functional and deep neural network features to representational similarity of scenes in human brain and behavior , 2018, eLife.

[24]  Dimitrios Pantazis,et al.  Multivariate pattern analysis of MEG and EEG: A comparison of representational structure in time and space , 2016, NeuroImage.

[25]  Antonio Torralba,et al.  Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence , 2016, Scientific Reports.

[26]  L. Tyler,et al.  Predicting the Time Course of Individual Objects with MEG , 2014, Cerebral cortex.

[27]  Jonas Kubilius,et al.  Deep Neural Networks as a Computational Model for Human Shape Sensitivity , 2016, PLoS Comput. Biol..

[28]  J. Bruner On perceptual readiness. , 1957, Psychological review.

[29]  C. Dumouchel,et al.  3. Pressure atomiser: Hole breakup of the sheet , 2001 .

[30]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[31]  E. Halgren,et al.  Top-down facilitation of visual recognition. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Li Su,et al.  A Toolbox for Representational Similarity Analysis , 2014, PLoS Comput. Biol..

[33]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[34]  Marcel van Gerven,et al.  Convolutional neural network-based encoding and decoding of visual object recognition in space and time , 2017, NeuroImage.

[35]  Ha Hong,et al.  Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream , 2013, NIPS.

[36]  Jitendra Malik,et al.  Pixels to Voxels: Modeling Visual Representation in the Human Brain , 2014, ArXiv.

[37]  Thomas A. Carlson,et al.  Representational dynamics of object recognition: Feedforward and feedback information flows , 2016, NeuroImage.

[38]  Christoph M. Michel,et al.  The Neural Substrates and Timing of Top–Down Processes during Coarse-to-Fine Categorization of Visual Scenes: A Combined fMRI and ERP Study , 2010, Journal of Cognitive Neuroscience.

[39]  Daniel L. K. Yamins,et al.  Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition , 2014, PLoS Comput. Biol..

[40]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[41]  Bruce C Hansen,et al.  Discrimination of amplitude spectrum slope in the fovea and parafovea and the local amplitude distributions of natural scene imagery. , 2006, Journal of vision.

[42]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[43]  Arnaud Delorme,et al.  EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis , 2004, Journal of Neuroscience Methods.

[44]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Matthias Bethge,et al.  Comparing deep neural networks against humans: object recognition when the signal gets weaker , 2017, ArXiv.

[46]  Edward A Essock,et al.  A horizontal bias in human visual processing of orientation and its correspondence to the structural components of natural scenes. , 2004, Journal of vision.

[47]  S Edelman,et al.  Representation is representation of similarities , 1996, Behavioral and Brain Sciences.