Characterizing the temporal dynamics of object recognition by deep neural networks : role of depth

Convolutional neural networks (CNNs) have recently emerged as promising models of human vision based on their ability to predict hemodynamic brain responses to visual stimuli measured with functional magnetic resonance imaging (fMRI). However, the degree to which CNNs can predict temporal dynamics of visual object recognition reflected in neural measures with millisecond precision is less understood. Additionally, while deeper CNNs with higher numbers of layers perform better on automated object recognition, it is unclear if this also results into better correlation to brain responses. Here, we examined 1) to what extent CNN layers predict visual evoked responses in the human brain over time and 2) whether deeper CNNs better model brain responses. Specifically, we tested how well CNN architectures with 7 (CNN-7) and 15 (CNN-15) layers predicted electro-encephalography (EEG) responses to several thousands of natural images. Our results show that both CNN architectures correspond to EEG responses in a hierarchical spatio-temporal manner, with lower layers explaining responses early in time at electrodes overlying early visual cortex, and higher layers explaining responses later in time at electrodes overlying lateral-occipital cortex. While the explained variance of neural responses by individual layers did not differ between CNN-7 and CNN-15, combining the representations across layers resulted in improved performance of CNN-15 compared to CNN-7, but only after 150 ms after stimulus-onset. This suggests that CNN representations reflect both early (feed-forward) and late (feedback) stages of visual processing. Overall, our results show that depth of CNNs indeed plays a role in explaining time-resolved EEG responses.

[1]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[2]  E Donchin,et al.  A new method for off-line removal of ocular artifact. , 1983, Electroencephalography and clinical neurophysiology.

[3]  F. Perrin,et al.  Spherical splines for scalp potential and current density mapping. , 1989, Electroencephalography and clinical neurophysiology.

[4]  Victor A. F. Lamme,et al.  Contextual Modulation in Primary Visual Cortex , 1996, The Journal of Neuroscience.

[5]  Denis Fize,et al.  Speed of processing in the human visual system , 1996, Nature.

[6]  S. Edelman,et al.  Human Brain Mapping 6:316–328(1998) � A Sequence of Object-Processing Stages Revealed by fMRI in the Human Occipital Lobe , 2022 .

[7]  Victor A. F. Lamme,et al.  The implementation of visual routines , 2000, Vision Research.

[8]  V. Lamme,et al.  The distinct modes of vision offered by feedforward and recurrent processing , 2000, Trends in Neurosciences.

[9]  S. Thorpe,et al.  Surfing a spike wave down the ventral stream , 2002, Vision Research.

[10]  Andriana Olmos,et al.  A biologically inspired algorithm for the recovery of shading and reflectance images , 2004 .

[11]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[12]  Peter Auer,et al.  Generic object recognition with boosting , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[14]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[15]  Arnold W. M. Smeulders,et al.  Brain responses strongly correlate with Weibull image statistics when processing natural images. , 2009, Journal of vision.

[16]  Arnold W. M. Smeulders,et al.  A Biologically Plausible Model for Rapid Natural Scene Identification , 2009, NIPS.

[17]  Dirk B. Walther,et al.  Natural Scene Categories Revealed in Distributed Patterns of Activity in the Human Brain , 2009, The Journal of Neuroscience.

[18]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[19]  Victor A. F. Lamme,et al.  The role of Weibull image statistics in rapid object detection in natural scenes , 2010 .

[20]  Daniel D. Dilks,et al.  The Functional Organization of the Ventral Visual Pathway in Humans , 2012 .

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Victor A. F. Lamme,et al.  Spatially Pooled Contrast Responses Predict Neural and Perceptual Similarity of Naturalistic Image Categories , 2012, PLoS Comput. Biol..

[23]  Sennay Ghebreab,et al.  From Image Statistics to Scene Gist: Evoked Neural Activity Reveals Transition from Low-Level Natural Image Structure to Scene Category , 2013, The Journal of Neuroscience.

[24]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[25]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[26]  Daniel L. K. Yamins,et al.  Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition , 2014, PLoS Comput. Biol..

[27]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[28]  Antonio Torralba,et al.  Mapping human visual representations in space and time by neural networks. , 2015, Journal of vision.

[29]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[32]  Marcel A. J. van Gerven,et al.  Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream , 2014, The Journal of Neuroscience.

[33]  Arnold W. M. Smeulders,et al.  Visual dictionaries as intermediate features in the human brain , 2015, Front. Comput. Neurosci..

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Steven Scholte,et al.  Overlap in performance of CNN's, human behavior and EEG classification , 2016 .

[36]  Antonio Torralba,et al.  Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence , 2016, Scientific Reports.

[37]  David J Heeger,et al.  Theory of cortical function , 2017, Proceedings of the National Academy of Sciences.