Efficient inverse graphics in biological face processing

Vision must not only recognize and localize objects, but perform richer inferences about the underlying causes in the world that give rise to sensory data. How the brain performs these inferences remains unknown: Theoretical proposals based on inverting generative models (or “analysis-by-synthesis”) have a long history but their mechanistic implementations have typically been too slow to support online perception, and their mapping to neural circuits is unclear. Here we present a neurally plausible model for efficiently inverting generative models of images and test it as an account of one high-level visual capacity, the perception of faces. The model is based on a deep neural network that learns to invert a three-dimensional (3D) face graphics program in a single fast feedforward pass. It explains both human behavioral data and multiple levels of neural processing in non-human primates, as well as a classic illusion, the “hollow face” effect. The model fits qualitatively better than state-of-the-art computer vision models, and suggests an interpretable reverse-engineering account of how images are transformed into percepts in the ventral stream.

[1]  Oriol Vinyals,et al.  Synthesizing Programs for Images using Reinforced Adversarial Learning , 2018, ICML.

[2]  Thomas Vetter,et al.  A morphable model for the synthesis of 3D faces , 1999, SIGGRAPH.

[3]  Heinrich H Bülthoff,et al.  Is prior knowledge of object geometry used in visually guided reaching? , 2005, Journal of vision.

[4]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[5]  Jiajun Wu,et al.  MarrNet: 3D Shape Reconstruction via 2.5D Sketches , 2017, NIPS.

[6]  Yan Wang,et al.  A Simple, Fast and Highly-Accurate Algorithm to Recover 3D Shape from 2D Landmarks on a Single Image , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jiajun Wu,et al.  Learning to Reconstruct Shapes from Unseen Classes , 2018, NeurIPS.

[8]  Francesc Moreno-Noguer,et al.  GANimation: Anatomically-aware Facial Animation from a Single Image , 2018, ECCV.

[9]  Jitendra Malik,et al.  Shape, Illumination, and Reflectance from Shading , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  I. Helland Partial Least Squares Regression , 2006 .

[11]  Joshua B. Tenenbaum,et al.  Causal and compositional generative models in online perception , 2017, CogSci.

[12]  Joshua B. Tenenbaum,et al.  Integrating identification and perception: A case study of familiar and unfamiliar face processing , 2016, CogSci.

[13]  Connor J. Parde,et al.  Face and Image Representation in Deep CNN Features , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[14]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[15]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[16]  Doris Y. Tsao,et al.  Anatomical Connections of the Functionally Defined “Face Patches” in the Macaque Monkey , 2016, Neuron.

[17]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[18]  V. Bruce,et al.  Recognition of unfamiliar faces , 2000, Trends in Cognitive Sciences.

[19]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[20]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[21]  Ha Hong,et al.  Explicit information for category-orthogonal object properties increases along the ventral stream , 2016, Nature Neuroscience.

[22]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[23]  James M. Rehg,et al.  3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Oliver G. B. Garrod,et al.  Modelling face memory reveals task-generalizable representations , 2019, Nature Human Behaviour.

[25]  Aleix M. Martínez,et al.  Recognizing Imprecisely Localized, Partially Occluded, and Expression Variant Faces from a Single Sample per Class , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  H. Barrow,et al.  RECOVERING INTRINSIC SCENE CHARACTERISTICS FROM IMAGES , 1978 .

[27]  Sami Romdhani,et al.  A 3D Face Model for Pose and Illumination Invariant Face Recognition , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[28]  Li Su,et al.  A Toolbox for Representational Similarity Analysis , 2014, PLoS Comput. Biol..

[29]  Joshua B. Tenenbaum,et al.  Efficient analysis-by-synthesis in vision: A computational framework, behavioral tests, and modeling neuronal representations , 2015, Annual Meeting of the Cognitive Science Society.

[30]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[31]  A. Yuille,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 Special Issue: Probabilistic models of cognition Vision as Bayesian inference: analysis by synthesis? , 2022 .

[32]  F. McCoy,et al.  Janus-faced PIDD: a sensor for DNA damage-induced cell death or survival? , 2012, Molecular cell.

[33]  Pushmeet Kohli,et al.  Vision-as-Inverse-Graphics: Obtaining a Rich 3D Explanation of a Scene from a Single Image , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[34]  Laurent Itti,et al.  Perceptual consequences of feature-based attention. , 2004, Journal of vision.

[35]  Ethan Meyers,et al.  The neural decoding toolbox , 2013, Front. Neuroinform..

[36]  Pushmeet Kohli,et al.  Overcoming Occlusion with Inverse Graphics , 2016, ECCV Workshops.

[37]  Jörn Diedrichsen,et al.  Representational models: A common framework for understanding encoding, pattern-component, and representational-similarity analysis , 2017, bioRxiv.

[38]  E. Bizzi,et al.  The Cognitive Neurosciences , 1996 .

[39]  Jiajun Wu,et al.  Neural Scene De-rendering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[41]  T Poggio,et al.  View-based models of 3D object recognition: invariance to imaging transformations. , 1995, Cerebral cortex.

[42]  Tejas D. Kulkarni,et al.  Deep Generative Vision as Approximate Bayesian Computation , 2014 .

[43]  Bhaskara Marthi,et al.  A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs , 2017, Science.

[44]  Georgios Tzimiropoulos,et al.  Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  R. Jacobs,et al.  Transfer of object category knowledge across visual and haptic modalities: Experimental and computational studies , 2013, Cognition.

[46]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[47]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[48]  Galit Yovel,et al.  Face recognition systems in monkey and human: are they the same thing? , 2013, F1000prime reports.

[49]  Doris Y. Tsao,et al.  Functional Compartmentalization and Viewpoint Generalization Within the Macaque Face-Processing System , 2010, Science.

[50]  Tai Sing Lee,et al.  Hierarchical Bayesian inference in the visual cortex. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[51]  Doris Y. Tsao,et al.  Comparing face patch systems in macaques and humans , 2008, Proceedings of the National Academy of Sciences.

[52]  Bevil R. Conway,et al.  The Organization and Operation of Inferior Temporal Cortex. , 2018, Annual review of vision science.

[53]  Kunihiko Fukushima,et al.  Neocognitron: A hierarchical neural network capable of visual pattern recognition , 1988, Neural Networks.

[54]  Aude Oliva,et al.  Visual long-term memory has a massive storage capacity for object details , 2008, Proceedings of the National Academy of Sciences.

[55]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[56]  Bolei Zhou,et al.  Places: An Image Database for Deep Scene Understanding , 2016, ArXiv.

[57]  Michael Eickenberg,et al.  Seeing it all: Convolutional network layers map the function of the human visual system , 2017, NeuroImage.

[58]  C. Koch,et al.  Category-specific visual responses of single neurons in the human medial temporal lobe , 2000, Nature Neuroscience.

[59]  Joel Z. Leibo,et al.  View-Tolerant Face Recognition and Hebbian Learning Imply Mirror-Symmetric Neural Tuning to Head Orientation , 2016, Current Biology.

[60]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[61]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[62]  Winrich A Freiwald,et al.  Two areas for familiar face recognition in the primate brain , 2017, Science.

[63]  R. Gregory,et al.  Knowledge in perception and illusion. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[64]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[65]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[66]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[67]  K. Nakayama,et al.  Where cognitive development and aging meet: Face learning ability peaks after age 30 , 2011, Cognition.

[68]  Robert A Jacobs,et al.  Visual Shape Perception as Bayesian Inference of 3D Object-Centered Shape Representations , 2017, Psychological review.

[69]  J. DiCarlo,et al.  Using goal-driven deep learning models to understand sensory cortex , 2016, Nature Neuroscience.

[70]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[71]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[72]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[73]  Doris Y. Tsao,et al.  The Code for Facial Identity in the Primate Brain , 2017, Cell.

[74]  Joshua B. Tenenbaum,et al.  Picture: A probabilistic programming language for scene perception , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Frank Tong,et al.  Prevalence of Selectivity for Mirror-Symmetric Views of Faces in the Ventral and Dorsal Visual Pathways , 2012, The Journal of Neuroscience.

[76]  Bruno A. Olshausen,et al.  Perception as an Inference Problem , 2013 .

[77]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[78]  Doris Y. Tsao,et al.  Intelligent Information Loss: The Coding of Facial Identity, Head Pose, and Non-Face Information in the Macaque Face Patch System , 2015, The Journal of Neuroscience.

[79]  Jiajun Wu,et al.  Learning Shape Priors for Single-View 3D Completion and Reconstruction , 2018, ECCV.

[80]  V. Lamme,et al.  The distinct modes of vision offered by feedforward and recurrent processing , 2000, Trends in Neurosciences.

[81]  Michal Irani,et al.  Deep Convolutional modeling of human face selective columns reveals their role in pictorial face representation , 2018, bioRxiv.

[82]  Noah D. Goodman,et al.  Learning Stochastic Inverses , 2013, NIPS.

[83]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Marcel A. J. van Gerven,et al.  Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream , 2014, The Journal of Neuroscience.

[85]  James J. DiCarlo,et al.  How Does the Brain Solve Visual Object Recognition? , 2012, Neuron.

[86]  Jacob Jolij,et al.  Figure–ground segregation requires two distinct periods of activity in V1: a transcranial magnetic stimulation study , 2005, Neuroreport.

[87]  Doris Y. Tsao,et al.  A face feature space in the macaque temporal lobe , 2009, Nature Neuroscience.

[88]  David Marr,et al.  Vision: A computational investigation into the human representation , 1983 .

[89]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[90]  J. Diedrichsen,et al.  Hand use predicts the structure of representations in sensorimotor cortex , 2015, Nature Neuroscience.

[91]  M. Giese,et al.  Norm-based face encoding by single neurons in the monkey inferotemporal cortex , 2006, Nature.

[92]  Ryan P. Adams,et al.  Elliptical slice sampling , 2009, AISTATS.

[93]  S. Thorpe,et al.  The Time Course of Visual Processing: From Early Perception to Decision-Making , 2001, Journal of Cognitive Neuroscience.