An Explicit Local and Global Representation Disentanglement Framework with Applications in Deep Clustering and Unsupervised Object Detection

Visual data can be understood at different levels of granularity, where global features correspond to semantic-level information and local features correspond to texture patterns. In this work, we propose a framework, called SPLIT, which allows us to disentangle local and global information into two separate sets of latent variables within the variational autoencoder (VAE) framework. Our framework adds generative assumption to the VAE by requiring a subset of the latent variables to generate an auxiliary set of observable data. This additional generative assumption primes the latent variables to local information and encourages the other latent variables to represent global information. We examine three different flavours of VAEs with different generative assumptions. We show that the framework can effectively disentangle local and global information within these models leads to improved representation, with better clustering and unsupervised object detection benchmarks. Finally, we establish connections between SPLIT and recent research in cognitive neuroscience regarding the disentanglement in human visual perception. The code for our experiments is at this https URL .

[1]  N. Kanwisher,et al.  The lateral occipital complex and its role in object recognition , 2001, Vision Research.

[2]  J. Maunsell,et al.  Attention improves performance primarily by reducing interneuronal correlations , 2009, Nature Neuroscience.

[3]  Russell A. Epstein,et al.  Scene Perception in the Human Brain. , 2019, Annual review of vision science.

[4]  S. Treue,et al.  Attentional Modulation Strength in Cortical Area MT Depends on Stimulus Contrast , 2002, Neuron.

[5]  R. Desimone,et al.  Attention Increases Sensitivity of V4 Neurons , 2000, Neuron.

[6]  Marcel A. J. van Gerven,et al.  Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream , 2014, The Journal of Neuroscience.

[7]  Roger B. Grosse,et al.  Isolating Sources of Disentanglement in Variational Autoencoders , 2018, NeurIPS.

[8]  Nicole C. Rust,et al.  Selectivity and Tolerance (“Invariance”) Both Increase as Visual Information Propagates from Cortical Area V4 to IT , 2010, The Journal of Neuroscience.

[9]  Joelle Pineau,et al.  Spatially Invariant Unsupervised Object Detection with Convolutional Neural Networks , 2019, AAAI.

[10]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[11]  Yoshua Bengio,et al.  Measuring the tendency of CNNs to Learn Surface Statistical Regularities , 2017, ArXiv.

[12]  J. Serences,et al.  Two different mechanisms support selective attention at different phases of training , 2016, PLoS biology.

[13]  D. Heeger,et al.  The Normalization Model of Attention , 2009, Neuron.

[14]  Nikolaus Kriegeskorte,et al.  Deep neural networks: a new framework for modelling biological vision and brain information processing , 2015, bioRxiv.

[15]  David D. Cox,et al.  Untangling invariant object recognition , 2007, Trends in Cognitive Sciences.

[16]  Murray Shanahan,et al.  Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders , 2016, ArXiv.

[17]  Tom Hartley,et al.  A data driven approach to understanding the organization of high-level visual cortex , 2017, Scientific Reports.

[18]  E. Vogel,et al.  Sensory gain control (amplification) as a mechanism of selective attention: electrophysiological and neuroimaging evidence. , 1998, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[19]  Ankush Gupta,et al.  Unsupervised Learning of Object Landmarks through Conditional Image Generation , 2018, NeurIPS.

[20]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[21]  N. Kanwisher,et al.  The Fusiform Face Area: A Module in Human Extrastriate Cortex Specialized for Face Perception , 1997, The Journal of Neuroscience.

[22]  Katharina N. Seidl-Rathkopf,et al.  Functions of the human frontoparietal attention network: Evidence from neuroimaging , 2015, Current Opinion in Behavioral Sciences.

[23]  Eero P. Simoncelli,et al.  A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients , 2000, International Journal of Computer Vision.

[24]  Andrew J. Davison,et al.  Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task , 2017, CoRL.

[25]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[26]  Yoshua Bengio,et al.  The Consciousness Prior , 2017, ArXiv.

[27]  Sirawaj Itthipuripat,et al.  Integrating Levels of Analysis in Systems and Cognitive Neurosciences , 2016, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[28]  Jude F. Mitchell,et al.  Spatial Attention Decorrelates Intrinsic Activity Fluctuations in Macaque Area V4 , 2009, Neuron.

[29]  Tom Hartley,et al.  Patterns of response to visual scenes are linked to the low-level properties of the image , 2014, NeuroImage.

[30]  Masashi Sugiyama,et al.  Learning Discrete Representations via Information Maximizing Self-Augmented Training , 2017, ICML.

[31]  Hao Wu,et al.  Hierarchical Disentangled Representations , 2018, ArXiv.

[32]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[33]  Harri Valpola,et al.  Tagger: Deep Unsupervised Perceptual Grouping , 2016, NIPS.

[34]  Tomaso Poggio,et al.  Trade-Off between Object Selectivity and Tolerance in Monkey Inferotemporal Cortex , 2007, The Journal of Neuroscience.

[35]  Thomas C. Sprague,et al.  Changing the Spatial Scope of Attention Alters Patterns of Neural Gain in Human Cortex , 2014, The Journal of Neuroscience.

[36]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[37]  Tom Hartley,et al.  Patterns of response to scrambled scenes reveal the importance of visual properties in the organization of scene-selective cortex , 2017, Cortex.

[38]  S. Yantis,et al.  Selective visual attention and perceptual coherence , 2006, Trends in Cognitive Sciences.

[39]  Surya Ganguli,et al.  A deep learning framework for neuroscience , 2019, Nature Neuroscience.

[40]  James J DiCarlo,et al.  Large-Scale, High-Resolution Comparison of the Core Visual Object Recognition Behavior of Humans, Monkeys, and State-of-the-Art Deep Artificial Neural Networks , 2018, The Journal of Neuroscience.

[41]  Tomoyasu Horikawa,et al.  Generic decoding of seen and imagined objects using hierarchical visual features , 2015, Nature Communications.

[42]  Antonio Torralba,et al.  Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence , 2016, Scientific Reports.

[43]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[44]  Yee Whye Teh,et al.  Disentangling Disentanglement in Variational Autoencoders , 2018, ICML.

[45]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[46]  Stefano Ermon,et al.  Learning Hierarchical Features from Generative Models , 2017, ArXiv.

[47]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[48]  Radoslaw Martin Cichy,et al.  Deep Neural Networks as Scientific Models , 2019, Trends in Cognitive Sciences.

[49]  Stefano Ermon,et al.  Learning Hierarchical Features from Deep Generative Models , 2017, ICML.

[50]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[51]  Ismail Uysal,et al.  Learning Latent Representations in Neural Networks for Clustering through Pseudo Supervision and Graph-based Activity Regularization , 2018, ICLR.

[52]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[53]  Chris I Baker,et al.  Contributions of low- and high-level properties to neural processing of visual scenes in the human brain , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[54]  James J. DiCarlo,et al.  How Does the Brain Solve Visual Object Recognition? , 2012, Neuron.

[55]  Dana H. Brooks,et al.  Structured Disentangled Representations , 2018, AISTATS.

[56]  Alireza Makhzani,et al.  Implicit Autoencoders , 2018, ArXiv.

[57]  Rui Shu,et al.  A Note on Deep Variational Models for Unsupervised Clustering , 2017 .

[58]  Eric P. Xing,et al.  Learning Robust Global Representations by Penalizing Local Predictive Power , 2019, NeurIPS.

[59]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[60]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[61]  Murray Shanahan,et al.  An Explicitly Relational Neural Network Architecture , 2019, ICML.

[62]  George L. Malcolm,et al.  Making Sense of Real-World Scenes , 2016, Trends in Cognitive Sciences.

[63]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[64]  James J. DiCarlo,et al.  Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior , 2018, Nature Neuroscience.

[65]  Michael Eickenberg,et al.  Seeing it all: Convolutional network layers map the function of the human visual system , 2017, NeuroImage.

[66]  D. Heeger,et al.  Attentional Enhancement via Selection and Pooling of Early Sensory Responses in Human Visual Cortex , 2011, Neuron.

[67]  Nat Dilokthanakul Towards better data efficiency in deep reinforcement learning , 2018 .

[68]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[69]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[70]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.

[71]  Soren Hauberg,et al.  Explicit Disentanglement of Appearance and Perspective in Generative Models , 2019, NeurIPS.

[72]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[73]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[74]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[75]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[76]  S. Kastner,et al.  From Behavior to Neural Dynamics: An Integrated Theory of Attention , 2015, Neuron.