SCAN: Learning Abstract Hierarchical Compositional Visual Concepts

The natural world is infinitely diverse, yet this diversity arises from a relatively small set of coherent properties and rules, such as the laws of physics or chemistry. We conjecture that biological intelligent systems are able to survive within their diverse environments by discovering the regularities that arise from these rules primarily through unsupervised experiences, and representing this knowledge as abstract concepts. Such representations possess useful properties of compositionality and hierarchical organisation, which allow intelligent agents to recombine a finite set of conceptual building blocks into an exponentially large set of useful new concepts. This paper describes SCAN (Symbol-Concept Association Network), a new framework for learning such concepts in the visual domain. We first use the previously published beta-VAE (Higgins et al., 2017a) architecture to learn a disentangled representation of the latent structure of the visual world, before training SCAN to extract abstract concepts grounded in such disentangled visual primitives through fast symbol association. Our approach requires very few pairings between symbols and images and makes no assumptions about the choice of symbol representations. Once trained, SCAN is capable of multimodal bi-directional inference, generating a diverse set of image samples from symbolic descriptions and vice versa. It also allows for traversal and manipulation of the implicit hierarchy of compositional visual concepts through symbolic instructions and learnt logical recombination operations. Such manipulations enable SCAN to invent and learn novel visual concepts through recombination of the few learnt concepts.

[1]  S. S. Culbert,et al.  Cognition and Categorization , 1979 .

[2]  Elizabeth S. Spelke,et al.  Principles of Object Perception , 1990, Cogn. Sci..

[3]  Joshua B. Tenenbaum,et al.  Bayesian Modeling of Human Concept Learning , 1998, NIPS.

[4]  R. Baillargeon Infants' Physical World , 2004 .

[5]  J. Deloache,et al.  Young Infants ' Reasoning about the Physical and Spatial Properties of a Hidden Object , 2004 .

[6]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[7]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[8]  Kevin A. Smith,et al.  Sources of uncertainty in intuitive physics , 2012, CogSci.

[9]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[12]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[13]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[14]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[15]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[17]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[24]  Murray Shanahan,et al.  Towards Deep Symbolic Reinforcement Learning , 2016, ArXiv.

[25]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[26]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[27]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[28]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[30]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[31]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[32]  Zhe Gan,et al.  Variational Autoencoder for Deep Learning of Images, Labels and Captions , 2016, NIPS.

[33]  Honglak Lee,et al.  Deep Variational Canonical Correlation Analysis , 2016, ArXiv.

[34]  Masahiro Suzuki,et al.  Joint Multimodal Learning with Deep Generative Models , 2016, ICLR.

[35]  Christopher Burgess,et al.  DARLA: Improving Zero-Shot Transfer in Reinforcement Learning , 2017, ICML.

[36]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[37]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[38]  Ambedkar Dukkipati,et al.  Variational methods for conditional multimodal deep learning , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[39]  Kevin Murphy,et al.  Generative Models of Visually Grounded Imagination , 2017, ICLR.