Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions

Common-sense physical reasoning is an essential ingredient for any intelligent agent operating in the real-world. For example, it can be used to simulate the environment, or to infer the state of parts of the world that are currently unobserved. In order to match real-world conditions this causal knowledge must be learned without access to supervised data. To address this problem we present a novel method that learns to discover objects and model their physical interactions from raw visual images in a purely \emph{unsupervised} fashion. It incorporates prior knowledge about the compositional nature of human perception to factor interactions between object-pairs and learn efficiently. On videos of bouncing balls we show the superior modelling capabilities of our method compared to other unsupervised neural approaches that do not incorporate such prior knowledge. We demonstrate its ability to handle occlusion and show that it can extrapolate learned knowledge to scenes with different numbers of objects.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  W. Epstein,et al.  The status of the minimum principle in the theoretical analysis of visual perception. , 1985, Psychological bulletin.

[3]  L. Hubert,et al.  Comparing partitions , 1985 .

[4]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[5]  Elizabeth S. Spelke,et al.  Principles of Object Perception , 1990, Cogn. Sci..

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  James L. McClelland,et al.  Rethinking infant knowledge: toward an adaptive process account of successes and failures in object permanence tasks. , 1997, Psychological review.

[8]  Geoffrey E. Hinton,et al.  NeuroAnimator: fast neural network emulation and control of physics-based models , 1998, SIGGRAPH.

[9]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[10]  Aniket Kittur,et al.  A Solution to the Binding Problem for Compositional Connectionism , 2004, AAAI Technical Report.

[11]  Geoffrey E. Hinton,et al.  The Recurrent Temporal Restricted Boltzmann Machine , 2008, NIPS.

[12]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[13]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[14]  E. Spelke Where Perceiving Ends and Thinking Begins: The Apprehension of Objects in Infancy , 2013 .

[15]  Roland Memisevic,et al.  Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells" , 2014, NIPS.

[16]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[17]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[18]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[19]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[20]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[21]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Jürgen Schmidhuber,et al.  On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models , 2015, ArXiv.

[24]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[28]  Jitendra Malik,et al.  Learning Visual Predictive Models of Physics for Playing Billiards , 2015, ICLR.

[29]  Harri Valpola,et al.  Tagger: Deep Unsupervised Perceptual Grouping , 2016, NIPS.

[30]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[31]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[32]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[33]  Nando de Freitas,et al.  Neural Programmer-Interpreters , 2015, ICLR.

[34]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[35]  Vincent Dumoulin,et al.  Deconvolution and Checkerboard Artifacts , 2016 .

[36]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[37]  J. Tenenbaum,et al.  Mind Games: Game Engines as an Architecture for Intuitive Physics , 2017, Trends in Cognitive Sciences.

[38]  Marc Brockschmidt,et al.  Neural Program Lattices , 2016, ICLR.

[39]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Jürgen Schmidhuber,et al.  Neural Expectation Maximization , 2017, NIPS.

[41]  Razvan Pascanu,et al.  Visual Interaction Networks: Learning a Physics Simulator from Video , 2017, NIPS.

[42]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[43]  Tim Rocktäschel,et al.  Programming with a Differentiable Forth Interpreter , 2016, ICML.

[44]  Dawn Xiaodong Song,et al.  Making Neural Programming Architectures Generalize via Recursion , 2017, ICLR.

[45]  Alexander Ilin,et al.  Recurrent Ladder Networks , 2017, NIPS.

[46]  Dileep George,et al.  Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics , 2017, ICML.

[47]  Joshua B. Tenenbaum,et al.  A Compositional Object-Based Approach to Learning Physical Dynamics , 2016, ICLR.

[48]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[50]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[51]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[52]  Armando Solar-Lezama,et al.  Learning to Infer Graphics Programs from Hand-Drawn Images , 2017, NeurIPS.