Occlusion resistant learning of intuitive physics from videos

To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to the case where no, or only limited, occlusions occur. In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions. In our formulation, object positions are modeled as latent variables enabling the reconstruction of the scene. We then propose a series of approximations that make this problem tractable. Object proposals are linked across frames using a combination of a recurrent interaction network, modeling the physics in object space, and a compositional renderer, modeling the way in which objects project onto pixel space. We demonstrate significant improvements over state-of-the-art in the intuitive physics benchmark of IntPhys. We apply our method to a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future. Finally, we also show results on predicting motion of objects in real videos.

[1]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[2]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[3]  S. Carey,et al.  The perception of causality in infancy. , 2006, Acta psychologica.

[4]  Jiajun Wu,et al.  Learning Particle Dynamics for Manipulating Rigid Bodies, Deformable Objects, and Fluids , 2018, ICLR.

[5]  Razvan Pascanu,et al.  Visual Interaction Networks: Learning a Physics Simulator from Video , 2017, NIPS.

[6]  Yoshua Bengio,et al.  Generalizable Features From Unsupervised Learning , 2016, ICLR.

[7]  Chongjie Zhang,et al.  Object-Oriented Dynamics Predictor , 2018, NeurIPS.

[8]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[9]  Chen Sun,et al.  Unsupervised Learning of Object Structure and Dynamics from Videos , 2019, NeurIPS.

[10]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[11]  Daniel L. K. Yamins,et al.  Flexible Neural Representation for Physics Prediction , 2018, NeurIPS.

[12]  Emmanuel Dupoux,et al.  IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning , 2018, ArXiv.

[13]  Jiajun Wu,et al.  Physics 101: Learning Physical Object Properties from Unlabeled Videos , 2016, BMVC.

[14]  Jiajun Wu,et al.  Learning to See Physics via Visual De-animation , 2017, NIPS.

[15]  Jiajun Wu,et al.  DensePhysNet: Learning Dense Physical Object Representations via Multi-step Dynamic Interactions , 2019, Robotics: Science and Systems.

[16]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[17]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[18]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Mario Fritz,et al.  To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction , 2016, ArXiv.

[20]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[21]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[22]  Niloy J. Mitra,et al.  Learning A Physical Long-term Predictor , 2017, ArXiv.

[23]  Niloy J. Mitra,et al.  Taking Visual Motion Prediction To New Heightfields , 2019, Comput. Vis. Image Underst..

[24]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[25]  Shunyu Yao,et al.  Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations , 2019, NeurIPS.

[26]  Sergey Levine,et al.  Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.

[27]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[28]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[29]  S. Pauen,et al.  Early childhood development and later outcome , 2012 .

[30]  E. Spelke,et al.  Spatiotemporal continuity, smoothness of motion and object identity in infancy , 1995 .

[31]  Joshua B. Tenenbaum,et al.  A Compositional Object-Based Approach to Learning Physical Dynamics , 2016, ICLR.

[32]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[33]  Sergey Levine,et al.  Reasoning About Physical Interactions with Object-Oriented Prediction and Planning , 2018, ICLR.

[34]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[35]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[36]  Mario Fritz,et al.  Visual Stability Prediction and Its Application to Manipulation , 2016, AAAI Spring Symposia.

[37]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[38]  Jiajun Wu,et al.  A Comparative Evaluation of Approximate Probabilistic Simulation and Deep Neural Networks as Accounts of Human Physical Scene Understanding , 2016, CogSci.

[39]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[40]  Jürgen Schmidhuber,et al.  Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[41]  E. Spelke,et al.  Perception of partly occluded objects in infancy , 1983, Cognitive Psychology.

[42]  Ole Winther,et al.  A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning , 2017, NIPS.

[43]  Andrea Vedaldi,et al.  ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking , 2018, ECCV.

[44]  Ruben Villegas,et al.  Hierarchical Long-term Video Prediction without Supervision , 2018, ICML.