PIP: Physical Interaction Prediction via Mental Simulation with Span Selection

Accurate prediction of physical interaction outcomes is a crucial component of human intelligence and is important for safe and efficient deployments of robots in the real world. While there are existing vision-based intuitive physics models that learn to predict physical interaction outcomes, they mostly focus on generating short sequences of future frames based on physical properties (e.g. mass, friction and velocity) extracted from visual inputs or a latent space. However, there is a lack of intuitive physics models that are tested on long physical interaction sequences with multiple interactions among different objects. We hypothesize that selective temporal attention during approximate mental simulations helps humans in physical interaction outcome prediction. With these motivations, we propose a novel scheme: Physical Interaction Prediction via Mental Simulation with Span Selection (PIP). It utilizes a deep generative model to model approximate mental simulations by generating future frames of physical interactions before employing selective temporal attention in the form of span selection for predicting physical interaction outcomes. To evaluate our model, we further propose the large-scale SPACE+ dataset of synthetic videos with long sequences of three prime physical interactions in a 3D environment. Our experiments show that PIP outperforms human, baseline, and related intuitive physics models that utilize mental simulation. Furthermore, PIP’s span selection module effectively identifies the frames indicating key physical interactions among objects, allowing for added interpretability.

[1]  James R. Kubricht,et al.  Intuitive Physics: Current Research and Controversies , 2017, Trends in Cognitive Sciences.

[2]  Jiajun Wu,et al.  Physics 101: Learning Physical Object Properties from Unlabeled Videos , 2016, BMVC.

[3]  J. Tenenbaum,et al.  Mind Games: Game Engines as an Architecture for Intuitive Physics , 2017, Trends in Cognitive Sciences.

[4]  Greg Mori,et al.  COPHY: Counterfactual Learning of Physical Dynamics , 2019, ICLR.

[5]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[6]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[7]  John G. Mikhael,et al.  Functional neuroanatomy of intuitive physical inference , 2016, Proceedings of the National Academy of Sciences.

[8]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[9]  Thomas L. Griffiths,et al.  Think again? The amount of mental simulation tracks uncertainty in the outcome , 2015, CogSci.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Pieter Abbeel,et al.  VideoGPT: Video Generation using VQ-VAE and Transformers , 2021, ArXiv.

[12]  Jitendra Malik,et al.  Which Tasks Should Be Learned Together in Multi-task Learning? , 2019, ICML.

[13]  Jitendra Malik,et al.  Learning Visual Predictive Models of Physics for Playing Billiards , 2015, ICLR.

[14]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15]  Yutaka Satoh,et al.  Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs? , 2020, ArXiv.

[16]  B. Scholl,et al.  Seeing stability: Intuitive physics automatically guides selective attention , 2016 .

[17]  Ronald M. Summers,et al.  Spatial-Temporal Convolutional LSTMs for Tumor Growth Prediction by Learning 4D Longitudinal Patient Data , 2019, ArXiv.

[18]  Nicolas Thome,et al.  Disentangling Physical Dynamics From Unknown Factors for Unsupervised Video Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Liang Zhang,et al.  Self-Supervised Learning to Detect Key Frames in Videos , 2020, Sensors.

[20]  Neil R. Bramley,et al.  Intuitive experimentation in the physical world , 2018, Cognitive Psychology.

[21]  Nancy Kanwisher,et al.  Physion: Evaluating Physical Prediction from Vision in Humans and Machines , 2021, ArXiv.

[22]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[23]  Zhihui Lin,et al.  CMS-LSTM: Context-Embedding and Multi-Scale Spatiotemporal-Expression LSTM for Video Prediction , 2021, ArXiv.

[24]  Marcelo H. Ang,et al.  AVoE: A Synthetic 3D Dataset on Understanding Violation of Expectation for Artificial Cognition , 2021, ArXiv.

[25]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[26]  R. Fleming Visual perception of materials and their properties , 2014, Vision Research.

[27]  Mario Fritz,et al.  Visual stability prediction for robotic manipulation , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Geoffrey E. Hinton,et al.  Deep learning for AI , 2021, Commun. ACM.

[29]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[30]  Cheston Tan,et al.  SPACE: A Simulator for Physical Interactions and Causal Learning in 3D Environments , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[31]  Claudio de’Sperati,et al.  Speed Biases With Real-Life Video Clips , 2018, Front. Integr. Neurosci..

[32]  Abhinav Gupta,et al.  Interpretable Intuitive Physics Model , 2018, ECCV.

[33]  Cheston Tan,et al.  A Survey of Embodied AI: From Simulators to Research Tasks , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[34]  Lawrence Carin,et al.  SpanPredict: Extraction of Predictive Document Spans with Neural Attention , 2021, NAACL.

[35]  Kevin A. Smith,et al.  Sources of uncertainty in intuitive physics , 2012, CogSci.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Mario Fritz,et al.  To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction , 2016, ArXiv.

[38]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[39]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[40]  Katsushi Ikeuchi,et al.  Scene Understanding by Reasoning Stability and Safety , 2015, International Journal of Computer Vision.

[41]  Neil R. Bramley,et al.  Limits on simulation approaches in intuitive physics , 2021, Cognitive Psychology.

[42]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[43]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[45]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46]  Jason Fischer,et al.  When it all falls down: the relationship between intuitive physics and spatial cognition , 2020, Cognitive research: principles and implications.

[47]  J. Tenenbaum,et al.  Intuitive Theories , 2020, Encyclopedia of Creativity, Invention, Innovation and Entrepreneurship.

[48]  David J. Fleet,et al.  Estimating contact dynamics , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[49]  Clément Gosselin,et al.  Safe, Stable and Intuitive Control for Physical Human-Robot Interaction , 2009, 2009 IEEE International Conference on Robotics and Automation.

[50]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[51]  Andrea Vedaldi,et al.  ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking , 2018, ECCV.

[52]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[53]  Michael R. Waldmann,et al.  The Oxford handbook of causal reasoning , 2017 .

[54]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[55]  Jason Fischer,et al.  A striking take on mass inferences from collisions , 2021, Journal of Vision.