PIP: Physical Interaction Prediction via Mental Imagery with Span Selection

To align advanced artificial intelligence (AI) with human values and promote safe AI, it is important for AI to predict the outcome of physical interactions. Even with the ongoing debates on how humans predict the outcomes of physical interactions among objects in the real world, there are works attempting to tackle this task via cognitive-inspired AI approaches. However, there is still a lack of AI approaches that mimic the mental imagery humans use to predict physical interactions in the real world. In this work, we propose a novel PIP scheme: Physical Interaction Prediction via Mental Imagery with Span Selection. PIP utilizes a deep generative model to output future frames of physical interactions among objects before extracting crucial information for predicting physical interactions by focusing on salient frames using span selection. To evaluate our model, we propose a large-scale SPACE+ dataset of synthetic video frames, including three physical interaction events in a 3D environment. Our experiments show that PIP outperforms baselines and human performance in physical interaction prediction for both seen and unseen objects. Furthermore, PIP’s span selection scheme can effectively identify the frames where physical interactions among objects occur within the generated frames, allowing for added interpretability.

[1]  Andrea Vedaldi,et al.  ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking , 2018, ECCV.

[2]  S. Kosslyn,et al.  Topographical representations of mental images in primary visual cortex , 1995, Nature.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Mauro Da Lio,et al.  On the Road with 16 Neurons: Mental Imagery with Bio-inspired Deep Neural Networks , 2020, ArXiv.

[5]  Cheston Tan,et al.  A Survey of Embodied AI: From Simulators to Research Tasks , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[6]  James R. Kubricht,et al.  Intuitive Physics: Current Research and Controversies , 2017, Trends in Cognitive Sciences.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Daniel L. Schwartz,et al.  Analog Imagery in Mental Model Reasoning: Depictive Models , 1996, Cognitive Psychology.

[9]  Chuang Gan,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2020, ICLR.

[10]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[11]  N. Hari Narayanan,et al.  Diagrammatic Reasoning: Cognitive and Computational Perspectives , 1995 .

[12]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[13]  Rohit Girdhar,et al.  Forward Prediction for Physical Reasoning , 2020, ArXiv.

[14]  Mario Fritz,et al.  To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction , 2016, ArXiv.

[15]  Joshua B. Tenenbaum,et al.  A Compositional Object-Based Approach to Learning Physical Dynamics , 2016, ICLR.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[18]  Nancy Kanwisher,et al.  Physion: Evaluating Physical Prediction from Vision in Humans and Machines , 2021, ArXiv.

[19]  Nico Bruns,et al.  Blender , 2020, Der Unfallchirurg.

[20]  M. Kunda Visual mental imagery: A view from artificial intelligence , 2018, Cortex.

[21]  John G. Mikhael,et al.  Functional neuroanatomy of intuitive physical inference , 2016, Proceedings of the National Academy of Sciences.

[22]  Yutaka Satoh,et al.  Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs? , 2020, ArXiv.

[23]  S. Kosslyn,et al.  The role of area 17 in visual imagery: convergent evidence from PET and rTMS. , 1999, Science.

[24]  Jitendra Malik,et al.  Learning Visual Predictive Models of Physics for Playing Billiards , 2015, ICLR.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Lawrence Carin,et al.  SpanPredict: Extraction of Predictive Document Spans with Neural Attention , 2021, NAACL.

[27]  Chuang Gan,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, ArXiv.

[28]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[30]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[31]  Ali Farhadi,et al.  "What Happens If..." Learning to Predict the Effect of Forces in Images , 2016, ECCV.

[32]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[33]  Angelo Cangelosi,et al.  A Neural Network model for spatial mental imagery investigation: A study with the humanoid robot platform iCub , 2011, The 2011 International Joint Conference on Neural Networks.

[34]  David J. Fleet,et al.  Estimating contact dynamics , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Brian V. Funt,et al.  Problem-Solving with Diagrammatic Representations , 1980, Artif. Intell..

[36]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[37]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[38]  Felix Hill,et al.  Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures , 2020, ArXiv.

[39]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[40]  Bernard Meltzer,et al.  Analogical Representations of Naive Physics , 1989, Artif. Intell..

[41]  Christian Wolf,et al.  COPHY: Counterfactual Learning of Physical Dynamics , 2020, ICLR.

[42]  Cheston Tan,et al.  SPACE: A Simulator for Physical Interactions and Causal Learning in 3D Environments , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[43]  D. Proffitt,et al.  Heuristic judgment of mass ratio in two-body collisions , 1994, Perception & psychophysics.

[44]  H. Furth Object permanence in five-month-old infants. , 1987, Cognition.

[45]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[46]  Nicolas Thome,et al.  Disentangling Physical Dynamics From Unknown Factors for Unsupervised Video Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[48]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[49]  R. Baillargeon,et al.  How Do Infants Reason about Physical Events , 2010 .

[50]  Deva Ramanan,et al.  CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning , 2020, ICLR.

[51]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[52]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[53]  Jiajun Wu,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.