Physion: Evaluating Physical Prediction from Vision in Humans and Machines

While machine learning algorithms excel at many challenging visual tasks, it is unclear that they can make predictions about commonplace real world physical events. Here, we present a visual and physical prediction benchmark that precisely measures this capability. In realistically simulating a wide variety of physical phenomena – rigid and soft-body collisions, stable multi-object configurations, rolling and sliding, projectile motion – our dataset presents a more comprehensive challenge than existing benchmarks. Moreover, we have collected human responses for our stimuli so that model predictions can be directly compared to human judgments. We compare an array of algorithms – varying in their architecture, learning objective, input-output structure, and training data – on their ability to make diverse physical predictions. We find that graph neural networks with access to the physical state best capture human behavior, whereas among models that receive only visual input, those with object-centric representations or pretraining do best but fall far short of human accuracy. This suggests that extracting physically meaningful representations of scenes is the main bottleneck to achieving human-like visual prediction. We thus demonstrate how our benchmark can identify areas for improvement and measure progress on this key aspect of physical understanding.

[1]  Ivan Laptev,et al.  Occlusion resistant learning of intuitive physics from videos , 2020, ArXiv.

[2]  Elizabeth S. Spelke,et al.  Principles of Object Perception , 1990, Cogn. Sci..

[3]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[4]  James J DiCarlo,et al.  Large-Scale, High-Resolution Comparison of the Core Visual Object Recognition Behavior of Humans, Monkeys, and State-of-the-Art Deep Artificial Neural Networks , 2018, The Journal of Neuroscience.

[5]  R. Nickerson,et al.  Long-term memory for a common object , 1979, Cognitive Psychology.

[6]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[7]  H. Furth Object permanence in five-month-old infants. , 1987, Cognition.

[8]  Rohit Girdhar,et al.  Forward Prediction for Physical Reasoning , 2020, ArXiv.

[9]  Susan Carey,et al.  Infants' knowledge of objects: beyond object files and object tracking , 2001, Cognition.

[10]  Mario Fritz,et al.  To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction , 2016, ArXiv.

[11]  Vikash K. Mansinghka,et al.  Reconciling intuitive physics and Newtonian mechanics for colliding objects. , 2013, Psychological review.

[12]  Joshua B. Tenenbaum,et al.  A Compositional Object-Based Approach to Learning Physical Dynamics , 2016, ICLR.

[13]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[14]  Daniel L. K. Yamins,et al.  Learning to Play with Intrinsically-Motivated Self-Aware Agents , 2018, NeurIPS.

[15]  Jure Leskovec,et al.  Learning to Simulate Complex Physics with Graph Networks , 2020, ICML.

[16]  M. Bertamini,et al.  Understanding projectile acceleration. , 2000, Journal of experimental psychology. Human perception and performance.

[17]  R. Baillargeon,et al.  How Do Infants Reason about Physical Events , 2010 .

[18]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[19]  Perception of soft materials relies on physics-based object representations: Behavioral and computational evidence , 2021 .

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Daniel L. K. Yamins,et al.  Visual Grounding of Learned Physical Models , 2020, ICML.

[22]  Kevin A. Smith,et al.  Sources of uncertainty in intuitive physics , 2012, CogSci.

[23]  Sergey Levine,et al.  RoboNet: Large-Scale Multi-Robot Learning , 2019, CoRL.

[24]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[25]  Ali Farhadi,et al.  Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Li Fei-Fei,et al.  Learning Physical Graph Representations from Visual Scenes , 2020, NeurIPS.

[27]  Raia Hadsell,et al.  Graph networks as learnable physics engines for inference and control , 2018, ICML.

[28]  Felix Hill,et al.  Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures , 2020, ArXiv.

[29]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[30]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[31]  Sergey Levine,et al.  Learning Predictive Models From Observation and Interaction , 2019, ECCV.

[32]  Jiajun Wu,et al.  A Comparative Evaluation of Approximate Probabilistic Simulation and Deep Neural Networks as Accounts of Human Physical Scene Understanding , 2016, CogSci.

[33]  Abhinav Gupta,et al.  Compositional Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Jiajun Wu,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.

[35]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[36]  Silvio Savarese,et al.  ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation , 2018, CoRL.

[37]  Jiajun Wu,et al.  Combining Physical Simulators and Object-Based Networks for Control , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[38]  Emmanuel Dupoux,et al.  IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning , 2018, ArXiv.

[39]  Kevin A. Smith,et al.  Different Physical Intuitions Exist Between Tasks, Not Domains , 2018, Computational Brain & Behavior.

[40]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[41]  S. Savarese,et al.  Goal-Aware Prediction: Learning to Model What Matters , 2020, ICML.

[42]  J. Tenenbaum,et al.  Mind Games: Game Engines as an Architecture for Intuitive Physics , 2017, Trends in Cognitive Sciences.

[43]  H. Francis Song,et al.  Relational Forward Models for Multi-Agent Learning , 2018, ICLR.

[44]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[45]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[46]  Ruben Villegas,et al.  High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks , 2019, NeurIPS.

[47]  Joshua B. Tenenbaum,et al.  Humans predict liquid dynamics using probabilistic simulation , 2015, CogSci.

[48]  Chuang Gan,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, ArXiv.

[49]  Jiajun Wu,et al.  Learning Particle Dynamics for Manipulating Rigid Bodies, Deformable Objects, and Fluids , 2018, ICLR.

[50]  David Amos,et al.  Probing Physics Knowledge Using Tools from Developmental Psychology , 2018, ArXiv.

[51]  Sergey Levine,et al.  Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning , 2018, CoRL.

[52]  Deepak Pathak,et al.  Learning Long-term Visual Dynamics with Region Proposal Interaction Networks , 2021, ICLR.

[53]  Shunyu Yao,et al.  Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations , 2019, NeurIPS.

[54]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Alexander Lerchner,et al.  COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration , 2019, ArXiv.

[56]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[57]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[58]  Sergey Levine,et al.  Reasoning About Physical Interactions with Object-Oriented Prediction and Planning , 2018, ICLR.

[59]  Li Fei-Fei,et al.  Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Ross B. Girshick,et al.  PHYRE: A New Benchmark for Physical Reasoning , 2019, NeurIPS.

[61]  Elise van der Pol,et al.  Contrastive Learning of Structured World Models , 2020, ICLR.

[62]  Jitendra Malik,et al.  Learning Visual Predictive Models of Physics for Playing Billiards , 2015, ICLR.

[63]  Joshua B. Tenenbaum,et al.  Modeling human intuitions about liquid flow with particle-based simulation , 2018, PLoS Comput. Biol..

[64]  Daniel L. K. Yamins,et al.  Flexible Neural Representation for Physics Prediction , 2018, NeurIPS.

[65]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[66]  Kevin A. Smith,et al.  Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning , 2019, Proceedings of the National Academy of Sciences.

[67]  Ali Farhadi,et al.  "What Happens If..." Learning to Predict the Effect of Forces in Images , 2016, ECCV.

[68]  Andrea Vedaldi,et al.  ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking , 2018, ECCV.