Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long-range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Code, pre-trained models, and more visualization results are available at this https URL.

[1]  Stefano Ermon,et al.  Label-Free Supervision of Neural Networks with Physics and Domain Knowledge , 2016, AAAI.

[2]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[4]  Mario Fritz,et al.  Visual Stability Prediction and Its Application to Manipulation , 2016, AAAI Spring Symposia.

[5]  Jitendra Malik,et al.  Mesh R-CNN , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  M. McCloskey,et al.  Intuitive physics: the straight-down belief and its origin. , 1983, Journal of experimental psychology. Learning, memory, and cognition.

[7]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[8]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[9]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[10]  Mario Fritz,et al.  To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction , 2016, ArXiv.

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[13]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[14]  Joshua B. Tenenbaum,et al.  A Compositional Object-Based Approach to Learning Physical Dynamics , 2016, ICLR.

[15]  James R. Kubricht,et al.  Intuitive Physics: Current Research and Controversies , 2017, Trends in Cognitive Sciences.

[16]  Alexei A. Efros,et al.  Time-Agnostic Prediction: Predicting Predictable Video Frames , 2018, ICLR.

[17]  Jure Leskovec,et al.  Learning to Simulate Complex Physics with Graph Networks , 2020, ICML.

[18]  Sergey Levine,et al.  Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.

[19]  W. H. F. Barnes The Nature of Explanation , 1944, Nature.

[20]  Ali Farhadi,et al.  "What Happens If..." Learning to Predict the Effect of Forces in Images , 2016, ECCV.

[21]  Geoffrey E. Hinton,et al.  NeuroAnimator: fast neural network emulation and control of physics-based models , 1998, SIGGRAPH.

[22]  Jiajun Wu,et al.  Learning to See Physics via Visual De-animation , 2017, NIPS.

[23]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[24]  David J. Fleet,et al.  Estimating contact dynamics , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25]  Rohit Girdhar,et al.  Forward Prediction for Physical Reasoning , 2020, ArXiv.

[26]  Niloy J. Mitra,et al.  Learning A Physical Long-term Predictor , 2017, ArXiv.

[27]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[28]  Jiajun Wu,et al.  Propagation Networks for Model-Based Control Under Partial Observation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[29]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[30]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[31]  Jitendra Malik,et al.  Zero-Shot Visual Imitation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Andrea Vedaldi,et al.  ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking , 2018, ECCV.

[33]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[35]  Marko Bacic,et al.  Model predictive control , 2003 .

[36]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[37]  Bernt Schiele,et al.  Long-Term Image Boundary Extrapolation , 2016, ArXiv.

[38]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[39]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Steven M. Seitz,et al.  Computing the Physical Parameters of Rigid-Body Motion from Video , 2002, ECCV.

[41]  Abhinav Gupta,et al.  Interpretable Intuitive Physics Model , 2018, ECCV.

[42]  Internal physics models guide probabilistic judgments about object dynamics , 2011 .

[43]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Geoffrey E. Hinton,et al.  The Recurrent Temporal Restricted Boltzmann Machine , 2008, NIPS.

[45]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[46]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[48]  Razvan Pascanu,et al.  Visual Interaction Networks: Learning a Physics Simulator from Video , 2017, NIPS.

[49]  Chuang Gan,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2020, ICLR.

[50]  Tsuhan Chen,et al.  3D Reasoning from Blocks to Stability , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Sergey Levine,et al.  Reasoning About Physical Interactions with Object-Oriented Prediction and Planning , 2018, ICLR.

[52]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53]  Frank Allgöwer,et al.  Nonlinear Model Predictive Control , 2007 .

[54]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[55]  Abhinav Gupta,et al.  Compositional Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Jiajun Wu,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.

[57]  Sergey Levine,et al.  Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning , 2018, CoRL.

[58]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[59]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[61]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[62]  Kristian Kersting,et al.  Structured Object-Aware Physics Prediction for Video Modeling and Planning , 2020, ICLR.

[63]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[64]  Thomas B. Schön,et al.  From Pixels to Torques: Policy Learning with Deep Dynamical Models , 2015, ICML 2015.

[65]  Wenmin Wang,et al.  Video Imagination from a Single Image with Transformation Generation , 2017, ACM Multimedia.

[66]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[67]  G. Martin,et al.  Nonlinear model predictive control , 1999, Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251).

[68]  Ross B. Girshick,et al.  PHYRE: A New Benchmark for Physical Reasoning , 2019, NeurIPS.

[69]  Elise van der Pol,et al.  Contrastive Learning of Structured World Models , 2020, ICLR.

[70]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[71]  Jitendra Malik,et al.  Learning Visual Predictive Models of Physics for Playing Billiards , 2015, ICLR.

[72]  Jiajun Wu,et al.  Physics 101: Learning Physical Object Properties from Unlabeled Videos , 2016, BMVC.

[73]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[74]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[75]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[76]  Ali Farhadi,et al.  Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[78]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[79]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[80]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.