Unsupervised Discovery of 3D Physical Objects from Video

We study the problem of unsupervised physical object discovery. While existing frameworks aim to decompose scenes into 2D segments based off each object's appearance, we explore how physics, especially object interactions, facilitates disentangling of 3D geometry and position of objects from video, in an unsupervised manner. Drawing inspiration from developmental psychology, our Physical Object Discovery Network (POD-Net) uses both multi-scale pixel cues and physical motion cues to accurately segment observable and partially occluded objects of varying sizes, and infer properties of those objects. Our model reliably segments objects on both synthetic and real scenes. The discovered object properties can also be used to reason about physical events.

[1]  Erik G. Learned-Miller,et al.  The Best of Both Worlds: Combining CNNs and Geometric Constraints for Hierarchical Motion Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[3]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[4]  Deva Ramanan,et al.  Towards Segmenting Anything That Moves , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[5]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[6]  Jana Kosecka,et al.  Multiview RGB-D Dataset for Object Instance Detection , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[7]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[8]  Dieter Fox,et al.  Toward object discovery and modeling via 3-D scene comparison , 2011, 2011 IEEE International Conference on Robotics and Automation.

[9]  Edward H. Adelson,et al.  Crisp Boundary Detection Using Pointwise Mutual Information , 2014, ECCV.

[10]  Sergey Levine,et al.  Reasoning About Physical Interactions with Object-Oriented Prediction and Planning , 2018, ICLR.

[11]  Dileep George,et al.  Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics , 2017, ICML.

[12]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[13]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[14]  Ling Shao,et al.  See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Lu Ma,et al.  Unsupervised Dense Object Discovery, Detection, Tracking and Reconstruction , 2014, ECCV.

[16]  Daniel L. K. Yamins,et al.  Flexible Neural Representation for Physics Prediction , 2018, NeurIPS.

[17]  Larry S. Davis,et al.  DOPS: Learning to Detect 3D Objects and Predict Their 3D Shapes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[19]  Emmanuel Dupoux,et al.  IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning , 2018, ArXiv.

[20]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[21]  Shunyu Yao,et al.  Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations , 2019, NeurIPS.

[22]  Jürgen Schmidhuber,et al.  R-SQAIR: Relational Sequential Attend, Infer, Repeat , 2019, ArXiv.

[23]  Jürgen Schmidhuber,et al.  Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[24]  Jiajun Wu,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.

[25]  Ling Shao,et al.  Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Jiajun Wu,et al.  Learning to See Physics via Visual De-animation , 2017, NIPS.

[27]  Zaïd Harchaoui,et al.  Object Discovery in Videos as Foreground Motion Clustering , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Stefano Soatto,et al.  Unsupervised Moving Object Detection via Contextual Information Separation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Luca Bertinetto,et al.  Anchor Diffusion for Unsupervised Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[32]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[35]  Max Jaderberg,et al.  Unsupervised Learning of 3D Structure from Images , 2016, NIPS.

[36]  Jure Leskovec,et al.  Learning to Simulate Complex Physics with Graph Networks , 2020, ICML.

[37]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[38]  Fei-Fei Li,et al.  Object discovery in 3D scenes via shape analysis , 2013, 2013 IEEE International Conference on Robotics and Automation.

[39]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[40]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.