3DP3: 3D Scene Perception via Probabilistic Programming

We present 3DP3, a framework for inverse graphics that uses inference in a structured generative model of objects, scenes, and images. 3DP3 uses (i) voxel models to represent the 3D shape of objects, (ii) hierarchical scene graphs to decompose scenes into objects and the contacts between them, and (iii) depth image likelihoods based on real-time graphics. Given an observed RGB-D image, 3DP3’s inference algorithm infers the underlying latent 3D scene, including the object poses and a parsimonious joint parametrization of these poses, using fast bottom-up pose proposals, novel involutive MCMC updates of the scene graph structure, and, optionally, neural object detectors and pose estimators. We show that 3DP3 enables scene understanding that is aware of 3D shape, occlusion, and contact structure. Our results demonstrate that 3DP3 is more accurate at 6DoF object pose estimation from real images than deep learning baselines and shows better generalization to challenging scenes with novel viewpoints, contact, and partial observability.

[1]  Tai Sing Lee,et al.  Hierarchical Bayesian inference in the visual cortex. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[2]  Pushmeet Kohli,et al.  Overcoming Occlusion with Inverse Graphics , 2016, ECCV Workshops.

[3]  Ben Zinberg Structured differentiable models of 3D scenes via generative scene graphs , 2019 .

[4]  Shiyang Lu,et al.  Efficient nonparametric belief propagation for pose estimation and manipulation of articulated objects , 2019, Science Robotics.

[5]  David C. Knill,et al.  Introduction: a Bayesian formulation of visual perception , 1996 .

[6]  Hongwan Liu HOPF FIBRATION , 2009 .

[7]  Rui Chen,et al.  GRIP: Generative Robust Inference and Perception for Semantic Robot Manipulation in Adversarial Environments , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[8]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[10]  Zhijian Liu,et al.  Learning to Exploit Stability for 3D Scene Parsing , 2018, NeurIPS.

[11]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[12]  Silvio Savarese,et al.  Semantic structure from motion , 2011, CVPR 2011.

[13]  Gim Hee Lee,et al.  Robust 6D Object Pose Estimation by Learning RGB-D Features , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Alan L. Yuille,et al.  Region Competition: Unifying Snakes, Region Growing, and Bayes/MDL for Multiband Image Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  James H. Clark,et al.  Hierarchical geometric models for visible surface algorithms , 1976, CACM.

[16]  Luca Carlone,et al.  3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans , 2020, RSS 2020.

[17]  Joshua B. Tenenbaum,et al.  Picture: A probabilistic programming language for scene perception , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Martin C. Rinard,et al.  Probabilistic modeling and inference are becoming central computational tools across a broad range of fields , 2018 .

[19]  Bhaskara Marthi,et al.  A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs , 2017, Science.

[20]  A. Cayley A theorem on trees , 2009 .

[21]  Fei Deng,et al.  Generative Scene Graph Networks , 2021, ICLR.

[22]  Vikash K. Mansinghka,et al.  Gen: a general-purpose probabilistic programming system with programmable inference , 2019, PLDI.

[23]  Siddhartha S. Srinivasa,et al.  Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set , 2015, IEEE Robotics & Automation Magazine.

[24]  Fei Deng,et al.  Generative Hierarchical Models for Parts, Objects, and Scenes , 2019, ArXiv.

[25]  Vikash K. Mansinghka,et al.  Automating Involutive MCMC using Probabilistic and Differentiable Programming. , 2020, 2007.09871.

[26]  Richard Szeliski,et al.  Bundle Adjustment in the Large , 2010, ECCV.

[27]  Dieter Fox,et al.  Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects , 2018, CoRL.

[28]  Gary R. Bradski,et al.  Monte Carlo Pose Estimation with Quaternion Kernels and the Bingham Distribution , 2011, Robotics: Science and Systems.

[29]  Silvio Savarese,et al.  DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Silvio Savarese,et al.  3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Joshua B. Tenenbaum,et al.  Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs , 2013, NIPS.

[32]  Eric Brachmann,et al.  BOP: Benchmark for 6D Object Pose Estimation , 2018, ECCV.

[33]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[34]  C. Andrieu,et al.  The pseudo-marginal approach for efficient Monte Carlo computations , 2009, 0903.5480.

[35]  Harry Shum,et al.  Image segmentation by data driven Markov chain Monte Carlo , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[36]  A. Yuille,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 Special Issue: Probabilistic models of cognition Vision as Bayesian inference: analysis by synthesis? , 2022 .

[37]  John M. Lee Introduction to Smooth Manifolds , 2002 .

[38]  A. Yuille,et al.  Object perception as Bayesian inference. , 2004, Annual review of psychology.

[39]  Timothy Bretl,et al.  PoseRBPF: A Rao–Blackwellized Particle Filter for 6-D Object Pose Tracking , 2019, IEEE Transactions on Robotics.

[40]  Song-Chun Zhu,et al.  Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Song-Chun Zhu,et al.  Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image , 2018, ECCV.

[42]  Pushmeet Kohli,et al.  Vision-as-Inverse-Graphics: Obtaining a Rich 3D Explanation of a Scene from a Single Image , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[43]  P. Hu,et al.  Method for registration of 3D shapes without overlap for known 3D priors , 2021, Electronics Letters.

[44]  Reinhard Klein,et al.  Efficient RANSAC for Point‐Cloud Shape Detection , 2007, Comput. Graph. Forum.

[45]  Steven M. LaValle,et al.  Generating Uniform Incremental Grids on SO(3) Using the Hopf Fibration , 2010, WAFR.

[46]  Abhinav Gupta,et al.  3D-RelNet: Joint Object and Relational Network for 3D Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).