Reconstructing Interactive 3D Scenes by Panoptic Mapping and CAD Model Alignments

In this paper, we rethink the problem of scene reconstruction from an embodied agent’s perspective: While the classic view focuses on the reconstruction accuracy, our new perspective emphasizes the underlying functions and constraints such that the reconstructed scenes provide actionable information for simulating interactions with agents. Here, we address this challenging problem by reconstructing an interactive scene using RGB-D data stream, which captures (i) the semantics and geometry of objects and layouts by a 3D volumetric panoptic mapping module, and (ii) object affordance and contextual relations by reasoning over physical common sense among objects, organized by a graph-based scene representation. Crucially, this reconstructed scene replaces the object meshes in the dense panoptic map with part-based articulated CAD models for finer-grained robot interactions. In the experiments, we demonstrate that (i) our panoptic mapping module outperforms previous state-of-the-art methods, (ii) a high-performant physical reasoning procedure that matches, aligns, and replaces objects’ meshes with best-fitted CAD models, and (iii) reconstructed scenes are physically plausible and naturally afford actionable interactions; without any manual labeling, they are seamlessly imported to ROS-based simulators and virtual environments for complex robot task executions.1

[1]  Yiannis Aloimonos,et al.  Affordance detection of tool parts from geometric features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Leonidas J. Guibas,et al.  GSPN: Generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Lea Fleischer,et al.  The Senses Considered As Perceptual Systems , 2016 .

[4]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[5]  J. J. Moré,et al.  Levenberg--Marquardt algorithm: implementation and theory , 1977 .

[6]  Katsushi Ikeuchi,et al.  Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Katsushi Ikeuchi,et al.  Scene Understanding by Reasoning Stability and Safety , 2015, International Journal of Computer Vision.

[8]  Song-Chun Zhu,et al.  Image Parsing with Stochastic Scene Grammar , 2011, NIPS.

[9]  Angela Dai,et al.  SceneCAD: Predicting Object Alignments and Layouts in RGB-D Scans , 2020, ECCV.

[10]  Song-Chun Zhu,et al.  Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Chenfanfu Jiang,et al.  Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars , 2017, International Journal of Computer Vision.

[12]  Song-Chun Zhu,et al.  Scene Parsing by Integrating Function, Geometry and Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Song-Chun Zhu,et al.  Understanding tools: Task-oriented object modeling, learning and recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Silvio Savarese,et al.  Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments , 2020, IEEE Robotics and Automation Letters.

[15]  Abel Gawel,et al.  Incremental Object Database: Building 3D Models from Multiple Partial Observations , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Song-Chun Zhu,et al.  Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image , 2018, ECCV.

[17]  Matthias Nießner,et al.  Scan2CAD: Learning CAD Model Alignment in RGB-D Scans , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Tomoya Ishikawa,et al.  PanopticFusion: Online Volumetric Semantic Mapping at the Level of Stuff and Things , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  Shichao Yang,et al.  Monocular Object and Plane SLAM in Structured Environments , 2018, IEEE Robotics and Automation Letters.

[21]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Chen Feng,et al.  Point-plane SLAM for hand-held 3D sensors , 2013, 2013 IEEE International Conference on Robotics and Automation.

[24]  Luca Carlone,et al.  3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans , 2020, RSS 2020.

[25]  Chenfanfu Jiang,et al.  Inferring Forces and Learning Human Utilities from Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Feng Gao,et al.  Feeling the force: Integrating force and pose for fluent discovery through imitation learning to open medicine bottles , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[27]  Song-Chun Zhu,et al.  Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Carsten Rother,et al.  Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Chenfanfu Jiang,et al.  Human-Centric Indoor Scene Synthesis Using Stochastic Grammar , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Jean-Daniel Boissonnat,et al.  Computing the Diameter of a Point Set , 2002, Int. J. Comput. Geom. Appl..

[32]  Song-Chun Zhu,et al.  A tale of two explanations: Enhancing human trust by explaining robot behavior , 2019, Science Robotics.

[33]  Duc Thanh Nguyen,et al.  JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds With Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Leonidas J. Guibas,et al.  SAPIEN: A SimulAted Part-Based Interactive ENvironment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[36]  Leslie Pack Kaelbling,et al.  Hierarchical task and motion planning in the now , 2011, 2011 IEEE International Conference on Robotics and Automation.

[37]  Shichao Yang,et al.  CubeSLAM: Monocular 3-D Object SLAM , 2018, IEEE Transactions on Robotics.

[38]  Andrew J. Davison,et al.  MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Silvio Savarese,et al.  3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Odest Chadwicke Jenkins,et al.  GeoFusion: Geometric Consistency Informed Scene Estimation in Dense Clutter , 2020, IEEE Robotics and Automation Letters.

[41]  Song-Chun Zhu,et al.  Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation , 2018, NeurIPS.

[42]  Steven Minton,et al.  Minimizing Conflicts: A Heuristic Repair Method for Constraint Satisfaction and Scheduling Problems , 1992, Artif. Intell..

[43]  Surya P. N. Singh,et al.  V-REP: A versatile and scalable robot simulation framework , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[44]  Dinh-Cuong Hoang,et al.  Panoptic 3D Mapping and Object Pose Estimation Using Adaptively Weighted Semantic Information , 2020, IEEE Robotics and Automation Letters.

[45]  R. Hetherington The Perception of the Visual World , 1952 .

[46]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[47]  Duc Thanh Nguyen,et al.  Real-Time Progressive 3D Semantic Segmentation for Indoor Scenes , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[48]  Leslie Pack Kaelbling,et al.  Active Model Learning and Diverse Action Sampling for Task and Motion Planning , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[49]  Katsushi Ikeuchi,et al.  Task Oriented Vision , 1992, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems.

[50]  Stefan Leutenegger,et al.  Fusion++: Volumetric Object-Level SLAM , 2018, 2018 International Conference on 3D Vision (3DV).

[51]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[52]  Song-Chun Zhu,et al.  Interactive Robot Knowledge Patching Using Augmented Reality , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[53]  Leslie Pack Kaelbling,et al.  Learning to guide task and motion planning using score-space representation , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Katsushi Ikeuchi,et al.  Detecting potential falling objects by inferring human action and natural disturbance , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[55]  Patric Jensfelt,et al.  Large-scale semantic mapping and reasoning with heterogeneous modalities , 2012, 2012 IEEE International Conference on Robotics and Automation.

[56]  Chenfanfu Jiang,et al.  Mirroring without Overimitation: Learning Functionally Equivalent Manipulation Actions , 2019, AAAI.

[57]  Song-Chun Zhu,et al.  A Generalized Earley Parser for Human Activity Parsing and Prediction , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Peng Liu,et al.  3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Pieter Abbeel,et al.  Combined task and motion planning through an extensible planner-independent interface layer , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[60]  Duc Thanh Nguyen,et al.  SceneNN: A Scene Meshes Dataset with aNNotations , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[61]  Chi-Keung Tang,et al.  Make it home: automatic optimization of furniture arrangement , 2011, ACM Trans. Graph..

[62]  Roland Siegwart,et al.  Volumetric Instance-Aware Semantic Mapping and 3D Object Discovery , 2019, IEEE Robotics and Automation Letters.

[63]  Federico Tombari,et al.  Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Song-Chun Zhu,et al.  Graph-based Hierarchical Knowledge Representation for Robot Task Transfer from Virtual to Physical World , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[65]  Feng Gao,et al.  VRGym: a virtual testbed for physical and interactive AI , 2019, ACM TUR-C.

[66]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.

[68]  Andrew Howard,et al.  Design and use paradigms for Gazebo, an open-source multi-robot simulator , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).