StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects

Robots operating in human environments must be able to rearrange objects into semantically-meaningful configurations, even if these objects are previously unseen. In this work, we focus on the problem of building physically-valid structures without step-by-step instructions. We propose StructDiffusion, which combines a diffusion model and an object-centric transformer to construct structures given partial-view point clouds and high-level language goals, such as"set the table". Our method can perform multiple challenging language-conditioned multi-step 3D planning tasks using one model. StructDiffusion even improves the success rate of assembling physically-valid structures out of unseen objects by on average 16% over an existing multi-modal transformer model trained on specific structures. We show experiments on held-out objects in both simulation and on real-world rearrangement tasks. Importantly, we show how integrating both a diffusion model and a collision-discriminator model allows for improved generalization over other methods when rearranging previously-unseen objects. For videos and additional results, see our website: https://structdiffusion.github.io/.

[1]  Song-Chun Zhu,et al.  Diffusion-based Generation, Optimization, and Planning in 3D Scenes , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Edward Johns,et al.  DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics , 2022, IEEE Robotics and Automation Letters.

[3]  S. Levine,et al.  RT-1: Robotics Transformer for Real-World Control at Scale , 2022, Robotics: Science and Systems.

[4]  J. Tenenbaum,et al.  Is Conditional Generative Modeling all you need for Decision-Making? , 2022, ArXiv.

[5]  L. Kaelbling,et al.  SE(3)-Equivariant Relational Rearrangement with Neural Descriptor Fields , 2022, CoRL.

[6]  Peter R. Florence,et al.  Interactive Language: Talking to Robots in Real Time , 2022, IEEE Robotics and Automation Letters.

[7]  Li Fei-Fei,et al.  VIMA: General Robot Manipulation with Multimodal Prompts , 2022, ArXiv.

[8]  D. Fox,et al.  ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[9]  M. Ryoo,et al.  Open-vocabulary Queryable Scene Representations for Real World Planning , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Peter R. Florence,et al.  Code as Policies: Language Model Programs for Embodied Control , 2022, ArXiv.

[11]  D. Fox,et al.  Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation , 2022, CoRL.

[12]  Ricardo Garcia Pinel,et al.  Instruction-driven history-aware policies for robotic manipulations , 2022, CoRL.

[13]  Jan Peters,et al.  SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp and motion optimization through diffusion , 2022, ArXiv.

[14]  S. Levine,et al.  Planning with Diffusion for Flexible Behavior Synthesis , 2022, ICML.

[15]  Vincent Vanhoucke,et al.  Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[16]  Oier Mees,et al.  What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data , 2022, IEEE Robotics and Automation Letters.

[17]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[18]  D. Fox,et al.  IFOR: Iterative Flow Minimization for Robotic Object Rearrangement , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  D. Fox,et al.  Learning Perceptual Concepts by Bootstrapping From Human Queries , 2021, IEEE Robotics and Automation Letters.

[21]  Dieter Fox,et al.  StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[22]  Toki Migimatsu,et al.  Grounding Predicates through Actions , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[23]  Leslie Pack Kaelbling,et al.  Long-Horizon Manipulation of Unknown Objects via Task and Motion Planning with Estimated Affordances , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[24]  Hao Dong,et al.  TarGF: Learning Target Gradient Field for Object Rearrangement , 2022, ArXiv.

[25]  Dieter Fox,et al.  CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.

[26]  D. Fox,et al.  SORNet: Spatial Object-Centric Representations for Sequential Manipulation , 2021, CoRL.

[27]  S. Savarese,et al.  Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation , 2021, CoRL.

[28]  Dieter Fox,et al.  Predicting Stable Configurations for Semantic Placement of Novel Objects , 2021, CoRL.

[29]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[30]  Dieter Fox,et al.  NeRP: Neural Rearrangement Planning for Unknown Objects , 2021, Robotics: Science and Systems.

[31]  Stephen Tyree,et al.  NViSII: A Scriptable Tool for Photorealistic Image Generation , 2021, ArXiv.

[32]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[33]  Dieter Fox,et al.  Contact-GraspNet: Efficient 6-DoF Grasp Generation in Cluttered Scenes , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[34]  Sylvia L. Herbert,et al.  FaSTrack:A Modular Framework for Real-Time Motion Planning and Guaranteed Safe Tracking , 2021, IEEE Transactions on Automatic Control.

[35]  Ralph R. Martin,et al.  PCT: Point cloud transformer , 2020, Computational Visual Media.

[36]  Oliver Kroemer,et al.  Relational Learning for Skill Preconditions , 2020, CoRL.

[37]  Pulkit Agrawal,et al.  A Long Horizon Planning Framework for Manipulating Rigid Pointcloud Objects , 2020, CoRL.

[38]  Oliver Kroemer,et al.  Towards Robotic Assembly by Predicting Robust, Precise and Task-oriented Grasps , 2020, CoRL.

[39]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[40]  Tucker Hermans,et al.  Multifingered Grasp Planning via Inference in Deep Neural Networks: Outperforming Sampling by Learning Differentiable Models , 2020, IEEE Robotics & Automation Magazine.

[41]  Dieter Fox,et al.  6-DOF Grasping for Target-driven Object Manipulation in Clutter , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[43]  S. Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[44]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  David Kent,et al.  Adaptive Autonomous Grasp Selection via Pairwise Ranking , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Siddhartha S. Srinivasa,et al.  The YCB object and Model set: Towards common benchmarks for manipulation research , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[48]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[49]  Robert Platt,et al.  Using Geometry to Detect Grasp Poses in 3D Point Clouds , 2015, ISRR.

[50]  Radu Bogdan Rusu,et al.  3D is here: Point Cloud Library (PCL) , 2011, 2011 IEEE International Conference on Robotics and Automation.

[51]  Steven M. LaValle,et al.  RRT-connect: An efficient approach to single-query path planning , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).