ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose PROCTHOR, a framework for procedural generation of Embodied AI environments. PROCTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks. We demonstrate the power and potential of PROCTHOR via a sample of 10,000 generated houses and a simple neural model. Models trained using only RGB images on PROCTHOR, with no explicit mapping and no human task supervision produce state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation, including the presently running Habitat 2022, AI2-THOR Rearrangement 2022, and RoboTHOR challenges. We also demonstrate strong 0-shot results on these benchmarks, via pre-training on PROCTHOR with no fine-tuning on the downstream benchmark, often beating previous state-of-the-art systems that access the downstream training data.

[1]  J. Togelius,et al.  PCGRL: Procedural Content Generation via Reinforcement Learning , 2020, AIIDE.

[2]  J. Tenenbaum,et al.  Learning Neuro-Symbolic Relational Transition Models for Bilevel Planning , 2021, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[4]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[5]  Chandan Yeshwanth,et al.  SceneFormer: Indoor Scene Generation with Transformers , 2020, 2021 International Conference on 3D Vision (3DV).

[6]  Maneesh Agrawala,et al.  SceneSuggest: Context-driven 3D Scene Design , 2017, ArXiv.

[7]  Luisa Caldas,et al.  SceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors , 2020, ArXiv.

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Stefan Lee,et al.  EvalAI: Towards Better Evaluation Systems for AI Agents , 2019, ArXiv.

[10]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[11]  Silvio Savarese,et al.  ReLMoGen: Leveraging Motion Generation in Reinforcement Learning for Mobile Manipulation , 2020, ArXiv.

[12]  Dhruv Batra,et al.  Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Daniel Cohen-Or,et al.  GRAINS , 2018, ACM Trans. Graph..

[14]  Kristen Grauman,et al.  Semantic Audio-Visual Navigation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Alexander Toshev,et al.  ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects , 2020, ArXiv.

[16]  Yasutaka Furukawa,et al.  House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent for Professional Architects , 2021, Computer Vision and Pattern Recognition.

[17]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[18]  Jitendra Malik,et al.  RMA: Rapid Motor Adaptation for Legged Robots , 2021, Robotics: Science and Systems.

[19]  K. Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2019, ECCV.

[20]  Silvio Savarese,et al.  Robot Navigation in Constrained Pedestrian Environments using Reinforcement Learning , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Towards Disturbance-Free Visual Mobile Manipulation , 2021, ArXiv.

[22]  Sanja Fidler,et al.  Learning to Simulate Dynamic Environments With GameGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Leon L. Xu,et al.  ABO: Dataset and Benchmarks for Real-World 3D Object Understanding , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Pulkit Agrawal,et al.  Stubborn: A Strong Baseline for Indoor Object Navigation , 2022, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[26]  Dragomir Anguelov,et al.  Scalability in Perception for Autonomous Driving: Waymo Open Dataset , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[28]  Yashraj S. Narang,et al.  Factory: Fast Contact for Robotic Assembly , 2022, Robotics: Science and Systems.

[29]  Klaus C. J. Dietmayer,et al.  Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges , 2019, IEEE Transactions on Intelligent Transportation Systems.

[30]  Rafael Bidarra,et al.  A Constrained Growth Method for Procedural Floor Plan Generation , 2010 .

[31]  Radu Soricut,et al.  Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jungseock Joo,et al.  Communicative Learning with Natural Gestures for Embodied Navigation Agents with Human-in-the-Scene , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[33]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[34]  Leonidas J. Guibas,et al.  PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[36]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[37]  Ali Farhadi,et al.  SeGAN: Segmenting and Generating the Invisible , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[39]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[40]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[41]  J. Tenenbaum,et al.  Look, Listen, and Act: Towards Audio-Visual Embodied Navigation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Ali Farhadi,et al.  Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Ludwig Schmidt,et al.  CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration , 2022, ArXiv.

[44]  Josh H. McDermott,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, NeurIPS Datasets and Benchmarks.

[45]  Henry O. Velesaca,et al.  Camera pose estimation in multi-view environments: From virtual scenarios to the real world , 2021, Image Vis. Comput..

[46]  Leonidas J. Guibas,et al.  ObjectNet3D: A Large Scale Database for 3D Object Recognition , 2016, ECCV.

[47]  G. Konidaris,et al.  Towards Optimal Correlational Object Search , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[48]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[49]  Roberto Mart'in-Mart'in,et al.  robosuite: A Modular Simulation Framework and Benchmark for Robot Learning , 2020, ArXiv.

[50]  Roozbeh Mottaghi,et al.  Visual Room Rearrangement , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[52]  Chongyang Ma,et al.  Deep Generative Modeling for Scene Synthesis via Hybrid Representations , 2018, ACM Trans. Graph..

[53]  Joshua B. Tenenbaum,et al.  The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark Towards Physically Realistic Embodied AI , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[54]  Kiana Ehsani,et al.  Continuous Scene Representations for Embodied AI , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Roozbeh Mottaghi,et al.  ManipulaTHOR: A Framework for Visual Object Manipulation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Ari S. Morcos,et al.  DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames , 2019, ICLR.

[57]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[58]  Dilek Z. Hakkani-Tür,et al.  TEACh: Task-driven Embodied Agents that Chat , 2021, AAAI.

[59]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[60]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  David J. Fleet,et al.  Kubric: A scalable dataset generator , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Kalyan Sunkavalli,et al.  OpenRooms: An Open Framework for Photorealistic Indoor Scene Datasets , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Patrick Labatut,et al.  Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[65]  Hao Zhang,et al.  Graph2Plan , 2020, ACM Trans. Graph..

[66]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[67]  Ali Farhadi,et al.  Two Body Problem: Collaborative Visual Task Completion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Cynthia Matuszek,et al.  A Simulator for Human-Robot Interaction in Virtual Reality , 2021, 2021 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW).

[69]  Lyne P. Tchapmi,et al.  iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[70]  R. Mottaghi,et al.  Simple but Effective: CLIP Embeddings for Embodied AI , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[72]  Sonia Chernova,et al.  Sim2Real Predictivity: Does Evaluation in Simulation Predict Real-World Performance? , 2019, IEEE Robotics and Automation Letters.

[73]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[74]  Natasha Jaques,et al.  Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design , 2020, NeurIPS.

[75]  Ali Farhadi,et al.  RoboTHOR: An Open Simulation-to-Real Embodied AI Platform , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Evangelos Kalogerakis,et al.  SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[77]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[78]  Tongzhou Mu,et al.  ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations , 2021, NeurIPS Datasets and Benchmarks.

[79]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Michael L. Waskom,et al.  Seaborn: Statistical Data Visualization , 2021, J. Open Source Softw..

[81]  Vincent Vanhoucke,et al.  Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items , 2022, 2022 International Conference on Robotics and Automation (ICRA).

[82]  Andrew J. Davison,et al.  RLBench: The Robot Learning Benchmark & Learning Environment , 2019, IEEE Robotics and Automation Letters.

[83]  Kai Xu,et al.  Learning Generative Models of 3D Structures , 2020, Eurographics.

[84]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[85]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[86]  Silvio Savarese,et al.  iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks , 2021, CoRL.

[87]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[88]  Ali Farhadi,et al.  A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks , 2020, ECCV.

[89]  Roozbeh Mottaghi,et al.  AllenAct: A Framework for Embodied AI Research , 2020, ArXiv.

[90]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[91]  Christopher Potts,et al.  Text to 3D Scene Generation with Rich Lexical Grounding , 2015, ACL.

[92]  Jason Baldridge,et al.  Pathdreamer: A World Model for Indoor Navigation , 2021, ALVR.

[93]  Ali Farhadi,et al.  Object Manipulation via Visual Target Localization , 2022, ECCV.

[94]  Cynthia Matuszek,et al.  Head Pose as a Proxy for Gaze in Virtual Reality , 2022 .

[95]  Silvio Savarese,et al.  BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments , 2021, CoRL.

[96]  Julian Togelius,et al.  Learning Controllable Content Generators , 2021, 2021 IEEE Conference on Games (CoG).

[97]  Jacob Krantz,et al.  Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments , 2020, ECCV.

[98]  StandardSim: A Synthetic Dataset For Retail Environments , 2022, ArXiv.

[99]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, ArXiv.

[100]  Alexander M. Rush,et al.  Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.

[101]  Angel X. Chang,et al.  Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation , 2014 .

[102]  Dorsa Sadigh,et al.  Learning Adaptive Language Interfaces through Decomposition , 2020, INTEXSEMPAR.

[103]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[104]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[105]  Pratul P. Srinivasan,et al.  Block-NeRF: Scalable Large Scene Neural View Synthesis , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[106]  Angel X. Chang,et al.  Learning Spatial Knowledge for Text to 3D Scene Generation , 2014, EMNLP.

[107]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[108]  Vladlen Koltun,et al.  Megaverse: Simulating Embodied Agents at One Million Experiences per Second , 2021, ICML.

[109]  Kai Wang,et al.  Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[110]  Leonidas J. Guibas,et al.  SAPIEN: A SimulAted Part-Based Interactive ENvironment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[111]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[112]  Roozbeh Mottaghi,et al.  Interactron: Embodied Adaptive Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[113]  Max Jaderberg,et al.  Open-Ended Learning Leads to Generally Capable Agents , 2021, ArXiv.

[114]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[115]  Joao Marques-Silva,et al.  PySAT: A Python Toolkit for Prototyping with SAT Oracles , 2018, SAT.

[116]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[117]  Rui Tang,et al.  Data-driven interior plan generation for residential buildings , 2019, ACM Trans. Graph..

[118]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[119]  Angel X. Chang,et al.  Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI , 2021, NeurIPS Datasets and Benchmarks.

[120]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[121]  Fernando Marson,et al.  Automatic Real-Time Generation of Floor Plans Based on Squarified Treemaps Algorithm , 2010, Int. J. Comput. Games Technol..

[122]  Vincent Sitzmann,et al.  3D Neural Scene Representations for Visuomotor Control , 2021, CoRL.

[123]  Simple and Effective Synthesis of Indoor 3D Scenes , 2022, 2204.02960.