LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

Learning-based methods for training embodied agents typically require a large number of high-quality scenes that contain realistic layouts and support meaningful interactions. However, current simulators for Embodied AI (EAI) challenges only provide simulated indoor scenes with a limited number of layouts. This paper presents LUMINOUS, the first research framework that employs stateof-the-art indoor scene synthesis algorithms to generate large-scale simulated scenes for Embodied AI challenges. Further, we automatically and quantitatively evaluate the quality of generated indoor scenes via their ability to support complex household tasks. LUMINOUS incorporates a novel scene generation algorithm (Constrained Stochastic Scene Generation (CSSG)), which achieves competitive performance with human-designed scenes. Within LUMINOUS, the EAI task executor, task instruction generation module, and video rendering toolkit can collectively generate a massive multimodal dataset of new scenes for the training and evaluation of Embodied AI agents. Extensive experimental results demonstrate the effectiveness of the data generated by LUMINOUS, enabling the comprehensive assessment of embodied agents on generalization and robustness. The full codebase and documentation of LUMINOUS is available at: https: //github.com/amazon-research/indoor-scene-generation-eai/.

[1]  Roozbeh Mottaghi,et al.  AllenAct: A Framework for Embodied AI Research , 2020, ArXiv.

[2]  Dieter Fox,et al.  A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution , 2021, CoRL.

[3]  Pat Hanrahan,et al.  Example-based synthesis of 3D object arrangements , 2012, ACM Trans. Graph..

[4]  Yichi Zhang,et al.  Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring , 2021, FINDINGS.

[5]  Kai Wang,et al.  Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Cordelia Schmid,et al.  Episodic Transformer for Vision-and-Language Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Chuang Gan,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, ArXiv.

[8]  Ali Farhadi,et al.  RoboTHOR: An Open Simulation-to-Real Embodied AI Platform , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Chenfanfu Jiang,et al.  Human-Centric Indoor Scene Synthesis Using Stochastic Grammar , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Chi-Keung Tang,et al.  Make it home: automatic optimization of furniture arrangement , 2011, ACM Trans. Graph..

[12]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[14]  James Stewart,et al.  Constraint-Based Automatic Placement for Scene Composition , 2002, Graphics Interface.

[15]  Silvio Savarese,et al.  iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks , 2021, CoRL.

[16]  Chongyang Ma,et al.  Deep Generative Modeling for Scene Synthesis via Hybrid Representations , 2018, ACM Trans. Graph..

[17]  Leonidas J. Guibas,et al.  Language-driven synthesis of 3D scenes from scene databases , 2018, ACM Trans. Graph..

[18]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Andrew Bennett,et al.  CHALET: Cornell House Agent Learning Environment , 2018, ArXiv.

[20]  Evangelos Kalogerakis,et al.  SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Alessandro Suglia,et al.  Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion , 2021, ArXiv.

[22]  Jonghyun Choi,et al.  MOCA: A Modular Object-Centric Approach for Interactive Instruction Following , 2020, ArXiv.

[23]  Peng Liu,et al.  3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Silvio Savarese,et al.  iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  Roozbeh Mottaghi,et al.  Rearrangement: A Challenge for Embodied AI , 2020, ArXiv.

[26]  Angel X. Chang,et al.  Deep convolutional priors for indoor scene synthesis , 2018, ACM Trans. Graph..

[27]  Andrew J. Davison,et al.  RLBench: The Robot Learning Benchmark & Learning Environment , 2019, IEEE Robotics and Automation Letters.

[28]  Leonidas J. Guibas,et al.  SAPIEN: A SimulAted Part-Based Interactive ENvironment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Anne Verroust-Blondet,et al.  3D Semantic Scene Completion: A Survey , 2021, International Journal of Computer Vision.

[30]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Roozbeh Mottaghi,et al.  Visual Room Rearrangement , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[33]  Jiajun Wu,et al.  End-to-End Optimization of Scene Layout , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Angel X. Chang,et al.  Semantic Parsing for Text to 3D Scene Generation , 2014, ACL 2014.

[35]  Matthias Nießner,et al.  Activity-centric scene synthesis for functional 3D scene modeling , 2015, ACM Trans. Graph..

[36]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Craig A. Knoblock,et al.  PDDL-the planning domain definition language , 1998 .

[38]  Chandan Yeshwanth,et al.  SceneFormer: Indoor Scene Generation with Transformers , 2020, 2021 International Conference on 3D Vision (3DV).

[39]  Jitendra Malik,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, ArXiv.

[40]  Patrik Haslum,et al.  An Introduction to the Planning Domain Definition Language , 2019, An Introduction to the Planning Domain Definition Language.