Gibson Env V2: Embodied Simulation Environments for Interactive Navigation

Autonomous navigation is one of the most crucial tasks for mobile agents. The goal is to have the mobile agent reach any location in the environment in a safe and robust manner. Traditionally, robot navigation (including obstacle avoidance) has been addressed with analytical model-based solutions using signals from Lidars or depth sensors [5, 6]. Recently, learning based visual navigation methods have gained popularity because 1) they can perform navigation without accurate localization or metric maps [18, 1], 2) they do not require expensive Lidars or depth sensors [11, 14], 3) they can generalize robustly in previously unseen environments [24, 15]. Despite these benefits, learning-based approaches usually require a large amount of data. Collecting the data through interactions with the real world could be dangerous, costly and time-consuming. A solution to this challenge of learning-based navigation is to learn in simulated environments. In simulation, the agent can collect experiences safely and efficiently: usually one or two orders of magnitude faster than real-time. However, despite recent advances in simulation for robotics [19, 24, 20, 21, 12, 10, 7, 3], it is still less than straightforward to transfer directly what is learned in simulation to the real world. The reason for this is the so-called sim2real gap: the (more or less subtle) differences between the simulated and real environments, for example due to the different spatial arrangement of objects or the disparity between real and simulated sensor signals. Many learning-based approaches rely on simulation for fast policy learning. And different simulation-to-real transfer strategies including photorealistic rendering[16, 4], domain randomization[17] and domain adaptation[2, 23] were proposed. In particular, the Gibson Env[23] is a simulation environment that does photorealistic rendering and additionally provides a pixel-level domain adaptation mechanism to aid with the commonly raised concern of sim-to-real transfer. In Gibson Env the spatial arrangement of objects is realistic because the models of the environments are obtained from the real world. Additionally, the gap between simulated and real visual sensors is bridged through a novel neural network, the Goggles. Thanks to these properties, Gibson Env has demonstrated great sim-to-real transfer performance [9, 13]. However, Gibson Env presents two main limitations that hinder its use for learning-based navigation approaches: 1) the rendering is relatively slow (40-100fps) for large-scale training, which partially defeats the purpose of training in simulation, and 2) the interaction between the agent and the environment is limited only to a planar motion on the floor, while navigation in many real-world scenarios involves more intricate forms of interaction, such as opening doors and pushing away objects that are in the way.

[1]  Silvio Savarese,et al.  Deep Visual MPC-Policy Learning for Navigation , 2019, IEEE Robotics and Automation Letters.

[2]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[3]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jitendra Malik,et al.  Combining Optimal Control and Learning for Visual Navigation in Novel Environments , 2019, CoRL.

[5]  Siddhartha S. Srinivasa,et al.  Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Sergey Levine,et al.  Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[7]  George Drettakis,et al.  Scalable inside-out image-based rendering , 2016, ACM Trans. Graph..

[8]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Oussama Khatib,et al.  A depth space approach to human-robot collision avoidance , 2012, 2012 IEEE International Conference on Robotics and Automation.

[11]  Thomas A. Funkhouser,et al.  MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments , 2017, ArXiv.

[12]  Raia Hadsell,et al.  Learning to Navigate in Cities Without a Map , 2018, NeurIPS.

[13]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[14]  Vladlen Koltun,et al.  Benchmarking Classic and Learned Navigation in Complex 3D Environments , 2019, ArXiv.

[15]  Wolfram Burgard,et al.  The dynamic window approach to collision avoidance , 1997, IEEE Robotics Autom. Mag..

[16]  Joonho Lee,et al.  Learning agile and dynamic motor skills for legged robots , 2019, Science Robotics.

[17]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Howie Choset,et al.  Learning to Sequence Robot Behaviors for Visual Navigation , 2018, ArXiv.

[19]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[20]  Andrew Howard,et al.  Design and use paradigms for Gazebo, an open-source multi-robot simulator , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[21]  Dieter Fox,et al.  Neural Autonomous Navigation with Riemannian Motion Policy , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[22]  Simon Brodeur,et al.  HoME: a Household Multimodal Environment , 2017, ICLR.

[23]  Michael Goesele,et al.  Let There Be Color! Large-Scale Texturing of 3D Reconstructions , 2014, ECCV.