Object Goal Navigation with End-to-End Self-Supervision

A household robot should be able to navigate to target locations without requiring users to first annotate ev-erything in their home. Current approaches to this object navigation challenge do not test on real robots and rely on expensive semantically labeled 3D meshes. In this work, our aim is an agent that builds self-supervised models of the world via exploration, the same as a child might. We propose an end-to-end self-supervised embodied agent that leverages exploration to train a semantic segmentation model of 3D objects, and uses those representations to learn an object navigation policy purely from self-labeled 3D meshes. The key insight is that embodied agents can leverage location consistency as a supervision signal – collecting images from different views/angles and applying contrastive learning to fine-tune a semantic segmentation model. In our experiments, we observe that our framework performs better than other self-supervised baselines and competitively with supervised baselines, in both simulation and when deployed in real houses.

[1]  Dhruv Batra,et al.  ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings , 2022, NeurIPS.

[2]  Ali Farhadi,et al.  ProcTHOR: Large-Scale Embodied AI Using Procedural Generation , 2022, NeurIPS.

[3]  Dhruv Batra,et al.  Offline Visual Representation Learning for Embodied Navigation , 2022, ArXiv.

[4]  Dhruv Batra,et al.  Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ludwig Schmidt,et al.  CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration , 2022, ArXiv.

[6]  Wenwu Zhu,et al.  Fully Self-Supervised Learning for Semantic Segmentation , 2022, ArXiv.

[7]  Santhosh K. Ramakrishnan,et al.  Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Santhosh K. Ramakrishnan,et al.  PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  R. Mottaghi,et al.  Simple but Effective: CLIP Embeddings for Embodied AI , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Issam H. Laradji,et al.  A Survey of Self-Supervised and Few-Shot Object Detection , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Devendra Singh Chaplot,et al.  FILM: Following Instructions in Language with Modular Methods , 2021, ICLR.

[12]  Ruslan Salakhutdinov,et al.  SEAL: Self-supervised Embodied Active Learning using Exploration and 3D Consistency , 2021, NeurIPS.

[13]  Dhruv Batra,et al.  THDA: Treasure Hunt Data Augmentation for Semantic Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Angel X. Chang,et al.  Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI , 2021, NeurIPS Datasets and Benchmarks.

[15]  Silvio Savarese,et al.  iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks , 2021, CoRL.

[16]  Kevin Kochersberger,et al.  Improving Autonomous Robotic Navigation Using Imitation Learning , 2021, Frontiers in Robotics and AI.

[17]  Phillip Isola,et al.  Curious Representation Learning for Embodied Intelligence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Kavita Bala,et al.  PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[21]  Wouter Van Gansbeke,et al.  Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Zhenguo Li,et al.  DetCo: Unsupervised Contrastive Learning for Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Ruslan Salakhutdinov,et al.  Object Goal Navigation using Goal-Oriented Semantic Exploration , 2020, NeurIPS.

[24]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[25]  Asako Kanezaki,et al.  Visual Object Search by Learning Spatial Context , 2020, IEEE Robotics and Automation Letters.

[26]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ari S. Morcos,et al.  DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames , 2019, ICLR.

[28]  Silvio Savarese,et al.  3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[30]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Ali Farhadi,et al.  Visual Semantic Navigation using Scene Priors , 2018, ICLR.

[32]  Jana Kosecka,et al.  Visual Representations for Semantic Target Driven Navigation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[33]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[34]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[35]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Thomas A. Funkhouser,et al.  MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments , 2017, ArXiv.

[39]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[42]  Brian Yamauchi,et al.  A frontier-based approach for autonomous exploration , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.