Navigating to Objects Specified by Images

Images are a convenient way to specify which particular object instance an embodied agent should navigate to. Solving this task requires semantic visual reasoning and exploration of unknown environments. We present a system that can perform this task in both simulation and the real world. Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation. We re-identify the goal instance in egocentric vision using feature-matching and localize the goal instance by projecting matched features to a map. Each sub-task is solved using off-the-shelf components requiring zero fine-tuning. On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and a state-of-the-art ImageNav model 2.3x (56% vs 25% success). We deploy this system to a mobile robot platform and demonstrate effective real-world performance, achieving an 88% success rate across a home and an office environment.

[1]  Dhruv Batra,et al.  OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav , 2023, ArXiv.

[2]  Devendra Singh Chaplot,et al.  Navigating to objects in the real world , 2022, Science Robotics.

[3]  Devendra Singh Chaplot,et al.  Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances , 2022, ArXiv.

[4]  João F. Henriques,et al.  A Light Touch Approach to Teaching Transformers Multi-view Geometry , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Unnat Jain,et al.  Last-Mile Embodied Visual Navigation , 2022, CoRL.

[6]  Angel X. Chang,et al.  Retrospectives on the Embodied AI Workshop , 2022, ArXiv.

[7]  Ram Ramrakhya,et al.  Habitat-Matterport 3D Semantics Dataset , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Dhruv Batra,et al.  VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement , 2022, NeurIPS.

[9]  Dhruv Batra,et al.  ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings , 2022, NeurIPS.

[10]  Dhruv Batra,et al.  Offline Visual Representation Learning for Embodied Navigation , 2022, ArXiv.

[11]  Jacob Krantz,et al.  Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments , 2022, ECCV.

[12]  S. Levine,et al.  ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints , 2022, Robotics: Science and Systems.

[13]  Santhosh K. Ramakrishnan,et al.  Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Santhosh K. Ramakrishnan,et al.  PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Armand Joulin,et al.  Detecting Twenty-thousand Classes using Image-level Supervision , 2022, ECCV.

[16]  Henry M. Clever,et al.  The Design of Stretch: A Compact, Lightweight Mobile Manipulator for Indoor Human Environments , 2021, 2022 International Conference on Robotics and Automation (ICRA).

[17]  Sainbayar Sukhbaatar,et al.  Memory-Augmented Reinforcement Learning for Image-Goal Navigation , 2021, 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Ludwig Schmidt,et al.  CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration , 2022, ArXiv.

[19]  Shubham Tulsiani,et al.  No RL, No Simulation: Learning to Navigate without Navigating , 2021, NeurIPS.

[20]  G. Foresti,et al.  Where Did I See It? Object Instance Re-Identification with Attention , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[21]  Angel X. Chang,et al.  Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI , 2021, NeurIPS Datasets and Benchmarks.

[22]  Songhwai Oh,et al.  Image-Goal Navigation via Keypoint-Based Reinforcement Learning , 2021, 2021 18th International Conference on Ubiquitous Robots (UR).

[23]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[24]  Sergey Levine,et al.  RECON: Rapid Exploration for Open-World Navigation with Latent Goal Models , 2021, CoRL.

[25]  Vicente Ordonez,et al.  Instance-level Image Retrieval using Reranking Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27]  S. Levine,et al.  ViNG: Learning Open-World Navigation with Visual Goals , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Ruslan Salakhutdinov,et al.  Object Goal Navigation using Goal-Oriented Semantic Exploration , 2020, NeurIPS.

[29]  Alexander Toshev,et al.  ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects , 2020, ArXiv.

[30]  Ruslan Salakhutdinov,et al.  Neural Topological SLAM for Visual Navigation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[32]  Jacob Krantz,et al.  Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments , 2020, ECCV.

[33]  K. Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2019, ECCV.

[34]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[36]  Yuandong Tian,et al.  Bayesian Relational Memory for Semantic Visual Navigation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Alessio Del Bue,et al.  re-OBJ: Jointly Learning the Foreground and Background for Object Instance Re-identification , 2019, ICIAP.

[38]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Matti Pietikäinen,et al.  Deep Learning for Generic Object Detection: A Survey , 2018, International Journal of Computer Vision.

[40]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[41]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[42]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[45]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Hugh F. Durrant-Whyte,et al.  Simultaneous localization and mapping: part I , 2006, IEEE Robotics & Automation Magazine.

[48]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[49]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[50]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[51]  Brian Yamauchi,et al.  A frontier-based approach for autonomous exploration , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.

[52]  J A Sethian,et al.  A fast marching level set method for monotonically advancing fronts. , 1996, Proceedings of the National Academy of Sciences of the United States of America.