Instance-Level Semantic Maps for Vision Language Navigation

Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment, allowing them to navigate on-demand when given linguistic instructions. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recent works take a step towards this goal by creating a semantic spatial map representation of the environment without any labeled data. However, their representations are limited for practical applicability as they do not distinguish between different instances of the same object. In this work, we address this limitation by integrating instance-level information into spatial map representation using a community detection algorithm and utilizing word ontology learned by large language models (LLMs) to perform open-set semantic associations in the mapping representation. The resulting map representation improves the navigation performance by two-fold (233%) on realistic language commands with instance-specific descriptions compared to the baseline. We validate the practicality and effectiveness of our approach through extensive qualitative and quantitative experiments.

[1]  M. Sridharan,et al.  Sequence-Agnostic Multi-Object Navigation , 2023, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[2]  C. Krettek,et al.  ChatGPT , 2023, Die Unfallchirurgie.

[3]  Yong Jae Lee,et al.  Generalized Decoding for Pixel, Image, and Language , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andy Zeng,et al.  Visual Language Maps for Robot Navigation , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[5]  K. M. Krishna,et al.  Ground then Navigate: Language-guided Navigation in Dynamic Scenes , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[6]  M. Ryoo,et al.  Open-vocabulary Queryable Scene Representations for Real World Planning , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[7]  S. Riezler,et al.  Analyzing Generalization of Vision and Language Navigation to Unseen Outdoor Areas , 2022, ACL.

[8]  Kilian Q. Weinberger,et al.  Language-driven Semantic Segmentation , 2022, ICLR.

[9]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[11]  Marcella Cornia,et al.  Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation , 2019, ArXiv.

[12]  Hal Daumé,et al.  Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning , 2019, EMNLP.

[13]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Khanh Nguyen,et al.  Vision-Based Navigation With Language-Based Assistance via Imitation Learning With Indirect Intervention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  C. V. Jawahar,et al.  IDD: A Dataset for Exploring Problems of Autonomous Navigation in Unconstrained Environments , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Stefan Lee,et al.  Neural Modular Control for Embodied Question Answering , 2018, CoRL.

[18]  Stefan Leutenegger,et al.  Fusion++: Volumetric Object-Level SLAM , 2018, 2018 International Conference on 3D Vision (3DV).

[19]  Carsten Rother,et al.  Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[24]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[28]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.