SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments

Semantic reasoning and dynamic planning capabilities are crucial for an autonomous agent to perform complex navigation tasks in unknown environments. It requires a large amount of common-sense knowledge, that humans possess, to succeed in these tasks. We present SayNav, a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks in unknown large-scale environments. SayNav uses a novel grounding mechanism, that incrementally builds a 3D scene graph of the explored environment as inputs to LLMs, for generating feasible and contextually appropriate high-level plans for navigation. The LLM-generated plan is then executed by a pre-trained low-level planner, that treats each planned step as a short-distance point-goal navigation sub-task. SayNav dynamically generates step-by-step instructions during navigation and continuously refines future steps based on newly perceived information. We evaluate SayNav on a new multi-object navigation task, that requires the agent to utilize a massive amount of human knowledge to efficiently search multiple different objects in an unknown environment. SayNav outperforms an oracle based Point-nav baseline, achieving a success rate of 95.35% (vs 56.06% for the baseline), under the ideal settings on this task, highlighting its ability to generate dynamic plans for successfully locating objects in large-scale new environments. In addition, SayNav also enables efficient generalization from simulation to real environments.

[1]  S. Levine,et al.  ViNT: A Foundation Model for Visual Navigation , 2023, ArXiv.

[2]  B. Liu,et al.  LLM+P: Empowering Large Language Models with Optimal Planning Proficiency , 2023, ArXiv.

[3]  Li Cheng,et al.  Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world Applications , 2023, Machine Intelligence Research.

[4]  H. Kasaei,et al.  L3MVN: Leveraging Large Language Models for Visual Target Navigation , 2023, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[5]  Chunyuan Li,et al.  Instruction Tuning with GPT-4 , 2023, ArXiv.

[6]  Wayne Xin Zhao,et al.  A Survey of Large Language Models , 2023, ArXiv.

[7]  Mehdi S. M. Sajjadi,et al.  PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.

[8]  Dhruv Batra,et al.  PIRLNav: Pretraining with Imitation and RL Finetuning for OBJECTNAV , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Chan Hee Song,et al.  LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  D. Fox,et al.  ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[12]  Ali Farhadi,et al.  ProcTHOR: Large-Scale Embodied AI Using Procedural Generation , 2022, NeurIPS.

[13]  Dhruv Batra,et al.  Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[15]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[16]  R. Mottaghi,et al.  Simple but Effective: CLIP Embeddings for Embodied AI , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[18]  Federico Tombari,et al.  SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Luca Carlone,et al.  Kimera: From SLAM to spatial perception with 3D dynamic scene graphs , 2021, Int. J. Robotics Res..

[20]  Roozbeh Mottaghi,et al.  AllenAct: A Framework for Embodied AI Research , 2020, ArXiv.

[21]  Ruslan Salakhutdinov,et al.  Object Goal Navigation using Goal-Oriented Semantic Exploration , 2020, NeurIPS.

[22]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[23]  Federico Tombari,et al.  Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ari S. Morcos,et al.  DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames , 2019, ICLR.

[25]  Silvio Savarese,et al.  3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Jong-Hwan Kim,et al.  3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents , 2019, IEEE Transactions on Cybernetics.

[27]  Vladlen Koltun,et al.  Benchmarking Classic and Learned Navigation in Complex 3D Environments , 2019, ArXiv.

[28]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[29]  Kaiming He,et al.  Group Normalization , 2018, International Journal of Computer Vision.

[30]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[31]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[34]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[35]  Mirella Lapata,et al.  Automatic Evaluation of Information Ordering: Kendall’s Tau , 2006, CL.

[36]  L. Carlone,et al.  Hydra: A Real-time Spatial Perception Engine for 3D Scene Graph Construction and Optimization , 2022, ArXiv.