One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones

We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task. Significant progress has been made in recent years, especially for tasks with short horizons. However, when it comes to long-horizon tasks with extended sequences of actions, an agent can easily ignore some instructions or get stuck in the middle of the long instructions and eventually fail the task. To address this challenge, we propose a modelagnostic milestone-based task tracker (M-TRACK) to guide the agent and monitor its progress. Specifically, we propose a milestone builder that tags the instructions with navigation and interaction milestones which the agent needs to complete step by step, and a milestone checker that systemically checks the agent's progress in its current milestone and determines when to proceed to the next. On the challenging ALFRED dataset, our M-Track leads to a notable 33% and 52% relative improvement in unseen success rate over two competitive base models.

[1]  Devendra Singh Chaplot,et al.  FILM: Following Instructions in Language with Modular Methods , 2021, ICLR.

[2]  Charless C. Fowlkes,et al.  Modular Framework for Visuomotor Language Grounding , 2021, ArXiv.

[3]  Alessandro Suglia,et al.  Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion , 2021, ArXiv.

[4]  Dieter Fox,et al.  A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution , 2021, CoRL.

[5]  Yichi Zhang,et al.  Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring , 2021, FINDINGS.

[6]  Masanori Suganuma,et al.  Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks , 2021, IJCAI.

[7]  Cordelia Schmid,et al.  Episodic Transformer for Vision-and-Language Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Tamir Hazan,et al.  Visual Navigation with Spatial Attention , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Hao Hao Tan,et al.  Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information , 2021, NAACL.

[10]  Caiming Xiong,et al.  Structured Scene Memory for Vision-Language Navigation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Kunal Pratap Singh,et al.  Factorizing Perception and Policy for Interactive Instruction Following , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Stephen Gould,et al.  VLN↻BERT: A Recurrent Vision-and-Language BERT for Navigation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ross A. Knepper,et al.  Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following , 2020, CoRL.

[14]  Stephen Gould,et al.  Language and Visual Entity Relationship Graph for Agent Navigation , 2020, NeurIPS.

[15]  Dan Roth,et al.  Analogous Process Structure Induction for Sub-event Sequence Prediction , 2020, EMNLP.

[16]  William Yang Wang,et al.  Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation , 2020, FINDINGS.

[17]  Kyunghyun Cho,et al.  Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule , 2020, ICLR.

[18]  Wenguan Wang,et al.  Active Visual Information Gathering for Vision-Language Navigation , 2020, ECCV.

[19]  Fei Sha,et al.  BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps , 2020, ACL.

[20]  Arjun Majumdar,et al.  Improving Vision-and-Language Navigation with Image-Text Pairs from the Web , 2020, ECCV.

[21]  Stephen Gould,et al.  Sub-Instruction Aware Vision-and-Language Navigation , 2020, EMNLP.

[22]  L. Carin,et al.  Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  William Yang Wang,et al.  Unsupervised Reinforcement Learning of Transferable Meta-Skills for Embodied Navigation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xiaojun Chang,et al.  Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Rodrigo Nogueira,et al.  Portuguese Named Entity Recognition using BERT-CRF , 2019, ArXiv.

[27]  Jianfeng Gao,et al.  Robust Navigation with Language Pretraining and Stochastic Sampling , 2019, EMNLP.

[28]  Ashish Vaswani,et al.  Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation , 2019, ACL.

[29]  Licheng Yu,et al.  Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout , 2019, NAACL.

[30]  Stefan Lee,et al.  Embodied Question Answering in Photorealistic Environments With Point Cloud Perception , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Siddhartha S. Srinivasa,et al.  Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ghassan Al-Regib,et al.  The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Silvio Savarese,et al.  A Behavioral Approach to Visual Navigation with Graph Localization Networks , 2019, Robotics: Science and Systems.

[34]  Ghassan Al-Regib,et al.  Self-Monitoring Navigation Agent via Auxiliary Progress Estimation , 2019, ICLR.

[35]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Stefan Lee,et al.  Neural Modular Control for Embodied Question Answering , 2018, CoRL.

[37]  Ali Farhadi,et al.  Visual Semantic Navigation using Scene Priors , 2018, ICLR.

[38]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[39]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Dan Klein,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[41]  Ross A. Knepper,et al.  Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning , 2018, Robotics: Science and Systems.

[42]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[43]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[44]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Georgia Gkioxari,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[46]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[50]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[53]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[55]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[56]  Steven Bird NLTK: The Natural Language Toolkit , 2006, ACL.

[57]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[58]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[59]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[60]  Kunal Pratap Singh,et al.  Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following , 2021 .

[61]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[62]  Craig A. Knoblock,et al.  PDDL-the planning domain definition language , 1998 .