Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction via the multimodal cross-attention mechanism. However, these methods usually represent temporal context as a fixed-length vector by using an LSTM decoder or using manually designed hidden states to build a recurrent Transformer. Considering a single fixed-length vector is often insufficient to capture longterm temporal context, in this paper, we introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation by modelling the temporal context explicitly. Specifically, MTVM enables the agent to keep track of the navigation trajectory by directly storing previous activations in a memory bank. To further boost the performance, we propose a memoryaware consistency loss to help learn a better joint representation of temporal context with random masked instructions. We evaluate MTVM on popular R2R and CVDN datasets, and our model improves Success Rate on R2R unseen validation and test set by 2% each, and reduce Goal Process by 1.6m on CVDN test set.

[1]  Qixiang Ye,et al.  Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Khanh Nguyen,et al.  Vision-Based Navigation With Language-Based Assistance via Imitation Learning With Indirect Intervention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Cordelia Schmid,et al.  Episodic Transformer for Vision-and-Language Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Xiaojun Chang,et al.  Vision-Dialog Navigation by Exploring Cross-Modal Memory , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiaojun Chang,et al.  Vision-Language Navigation with Random Environmental Mixup , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Yuankai Qi,et al.  Language and Visual Entity Relationship Graph for Agent Navigation , 2020, NeurIPS.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Jesse Thomason,et al.  Vision-and-Dialog Navigation , 2019, CoRL.

[9]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jianfeng Gao,et al.  Robust Navigation with Language Pretraining and Stochastic Sampling , 2019, EMNLP.

[11]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[12]  Hal Daumé,et al.  Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning , 2019, EMNLP.

[13]  Dan Klein,et al.  Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation , 2019, ACL.

[14]  Xiaojun Chang,et al.  Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Siddhartha S. Srinivasa,et al.  Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Dan Klein,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[18]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[19]  Licheng Yu,et al.  Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout , 2019, NAACL.

[20]  Yuankai Qi,et al.  A Recurrent Vision-and-Language BERT for Navigation , 2020, ArXiv.

[21]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[22]  Arjun Majumdar,et al.  Improving Vision-and-Language Navigation with Image-Text Pairs from the Web , 2020, ECCV.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Ghassan Al-Regib,et al.  Self-Monitoring Navigation Agent via Auxiliary Progress Estimation , 2019, ICLR.

[25]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Cordelia Schmid,et al.  History Aware Multimodal Transformer for Vision-and-Language Navigation , 2021, NeurIPS.

[28]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Ming Zhao,et al.  On the Evaluation of Vision-and-Language Navigation Instructions , 2021, EACL.

[30]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[31]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[32]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[33]  Roozbeh Mottaghi,et al.  RobustNav: Towards Benchmarking Robustness in Embodied Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Cordelia Schmid,et al.  Airbert: In-domain Pretraining for Vision-and-Language Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Chunhua Shen,et al.  REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Anton van den Hengel,et al.  Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation , 2021, ArXiv.

[37]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[38]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[39]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[40]  Jianfeng Gao,et al.  Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[42]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[43]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[44]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[45]  Dan Klein,et al.  Grounding spatial relations for human-robot interaction , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.