Vision Language Navigation with Knowledge-driven Environmental Dreamer

Vision-language navigation (VLN) requires an agent to perceive visual observation in a house scene and navigate step-by-step following natural language instruction. Due to the high cost of data annotation and data collection, current VLN datasets provide limited instruction-trajectory data samples. Learning vision-language alignment for VLN from limited data is challenging since visual observation and language instruction are both complex and diverse. Previous works only generate augmented data based on original scenes while failing to generate data samples from unseen scenes, which limits the generalization ability of the navigation agent. In this paper, we introduce the Knowledge-driven Environmental Dreamer (KED), a method that leverages the knowledge of the embodied environment and generates unseen scenes for a navigation agent to learn. Generating an unseen environment with texture consistency and structure consistency is challenging. To address this problem, we incorporate three knowledge-driven regularization objectives into the KED and adopt a reweighting mechanism for self-adaptive optimization. Our KED method is able to generate unseen embodied environments without extra annotations. We use KED to successfully generate 270 houses and 500K instruction-trajectory pairs. The navigation agent with the KED method outperforms the state-of-the-art methods on various VLN benchmarks, such as R2R, R4R, and RxR. Both qualitative and quantitative experiments prove that our proposed KED method is able to high-quality augmentation data with texture consistency and structure consistency.

[1]  Yuankai Qi,et al.  HOP: History-and-Order Aware Pretraining for Vision-and-Language Navigation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Gaurav S. Sukhatme,et al.  LUMINOUS: Indoor Scene Generation for Embodied AI Challenges , 2021, ArXiv.

[3]  C. Schmid,et al.  History Aware Multimodal Transformer for Vision-and-Language Navigation , 2021, NeurIPS.

[4]  Xiaojun Chang,et al.  Deep Learning for Embodied Vision Navigation: A Survey , 2021, 2108.04097.

[5]  Jaakko Lehtinen,et al.  Alias-Free Generative Adversarial Networks , 2021, NeurIPS.

[6]  Xiaojun Chang,et al.  Vision-Language Navigation with Random Environmental Mixup , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Jason Baldridge,et al.  Pathdreamer: A World Model for Indoor Navigation , 2021, ALVR.

[8]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[9]  Stephen Gould,et al.  VLN↻BERT: A Recurrent Vision-and-Language BERT for Navigation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Stephen Gould,et al.  Language and Visual Entity Relationship Graph for Agent Navigation , 2020, NeurIPS.

[11]  Xin Wang,et al.  Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation , 2020, EACL.

[12]  Alexei A. Efros,et al.  Swapping Autoencoder for Deep Image Manipulation , 2020, NeurIPS.

[13]  Arjun Majumdar,et al.  Improving Vision-and-Language Navigation with Image-Text Pairs from the Web , 2020, ECCV.

[14]  L. Carin,et al.  Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Peter Wonka,et al.  SEAN: Image Synthesis With Semantic Region-Adaptive Normalization , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Xiaojun Chang,et al.  Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Silvio Savarese,et al.  Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments , 2019, IEEE Robotics and Automation Letters.

[18]  Jianfeng Gao,et al.  Robust Navigation with Language Pretraining and Stochastic Sampling , 2019, EMNLP.

[19]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[20]  Licheng Yu,et al.  Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout , 2019, NAACL.

[21]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Trevor Darrell,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[24]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[26]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[27]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[29]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[30]  Nenghai Yu,et al.  StyleBank: An Explicit Representation for Neural Image Style Transfer , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jana Kosecka,et al.  A dataset for developing and benchmarking active vision , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[34]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[35]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[36]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[37]  David Salesin,et al.  Image Analogies , 2001, SIGGRAPH.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.