Curriculum Learning for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a task where an agent navigates in an embodied indoor environment under human instructions. Previous works ignore the distribution of sample difficulty and we argue that this potentially degrade their agent performance. To tackle this issue, we propose a novel curriculumbased training paradigm for VLN tasks that can balance human prior knowledge and agent learning progress about training samples. We develop the principle of curriculum design and re-arrange the benchmark Room-to-Room (R2R) dataset to make it suitable for curriculum training. Experiments show that our method is model-agnostic and can significantly improve the performance, the generalizability, and the training efficiency of current state-of-the-art navigation agents without increasing model complexity.

[1]  Jun Zhao,et al.  Curriculum Learning for Natural Answer Generation , 2018, IJCAI.

[2]  Fei Sha,et al.  BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps , 2020, ACL.

[3]  Deyu Meng,et al.  Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[4]  Xin Wang,et al.  Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation , 2018, ECCV.

[5]  Eric P. Xing,et al.  Easy Questions First? A Case Study on Curriculum Learning for Question Answering , 2016, ACL.

[6]  Xiaojun Chang,et al.  Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Gabriel Magalhaes,et al.  Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping , 2019, 1907.05446.

[8]  Barnabás Póczos,et al.  Competence-based Curriculum Learning for Neural Machine Translation , 2019, NAACL.

[9]  John Schulman,et al.  Teacher–Student Curriculum Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Daphna Weinshall,et al.  On The Power of Curriculum Learning in Training Deep Networks , 2019, ICML.

[12]  Zhiwu Lu,et al.  WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training , 2021, ArXiv.

[13]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[14]  Jason Baldridge,et al.  Multi-modal Discriminative Model for Vision-and-Language Navigation , 2019, Proceedings of the Combined Workshop on Spatial Language Understanding (.

[15]  Ashish Vaswani,et al.  Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation , 2019, ACL.

[16]  Stephen Gould,et al.  Sub-Instruction Aware Vision-and-Language Navigation , 2020, EMNLP.

[17]  Laurenz Wiskott,et al.  Measuring the Data Efficiency of Deep Learning Methods , 2019, ICPRAM.

[18]  Jesse Thomason,et al.  Vision-and-Dialog Navigation , 2019, CoRL.

[19]  Yiming Wang,et al.  Accelerated Mini-batch Randomized Block Coordinate Descent Method , 2014, NIPS.

[20]  Jianfeng Gao,et al.  Robust Navigation with Language Pretraining and Stochastic Sampling , 2019, EMNLP.

[21]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Jason Baldridge,et al.  Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding , 2020, EMNLP.

[23]  Dan Klein,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[24]  Zornitsa Kozareva,et al.  Environment-agnostic Multitask Learning for Natural Language Grounded Navigation , 2020, ECCV.

[25]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[26]  Kathrin Klamroth,et al.  Biconvex sets and optimization with biconvex functions: a survey and extensions , 2007, Math. Methods Oper. Res..

[27]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[28]  Chunhua Shen,et al.  REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Hal Daumé,et al.  Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning , 2019, EMNLP.

[30]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[31]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[32]  Arjun Majumdar,et al.  Improving Vision-and-Language Navigation with Image-Text Pairs from the Web , 2020, ECCV.

[33]  Jianfeng Gao,et al.  Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[35]  Licheng Yu,et al.  Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout , 2019, NAACL.

[36]  Ghassan Al-Regib,et al.  Self-Monitoring Navigation Agent via Auxiliary Progress Estimation , 2019, ICLR.

[37]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).