论文信息 - Vision-and-Dialog Navigation

Vision-and-Dialog Navigation

Robots navigating in human environments should use language to ask for assistance and be able to understand human responses. To study this challenge, we introduce Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments. The Navigator asks questions to their partner, the Oracle, who has privileged access to the best next steps the Navigator should take according to a shortest path planner. To train agents that search an environment for a goal location, we define the Navigation from Dialog History task. An agent, given a target object and a dialog history between humans cooperating to find that object, must infer navigation actions towards the goal in unexplored environments. We establish an initial, multi-modal sequence-to-sequence model and demonstrate that looking farther back in the dialog history improves performance. Sourcecode and a live interface demo can be found at https://cvdn.dev/

[1] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2] Ross A. Knepper,et al. Asking for Help Using Inverse Semantics , 2014, Robotics: Science and Systems.

[3] Qi Wu,et al. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Stefan Lee,et al. Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5] Ashish Vaswani,et al. Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation , 2019, ACL.

[6] Matthias Nießner,et al. Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[7] Eunsol Choi,et al. QuAC: Question Answering in Context , 2018, EMNLP.

[8] Yonatan Bisk,et al. Shifting the Baseline: Single Modality Performance on Visual Navigation & QA , 2018, NAACL.

[9] Yoav Artzi,et al. TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Ali Farhadi,et al. IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] Ross A. Knepper,et al. Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction , 2018, CoRL.

[12] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Murel H. Folk. Help , 1967 .

[15] Shaohua Yang,et al. Language to Action: Towards Interactive Task Learning with Physical Agents , 2018, IJCAI.

[16] Clare R. Voss,et al. A Research Platform for Multi-Robot Dialogue with Humans , 2019, NAACL-HLT.

[17] Matthew E. Taylor,et al. Improving Reinforcement Learning with Confidence-Based Demonstrations , 2017, IJCAI.

[18] Ali Farhadi,et al. AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[19] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Khanh Nguyen,et al. Vision-Based Navigation With Language-Based Assistance via Imitation Learning With Indirect Intervention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] José M. F. Moura,et al. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog , 2019, NAACL.

[22] Raymond J. Mooney,et al. Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[23] Eric Fosler-Lussier,et al. SCARE: a Situated Corpus with Annotated Referring Expressions , 2008, LREC.

[24] Sonia Chernova,et al. Integrating reinforcement learning with human demonstrations of varying ability , 2011, AAMAS.

[25] Anne H. Anderson,et al. The Hcrc Map Task Corpus , 1991 .

[26] Peter Stone,et al. Improving Grounded Natural Language Understanding through Human-Robot Dialog , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[27] Daniel Jurafsky,et al. Learning to Follow Navigational Directions , 2010, ACL.

[28] Matthias Scheutz,et al. Dempster-Shafer theoretic resolution of referential ambiguity , 2018, Auton. Robots.

[29] Jason Weston,et al. Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.

[30] Benjamin Kuipers,et al. Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions , 2006, AAAI.

[31] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Licheng Yu,et al. Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout , 2019, NAACL.

[33] Matthias Scheutz,et al. The Indiana “Cooperative Remote Search Task” (CReST) Corpus , 2010, LREC.

[34] Danqi Chen,et al. CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[35] Hal Daumé,et al. Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning , 2019, EMNLP.

[36] Guillaume Bouchard,et al. Interpretation of Natural Language Rules in Conversational Machine Reading , 2018, EMNLP.

[37] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[38] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Francis Ferraro,et al. Learning from human-robot interactions in modeled scenes , 2019, SIGGRAPH Posters.