Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

The role of robots in society keeps expanding, bringing with it the necessity of interacting and communicating with humans. In order to keep such interaction intuitive, we provide automatic wayfinding based on verbal navigational instructions. Our first contribution is the creation of a large-scale dataset with verbal navigation instructions. To this end, we have developed an interactive visual navigation environment based on Google Street View; we further design an annotation method to highlight mined anchor landmarks and local directions between them in order to help annotators formulate typical, human references to those. The annotation task was crowdsourced on the AMT platform, to construct a new Talk2Nav dataset with 10, 714 routes. Our second contribution is a new learning method. Inspired by spatial cognition research on the mental conceptualization of navigational instructions, we introduce a soft dual attention mechanism defined over the segmented language instructions to jointly extract two partial instructions—one for matching the next upcoming visual landmark and the other for matching the local directions to the next landmark. On the similar lines, we also introduce spatial memory scheme to encode the local directional transitions. Our work takes advantage of the advance in two lines of research: mental formalization of verbal navigational instructions and training neural network agents for automatic way finding. Extensive experiments show that our method significantly outperforms previous navigation methods. For demo video, dataset and code, please refer to our project page .

[1]  Michel Denis,et al.  When and Why Are Visual Landmarks Used in Giving Directions? , 2001, COSIT.

[2]  Samarth Brahmbhatt,et al.  DeepNav: Learning to Navigate Large Cities , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[4]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Daniel Jurafsky,et al.  Learning to Follow Navigational Directions , 2010, ACL.

[6]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[7]  Xin Wang,et al.  Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation , 2018, ECCV.

[8]  Stephan Winter,et al.  Structural Salience of Landmarks for Route Directions , 2005, COSIT.

[9]  Jitendra Malik,et al.  Visual Memory for Robust Path Following , 2018, NeurIPS.

[10]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[11]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Khanh Nguyen,et al.  Vision-Based Navigation With Language-Based Assistance via Imitation Learning With Indirect Intervention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ali Farhadi,et al.  Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  John F. Canny,et al.  Grounding Human-To-Vehicle Advice for Self-Driving Vehicles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[16]  Silvio Savarese,et al.  Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation , 2018, EMNLP.

[17]  Ilya Kostrikov,et al.  PlaNet - Photo Geolocation with Convolutional Neural Networks , 2016, ECCV.

[18]  Luc Van Gool,et al.  Navigation using special buildings as signposts , 2014, MapInteract '14.

[19]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[20]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Trevor Darrell,et al.  Language-Conditioned Graph Networks for Relational Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[23]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  L. Gool,et al.  Talk2Car: Taking Control of Your Self-Driving Car , 2019, EMNLP.

[26]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Jason Weston,et al.  Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.

[30]  Luc Van Gool,et al.  Object Referring in Videos with Language and Human Gaze , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Raia Hadsell,et al.  Learning To Follow Directions in Street View , 2019, AAAI.

[32]  Luc Van Gool,et al.  End-to-End Learning of Driving Models with Surround-View Cameras and Route Planners , 2018, ECCV.

[33]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[34]  Dan Klein,et al.  Speaker-Follower Models for Vision-and-Language Navigation , 2018, NeurIPS.

[35]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[36]  Luc Van Gool,et al.  Learning Accurate, Comfortable and Human-like Driving , 2019, ArXiv.

[37]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[38]  Paul U. Lee,et al.  Pictorial and Verbal Tools for Conveying Routes , 1999, COSIT.

[39]  M. Denis,et al.  Language and spatial cognition: comparing the roles of landmarks and street names in route instructions , 2004 .

[40]  Alexander G. Schwing,et al.  Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Yale Song,et al.  Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[44]  Ghassan Al-Regib,et al.  Self-Monitoring Navigation Agent via Auxiliary Progress Estimation , 2019, ICLR.

[45]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[47]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[48]  Paul U. Lee,et al.  Wayfinding choremes - a language for modeling conceptual route knowledge , 2005, J. Vis. Lang. Comput..

[49]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Toru Ishikawa,et al.  Landmark Selection in the Environment: Relationships with Object Characteristics and Sense of Direction , 2012, Spatial Cogn. Comput..

[51]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[52]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[53]  T. Tenbrink,et al.  Would you follow your own route description? Cognitive strategies in urban route planning , 2011, Cognition.

[54]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[55]  Stefan Lee,et al.  Embodied Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Luc Van Gool,et al.  Object Referring in Visual Scene with Spoken Language , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[57]  Raia Hadsell,et al.  Learning to Navigate in Cities Without a Map , 2018, NeurIPS.

[58]  Jean Oh,et al.  Grounding spatial relations for outdoor robot navigation , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[59]  Michel Denis,et al.  Referring to Landmark or Street Information in Route Directions: What Difference Does It Make? , 2003, COSIT.

[60]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[61]  Luc Van Gool,et al.  Mapping, Localization and Path Planning for Image-Based Navigation Using Visual Features and Map , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[64]  Byoungkwon An,et al.  Looking Beyond the Visible Scene , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[66]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[67]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[68]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[69]  Siddhartha S. Srinivasa,et al.  Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Andreas Geiger,et al.  SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images , 2018, ECCV.

[71]  Alexandra Millonig,et al.  Developing Landmark-Based Pedestrian-Navigation Systems , 2007, IEEE Transactions on Intelligent Transportation Systems.

[72]  Ghassan Al-Regib,et al.  The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[74]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Maneesh Agrawala,et al.  Automatic generation of tourist maps , 2008, ACM Trans. Graph..

[76]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[77]  Trevor Darrell,et al.  Explainable Neural Computation via Stack Neural Module Networks , 2018, ECCV.

[78]  Stephen Clark,et al.  Understanding Grounded Language Learning Agents , 2017, ArXiv.

[79]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Gregory Shakhnarovich,et al.  Discriminability Objective for Training Descriptive Captions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[81]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[82]  Lixiang Li,et al.  Captioning Transformer with Stacked Attention Modules , 2018 .