Guided Feature Transformation (GFT): A Neural Language Grounding Module for Embodied Agents

Recently there has been a rising interest in training agents, embodied in virtual environments, to perform language-directed tasks by deep reinforcement learning. In this paper, we propose a simple but effective neural language grounding module for embodied agents that can be trained end to end from scratch taking raw pixels, unstructured linguistic commands, and sparse rewards as the inputs. We model the language grounding process as a language-guided transformation of visual features, where latent sentence embeddings are used as the transformation matrices. In several language-directed navigation tasks that feature challenging partial observability and require simple reasoning, our module significantly outperforms the state of the art. We also release XWorld3D, an easy-to-customize 3D environment that can potentially be modified to evaluate a variety of embodied agents.

[1]  Jeffrey Mark Siskind,et al.  Grounding language in perception , 1993, Other Conferences.

[2]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[3]  Richard Socher,et al.  Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning , 2017, ICLR.

[4]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[5]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[6]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Arjun Chandra,et al.  Efficient Parallel Methods for Deep Reinforcement Learning , 2017, ArXiv.

[8]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[9]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Samarth Brahmbhatt,et al.  DeepNav: Learning to Navigate Large Cities , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Honglak Lee,et al.  Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning , 2017, ICML.

[15]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[16]  Jeffrey Mark Siskind,et al.  Driving Under the Influence (of Language) , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[18]  Raia Hadsell,et al.  Learning to Navigate in Cities Without a Map , 2018, NeurIPS.

[19]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[20]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[21]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  R. Prim Shortest connection networks and some generalizations , 1957 .

[24]  Michael Gasser,et al.  The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.

[25]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jitendra Malik,et al.  Unifying Map and Landmark Based Representations for Visual Navigation , 2017, ArXiv.

[27]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[29]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[30]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[31]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[32]  Stephen Clark,et al.  Virtual Embodiment: A Scalable Long-Term Strategy for Artificial Intelligence Research , 2016, ArXiv.

[33]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[34]  Wei Xu,et al.  Interactive Grounded Language Acquisition and Generalization in a 2D World , 2018, ICLR.

[35]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[36]  Shaohua Yang,et al.  Physical Causality of Action Verbs in Grounded Language Understanding , 2016, ACL.

[37]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[38]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[39]  Stevan Harnad The Symbol Grounding Problem , 1999, ArXiv.

[40]  Thomas A. Funkhouser,et al.  MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments , 2017, ArXiv.

[41]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.