Grounded Language Learning in a Simulated 3D World

We are increasingly surrounded by artificially intelligent technology that takes decisions and executes actions on our behalf. This creates a pressing need for general means to communicate with, instruct and guide artificial agents, with human language the most compelling means for such communication. To achieve this in a scalable fashion, agents must be able to relate language to the world and to actions; that is, their understanding of language must be grounded and embodied. However, learning grounded language is a notoriously challenging problem in artificial intelligence research. Here we present an agent that learns to interpret language in a simulated 3D environment where it is rewarded for the successful execution of written instructions. Trained via a combination of reinforcement and unsupervised learning, and beginning with minimal prior knowledge, the agent learns to relate linguistic symbols to emergent perceptual representations of its physical surroundings and to pertinent sequences of actions. The agent's comprehension of language extends beyond its prior experience, enabling it to apply familiar language to unfamiliar situations and to interpret entirely novel instructions. Moreover, the speed with which this agent learns new words increases as its semantic knowledge grows. This facility for generalising and bootstrapping semantic knowledge indicates the potential of the present approach for reconciling ambiguous natural language with the complexity of the physical world.

[1]  Noam Chomsky,et al.  A Review of B. F. Skinner's Verbal Behavior , 1980 .

[2]  Terry Winograd,et al.  Understanding natural language , 1974 .

[3]  Steven Pinker,et al.  Language learnability and language development , 1985 .

[4]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[5]  W. Brewer,et al.  Mental models of the earth: A study of conceptual change in childhood , 1992, Cognitive Psychology.

[6]  P. D. Eimas,et al.  Evidence for Representations of Perceptually Similar Natural Categories by 3-Month-Old and 4-Month-Old Infants , 1993, Perception.

[7]  Jeffrey Mark Siskind Grounding language in perception , 1993, Other Conferences.

[8]  Linda B. Smith,et al.  Naming in young children: a dumb attentional mechanism? , 1996, Cognition.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Jeffrey Mark Siskind,et al.  Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic , 2001, J. Artif. Intell. Res..

[11]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[12]  Jeffrey Mark Siskind Grounding language in perception , 2004, Artificial Intelligence Review.

[13]  James L. McClelland,et al.  Semantic Cognition: A Parallel Distributed Processing Approach , 2004 .

[14]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[15]  Raymond J. Mooney,et al.  Learning to sportscast: a test of grounded language acquisition , 2008, ICML '08.

[16]  L. Steels The symbol grounding problem has been solved, so what’s next? , 2008 .

[17]  Leonidas A A Doumas,et al.  A theory of the discovery and predication of relational concepts. , 2008, Psychological review.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Carina Silberer,et al.  Grounded Models of Semantic Representation , 2012, EMNLP.

[20]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[21]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[22]  Matthew R. Walter,et al.  Learning spatial-semantic representations from natural language descriptions and scene classifications , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Matthew R. Walter,et al.  A framework for learning semantic maps from grounded natural language descriptions , 2014, Int. J. Robotics Res..

[24]  Peter Stone,et al.  Learning to Interpret Natural Language Commands through Human-Robot Dialog , 2015, IJCAI.

[25]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[26]  Regina Barzilay,et al.  Language Understanding for Text-based Games using Deep Reinforcement Learning , 2015, EMNLP.

[27]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[28]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[29]  Christopher D. Manning,et al.  Learning Language Games through Interaction , 2016, ACL.

[30]  Tomas Mikolov,et al.  A Roadmap Towards Machine Intelligence , 2015, CICLing.

[31]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[32]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[33]  Wei Xu,et al.  A Deep Compositional Framework for Human-like Language Acquisition in Virtual Environment , 2017, ArXiv.

[34]  Stefanie Tellex,et al.  Accurately and Efficiently Interpreting Human-Robot Instructions of Varying Granularities , 2017, Robotics: Science and Systems.

[35]  Karen M. Feigh,et al.  Learning From Explanations Using Sentiment and Advice in RL , 2017, IEEE Transactions on Cognitive and Developmental Systems.

[36]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2017, ICLR.