Understanding Early Word Learning in Situated Artificial Agents

Neural network-based systems can now learn to locate the referents of words and phrases in images, answer questions about visual scenes, and execute symbolic instructions as first-person actors in partially-observable worlds. To achieve this so-called grounded language learning, models must overcome challenges that infants face when learning their first words. While it is notable that models with no meaningful prior knowledge overcome these obstacles, researchers currently lack a clear understanding of how they do so, a problem that we attempt to address in this paper. For maximum control and generality, we focus on a simple neural network-based language learning agent, trained via policy-gradient methods, which can interpret single-word instructions in a simulated 3D world. Whilst the goal is not to explicitly model infant word learning, we take inspiration from experimental paradigms in developmental psychology and apply some of these to the artificial agent, exploring the conditions under which established human biases and learning effects emerge. We further propose a novel method for visualising semantic representations in the agent.

[1]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[2]  Linda B. Smith,et al.  Infants rapidly learn word-referent mappings via cross-situational statistics , 2008, Cognition.

[3]  Catherine L. Harris,et al.  The human semantic potential: Spatial language and constrained connectionism , 1997 .

[4]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[5]  Alexander M. Rush,et al.  Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks , 2016, ArXiv.

[6]  J. Bertoncini,et al.  Before and after the vocabulary spurt: two modes of word acquisition? , 2003 .

[7]  Alexander M. Rush,et al.  LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks , 2016, IEEE Transactions on Visualization and Computer Graphics.

[8]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[9]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[10]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[11]  M. Gluck,et al.  Hippocampal mediation of stimulus representation: A computational theory , 1993, Hippocampus.

[12]  Honglak Lee,et al.  Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning , 2017, ICML.

[13]  Michael C. Frank,et al.  Wordbank: an open repository for developmental vocabulary data* , 2016, Journal of Child Language.

[14]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[15]  Rebecca J. Brand,et al.  Breaking the language barrier: an emergentist coalition model for the origins of word learning. , 2000, Monographs of the Society for Research in Child Development.

[16]  Sergey Levine,et al.  From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following , 2019, ICLR.

[17]  Brad E. Pfeiffer,et al.  The content of hippocampal “replay” , 2018, Hippocampus.

[18]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[19]  Chris Sinha,et al.  Symbol Grounding or the Emergence of Symbols? Vocabulary Growth in Children and a Connectionist Net , 1992 .

[20]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[21]  Stefan Lee,et al.  Neural Modular Control for Embodied Question Answering , 2018, CoRL.

[22]  Geoffrey E. Hinton,et al.  Implementing Semantic Networks in Parallel Hardware , 2014 .

[23]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[24]  J. Tenenbaum,et al.  Word learning as Bayesian inference. , 2007, Psychological review.

[25]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[26]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[27]  Linda B. Smith,et al.  The importance of shape in early lexical learning , 1988 .

[28]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[29]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[30]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[31]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[32]  J. Dore,et al.  Transitional phenomena in early language acquisition , 1976, Journal of Child Language.

[33]  V. Marchman,et al.  Blue car, red car: Developing efficiency in online interpretation of adjective–noun phrases , 2010, Cognitive Psychology.

[34]  Michael C. Frank,et al.  Social and Discourse Contributions to the Determination of Reference in Cross-Situational Word Learning , 2013 .

[35]  Linda B. Smith,et al.  From the lexicon to expectations about kinds: a role for associative learning. , 2005, Psychological review.

[36]  Pushmeet Kohli,et al.  Learning to Understand Goal Specifications by Modelling Reward , 2018, ICLR.

[37]  Samuel Ritter,et al.  Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study , 2017, ICML.

[38]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[39]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[40]  D. Rumelhart Parallel Distributed Processing Volume 1: Foundations , 1987 .

[41]  Alejandrina Cristia,et al.  Child-Directed Speech Is Infrequent in a Forager-Farmer Population: A Time Allocation Study. , 2019, Child development.

[42]  Willard Van Orman Quine,et al.  Word and Object , 1960 .

[43]  Alejandrina Cristià,et al.  Relating Unsupervised Word Segmentation to Reported Vocabulary Acquisition , 2017, INTERSPEECH.

[44]  Roger W. Brown,et al.  A First Language: The Early Stages , 1974 .

[45]  Brian MacWhinney,et al.  The emergence of language. , 1999 .

[46]  R. N. Spreng,et al.  The Future of Memory: Remembering, Imagining, and the Brain , 2012, Neuron.

[47]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[48]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[49]  Gary F. Marcus,et al.  Poverty of the stimulus arguments , 1999 .

[50]  Shane Legg,et al.  DeepMind Lab , 2016, ArXiv.

[51]  F. Pulvermüller,et al.  Motor cognition–motor semantics: Action perception theory of cognition and communication , 2014, Neuropsychologia.

[52]  Ellen M. Markman,et al.  Constraints Children Place on Word Meanings , 1990, Cogn. Sci..

[53]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.