Understanding Grounded Language Learning Agents

Neural network-based systems can now learn to locate the referents of words and phrases in images, answer questions about visual scenes, and even execute symbolic instructions as first-person actors in partially-observable worlds. To achieve this so-called grounded language learning, models must overcome certain well-studied learning challenges that are also fundamental to infants learning their first words. While it is notable that models with no meaningful prior knowledge overcome these learning obstacles, AI researchers and practitioners currently lack a clear understanding of exactly how they do so. Here we address this question as a way of achieving a clearer general understanding of grounded language learning, both to inform future research and to improve confidence in model predictions. For maximum control and generality, we focus on a simple neural network-based language learning agent trained via policy-gradient methods to interpret synthetic linguistic instructions in a simulated 3D world. We apply experimental paradigms from developmental psychology to this agent, exploring the conditions under which established human biases and learning effects emerge. We further propose a novel way to visualise and analyse semantic representation in grounded language learning agents that yields a plausible computational account of the observed effects.

[1]  R. Brown,et al.  A First Language , 1973 .

[2]  J. Dore,et al.  Transitional phenomena in early language acquisition , 1976, Journal of Child Language.

[3]  R. Pea The development of negation in early child language , 1978 .

[4]  E. Markman,et al.  Developmental differences in the acquisition of basic and superordinate categories. , 1980 .

[5]  Dedre Gentner,et al.  Why Nouns Are Learned before Verbs: Linguistic Relativity Versus Natural Partitioning. Technical Report No. 257. , 1982 .

[6]  Linda B. Smith,et al.  The importance of shape in early lexical learning , 1988 .

[7]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[8]  Ellen M. Markman,et al.  Constraints Children Place on Word Meanings , 1990, Cogn. Sci..

[9]  Chris Sinha,et al.  Symbol Grounding or the Emergence of Symbols? Vocabulary Growth in Children and a Connectionist Net , 1992 .

[10]  M. Gluck,et al.  Hippocampal mediation of stimulus representation: A computational theory , 1993, Hippocampus.

[11]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[12]  A. Paivio,et al.  Concreteness effects on memory: when and why? , 1994 .

[13]  Terry Regier,et al.  The Human Semantic Potential: Spatial Language and Constrained Connectionism , 1996 .

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  R. Shore Rethinking the Brain: New Insights into Early Development , 1997 .

[16]  Gary F. Marcus,et al.  Poverty of the stimulus arguments , 1999 .

[17]  Rebecca J. Brand,et al.  Breaking the language barrier: an emergentist coalition model for the origins of word learning. , 2000, Monographs of the Society for Research in Child Development.

[18]  Seungjin Choi,et al.  Shaping meanings for language: universal and language-specific in the acquisition of spatial semanti , 2001 .

[19]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[20]  J. Bertoncini,et al.  Before and after the vocabulary spurt: two modes of word acquisition? , 2003 .

[21]  Michael B. Lewis,et al.  Age of acquisition and the cumulative-frequency hypothesis: a review of the literature and a new multi-task investigation. , 2004, Acta psychologica.

[22]  James L. McClelland,et al.  Semantic Cognition: A Parallel Distributed Processing Approach , 2004 .

[23]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[24]  Linda B. Smith,et al.  From the lexicon to expectations about kinds: a role for associative learning. , 2005, Psychological review.

[25]  J. Tenenbaum,et al.  Word learning as Bayesian inference. , 2007, Psychological review.

[26]  T. Rogers,et al.  Where do you know what you know? The representation of semantic knowledge in the human brain , 2007, Nature Reviews Neuroscience.

[27]  Linda B. Smith,et al.  Infants rapidly learn word-referent mappings via cross-situational statistics , 2008, Cognition.

[28]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[29]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[30]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[31]  V. Marchman,et al.  Blue car, red car: Developing efficiency in online interpretation of adjective–noun phrases , 2010, Cognitive Psychology.

[32]  Rutvik H. Desai,et al.  The neurobiology of semantic memory , 2011, Trends in Cognitive Sciences.

[33]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[34]  R. N. Spreng,et al.  The Future of Memory: Remembering, Imagining, and the Brain , 2012, Neuron.

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Michael C. Frank,et al.  Social and Discourse Contributions to the Determination of Reference in Cross-Situational Word Learning , 2013 .

[37]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[38]  F. Pulvermüller,et al.  Motor cognition–motor semantics: Action perception theory of cognition and communication , 2014, Neuropsychologia.

[39]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[40]  Michael C. Frank,et al.  The role of context in young children's comprehension of negation , 2014 .

[41]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[42]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[43]  Thomas L. Griffiths,et al.  Supplementary Information for Natural Speech Reveals the Semantic Maps That Tile Human Cerebral Cortex , 2022 .

[44]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[45]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[46]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[47]  Michael C. Frank,et al.  Wordbank: an open repository for developmental vocabulary data* , 2016, Journal of Child Language.

[48]  M. Barrett Early speech perception and word learning , 2016 .

[49]  Alexander M. Rush,et al.  Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks , 2016, ArXiv.

[50]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[51]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[52]  Honglak Lee,et al.  Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning , 2017, ICML.

[53]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[54]  Samuel Ritter,et al.  Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study , 2017, ICML.

[55]  Brad E. Pfeiffer,et al.  The content of hippocampal “replay” , 2018, Hippocampus.

[56]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[57]  Alejandrina Cristià,et al.  Relating Unsupervised Word Segmentation to Reported Vocabulary Acquisition , 2017, INTERSPEECH.

[58]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[59]  Stefan Lee,et al.  Neural Modular Control for Embodied Question Answering , 2018, CoRL.

[60]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[61]  Alexander M. Rush,et al.  LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks , 2016, IEEE Transactions on Visualization and Computer Graphics.

[62]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[63]  Pushmeet Kohli,et al.  Learning to Understand Goal Specifications by Modelling Reward , 2018, ICLR.

[64]  Sergey Levine,et al.  From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following , 2019, ICLR.