Grounded Language Learning Fast and Slow

Recent work has shown that large text-based neural language models, trained with conventional supervised learning objectives, acquire a surprising propensity for few- and one-shot learning. Here, we show that an embodied agent situated in a simulated 3D world, and endowed with a novel dual-coding external memory, can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms. After a single introduction to a novel object via continuous visual perception and a language prompt ("This is a dax"), the agent can re-identify the object and manipulate it as instructed ("Put the dax on the bed"). In doing so, it seamlessly integrates short-term, within-episode knowledge of the appropriate referent for the word "dax" with long-term lexical and motor knowledge acquired across episodes (i.e. "bed" and "putting"). We find that, under certain training conditions and with a particular memory writing mechanism, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category, and is effective in settings with unfamiliar numbers of objects. We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for later executing instructions. Together, the results demonstrate that deep neural networks can exploit meta-learning, episodic memory and an explicitly multi-modal environment to account for 'fast-mapping', a fundamental pillar of human cognitive development and a potentially transformative capacity for agents that interact with human users.

[1]  Brenden M. Lake,et al.  Self-supervised learning through the eyes of a child , 2020, NeurIPS.

[2]  Stephen Clark,et al.  Understanding Early Word Learning in Situated Artificial Agents , 2017 .

[3]  Brenden M. Lake,et al.  Compositional generalization through meta sequence-to-sequence learning , 2019, NeurIPS.

[4]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[5]  Susan Carey,et al.  Acquiring a Single New Word , 1978 .

[6]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[7]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[8]  E. Kleinknecht,et al.  Beyond fast mapping: young children's extensions of novel words and novel facts. , 2001, Developmental psychology.

[9]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[11]  E. Markman,et al.  Word learning in children: an examination of fast mapping. , 1987, Child development.

[12]  J. Tenenbaum,et al.  Word learning as Bayesian inference. , 2007, Psychological review.

[13]  Michael C. Frank,et al.  PSYCHOLOGICAL SCIENCE Research Article Using Speakers ’ Referential Intentions to Model Early Cross-Situational Word Learning , 2022 .

[14]  James L. McClelland,et al.  Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[15]  M. D’Esposito Working memory. , 2008, Handbook of clinical neurology.

[16]  Linda B. Smith,et al.  Toddler-Inspired Visual Object Learning , 2018, NeurIPS.

[17]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[18]  L. Markson,et al.  Evidence against a dedicated system for word learning in children , 1997, Nature.

[19]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[20]  Adele Diamond,et al.  Fast mapping of multiple words: Insights into when ''the information provided'' does and does not equal ''the information perceived'' , 2003 .

[21]  A. Paivio Mental imagery in associative learning and memory , 1969 .

[22]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[23]  Sandra R. Waxman,et al.  Principles that are invoked in the acquisition of words, but not facts , 2000, Cognition.

[24]  D. Medin,et al.  The role of theories in conceptual coherence. , 1985, Psychological review.

[25]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[28]  Daniel Guo,et al.  Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[29]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[30]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[31]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Aaron van den Oord,et al.  Shaping Belief States with Generative Environment Models for RL , 2019, NeurIPS.

[33]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[34]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[35]  Wenpeng Yin,et al.  Meta-learning for Few-shot Natural Language Processing: A Survey , 2020, ArXiv.

[36]  Douglas L. Medin,et al.  Context theory of classification learning. , 1978 .

[37]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[38]  Chen Yu,et al.  The unrealized promise of infant statistical word–referent learning , 2014, Trends in Cognitive Sciences.

[39]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[40]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[41]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[42]  Hinrich Schütze,et al.  Extending Machine Language Models toward Human-Level Language Understanding , 2019, ArXiv.

[43]  Catherine M. Sandhofer,et al.  Fast Mapping Across Time: Memory Processes Support Children’s Retention of Learned Words , 2012, Front. Psychology.

[44]  Daniel Guo,et al.  Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Yan Wu,et al.  Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[47]  Joel Z. Leibo,et al.  Unsupervised Predictive Memory in a Goal-Directed Agent , 2018, ArXiv.

[48]  Linda B. Smith,et al.  Infants rapidly learn word-referent mappings via cross-situational statistics , 2008, Cognition.

[49]  L. Gleitman,et al.  Propose but verify: Fast mapping meets cross-situational word learning , 2013, Cognitive Psychology.

[50]  Joel Z. Leibo,et al.  Generalization of Reinforcement Learners with Working and Episodic Memory , 2019, NeurIPS.