Intra-agent speech permits zero-shot task acquisition

Human language learners are exposed to a trickle of informative, context-sensitive language, but a flood of raw sensory data. Through both social language use and internal processes of rehearsal and practice, language learners are able to build high-level, semantic representations that explain their perceptions. Here, we take inspiration from such processes of “inner speech” in humans (Vygotsky, 1934) to better understand the role of intra-agent speech in embodied behaviour. First, we formally pose intra-agent speech as a semi-supervised problem and develop two algorithms that enable visually grounded captioning with little labeled language data. We then experimentally compute scaling curves over different amounts of labeled data and compare the data efficiency against a supervised learning baseline. Finally, we incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world, and show that with as few as 150 additional image captions, intra-agent speech endows the agent with the ability to manipulate and answer questions about a new object without any related task-directed experience (zero-shot). Taken together, our experiments suggest that modelling intra-agent speech is effective in enabling embodied agents to learn new tasks efficiently and without direct interaction experience.

[1]  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, 2204.06125.

[2]  Santhosh K. Ramakrishnan,et al.  Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sergey Levine,et al.  BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning , 2021, CoRL.

[4]  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ArXiv.

[5]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[6]  Tamara von Glehn,et al.  Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning , 2021, ArXiv.

[7]  Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation , 2021, CoRL.

[8]  Hierarchical Few-Shot Imitation with Skill Transition Models , 2021, ArXiv.

[9]  Rajshekhar Sunderraman,et al.  Improving Text-to-Image Synthesis Using Contrastive Learning , 2021, ArXiv.

[10]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[11]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[12]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[13]  Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Language Conditioned Imitation Learning Over Unstructured Data , 2020, Robotics: Science and Systems.

[15]  Quoc V. Le,et al.  Combined Scaling for Zero-shot Transfer Learning , 2021, ArXiv.

[16]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[17]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[18]  Stefano Ermon,et al.  Improved Techniques for Training Score-Based Generative Models , 2020, NeurIPS.

[19]  Felix Hill,et al.  Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text , 2020, ArXiv.

[20]  Angeliki Lazaridou,et al.  Multi-agent Communication meets Natural Language: Synergies between Functional and Structural Language Learning , 2020, ACL.

[21]  A Benchmark for Systematic Generalization in Grounded Language Understanding , 2020, NeurIPS.

[22]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[23]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2020, ICLR.

[24]  Sergey Levine,et al.  Watch, Try, Learn: Meta-Learning from Demonstrations and Reward , 2020, ICLR.

[25]  Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach , 2019, EMNLP.

[26]  Kyunghyun Cho,et al.  Countering Language Drift via Visual Grounding , 2019, EMNLP/IJCNLP.

[27]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[28]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  S. Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[30]  CIENCY OF GROUNDED LANGUAGE LEARNING , 2019 .

[31]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[32]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[33]  Fabio Viola,et al.  Encoding Spatial Relations from Natural Language , 2018, ArXiv.

[34]  Zijun Zhang Improved Adam Optimizer for Deep Neural Networks , 2018, 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).

[35]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[36]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[38]  Sergey Levine,et al.  One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[39]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[42]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[43]  Wenhu Chen,et al.  A Semi-supervised Framework for Image Captioning , 2016 .

[44]  Phil Blunsom,et al.  Language as a Latent Variable: Discrete Generative Models for Sentence Compression , 2016, EMNLP.

[45]  Aapo Hyvärinen,et al.  Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA , 2016, NIPS.

[46]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[47]  Ben Alderson-Day,et al.  Inner Speech: Development, Cognitive Functions, Phenomenology, and Neurobiology , 2015, Psychological bulletin.

[48]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[49]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Chris Dyer Notes on Noise Contrastive Estimation and Negative Sampling , 2014, ArXiv.

[51]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[52]  Private and Inner Speech and the Regulation of Social Speech Communication. , 2011 .

[53]  H. Bekkering,et al.  Imitation Improves Language Comprehension , 2010, Psychological science.

[54]  Marianne Jones,et al.  Private Speech, Executive Functioning, and the Development of Verbal Self-Regulation , 2009 .

[55]  Charles Fernyhough,et al.  Private speech on an executive task: relations with task difficulty and task performance , 2005 .

[56]  A. Baddeley,et al.  The phonological loop as a language learning device. , 1998, Psychological review.

[57]  A. Winsler,et al.  The role of private speech in the transition from collaborative to independent task performance in young children. , 1997 .

[58]  D. Laplane Thought and language. , 1992, Behavioural neurology.

[59]  The Development of Verbal Control over Motor Behavior: A Replication and Extension of Luria's Findings. , 1982 .

[60]  D. Schiano,et al.  Speech-like coding of pictures in short-term memory , 1981, Memory & cognition.

[61]  A D Baddeley Short-term Memory for Word Sequences as a Function of Acoustic, Semantic and Formal Similarity , 1966, The Quarterly journal of experimental psychology.