Language Model-Based Paired Variational Autoencoders for Robotic Language Learning

Human infants learn language while interacting with their environment in which their caregivers may describe the objects and actions they perform. Similar to human infants, artificial agents can learn language while interacting with their environment. In this work, first, we present a neural model that bidirectionally binds robot actions and their language descriptions in a simple object manipulation scenario. Building on our previous Paired Variational Autoencoders (PVAE) model, we demonstrate the superiority of the variational autoencoder over standard autoencoders by experimenting with cubes of different colours, and by enabling the production of alternative vocabularies. Additional experiments show that the model's channel-separated visual feature extraction module can cope with objects of different shapes. Next, we introduce PVAE-BERT, which equips the model with a pretrained large-scale language model, i.e., Bidirectional Encoder Representations from Transformers (BERT), enabling the model to go beyond comprehending only the predefined descriptions that the network has been trained on; the recognition of action descriptions generalises to unconstrained natural language as the model becomes capable of understanding unlimited variations of the same descriptions. Our experiments suggest that using a pretrained language model as the language encoder allows our approach to scale up for real-world scenarios with instructions from human users.

[1]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Joseph Marino,et al.  Predictive Coding, Variational Autoencoders, and Biological Connections , 2019, Neural Computation.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[5]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Pierre Sermanet,et al.  Language Conditioned Imitation Learning Over Unstructured Data , 2021, Robotics: Science and Systems.

[7]  Angelo Cangelosi,et al.  A Bi-directional Multiple Timescales LSTM Model for Grounding of Actions and Verbs , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[8]  Stefan Wermter,et al.  NICO — Neuro-inspired companion: A developmental humanoid robot platform for multimodal interaction , 2017, 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[9]  Stefan Wermter,et al.  Generalization in Multimodal Language Learning from Simulation , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[10]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[11]  Tetsuya Ogata,et al.  Paired Recurrent Autoencoders for Bidirectional Translation Between Robot Actions and Linguistic Descriptions , 2018, IEEE Robotics and Automation Letters.

[12]  Shaohua Yang,et al.  Language to Action: Towards Interactive Task Learning with Physical Agents , 2018, IJCAI.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Kuniyuki Takahashi,et al.  Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[15]  C'edric Colas,et al.  Grounding Language to Autonomously-Acquired Skills via Goal Generation , 2020 .

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Tetsuya Ogata,et al.  Two-way translation of compound sentences and arm motions by recurrent neural networks , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Stefan Wermter,et al.  Crossmodal Language Grounding in an Embodied Neurocognitive Model , 2020, Frontiers in Neurorobotics.

[20]  Mohit Shridhar,et al.  INGRESS: Interactive visual grounding of referring expressions , 2020, Int. J. Robotics Res..

[21]  Erik Strahl,et al.  Teaching NICO How to Grasp: An Empirical Study on Crossmodal Social Interaction as a Key Factor for Robots Learning From Humans , 2020, Frontiers in Neurorobotics.

[22]  Jacob Andreas,et al.  Experience Grounds Language , 2020, EMNLP.

[23]  Stefan Wermter,et al.  Hey robot, why don't you talk to me? , 2017, 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[24]  Stefan Wermter,et al.  Embodied Language Learning with Paired Variational Autoencoders , 2021, 2021 IEEE International Conference on Development and Learning (ICDL).

[25]  Jeannette Bohg,et al.  Concept2Robot: Learning manipulation concepts from instructions and human demonstrations , 2020, Robotics: Science and Systems.

[26]  Stefan Wermter,et al.  Interactive natural language acquisition in a multi-modal recurrent neural architecture , 2017, Connect. Sci..

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.