Embodied Language Understanding with a Multiple Timescale Recurrent Neural Network

How the human brain understands natural language and what we can learn for intelligent systems is open research. Recently, researchers claimed that language is embodied in most — if not all — sensory and sensorimotor modalities and that the brain's architecture favours the emergence of language. In this paper we investigate the characteristics of such an architecture and propose a model based on the Multiple Timescale Recurrent Neural Network, extended by embodied visual perception. We show that such an architecture can learn the meaning of utterances with respect to visual perception and that it can produce verbal utterances that correctly describe previously unknown scenes.

[1]  L. Barsalou Grounded cognition. , 2008, Annual review of psychology.

[2]  Tetsuya Ogata,et al.  Emergence of hierarchical structure mirroring linguistic composition in a recurrent neural network , 2011, Neural Networks.

[3]  Stefan Wermter,et al.  Hybrid Neural Plausibility Networks for News Agents , 1999, AAAI/IAAI.

[4]  Karl J. Friston,et al.  A theory of cortical responses , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[5]  J. Henderson Human gaze control during real-world scene perception , 2003, Trends in Cognitive Sciences.

[6]  D. Poeppel,et al.  The cortical organization of speech processing , 2007, Nature Reviews Neuroscience.

[7]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[8]  L. Fadiga,et al.  Active perception: sensorimotor circuits as a cortical basis for language , 2010, Nature Reviews Neuroscience.

[9]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[10]  A. Karmiloff-Smith,et al.  Pathways to Language: From Fetus to Adolescent , 2001 .

[11]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[12]  David C. Plaut,et al.  Higher-Level Cognitive Functions and Connectionist Modeling. Connectionist Models of Language Processing. , 2003 .

[13]  Stefan Wermter,et al.  Adaptive Learning of Linguistic Hierarchy in a Multiple Timescale Recurrent Neural Network , 2012, ICANN.

[14]  Stefan L. Frank Strong Systematicity in Sentence Processing by an Echo State Network , 2006, ICANN.

[15]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[16]  Marc Toussaint,et al.  Extracting Motion Primitives from Natural Handwriting Data , 2006, ICANN.

[17]  Angelo Cangelosi,et al.  Grounding language in action and perception: from cognitive agents to humanoid robots. , 2010, Physics of life reviews.

[18]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Günther Palm,et al.  Artificial Neural Networks and Machine Learning – ICANN 2012 , 2012, Lecture Notes in Computer Science.

[20]  David C. Plaut,et al.  Connectionist Models of Language Processing (特集 高次認知機能の創発とコネクショニストモデル) , 2003 .

[21]  Luc Steels,et al.  Emergent Action Language on Real Robots , 2012, Language Grounding in Robots.

[22]  Jun Tani,et al.  Emergence of Functional Hierarchy in a Multiple Timescale Neural Network Model: A Humanoid Robot Experiment , 2008, PLoS Comput. Biol..

[23]  Peter Ford Dominey,et al.  Linking Language with Embodied and Teleological Representations of Action for Humanoid Cognition , 2010, Front. Neurorobot..

[24]  Claudia Scorolli,et al.  Sentence Comprehension: Effectors and Goals, Self and Others. An Overview of Experiments and Implications for Robotics , 2010, Front. Neurorobot..