A probabilistic approach to learning a visually grounded language model through human-robot interaction

Language is among the most fascinating and complex cognitive activities that develops rapidly since the early months of infants' life. The aim of the present work is to provide a humanoid robot with cognitive, perceptual and motor skills fundamental for the acquisition of a rudimentary form of language. We present a novel probabilistic model, inspired by the findings in cognitive sciences, able to associate spoken words with their perceptually grounded meanings. The main focus is set on acquiring the meaning of various perceptual categories (e.g. red, blue, circle, above, etc.), rather than specific world entities (e.g. an apple, a toy, etc.). Our probabilistic model is based on a variant of multi-instance learning technique, and it enables a robotic platform to learn grounded meanings of adjective/noun terms. The systems could be used to understand and generate appropriate natural language descriptions of real objects in a scene, and it has been successfully tested on the NAO humanoid robotic platform.

[1]  Martin Heckmann,et al.  Teaching a humanoid robot: Headset-free speech interaction for audio-visual association learning , 2009, RO-MAN 2009 - The 18th IEEE International Symposium on Robot and Human Interactive Communication.

[2]  Luc Steels,et al.  Bootstrapping grounded word semantics , 1999 .

[3]  P. Bloom,et al.  Intentionality and word learning , 1997, Trends in Cognitive Sciences.

[4]  Dare A. Baldwin,et al.  Understanding the link between joint attention and language. , 1995 .

[5]  D. Roy Grounding words in perception and action: computational insights , 2005, Trends in Cognitive Sciences.

[6]  A. Cangelosi APPROACHES TO GROUNDING SYMBOLS IN PERCEPTUAL AND SENSORIMOTOR CATEGORIES , 2005 .

[7]  E. Markman,et al.  When it is better to receive than to give: Syntactic and conceptual constraints on vocabulary growth , 1994 .

[8]  Deb K. Roy,et al.  Learning visually grounded words and syntax for a scene description task , 2002, Comput. Speech Lang..

[9]  Brian Scassellati,et al.  Robotic vocabulary building using extension inference and implicit contrast , 2009, Artificial Intelligence.

[10]  Gerd Herzog,et al.  VIsual TRAnslator: Linking perceptions and natural language descriptions , 1994, Artificial Intelligence Review.

[11]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[12]  Haris Dindo,et al.  Resolving ambiguities in a grounded human-robot interaction , 2009, RO-MAN 2009 - The 18th IEEE International Symposium on Robot and Human Interactive Communication.

[13]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[14]  Deb Roy dkroy Grounding Words in Perception and Action : Insights from Computational Models , 2005 .

[15]  C. Moore,et al.  Joint attention : its origins and role in development , 1995 .