Teaching a humanoid robot: Headset-free speech interaction for audio-visual association learning

Based on inspirations from infant development we present a system which learns associations between acoustic labels and visual representations in interaction with its tutor. The system is integrated with a humanoid robot. Except for a few trigger phrases to start learning all acoustical representations are learned online and in interaction. Similar, for the visual domain the clusters are not predefined and fully learned online. In contrast to other interactive systems the interaction with the acoustic environment is solely based on the two microphones mounted on the robots head. In this paper we give an overview on all key elements of the system and focus on the challenges arising from the headset-free learning of speech labels. In particular we present a mechanism for auditory attention integrating bottom-up and top-down information for the segmentation of the acoustic stream. The performance of the system is evaluated based on offline tests of individual parts of the system and an analysis of the online behavior.

[1]  Alexander H. Waibel,et al.  Enabling Multimodal Human–Robot Interaction for the Karlsruhe Humanoid Robot , 2007, IEEE Transactions on Robotics.

[2]  Heiko Wersing,et al.  Towards incremental hierarchical behavior generation for humanoids , 2007, 2007 7th IEEE-RAS International Conference on Humanoid Robots.

[3]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[4]  Ee Sian Neo,et al.  A natural language instruction system for humanoid robots integrating situated speech recognition, visual recognition and on-line whole-body motion generation , 2008, 2008 IEEE/ASME International Conference on Advanced Intelligent Mechatronics.

[5]  Chen Yu,et al.  Grounding word learning in multimodal sensorimotor interaction , 2008 .

[6]  Inna Mikhailova,et al.  Organizing multimodal perception for autonomous learning and interactive systems , 2008, Humanoids 2008 - 8th IEEE-RAS International Conference on Humanoid Robots.

[7]  Guy J. Brown,et al.  A computational model of auditory selective attention , 2004, IEEE Transactions on Neural Networks.

[8]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[9]  Martin Heckmann,et al.  A closer look on hierarchical spectro-temporal features (HIST) , 2008, INTERSPEECH.

[10]  Jannik Fritsch,et al.  Interacting with a mobile robot: Evaluating gestural object references , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  S. David,et al.  Auditory attention : focusing the searchlight on sound , 2007 .

[12]  Kiyohiro Shikano,et al.  Real-time implementation of blind spatial subtraction array for hands-free robot spoken dialogue system , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Miguel Vaz,et al.  Learning from a tutor: Embodied speech acquisition and imitation learning , 2009, 2009 IEEE 8th International Conference on Development and Learning.

[14]  Inna Mikhailova,et al.  Expectation-driven autonomous learning and interaction system , 2008, Humanoids 2008 - 8th IEEE-RAS International Conference on Humanoid Robots.

[15]  Tetsuya Ogata,et al.  Barge-in-able robot audition based on ICA and missing feature theory under semi-blind situation , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Takayuki Kanda,et al.  Natural deictic communication with humanoid robots , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..