Towards speech acquisition in natural interaction on ASIMO (特集「ロボット聴覚」)

The standard approach for teaching robots to communicate via speech is by providing the structure, statistics, and semantics of speech through a supervised, offline learning process. This process imposes constraints like a high degree of specialization to certain, predefined tasks. The resulting system is very rigid and lacks the ability to acquire new skills (e.g. words and their semantics). In contrast to this, children acquire language through observation of adults’ speech and, more importantly, in interaction with them. As a result their speech capabilities are very flexible and can adapt to new situations. Our research target is therefore to build a system that can learn to acquire speech in interaction with humans. The interaction aspect requires a hardware platform that can engage in a natural communication with humans in real-world environments. For this purpose we employ our humanoid robot ASIMO (see Fig. 1). To provide the robot with human-like speech communication abilities we are working on several aspects of sound processing, scene representation, and learning that will be outlined in more detail in the next sections.

[1]  S. Shamma On the role of space and time in auditory processing , 2001, Trends in Cognitive Sciences.

[2]  Hiroshi G. Okuno,et al.  A robot referee for rock-paper-scissors sound games , 2008, 2008 IEEE International Conference on Robotics and Automation.

[3]  Frank Joublin,et al.  Hierarchical spectro-temporal features for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Martin Heckmann,et al.  Real-time Sound Localization With a Binaural Head-system Using a Biologically-inspired Cue-triple Mapping , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  Martin Heckmann,et al.  Listen to the parrot: Demonstrating the quality of online pitch and formant extraction via feature-based resynthesis , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Ian C. Bruce,et al.  Robust Formant Tracking for Continuous Speech With Speaker Variability , 2003, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[8]  Martin Heckmann,et al.  Combining Auditory Preprocessing and Bayesian Estimation for Robust Formant Tracking , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Martin Heckmann,et al.  Combining rate and place information for robust pitch extraction , 2007, INTERSPEECH.

[10]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[11]  Martin Heckmann,et al.  Teaching a humanoid robot: Headset-free speech interaction for audio-visual association learning , 2009, RO-MAN 2009 - The 18th IEEE International Symposium on Robot and Human Interactive Communication.

[12]  Martin Heckmann,et al.  A hierarchical model for syllable recognition , 2007, ESANN.

[13]  Miguel Vaz,et al.  Learning from a tutor: Embodied speech acquisition and imitation learning , 2009, 2009 IEEE 8th International Conference on Development and Learning.

[14]  Tobias Rodemann,et al.  Purely auditory Online-adaptation of auditory-motor maps , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  B. Shinn-Cunningham Object-based auditory and visual attention , 2008, Trends in Cognitive Sciences.

[16]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[17]  Jun-ichi Imura,et al.  Ego noise suppression of a robot using template subtraction , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Tetsuya Ogata,et al.  Phoneme acquisition model based on vowel imitation using Recurrent Neural Network , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[20]  Gökhan Ince,et al.  Using binaural and spectral cues for azimuth and elevation localization , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Tobias Rodemann,et al.  Audio proto objects for improved sound localization , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.