Design and Implementation of Robot Audition System 'HARK' — Open Source Software for Listening to Three Simultaneous Speakers

This paper presents the design and implementation of the HARK robot audition software system consisting of sound source localization modules, sound source separation modules and automatic speech recognition modules of separated speech signals that works on any robot with any microphone configuration. Since a robot with ears may be deployed to various auditory environments, the robot audition system should provide an easy way to adapt to them. HARK provides a set of modules to cope with various auditory environments by using an open-sourced middleware, FlowDesigner, and reduces the overheads of data transfer between modules. HARK has been open-sourced since April 2008. The resulting implementation of HARK with MUSIC-based sound source localization, GSS-based sound source separation and Missing Feature Theory-based automatic speech recognition on Honda ASIMO, SIG2 and Robovie R2 attains recognizing three simultaneous utterances with the delay of 1.9 s at the word correct rate of 80–90% for three speakers.

[1]  Tetsuya Ogata,et al.  Real-Time Robot Audition System That Recognizes Simultaneous Speech in The Real World , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  Kazuhiro Nakadai,et al.  High performance sound source separation adaptable to environmental changes for robot audition , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3]  Jean Rouat,et al.  Robust Recognition of Simultaneous Speech by a Mobile Robot , 2007, IEEE Transactions on Robotics.

[4]  Jean Rouat,et al.  Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering , 2007, Robotics Auton. Syst..

[5]  Jon Barker,et al.  Robust ASR based on clean speech models: an evaluation of missing data techniques for connected digit recognition in noise , 2001, INTERSPEECH.

[6]  Tetsuya Ogata,et al.  Human Tracking System Integrating Sound and Face Localization Using an Expectation-Maximization Algorithm in Real Environments , 2009, Adv. Robotics.

[7]  Kazuhiro Nakadai,et al.  Adaptive step-size parameter control for real-world blind source separation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  François Michaud,et al.  Code reusability tools for programming mobile robots , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[9]  Hiroaki Kitano,et al.  Active Audition for Humanoid , 2000, AAAI/IAAI.

[10]  Hervé Bourlard,et al.  Microphone array post-filter for diffuse noise field , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Nobuaki Minematsu,et al.  Free software toolkit for Japanese large vocabulary continuous speech recognition , 2000, INTERSPEECH.

[12]  Kazuhiro Nakadai,et al.  Sound source separation of moving speakers for robot audition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Sadaoki Furui,et al.  Noise‐robust speech recognition using multi‐band spectral features , 2004 .

[14]  Phil D. Green,et al.  Robust automatic speech recognition with missing and unreliable acoustic data , 2001, Speech Commun..

[15]  Keiichiro Hoashi,et al.  Humanoid robot-development of an information assistant robot Hadaly , 1997, Proceedings 6th IEEE International Workshop on Robot and Human Communication. RO-MAN'97 SENDAI.

[16]  Cynthia Breazeal Emotive qualities in robot speech , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[17]  Hiroshi G. Okuno,et al.  A robot referee for rock-paper-scissors sound games , 2008, 2008 IEEE International Conference on Robotics and Automation.

[18]  Jean Rouat,et al.  Making a robot recognize three simultaneous sentences in real-time , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Christopher V. Alvino,et al.  Geometric source separation: merging convolutive source separation with geometric beamforming , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).

[20]  Hiroshi G. Okuno,et al.  Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots , 2004, Speech Commun..

[21]  Fumio Kanehiro,et al.  Robust speech interface based on audio and video information fusion for humanoid HRP-2 , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[22]  Takashi Suehiro,et al.  RT-middleware: distributed component middleware for RT (robot technology) , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  François Michaud,et al.  Robotic Software Integration Using MARIE , 2006 .

[24]  Christian Schlegel A Component Approach for Robotics Software: Communication Patterns in the OROCOS Context , 2003, AMS.

[25]  Hideki Asoh,et al.  Sound source localization and signal separation for office robot "JiJo-2" , 1999, Proceedings. 1999 IEEE/SICE/RSJ. International Conference on Multisensor Fusion and Integration for Intelligent Systems. MFI'99 (Cat. No.99TH8480).

[26]  Jean Rouat,et al.  Enhanced Robot Speech Recognition Based on Microphone Array Source Separation and Missing Feature Theory , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[27]  Jean Rouat,et al.  Enhanced robot audition based on microphone array source separation with post-filter , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[28]  Satoru Hayamizu,et al.  Socially Embedded Learning of the Office-Conversant Mobil Robot Jijo-2 , 1997, IJCAI.

[29]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[30]  Sebastian Thrun,et al.  Perspectives on standardization in mobile robot programming: the Carnegie Mellon Navigation (CARMEN) Toolkit , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).