Utilizing visual cues in robot audition for sound source discrimination in speech-based human-robot communication

It is easy for human beings to discern whether an observed acoustic signal is a direct speech, reflected speech or noise through simple listening. Relying purely on acoustic cues is enough for human beings to discriminate between the different kinds of sound sources which is not straightforward for machines. A robot equipped with the current robot audition mechanism in most cases, will fail to differentiate a direct speech from the other sound sources because acoustic information alone is insufficient for effective discrimination. Robot audition is an important topic in speech-based human-robot communication. It enables the robot to associate the incoming speech signal to the user for an effective human-robot communication. In challenging environments, this task becomes difficult due to reflections of the direct speech signal and background noise sources. To counter this problem, a robot needs to have a minimum amount of prior information to discriminate the valid speech signal (direct speech) from the contaminants (i.e., speech reflections and background noise sources). Failure to do so would lead to false speech-to-speaker association in robot audition and will gravely impact human-robot communication experience. In this paper we propose to using visual cues to augment the traditional robot audition which relies solely on acoustic information. The proposed method significantly improves accuracy of speech-to-speaker association and machine understanding performance in real environment situation. Experimental results show that our expanded system is robust in discriminating direct speech from speech reflections and background noise sources.

[1]  Kazuhiro Nakadai,et al.  Adaptive step-size parameter control for real-world blind source separation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  R I Hg,et al.  An RGB-D Database Using Microsoft's Kinect for Windows for Face Detection , 2012, 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems.

[3]  Alexander H. Waibel,et al.  Enabling Multimodal Human–Robot Interaction for the Karlsruhe Humanoid Robot , 2007, IEEE Transactions on Robotics.

[4]  Jacob Benesty,et al.  Speech Acquisition and Enhancement in a Reverberant, Cocktail-Party-Like Environment , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Keisuke Nakamura,et al.  Intelligent Sound Source Localization and its application to multimodal human tracking , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Hiroshi G. Okuno,et al.  Design and Implementation of Robot Audition System 'HARK' — Open Source Software for Listening to Three Simultaneous Speakers , 2010, Adv. Robotics.

[7]  Hiroshi Sawada,et al.  Polar coordinate based nonlinear function for frequency-domain blind source separation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Hiroaki Kitano,et al.  Real-time sound source localization and separation for robot audition , 2002, INTERSPEECH.

[10]  Keisuke Nakamura,et al.  Speech-based human-robot interaction robust to acoustic reflections in real environment , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Jean Rouat,et al.  Robust sound source localization using a microphone array on a mobile robot , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[12]  K. Shikano,et al.  Fast Dereverberation for Hands-Free Speech Recognition , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[13]  Hiroaki Kitano,et al.  Active Audition for Humanoid , 2000, AAAI/IAAI.

[14]  Eap Emanuël Habets Single- and multi-microphone speech dereverberation using spectral enhancement , 2007 .

[15]  Maurizio Omologo,et al.  Acoustic event localization using a crosspower-spectrum phase based technique , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Andrew Blake,et al.  Probabilistic tracking in a metric space , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[17]  Bayya Yegnanarayana,et al.  Enhancement of reverberant speech using LP residual signal , 2000, IEEE Trans. Speech Audio Process..

[18]  Michael S. Brandstein,et al.  WAVELET TRANSFORM EXTREMA CLUSTERING FOR MULTI-CHANNEL SPEECH DEREVERBERATION , 1999 .

[19]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[20]  Pierre Lison,et al.  Salience-driven Contextual Priming of Speech Recognition for Human-Robot Interaction , 2008, ECAI.

[21]  Tatsuya Kawahara,et al.  Hands-free human-robot communication robust to speaker's radial position , 2013, 2013 IEEE International Conference on Robotics and Automation.

[22]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[23]  Gordon Cheng,et al.  Real-time acoustic source localization in noisy environments for human-robot multimodal interaction , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[24]  Heinrich Kuttruff,et al.  Room acoustics , 1973 .

[25]  Keisuke Nakamura,et al.  Dereverberation robust to speaker's azimuthal orientation in multi-channel human-robot communication , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Tatsuya Kawahara,et al.  Robust Speech Recognition Based on Dereverberation Parameter Optimization Using Acoustic Model Likelihood , 2010, IEEE Transactions on Audio, Speech, and Language Processing.