Compensating changes in speaker position for improved voice-based human-robot communication

Acoustic perturbation due to reverberation and the changes in speaker position are detrimental to seamless human-robot speech-based communication. These cause a mismatch between the speech features at runtime condition and the acoustic model (training condition). Then the degradation of the Automatic Speech Recognition (ASR) and the Spoken Language Understanding (SLU) performances is imminent. As a consequence, the robot fails to understand the spoken commands which will negatively impact interaction experience. In this paper, we propose a framework to improving speech-based human-robot communication in various reverberant environments. The framework is based on robust robot audition that addresses the mismatch problem, striking a balance between technology and the limitations in a real robot setting. Our method improves both ASR and SLU performances. Moreover, the proposed framework has the ability to evolve in minimizing mismatch without human supervision. We experiment with data collected in real environment conditions.

[1]  Hiroshi G. Okuno,et al.  Design and Implementation of Robot Audition System 'HARK' — Open Source Software for Listening to Three Simultaneous Speakers , 2010, Adv. Robotics.

[2]  Pierre Lison,et al.  Salience-driven Contextual Priming of Speech Recognition for Human-Robot Interaction , 2008, ECAI.

[3]  B. Raj,et al.  Speech-recognizer-based filter optimization for microphone array processing , 2003, IEEE Signal Processing Letters.

[4]  Kazuhiro Nakadai,et al.  Adaptive step-size parameter control for real-world blind source separation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Michael S. Brandstein,et al.  WAVELET TRANSFORM EXTREMA CLUSTERING FOR MULTI-CHANNEL SPEECH DEREVERBERATION , 1999 .

[6]  Mark J. F. Gales,et al.  Mean and variance adaptation within the MLLR framework , 1996, Comput. Speech Lang..

[7]  Room Transfer Function , 2011 .

[8]  Tomohiro Nakatani,et al.  Efficient blind dereverberation framework for automatic speech recognition , 2005, INTERSPEECH.

[9]  Rakesh Gupta,et al.  Landmark-Based Location Belief Tracking in a Spoken Dialog System , 2012, SIGDIAL Conference.

[10]  Hiroshi G. Okuno,et al.  Active audition for humanoid robots that can listen to three simultaneous talkers , 2003 .

[11]  F. Asano,et al.  An optimum computer‐generated pulse signal suitable for the measurement of very long impulse responses , 1995 .

[12]  Erik Marchi,et al.  The TUM system for the REVERB Challenge: Recognition of Reverberated Speech using Multi-Channel Correlation Shaping Dereverberation and BLSTM Recurrent Neural Networks , 2014, ICASSP 2014.

[13]  Richard M. Stern,et al.  Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Hiroaki Kitano,et al.  Active Audition for Humanoid , 2000, AAAI/IAAI.

[15]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[16]  Tanja Schultz,et al.  Identifying the addressee in human-human-robot interactions based on head pose and speech , 2004, ICMI '04.

[17]  Keisuke Nakamura,et al.  Dereverberation robust to speaker's azimuthal orientation in multi-channel human-robot communication , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Hiroshi Sawada,et al.  Polar coordinate based nonlinear function for frequency-domain blind source separation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Hans-Günter Hirsch,et al.  A new approach for the adaptation of HMMs to reverberation and background noise , 2008, Speech Commun..

[20]  Tatsuya Kawahara,et al.  Multi-party Human-Machine Interaction Using a Smart Multimodal Digital Signage , 2013, HCI.

[21]  Tatsuya Kawahara,et al.  Robust Speech Recognition Based on Dereverberation Parameter Optimization Using Acoustic Model Likelihood , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[23]  Gordon Cheng,et al.  Real-time acoustic source localization in noisy environments for human-robot multimodal interaction , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[24]  Alexander H. Waibel,et al.  Enabling Multimodal Human–Robot Interaction for the Karlsruhe Humanoid Robot , 2007, IEEE Transactions on Robotics.

[25]  Bayya Yegnanarayana,et al.  Enhancement of reverberant speech using LP residual signal , 2000, IEEE Trans. Speech Audio Process..

[26]  Kiyohiro Shikano,et al.  Distant talking robust speech recognition using late reflection components of room impulse response , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  K. Shikano,et al.  Fast Dereverberation for Hands-Free Speech Recognition , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.