Speech-based localization of multiple persons for an interface robot

Robots are conveniently controlled by a human operator with spoken commands, since voice is a natural communication medium for humans. In order to successfully carry out a command, a robot needs to know which of the possibly many people gave the command and where this person is located. In this paper, we present a particle-filter based algorithm for localization of multiple speakers, in an environment where there is only one person speaking at a time. The algorithm incorporates person-specific voice features (vowel formant frequencies) in order to distinguish between the speakers. The voice features are supported by azimuth angle measurements obtained by a pair of microphones. We test our approach using the microphone system of the Philips iCat interface robot.

[1]  Rong Chen,et al.  A Theoretical Framework for Sequential Importance Sampling with Resampling , 2001, Sequential Monte Carlo Methods in Practice.

[2]  Daniel P. W. Ellis Computational auditory scene analysis exploiting speech-recognition knowledge , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[3]  Bart G. de Grooth,et al.  A simple model for Brownian motion leading to the Langevin equation , 1999 .

[4]  Larry S. Davis,et al.  Active speech source localization by a dual coarse-to-fine search , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Sadaoki Furui,et al.  A text-independent speaker recognition method robust against utterance variations , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Michael S. Brandstein,et al.  A practical methodology for speech source localization with microphone arrays , 1997, Comput. Speech Lang..

[7]  Sebastian Lang,et al.  Multi-modal anchoring for human-robot interaction , 2003, Robotics Auton. Syst..

[8]  Roland Siegwart,et al.  A navigation framework for multiple mobile robots and its application at the Expo.02 exhibition , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[9]  Albert J. N. van Breemen,et al.  Animation engine for believable interactive user-interface robots , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[10]  Ea-Ee Jan,et al.  Sound source localization in reverberant environments using an outlier elimination algorithm , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Darren B. Ward,et al.  Particle Filtering Algorithms for Acoustic Source Localization , 2003 .

[12]  Greg Welch,et al.  An Introduction to Kalman Filter , 1995, SIGGRAPH 2001.

[13]  Ramani Duraiswami,et al.  Accelerated speech source localization via a hierarchical search of steered response power , 2004, IEEE Transactions on Speech and Audio Processing.

[14]  Geoffrey Zweig,et al.  Speech Recognition with Dynamic Bayesian Networks , 1998, AAAI/IAAI.

[15]  B. Delgutte,et al.  Physiological measures of the precedence effect and spatial release from masking in the cat inferior colliculus , 2001 .

[16]  S. Zoletnik,et al.  Two-Point Correlation Measurements of Density Fluctuations in the W7-AS Stellarator , 2000 .

[17]  Roland Siegwart,et al.  A Navigation Framework for Multiple Mobile Robots and its Application , 2003 .

[18]  Joelle Pineau,et al.  Experiences with a mobile robotic guide for the elderly , 2002, AAAI/IAAI.

[19]  Yoram Singer,et al.  Discriminative Binaural Sound Localization , 2002, NIPS.

[20]  Ramdas Kumaresan,et al.  On decomposing speech into modulated components , 2000, IEEE Trans. Speech Audio Process..

[21]  Sadaoki Furui,et al.  Research of individuality features in speech waves and automatic speaker recognition techniques , 1986, Speech Commun..

[22]  Gregory Dudek,et al.  Probabilistic cooperative localization and mapping in practice , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[23]  Frank Dellaert,et al.  An MCMC-Based Particle Filter for Tracking Multiple Interacting Targets , 2004, ECCV.

[24]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[25]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[26]  Yong Rui,et al.  Real-time speaker tracking using particle filter sensor fusion , 2004, Proceedings of the IEEE.

[27]  Javier Nicolás Sánchez,et al.  Robust global localization using clustered particle filtering , 2002, AAAI/IAAI.

[28]  Patrick Pérez,et al.  Sequential Monte Carlo methods for multiple target tracking and data fusion , 2002, IEEE Trans. Signal Process..

[29]  Simon J. Godsill,et al.  On sequential simulation-based methods for Bayesian filtering , 1998 .

[30]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[31]  Wolfram Burgard,et al.  MINERVA: A Tour-Guide Robot that Learns , 1999, KI.

[32]  M. S. Brandstein A pitch-based approach to time-delay estimation of reverberant speech , 1997, Proceedings of 1997 Workshop on Applications of Signal Processing to Audio and Acoustics.

[33]  Ben J. A. Kröse,et al.  Jijo-2: An Office Robot that Communicates and Learns , 2001, IEEE Intell. Syst..

[34]  Ian C. Bruce,et al.  Robust Formant Tracking for Continuous Speech With Speaker Variability , 2003, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Michael S. Brandstein,et al.  A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[37]  David Gerhard,et al.  Pitch Extraction and Fundamental Frequency: History and Current Techniques , 2003 .

[38]  Maurizio Omologo,et al.  Use of the crosspower-spectrum phase in acoustic event location , 1997, IEEE Trans. Speech Audio Process..

[39]  Hiroaki Kitano,et al.  Real-time sound source localization and separation for robot audition , 2002, INTERSPEECH.

[40]  Noboru Ohnishi,et al.  Self-organization of a sound source localization robot by perceptual cycle , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[41]  Ben Kröse,et al.  A User-Interface Robot for Ambient Intelligent Environments , 2003 .

[42]  Simon Maskell,et al.  Fast mutual exclusion , 2004, SPIE Defense + Commercial Sensing.

[43]  Zoubin Ghahramani,et al.  Learning Dynamic Bayesian Networks , 1997, Summer School on Neural Networks.

[44]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[45]  A.K. Swain,et al.  Estimation of LPC parameters of speech signals in noisy environment , 2004, 2004 IEEE Region 10 Conference TENCON 2004..

[46]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[47]  Y. Bar-Shalom Tracking and data association , 1988 .

[48]  Gregory D. Hager,et al.  Joint probabilistic techniques for tracking objects using multiple visual cues , 1998, Proceedings. 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems. Innovations in Theory, Practice and Applications (Cat. No.98CH36190).

[49]  Maurizio Omologo,et al.  Acoustic event localization using a crosspower-spectrum phase based technique , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  David Gerhard Silence as a cue to rhythm in the analysis of speech and song , 2003 .

[51]  Yong Rui,et al.  New direct approaches to robust sound source localization , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[52]  Maurizio Omologo,et al.  Acoustic source location in a three-dimensional space using crosspower spectrum phase , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[53]  Gradje KlaassenWojciech,et al.  Speech-based localization ofmultiple persons foran interface robot , 2005 .

[54]  Andrew Blake,et al.  Nonlinear filtering for speaker tracking in noisy and reverberant environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[55]  Jean Rouat,et al.  Robust sound source localization using a microphone array on a mobile robot , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[56]  Stanley T. Birchfield,et al.  Acoustic source direction by hemisphere sampling , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[57]  Maurizio Omologo,et al.  Talker localization and speech enhancement in a noisy environment using a microphone array based acquisition system , 1993, EUROSPEECH.

[58]  Wolfram Burgard,et al.  People Tracking with Mobile Robots Using Sample-Based Joint Probabilistic Data Association Filters , 2003, Int. J. Robotics Res..

[59]  Fredrik Gustafsson,et al.  Monte Carlo data association for multiple target tracking , 2001 .

[60]  Wolfram Burgard,et al.  Tracking multiple moving targets with a mobile robot using particle filters and statistical data association , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).