Speaker localization using excitation source information in speech

This paper presents the results of simulation and real room studies for localization of a moving speaker using information about the excitation source of speech production. The first step in localization is the estimation of time-delay from speech collected by a pair of microphones. Methods for time-delay estimation generally use spectral features that correspond mostly to the shape of vocal tract during speech production. Spectral features are affected by degradations due to noise and reverberation. This paper proposes a method for localizing a speaker using features that arise from the excitation source during speech production. Experiments were conducted by simulating different noise and reverberation conditions to compare the performance of the time-delay estimation and source localization using the proposed method with the results obtained using the spectrum-based generalized cross correlation (GCC) methods. The results show that the proposed method shows lower number of discrepancies in the estimated time-delays. The bias, variance and the root mean square error (RMSE) of the proposed method is consistently equal or less than the GCC methods. The location of a moving speaker estimated using the time-delays obtained by the proposed method are closer to the actual values, than those obtained by the GCC method.

[1]  Christophe Beaugeant,et al.  Combined noise and echo reduction in hands-free systems: a survey , 2001, IEEE Trans. Speech Audio Process..

[2]  B. V. K. Vijaya Kumar,et al.  Spatial frequency domain image processing for biometric recognition , 2002, Proceedings. International Conference on Image Processing.

[3]  Michael S. Brandstein,et al.  Explicit Speech Modeling for Microphone Array Applications , 2001, Microphone Arrays.

[4]  Michael S. Brandstein,et al.  Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones , 2001, J. VLSI Signal Process..

[5]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[6]  Maurizio Omologo,et al.  Speech Recognition with Microphone Arrays , 2001, Microphone Arrays.

[7]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[8]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[9]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[10]  Amos Gilat,et al.  Numerical Methods with MATLAB , 2007 .

[11]  Bayya Yegnanarayana,et al.  Robustness of group-delay-based method for extraction of significant instants of excitation from speech signals , 1999, IEEE Trans. Speech Audio Process..

[12]  Benoît Champagne,et al.  Effects of room reverberation on time-delay estimation performance , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Sven Nordholm,et al.  Optimal and Adaptive Microphone Arrays for Speech Input in Automobiles , 2001, Microphone Arrays.

[14]  Hong Wang,et al.  Voice source localization for automatic camera pointing system in videoconferencing , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Kazuya Takeda,et al.  Construction of speech corpus in moving car environment , 2000, INTERSPEECH.

[16]  Benoît Champagne,et al.  Cepstral prefiltering for time delay estimation in reverberant environments , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[17]  Carl F. Eyring,et al.  Reverberation Time in “Dead” Rooms , 1930 .

[18]  Vishu R. Viswanathan,et al.  Hands-free voice communication in an automobile with a microphone array , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[20]  Benesty,et al.  Adaptive eigenvalue decomposition algorithm for passive acoustic source localization , 2000, The Journal of the Acoustical Society of America.

[21]  Michael S. Brandstein,et al.  A practical time-delay estimator for localizing speech sources with a microphone array , 1995, Comput. Speech Lang..

[22]  Maurizio Omologo,et al.  Environmental conditions and acoustic transduction in hands-free speech recognition , 1998, Speech Commun..

[23]  B. Yegnanarayana,et al.  Epoch extraction from linear prediction residual for identification of closed glottis interval , 1979 .

[24]  M S Brandstein Time-delay estimation of reverberated speech exploiting harmonic structure. , 1999, The Journal of the Acoustical Society of America.

[25]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[26]  S. R. Mahadeva Prasanna,et al.  Speech enhancement using excitation source information , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Larry S. Davis,et al.  Smart videoconferencing , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).