Robust speaker's location detection in a vehicle environment using GMM models

Human-computer interaction (HCI) using speech communication is becoming increasingly important, especially in driving where safety is the primary concern. Knowing the speaker's location (i.e., speaker localization) not only improves the enhancement results of a corrupted signal, but also provides assistance to speaker identification. Since conventional speech localization algorithms suffer from the uncertainties of environmental complexity and noise, as well as from the microphone mismatch problem, they are frequently not robust in practice. Without a high reliability, the acceptance of speech-based HCI would never be realized. This work presents a novel speaker's location detection method and demonstrates high accuracy within a vehicle cabinet using a single linear microphone array. The proposed approach utilize Gaussian mixture models (GMM) to model the distributions of the phase differences among the microphones caused by the complex characteristic of room acoustic and microphone mismatch. The model can be applied both in near-field and far-field situations in a noisy environment. The individual Gaussian component of a GMM represents some general location-dependent but content and speaker-independent phase difference distributions. Moreover, the scheme performs well not only in nonline-of-sight cases, but also when the speakers are aligned toward the microphone array but at difference distances from it. This strong performance can be achieved by exploiting the fact that the phase difference distributions at different locations are distinguishable in the environment of a car. The experimental results also show that the proposed method outperforms the conventional multiple signal classification method (MUSIC) technique at various SNRs.

[1]  Hong Wang,et al.  Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources , 1985, IEEE Trans. Acoust. Speech Signal Process..

[2]  Rafik A. Goubran,et al.  Application of near-field optimum microphone arrays to hands-free mobile telephony , 2003, IEEE Trans. Veh. Technol..

[3]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[4]  Parham Aarabi,et al.  Enhanced sound localization , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[5]  K. Kiguchi,et al.  Modular fuzzy-neuro controller driven by spoken language commands , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6]  Vladimir Pavlovic,et al.  Toward multimodal human-computer interface , 1998, Proc. IEEE.

[7]  G. C. Carter,et al.  The smoothed coherence transform , 1973 .

[8]  J.A. Borges,et al.  Speech browsing the World Wide Web , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[9]  Rafik A. Goubran,et al.  Array optimization applied in the near field of a microphone array , 2000, IEEE Trans. Speech Audio Process..

[10]  Huimin Chen,et al.  Tracking of multiple moving speakers with multiple microphone arrays , 2004, IEEE Transactions on Speech and Audio Processing.

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  Heinrich Kuttruff,et al.  Room acoustics , 1973 .

[13]  Tien Pham,et al.  Adaptive wideband aeroacoustic array processing , 1996, Proceedings of 8th Workshop on Statistical Signal and Array Processing.

[14]  Michael S. Brandstein,et al.  A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Darren B. Ward,et al.  Particle filter beamforming for acoustic source localization in a reverberant environment , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Mohamed El-Tanany,et al.  Robust near-field adaptive beamforming with distance discrimination , 2004, IEEE Transactions on Speech and Audio Processing.

[17]  T. Kailath,et al.  Spatio-temporal spectral analysis by eigenstructure methods , 1984 .

[18]  Maurizio Omologo,et al.  Use of the crosspower-spectrum phase in acoustic event location , 1997, IEEE Trans. Speech Audio Process..

[19]  Jean-Dominique Polack,et al.  On the variability of room acoustical parameters : reproducibility and statistical validity , 1992 .

[20]  Jan Baan,et al.  Spatial fluctuations in measures for spaciousness , 2001 .

[21]  Parham Aarabi,et al.  Multichannel nonlinear phase analysis for time-frequency data fusion , 2003, SPIE Defense + Commercial Sensing.

[22]  Jwu-Sheng Hu,et al.  Processing of speech signals using a microphone array for intelligent robots , 2005 .

[23]  Michael S. Brandstein,et al.  A closed-form location estimator for use with room environment microphone arrays , 1997, IEEE Trans. Speech Audio Process..

[24]  Wei Zhang,et al.  EM algorithms of Gaussian mixture model and hidden Markov model , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[25]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[26]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[27]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[28]  I. Claesson,et al.  Acoustic noise and echo cancelling with microphone array , 1999 .

[29]  Parham Aarabi,et al.  Robust sound localization using conditional time-frequency histograms , 2003, Inf. Fusion.

[30]  Jin-Li Hu,et al.  A self-calibrated speaker tracking system using both audio and video data , 2002, Proceedings of the International Conference on Control Applications.

[31]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[32]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[33]  Abraham Kandel,et al.  Introduction to Pattern Recognition: Statistical, Structural, Neural and Fuzzy Logic Approaches , 1999 .

[34]  Parham Aarabi,et al.  Phase-based dual-microphone robust speech enhancement , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  Chi-Yi Tsai,et al.  Speaker attention system for mobile robots using microphone array and face tracking , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[36]  C. L. Nikias,et al.  Signal processing with alpha-stable distributions and applications , 1995 .

[37]  Norbert Strobel,et al.  Classification of time delay estimates for robust speaker localization , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[38]  Ilyas Potamitis Estimation of speech presence probability in the field of microphone array , 2004, IEEE Signal Processing Letters.

[39]  H. Sabine Room Acoustics , 1953, The SAGE Encyclopedia of Human Communication Sciences and Disorders.