An integrated framework for multi-channel multi-source localization and voice activity detection

Two of the major challenges in microphone array based adaptive beamforming, speech enhancement and distant speech recognition, are robust and accurate source localization and voice activity detection. This paper introduces a spatial gradient steered response power using the phase transform (SRPPHAT) method which is capable of localization of competing speakers in overlapping conditions. We further investigate the behavior of the SRP function and characterize theoretically a fixed point in its search space for the diffuse noise field. We call this fixed point the null position in the SRP search space. Building on this evidence, we propose a technique for multichannel voice activity detection (MVAD) based on detection of a maximum power corresponding to the null position. The gradient SRP-PHAT in tandem with the MVAD form an integrated framework of multi-source localization and voice activity detection. The experiments carried out on real data recordings show that this framework is very effective in practical applications of hands-free communication.

[1]  Ilyas Potamitis Estimation of speech presence probability in the field of microphone array , 2004, IEEE Signal Processing Letters.

[2]  Mohammed Ghanbari,et al.  Verified speaker localization utilizing voicing level in split-bands , 2009, Signal Process..

[3]  Tania Stathaki,et al.  Voice activity detection using source separation techniques , 1997, EUROSPEECH.

[4]  D.H. Johnson,et al.  The application of spectral estimation methods to bearing estimation problems , 1982, Proceedings of the IEEE.

[5]  E. Habets,et al.  Generating sensor signals in isotropic noise fields. , 2007, The Journal of the Acoustical Society of America.

[6]  Ming Zhang,et al.  A robust speech detection algorithm in a microphone array teleconferencing system , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Hagen Soltau,et al.  Advances in automatic meeting record creation and access , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .

[9]  Nam Ik Cho,et al.  Voice activity detection using the phase vector in microphone array , 2007, INTERSPEECH.

[10]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[11]  Jacob Benesty,et al.  Adaptive eigenvalue decomposition algorithm for real time acoustic source localization system , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  K. U. Simmer,et al.  An alternative implementation of the superdirective beamformer , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).

[13]  Hong Wang,et al.  Voice source localization for automatic camera pointing system in videoconferencing , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Michael S. Brandstein,et al.  A practical methodology for speech source localization with microphone arrays , 1997, Comput. Speech Lang..

[15]  Jacob Benesty,et al.  Broadband Music: Opportunities and Challenges for Multiple Source Localization , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[16]  Walter Kellermann,et al.  TRINICON: a versatile framework for multichannel blind signal processing , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[18]  Maurizio Omologo,et al.  Acoustic source location in noisy and reverberant environment using CSP analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.