Voice activity classification using beamformer-output-ratio

In a conversation between multiple speakers, each person participates in the speech at different times. Therefore the active speakers in each speech segment are unknown. However, identifying the voice activity (VA) of the speakers of interest is required for adaptive beamforming techniques such as minimum variance distortionless response beamforming and the adaptive blocking beamforming (AB). Considering two speakers, this paper addresses a voice activity classification (VAC) problem that focuses on identifying the active speaker(s) in each speech segment. The proposed method is based on a new concept, the beamformer-output-ratio (BOR). This value is calculated from the outputs of two different beamformers steering at two speakers. The first part of the paper introduces the definition of BOR, the VAC method using BOR and simulation results. The simulations are based on real recordings and show a high classification accuracy. In the second part of the paper, the theoretical results of the BOR of the delay-and-sum (DS) beamforming are presented, including BOR formula derived in different environments and its behaviour in relation to parameter errors.

[1]  J M Górriz,et al.  Statistical voice activity detection based on integrated bispectrum likelihood ratio tests for robust speech recognition. , 2007, The Journal of the Acoustical Society of America.

[2]  H. Cox Resolving power and sensitivity to mismatch of optimum array processors , 1973 .

[3]  Ivan Himawan,et al.  Microphone Array Beamforming Approach to Blind Speech Separation , 2007, MLMI.

[4]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[5]  Douglas D. O'Shaughnessy Speech Communications: Human and Machine , 2012 .

[6]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[7]  Eun-Kyoung Kim,et al.  Enhanced voice activity detection using acoustic event detection and classification , 2011, IEEE Transactions on Consumer Electronics.

[8]  Wolfgang Herbordt Sound Capture for Human / Machine Interfaces: Practical Aspects of Microphone Array Signal Processing (Lecture Notes in Control and Information Sciences) , 2005 .

[9]  Sharon Gannot,et al.  Adaptive Beamforming and Postfiltering , 2008 .

[10]  Harry L. Van Trees,et al.  Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory , 2002 .

[11]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[12]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[14]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[15]  William G. Cowley,et al.  Adaptive Blocking Beamformer for Speech Separation , 2011, INTERSPEECH.

[16]  Dimitris G. Manolakis,et al.  Statistical and Adaptive Signal Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering and Array Processing , 1999 .