Automatic adaptive speech separation using beamformer-output-ratio for voice activity classification

This paper focuses on the practical challenge of adaptation control for speech separation systems. Adaptive beamforming methods, such as minimum variance distortionless response (MDVR), can effectively extract the desired speech signal from interference and noise. However, to avoid the signal cancellation problem, the beamformer adaptation is halted when the desired speaker is active. An automated scheme for this adaptation requires classifying speakers' voice activity status, which remains a challenge for multi-speaker environments. In this paper, we propose a novel approach to identify voice activities for two speakers based on a new metric, called the beamformer-output-ratio (BOR). Statistical properties of the BOR are studied and used to develop a hypothesis-based method for voice activity classification. The method is further refined using an algorithm detecting incorrect beamformer adaptation by analysing changes in the output power of a blind adapting MVDR beamformer. Based on the new methods, we construct an automatic adaptive beamforming system to simultaneously separate speech for two speakers. The speech separation module of the system uses MVDR beamformers whose adaptation is guided by the voice activity classification. Our methods can lead to, in some cases, 20% reduction in voice activity classification error, and 8dB improvement on the output SINR. The results are verified on both synthesised signals and realistic recordings. HighlightsWe design an automated adaptive beamforming system to extract speech of two speakers.The quantity BOR and its roles in active speaker identification are introduced.The BOR-VAC method is developed, in both generic form and practical realisation.We model the beamformer output power behaviour to detect incorrect adaptation.The proposed systems are tested in both real and synthesised recordings.

[1]  S. Kay Fundamentals of statistical signal processing: estimation theory , 1993 .

[2]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[3]  Ivan Himawan,et al.  Microphone Array Beamforming Approach to Blind Speech Separation , 2007, MLMI.

[4]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[5]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[6]  William G. Cowley,et al.  Voice activity classification using beamformer-output-ratio , 2012, 2012 Australian Communications Theory Workshop (AusCTW).

[7]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[8]  D. G. Watts,et al.  Spectral analysis and its applications , 1968 .

[9]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[10]  Douglas D. O'Shaughnessy Speech Communications: Human and Machine , 2012 .

[11]  Saeed Gazor,et al.  Statistical modelling of speech signals , 2002, 6th International Conference on Signal Processing, 2002..

[12]  Lucas C. Parra,et al.  A SURVEY OF CONVOLUTIVE BLIND SOURCE SEPARATION METHODS , 2007 .

[13]  Te-Won Lee,et al.  On the multivariate Laplace distribution , 2006, IEEE Signal Processing Letters.

[14]  H. Cox Resolving power and sensitivity to mismatch of optimum array processors , 1973 .

[15]  Zoltán Fodróczi,et al.  Acoustic source localization using microphone arrays via CNN algorithms , 2003 .

[16]  M. Portnoff Short-time Fourier analysis of sampled speech , 1981 .

[17]  Dimitris G. Manolakis,et al.  Statistical and Adaptive Signal Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering and Array Processing , 1999 .

[18]  Jian Li,et al.  On robust Capon beamforming and diagonal loading , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[19]  Israel Cohen,et al.  Relaxed statistical model for speech enhancement and a priori SNR estimation , 2005, IEEE Transactions on Speech and Audio Processing.

[20]  John W. McDonough,et al.  Adaptive Beamforming With a Minimum Mutual Information Criterion , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  S. Gazor,et al.  Speech probability distribution , 2003, IEEE Signal Processing Letters.

[22]  Ning Ma,et al.  The PASCAL CHiME speech separation and recognition challenge , 2013, Comput. Speech Lang..

[23]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Steven Kay,et al.  Fundamentals Of Statistical Signal Processing , 2001 .

[25]  P. Comon Independent Component Analysis , 1992 .

[26]  M. Viberg,et al.  Two decades of array signal processing research: the parametric approach , 1996, IEEE Signal Process. Mag..

[27]  Henry Cox,et al.  Eigenvalue Beamforming Using a Multirank MVDR Beamformer and Subspace Selection , 2008, IEEE Transactions on Signal Processing.

[28]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[29]  Fabian J. Theis,et al.  The signal separation evaluation campaign (2007-2010): Achievements and remaining challenges , 2012, Signal Process..

[30]  Kiyohiro Shikano,et al.  Blind Separation of Speech by Fixed-Point ICA with Source Adaptive Negentropy Approximation , 2005, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[31]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[32]  Hiroshi Sawada,et al.  Frequency-Domain Pearson Distribution Approach for Independent Component Analysis (FD-Pearson-ICA) in Blind Source Separation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Harry L. Van Trees,et al.  Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory , 2002 .

[34]  Muhammad Salman Khan,et al.  Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking , 2012, IET Signal Process..

[35]  Özgür Yõlmaz,et al.  Blind Separation of Speech Mixtures via , 2004 .

[36]  J M Górriz,et al.  Statistical voice activity detection based on integrated bispectrum likelihood ratio tests for robust speech recognition. , 2007, The Journal of the Acoustical Society of America.

[37]  Kiyohiro Shikano,et al.  Probability Distribution of Time-Series of Speech Spectral Components(Audio/Speech Coding)( Applications and Implementations of Digital Signal Processing) , 2004 .

[38]  Eun-Kyoung Kim,et al.  Enhanced voice activity detection using acoustic event detection and classification , 2011, IEEE Transactions on Consumer Electronics.

[39]  M. Degroot,et al.  Probability and Statistics , 1977 .