Voice activity classification for automatic bi-speaker adaptive beamforming in speech separation

A simple and low computational complexity system for bispeaker speech separation is proposed in this paper. The system is constructed of a voice activity classification (VAC) module and an adaptive bi-beamformer module for speech separation using a microphone array. The first module identifies active speaker(s) and allows the system to control the adaptation of the second module automatically. The VAC is based on a novel classification method containing two steps. The first step uses a robust VAC method based on our previous work on beamformer-output-ratio of a bi-beamforming system. The second step refines the VAC results using a novel method derived from an analytical result on the output power of an adaptive beamformer. The system is tested in reverberant environments with both synthesized and real recordings. The synthesized recordings contain two speakers, a background speech and noises. The real recording contains two speakers speaking spontaneously. The VAC results satisfy a conservative classification scheme to avoid the signal cancellation problem. The final separation outputs are compared with the ideal outputs provided by genie-aided adaptive beamformers which have perfect VAC knowledge. The results show that the propose automatic system achieves high performance close to the ideal system.

[1]  William G. Cowley,et al.  Multi-speaker beamforming for voice activity classification , 2013, 2013 Australian Communications Theory Workshop (AusCTW).

[2]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  William G. Cowley,et al.  Voice activity classification using beamformer-output-ratio , 2012, 2012 Australian Communications Theory Workshop (AusCTW).

[4]  H. Cox Resolving power and sensitivity to mismatch of optimum array processors , 1973 .

[5]  Douglas D. O'Shaughnessy Speech Communications: Human and Machine , 2012 .

[6]  Muhammad Salman Khan,et al.  Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking , 2012, IET Signal Process..

[7]  Jian Li,et al.  On robust Capon beamforming and diagonal loading , 2003, IEEE Trans. Signal Process..

[8]  Fabian J. Theis,et al.  The signal separation evaluation campaign (2007-2010): Achievements and remaining challenges , 2012, Signal Process..

[9]  Israel Cohen,et al.  Multichannel Eigenspace Beamforming in a Reverberant Noisy Environment With Multiple Interfering Speech Signals , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  William G. Cowley,et al.  Adaptive Blocking Beamformer for Speech Separation , 2011, INTERSPEECH.

[11]  Steven Kay,et al.  Fundamentals Of Statistical Signal Processing , 2001 .

[12]  John W. McDonough,et al.  Adaptive Beamforming With a Minimum Mutual Information Criterion , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[14]  Lucas C. Parra,et al.  A SURVEY OF CONVOLUTIVE BLIND SOURCE SEPARATION METHODS , 2007 .

[15]  Nicholas Zulu,et al.  Robust speech recognition using microphone arrays and speaker adaptation , .