Speech Activity Detection for Noisy Data Using Adaptation Techniques

Automatic detection of speech in audio streams has become an important preprocessing step for speech recognition, speaker recognition, and audio data mining. In many applications, the speech activity detection has to be performed on highly degraded audio streams. We present here our work to address the challenge of speech activity detection for highly degraded channel conditions. We present two two-pass modified cumulative sum (CUSUM) approaches based on maximum a posteriori (MAP) adaptation and regularized feature-based maximum likelihood linear regression (RFMLLR) adaption. In this paper, we compare the two approaches to a single-pass modified CUSUM baseline system with Gaussian mixture models (GMM) of speech and non-speech classes. The systems are evaluated on two test sets. Each consists of data from eight highly degraded channels. Our two-pass MAP adaptation system reduces the total error by 27%-54% relative compared to the single-pass baseline system. We present also experiments showing additional gains of 3%-25% relative by using channelspecific GMM models for speech and non-speech instead of a single channel-independent GMM model for each.

[1]  Mohamed Kamal Omar,et al.  Blind change detection for audio segmentation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Sridha Sridharan,et al.  Noise robust voice activity detection using features extracted from the time-domain autocorrelation function , 2010, INTERSPEECH.

[4]  Andrey Temko,et al.  Enhanced SVM Training for Robust Speech Activity Detection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  George V. Moustakides Quickest Detection of Abrupt Changes for a Class of Random Processes , 1998, IEEE Trans. Inf. Theory.

[6]  Mohamed Kamal Omar Regularized feature-based maximum likelihood linear regression for speech recognition , 2007, INTERSPEECH.

[7]  Sven Nordholm,et al.  Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[9]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..