Correlation coefficient-based voice activity detector algorithm

A voice activity detector (VAD) is an algorithm able to distinguish the speech regions from the background noise of the input signal and is an important step in many speech processing applications. The varying nature and the large variety of speech and background noise make this problem difficult especially for low signal to noise ratio (SNR) that is the case for many practical applications. In this paper we propose a new VAD algorithm designed to improve the solution of word boundary detection problem for variable background noise level in a real time application. The input signal is windowed in time domain and then the energy and the spectrum of the current frame are obtained. The first few frames are supposed not to contain speech and are used to obtain a first estimate of the noise parameters. These parameters are updated during the silence periods using a first order autoregressive filter. In order to obtain robust parameters that do not depend on the amplitude of the spectrum, the correlation coefficient of the instantaneous spectrum and an average of the background noise spectrum is calculated. The speech regions may be detected based on a statistical approach using a simple binary Markov model for speech activity process. To evaluate the performance of the proposed method a clean speech dataset from the TIMIT database corrupted with different types of noise from NOISEX database for different SNR levels has been utilized.

[1]  Jhing-Fa Wang,et al.  A wavelet-based voice activity detection algorithm in noisy environments , 2002, 9th International Conference on Electronics, Circuits and Systems.

[2]  G. Rose,et al.  Voice activity detection in noisy environments , 2001, INTERSPEECH.

[3]  L. Rabiner,et al.  An algorithm for determining the endpoints of isolated utterances , 1974, The Bell System Technical Journal.

[4]  Wei Zhang,et al.  A soft voice activity detector based on a Laplacian-Gaussian model , 2003, IEEE Trans. Speech Audio Process..

[5]  Qiru Zhou,et al.  Robust endpoint detection and energy normalization for real-time speech and speaker recognition , 2002, IEEE Trans. Speech Audio Process..

[6]  T. Dutoit,et al.  Traitement de la Parole , 2000 .

[7]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[8]  John G. Proakis,et al.  Digital signal processing (3rd ed.): principles, algorithms, and applications , 1996 .

[9]  Chin-Teng Lin,et al.  A robust word boundary detection algorithm for variable noise-level environment in cars , 2002, IEEE Trans. Intell. Transp. Syst..

[10]  John G. Proakis,et al.  Digital Signal Processing: Principles, Algorithms, and Applications , 1992 .

[11]  Jean-Claude Junqua,et al.  A robust algorithm for word boundary detection in the presence of noise , 1994, IEEE Trans. Speech Audio Process..