Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation

The introduction of interview speech in recent NIST Speaker Recognition Evaluations (SREs) has necessitated the development of robust voice activity detectors (VADs) that can work under very low signal-to-noise ratio. This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties of detecting speech/non-speech segments in these files. To alleviate these difficulties, this paper proposes a VAD that uses noise reduction as a pre-processing step. A strategy to avoid the undesirable effects of impulsive signals and sinusoidal background-signals on the VAD is also proposed. The proposed VAD is compared with the VAD in the ETSI-AMR speech coder for removing silence regions of interview speech files. The results show that the proposed VAD is more robust in detecting speech segments under very low SNR, leading to a significant performance gain in Common Conditions 1–4 of NIST 2008 SRE.

[1]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[2]  Lawrence R. Rabiner,et al.  Voiced-unvoiced-silence detection using the Itakura LPC distance measure , 1977 .

[3]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[4]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[5]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[6]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[7]  Jean-Claude Junqua,et al.  A study of endpoint detection algorithms in adverse conditions: incidence on a DTW and HMM recognizer , 1991, EUROSPEECH.

[8]  ScienceDirect Biometric technology today , 1993 .

[9]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[10]  K. Swaminathan,et al.  Robust voice activity detection for DTX operation of speech coders , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[11]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[12]  Nathalie Virag,et al.  Single channel speech enhancement based on masking properties of the human auditory system , 1999, IEEE Trans. Speech Audio Process..

[13]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[14]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[15]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[16]  Sun-Yuan Kung,et al.  Biometric Authentication: A Machine Learning Approach , 2004 .

[17]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[18]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[19]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[20]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[21]  Juan Manuel Górriz,et al.  Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Pietro Laface,et al.  Loquendo - Politecnico di Torino's 2008 NIST speaker recognition evaluation system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Tomi Kinnunen,et al.  Comparing maximum a posteriori vector quantization and Gaussian mixture models in speaker verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[25]  P. Fränti,et al.  645 Improving Speaker Verification by Periodicity Based Voice Activity Detection , .