A study of voice activity detection techniques for NIST speaker recognition evaluations

Since 2008, interview-style speech has become an important part of the NIST speaker recognition evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this paper proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with the ASR transcripts provided by NIST, VAD in the ETSI-AMR Option 2 coder, satistical-model (SM) based VAD, and Gaussian mixture model (GMM) based VAD. Experimental results based on the NIST 2010 SRE dataset suggest that the proposed VAD outperforms these conventional ones whenever interview-style speech is involved. This study also demonstrates that (1) noise reduction is vital for energy-based VAD under low SNR; (2) the ASR transcripts and ETSI-AMR speech coder do not produce accurate speech and non-speech segmentations; and (3) spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD. The segmentation files produced by the proposed VAD can be found in http://bioinfo.eie.polyu.edu.hk/ssvad.

[1]  Birger Kollmeier,et al.  Speech pause detection for noise spectrum estimation by tracking power envelope dynamics , 2002, IEEE Trans. Speech Audio Process..

[2]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[3]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[4]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[5]  A. Dabrowski,et al.  Subband wavelet signal denoising for voice activity detection , 2008, New Trends in Audio and Video / Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2008.

[6]  Javier Ramírez,et al.  Voice activity detection with noise reduction and long-term spectral divergence estimation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Hamid Sheikhzadeh,et al.  ETSI AMR-2 VAD: evaluation and ultra low-resource implementation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[8]  S. Casale,et al.  Performance evaluation and comparison of G.729/AMR/fuzzy voice activity detectors , 2002, IEEE Signal Processing Letters.

[9]  Man-Wai Mak,et al.  Comparison of Voice Activity Detectors for Interview Speech in NIST Speaker Recognition Evaluation , 2011, INTERSPEECH.

[10]  M. Mak,et al.  Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation , 2010 .

[11]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[12]  R. Tucker,et al.  Voice activity detection using a periodicity measure , 1992 .

[13]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[14]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[15]  Liang Gu,et al.  Perceptual harmonic cepstral coefficients for speech recognition in noisy environment , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[17]  I. Boyd,et al.  The voice activity detector for the Pan-European digital cellular mobile telephone service , 1988, International Conference on Acoustics, Speech, and Signal Processing,.

[18]  K. Swaminathan,et al.  Robust voice activity detection for DTX operation of speech coders , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[19]  Bin Ma,et al.  An Efficient Feature Selection Method for Speaker Recognition , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[20]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[21]  Damjan Vlaj,et al.  Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria , 2012, Comput. Electr. Eng..

[22]  Nathalie Virag,et al.  Single channel speech enhancement based on masking properties of the human auditory system , 1999, IEEE Trans. Speech Audio Process..

[23]  Bin Ma,et al.  Speaker diarization for meeting room audio , 2009, INTERSPEECH.

[24]  Pietro Laface,et al.  Loquendo - Politecnico di Torino's 2008 NIST speaker recognition evaluation system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Masafumi Nishimura,et al.  Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection , 2010, IEEE Journal of Selected Topics in Signal Processing.

[26]  Juan Manuel Górriz,et al.  Improved likelihood ratio test based voice activity detector applied to speech recognition , 2010, Speech Commun..

[27]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[28]  Rathinavelu Chengalvarayan,et al.  Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition , 1999, EUROSPEECH.

[29]  Qiru Zhou,et al.  Robust endpoint detection and energy normalization for real-time speech and speaker recognition , 2002, IEEE Trans. Speech Audio Process..

[30]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[32]  Satoshi Nakamura,et al.  Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[33]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  P. Fränti,et al.  645 Improving Speaker Verification by Periodicity Based Voice Activity Detection , .

[35]  Tomi Kinnunen,et al.  Comparing maximum a posteriori vector quantization and Gaussian mixture models in speaker verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  H.S. Jamadagni,et al.  VAD techniques for real-time speech transmission on the Internet , 2002, 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No.02EX612).

[37]  Javier Ramírez,et al.  Noise robust model-based voice activity detection , 2006, INTERSPEECH.

[38]  Alvin F. Martin,et al.  The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[39]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[40]  Juan Manuel Górriz,et al.  Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[42]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[43]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[44]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[45]  Rubén San-Segundo-Hernández,et al.  Combining pulse-based features for rejecting far-field speech in a HMM-based Voice Activity Detector , 2011, Comput. Electr. Eng..

[46]  S. Gökhun Tanyer,et al.  Voice activity detection in nonstationary noise , 2000, IEEE Trans. Speech Audio Process..

[47]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[48]  Rafik A. Goubran,et al.  Robust voice activity detection using higher-order statistics in the LPC residual domain , 2001, IEEE Trans. Speech Audio Process..

[49]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[50]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[51]  Chungyong Lee,et al.  Robust voice activity detection algorithm for estimating noise spectrum , 2000 .