A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data

A voice activity detector (VAD) plays a vital role in robust speaker verification, where energy VAD is most commonly used. Energy VAD works well in noise-free conditions but deteriorates in noisy conditions. One way to tackle this is to introduce speech enhancement preprocessing. We study an alternative, likelihood ratio based VAD that trains speech and nonspeech models on an utterance-by-utterance basis from mel-frequency cepstral coefficients (MFCCs). The training labels are obtained from enhanced energy VAD. As the speech and nonspeech models are re-trained for each utterance, minimum assumptions of the background noise are made. According to both VAD error analysis and speaker verification results utilizing state-of-the-art i-vector system, the proposed method outperforms energy VAD variants by a wide margin. We provide open-source implementation of the method.

[1]  Goutam Saha,et al.  Comparison of Speech Activity Detection Techniques for Speaker Recognition , 2012, ArXiv.

[2]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[3]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[4]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[5]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  E. Shlomot,et al.  ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications , 1997, IEEE Commun. Mag..

[7]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Roeland Ordelman,et al.  Filtering the unknown: speech activity detection in heterogeneous video collections , 2007, INTERSPEECH.

[9]  Sridha Sridharan,et al.  The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  John H. L. Hansen,et al.  The CRSS systems for the 2010 NIST speaker recognition evaluation , 2010 .

[11]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[12]  Jing Huang,et al.  The IBM RT06s Evaluation System for Speech Activity Detection in CHIL Seminars , 2006, MLMI.

[13]  Bin Ma,et al.  Speaker diarization for meeting room audio , 2009, INTERSPEECH.

[14]  Hynek Hermansky,et al.  Multi-layer perceptron based speech activity detection for speaker verification , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[15]  Jan Vaněk,et al.  UWB system description for NIST SRE 2010 , 2010 .

[16]  Lukás Burget,et al.  Analysis of Feature Extraction and Channel Compensation in a GMM Speaker Recognition System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[18]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[19]  Guillaume Gravier,et al.  Overview of the 2000-2001 ELISA Consortium research activities , 2001, Odyssey.

[20]  Umar Mohammed,et al.  Probabilistic Models for Inference about Identity , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Man-Wai Mak,et al.  Comparison of Voice Activity Detectors for Interview Speech in NIST Speaker Recognition Evaluation , 2011, INTERSPEECH.

[22]  Bin Ma,et al.  Frame selection of interview channel for NIST speaker recognition evaluation , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[23]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[24]  P. Fränti,et al.  645 Improving Speaker Verification by Periodicity Based Voice Activity Detection , .

[25]  John H. L. Hansen,et al.  I4u submission to NIST SRE 2012: a large-scale collaborative effort for noise-robust speaker verification , 2013, INTERSPEECH.