Comparison of Voice Activity Detectors for Interview Speech in NIST Speaker Recognition Evaluation

Interview speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has substantially lower signal-to-noise ratio, which necessitates robust voice activity detection (VAD). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/nonspeech segmentation in these files. To overcome these difficulties, this paper proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. It was found that spectral subtraction can make better use of the background spectrum than the likelihood-ratio tests in statisticalmodel-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. Results on NIST 2010 SRE show that the proposed VAD outperforms the statistical-modelbased VAD, the ETSI-AMR speech coder, and the ASR transcripts provided by NIST SRE Workshop.

[1]  M. Mak,et al.  Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation , 2010 .

[2]  Rafik A. Goubran,et al.  Robust voice activity detection using higher-order statistics in the LPC residual domain , 2001, IEEE Trans. Speech Audio Process..

[3]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[4]  Jiri Prinosil,et al.  Voice activity detection under the highly fluctuant recording conditions of call centres , 2010 .

[5]  Juan Manuel Górriz,et al.  Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  P. Fränti,et al.  645 Improving Speaker Verification by Periodicity Based Voice Activity Detection , .

[7]  Birger Kollmeier,et al.  Speech pause detection for noise spectrum estimation by tracking power envelope dynamics , 2002, IEEE Trans. Speech Audio Process..

[8]  Pietro Laface,et al.  Loquendo - Politecnico di Torino's 2008 NIST speaker recognition evaluation system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Sang‐Sik Ahn,et al.  An improved statistical model‐based VAD algorithm with an adaptive threshold , 2005 .

[10]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[11]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[12]  Man-Wai Mak,et al.  Addressing the Data-Imbalance Problem in Kernel-Based Speaker Verification via Utterance Partitioning and Speaker Comparison , 2011, INTERSPEECH.

[13]  Bin Ma,et al.  An Efficient Feature Selection Method for Speaker Recognition , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.