The QUT-NOISE-SRE protocol for the evaluation of noisy speaker recognition

The QUT-NOISE-SRE protocol is designed to mix the large QUT-NOISE database, consisting of over 10 hours of back- ground noise, collected across 10 unique locations covering 5 common noise scenarios, with commonly used speaker recognition datasets such as Switchboard, Mixer and the speaker recognition evaluation (SRE) datasets provided by NIST. By allowing common, clean, speech corpora to be mixed with a wide variety of noise conditions, environmental reverberant responses, and signal-to-noise ratios, this protocol provides a solid basis for the development, evaluation and benchmarking of robust speaker recognition algorithms, and is freely available to download alongside the QUT-NOISE database. In this work, we use the QUT-NOISE-SRE protocol to evaluate a state-of-the-art PLDA i-vector speaker recognition system, demonstrating the importance of designing voice-activity-detection front-ends specifically for speaker recognition, rather than aiming for perfect coherence with the true speech/non-speech boundaries.

[1]  David Miller,et al.  The Mixer Corpus of Multilingual, Multichannel Speaker Recognition Data , 2004, LREC.

[2]  John H. L. Hansen,et al.  Uncertainty propagation in front end factor analysis for noise robust speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  David A. van Leeuwen,et al.  The effect of noise on modern automatic speaker recognition systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Sridha Sridharan,et al.  PLDA based speaker recognition on short utterances , 2012, Odyssey.

[5]  L. Burget,et al.  Promoting robustness for speaker modeling in the community: the PRISM evaluation set , 2011 .

[6]  Daniel Garcia-Romero,et al.  Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yun Lei,et al.  Towards noise-robust speaker recognition using probabilistic linear discriminant analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Sridha Sridharan,et al.  Complete-linkage clustering for voice activity detection in audio and visual speech , 2015, INTERSPEECH.

[10]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[11]  Yun Lei,et al.  A noise-robust system for NIST 2012 speaker recognition evaluation , 2013, INTERSPEECH.

[12]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[13]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .

[14]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.