Analysis of mutual duration and noise effects in speaker recognition: benefits of condition-matched cohort selection in score normalization

The biometric and forensic performance of automatic speaker recognition systems degrades under noisy and short probe utterance conditions. Score normalization is an effective tool taking into account the mismatch of reference and probe utterances. In an adaptive symmetric score normalization scheme for state-ofthe-art i-vector recognition systems, a set of cohort speakers are employed to calculate the mean and variance of impostor scores when compared to reference and probe i-vectors. In dealing with real-life conditions where the quality of audio recordings in test phase does not match enrolment utterance(s) of speakers, we demonstrate the effectiveness of utilizing a conditionmatched cohort set for score normalization. The cohort set audio material is shortened and degraded by noise in different reasonable and controlled signal-to-noise ratios according to expected test conditions, yielding in multiple set of cohorts. Further, we propose automatic cohort pre-selection based on modeling each degradation category. For each i-vector, a quality vector is assigned as the posterior probability of degradation classes. The cohort set is then formed by i-vectors representing small KL-divergence of respective quality vectors when compared to reference and probe. Further gains are observed by including this quality vector also into the score calibration.

[1]  Julian Fiérrez,et al.  On the use of quality measures for text-independent speaker recognition , 2004, Odyssey.

[2]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[3]  Christoph Busch,et al.  Towards Duration Invariance of i-Vector-based Adaptive Score Normalization , 2014, Odyssey.

[4]  H. Hotelling The Generalization of Student’s Ratio , 1931 .

[5]  John H. L. Hansen,et al.  Duration mismatch compensation for i-vector based speaker recognition systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[7]  Pietro Laface,et al.  Generative pairwise models for speaker recognition , 2014, Odyssey.

[8]  David A. van Leeuwen,et al.  Quality Measure Functions for Calibration of Speaker Recognition Systems in Various Duration Conditions , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Simon J. D. Prince,et al.  Computer Vision: Models, Learning, and Inference , 2012 .

[10]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[11]  Jithendra Vepa,et al.  Using posterior-based features in template matching for speech recognition , 2006, INTERSPEECH.

[12]  Pietro Laface,et al.  Comparison of Speaker Recognition Approaches for Real Applications , 2011, INTERSPEECH.

[13]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[14]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[16]  Paavo Alku,et al.  Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verification , 2010, IEEE Signal Processing Letters.

[17]  Douglas A. Reynolds,et al.  The NIST 2014 Speaker Recognition i-vector Machine Learning Challenge , 2014, Odyssey.

[18]  Joaquín González-Rodríguez,et al.  Cross-entropy analysis of the information in forensic speaker recognition , 2008, Odyssey.

[19]  Douglas E. Sturim,et al.  Speaker adaptive cohort selection for Tnorm in text-independent speaker verification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  Lukás Burget,et al.  A unified approach for audio characterization and its application to speaker recognition , 2012, Odyssey.

[21]  D. V. Leeuwen,et al.  The Radboud University Nijmegen submission to NIST SRE-2012 , 2012 .

[22]  John H. L. Hansen,et al.  I4u submission to NIST SRE 2012: a large-scale collaborative effort for noise-robust speaker verification , 2013, INTERSPEECH.

[23]  Daniel Garcia-Romero,et al.  Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).