R-norm: improving inter-speaker variability modelling at the score level via regression score normalisation

This paper presents a new method of score post-processing which utilises previously hidden relationships among client models and test probes that are found within the scores produced by an automatic speaker recognition system. We suggest the name r-Norm (for Regression Normalisation) for the method, which can be viewed as both a score normalisation process and as a novel and improved modelling technique of inter-speaker variability. The key component of the method lies in learning a regression model between development data scores and an ‘ideal’ score matrix, which can either be derived from clean data or created synthetically. To generate scores for experimental validation of the proposed idea we perform a classic GMM-UBM experiment employing mel-cepstral features on the 1sp-female task of the NIST 2003 SRE corpus. Comparisons of the r-Norm results are made with standard score postprocessing/normalisation methods t-Norm and z-Norm. The rNorm method is shown to perform very strongly, improving the EER from 18.5% to 7.01%, significantly outperforming both z-Norm and t-Norm in this case. The baseline system performance was deemed acceptable for the aims of this experiment, which were focused on evaluating and comparing the performance of the proposed r-Norm idea.

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[3]  Patrick Kenny,et al.  An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech , 2010, Odyssey.

[4]  Jean-Luc Gauvain,et al.  Feature and score normalization for speaker verification of cellular data , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Patrick Kenny,et al.  Development of the primary CRIM system for the NIST 2008 speaker recognition evaluation , 2008, INTERSPEECH.

[6]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[8]  J. E. Porter,et al.  Normalizations and selection of speech segments for speaker recognition scoring , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[9]  Julian Fiérrez,et al.  U-NORM Likelihood Normalization in PIN-Based Speaker Verification Systems , 2003, AVBPA.

[10]  Jan Vaněk,et al.  UWB system description for NIST SRE 2010 , 2010 .

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Hui Jiang,et al.  Normalization and Transformation Techniques for Robust Speaker Recognition , 2008 .

[13]  Cristian Sminchisescu,et al.  Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[14]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[15]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[16]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[17]  Douglas A. Reynolds,et al.  Comparison of background normalization methods for text-independent speaker verification , 1997, EUROSPEECH.