Text-independent speaker identification using robust statistics estimation

Abstract It is well-known that the performance of Gaussian mixture model-based text-independent speaker identification systems deteriorates significantly with the presence of noise and spectral distortion in the training and testing utterances. In this paper, we propose a novel GMM-based speaker identification system based on two robust-statistics estimation methods: the minimum volume ellipsoid method, and the minimum covariance determinant method. Compared to the traditional maximum likelihood estimation method, the proposed methods are less sensitive to outliers in the feature-vector space caused by additive noise and spectral distortion. Moreover, in the testing phase, we propose a simple distance metric to be used for comparing the unknown testing utterance against the speakers’ models. Furthermore, we derive a more robust version of the i-vector extractor, named robust i-vector , which utilizes our proposed robust estimation methods for estimating the parameters of the base universal background model. The proposed classification system has been applied to the NIST 2000 speaker recognition evaluation and the COSINE database. It has also been compared against state-of-the-art techniques such as the GMM/UBM method, the super-vectors method, and the i-vector methods. Experimental results show that the proposed classification system provides up to 16% relative improvement in the identification performance over the i-vector methods for short utterances in the NIST 2000 database and up to 8% when the utterances of the NIST 2000 database are contaminated by different types of artificial noise for signal-to-noise ratio ranging from 0 to 20  dB. For the COSINE database, the robust i-vector estimation provides an absolute improvement of up to 8%. Finally, the real time factor of the proposed distance metric for testing is 55% higher than the RT of the regular ML scoring.

[1]  Jos F. Sturm,et al.  A Matlab toolbox for optimization over symmetric cones , 1999 .

[2]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[3]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[4]  P. Rousseeuw,et al.  Minimum volume ellipsoid , 2009 .

[5]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[6]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[7]  David M. Rocke,et al.  Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator , 2004, Comput. Stat. Data Anal..

[8]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[9]  Peng Sun,et al.  Computation of Minimum Volume Covering Ellipsoids , 2002, Oper. Res..

[10]  Steve Young,et al.  The HTK book , 1995 .

[11]  M. Debruyne,et al.  Minimum covariance determinant , 2010 .

[12]  Xunkai Wei,et al.  Enclosing machine learning: concepts and algorithms , 2008, Neural Computing and Applications.

[13]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Christophe Croux,et al.  Location adjustment for the minimum volume ellipsoid estimator , 2002, Stat. Comput..

[15]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[16]  Leonid Khachiyan,et al.  Rounding of Polytopes in the Real Number Model of Computation , 1996, Math. Oper. Res..

[17]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Abdel-Karim S.O. Hassan,et al.  APPLICATION OF CONIC OPTIMIZATION AND SEMIDEFINITE PROGRAMMING IN CLASSIFICATION , 2011 .

[19]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20]  Mohammed Bennamoun,et al.  Sparse Representation for Speaker Identification , 2010, 2010 20th International Conference on Pattern Recognition.

[21]  Douglas A. Reynolds,et al.  The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective , 2000, Speech Commun..

[22]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  John H. L. Hansen,et al.  Babble Noise: Modeling, Analysis, and Applications , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[25]  Stephen P. Boyd,et al.  Applications of semidefinite programming , 1999 .

[26]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[27]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[28]  David M. Rocke,et al.  The Distribution of Robust Distances , 2005 .

[29]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[30]  Nikolaos Dervilis,et al.  A machine learning approach to Structural Health Monitoring with a view towards wind turbines , 2013 .

[31]  Yun Lei,et al.  A noise robust i-vector extractor using vector taylor series for speaker recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Mia Hubert,et al.  LIBRA: a MATLAB library for robust analysis , 2005 .

[33]  A. Hadi Identifying Multiple Outliers in Multivariate Data , 1992 .

[34]  Akiko Takeda,et al.  Conditional minimum volume ellipsoid with application to multiclass discrimination , 2008, Comput. Optim. Appl..

[35]  Patrick Haffner,et al.  GMM/SVM N-best speaker identification under mismatch channel conditions , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Alex Park,et al.  ASR dependent techniques for speaker identification , 2002, INTERSPEECH.

[37]  John Shawe-Taylor,et al.  The Minimum Volume Covering Ellipsoid Estimation in Kernel-Defined Feature Spaces , 2006, ECML.

[38]  Kim-Chuan Toh,et al.  SDPT3 -- A Matlab Software Package for Semidefinite Programming , 1996 .

[39]  M. R. Srinivasan,et al.  An Overview of Multiple Outliers in Multidimensional Data , 2013 .

[40]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[41]  Piyush Kumar,et al.  Minimum-Volume Enclosing Ellipsoids and Core Sets , 2005 .

[42]  M. Hubert,et al.  High-Breakdown Robust Multivariate Methods , 2008, 0808.0657.

[43]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[45]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[46]  David G. Stork,et al.  Pattern Classification , 1973 .

[47]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[48]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[49]  R Togneri,et al.  An Overview of Speaker Identification: Accuracy and Robustness Issues , 2011, IEEE Circuits and Systems Magazine.

[50]  James R. Glass,et al.  Robust Speaker Recognition in Noisy Conditions , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Christophe Croux,et al.  An easy way to increase the finite-sample efficiency of the resampled minimum volume ellipsoid estimator , 1997 .

[52]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[53]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[54]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[55]  Jeff A. Bilmes,et al.  COSINE - A corpus of multi-party COnversational Speech In Noisy Environments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.