Combining deep speaker specific representations with GMM-SVM for speaker verification

This study combines a Gaussian mixture model support vector machine (GMM-SVM) system with a nonlinear feature transformation, discriminatively trained to extract speaker specific features from MFCCs. Separation of the speaker information component and non-speaker related information in the speech signal is accomplished using a regularized siamese deep network (RSDN). RSDN learns a hidden representation that well characterizes speaker information by training a subset of the hidden units using pairs of speech segments. MFCC features are input to a trained RSDN and a subset of hidden layer outputs are used as new input features in a GMM-SVM system. We demonstrate the potential of this approach for text-independent speaker verification by applying it to a subset of the NIST SRE 2006 1conv4w-1conv4w task. The hybrid RSDN GMM-SVM system achieves about 5% relative improvement over the baseline GMM-SVM system.

[1]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[2]  Mitch Weintraub,et al.  NONLINEAR DISCRIMINANT FEATURE EXTRACTION FOR ROBUST TEXT-INDEPENDENT SPEAKER RECOGNITION , 1997 .

[3]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[4]  Jean-François Bonastre,et al.  Localization and selection of speaker-specific information with statistical modeling , 2000, Speech Commun..

[5]  Larry P. Heck,et al.  Robustness to telephone handset distortion in speaker recognition by discriminative feature design , 2000, Speech Commun..

[6]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[7]  Marcos Faúndez-Zanuy,et al.  A new nonlinear feature extraction algorithm for speaker verification , 2004, INTERSPEECH.

[8]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  Andrew C. Morris,et al.  MLP Internal Representation as Discriminative Features for Improved Speaker Recognition , 2005, NOLISP.

[10]  Dalei Wu,et al.  MLP trained to separate problem speakers provides improved features for speaker identification , 2005, Proceedings 39th Annual 2005 International Carnahan Conference on Security Technology.

[11]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[12]  Speaker Recognition Via Nonlinear Discriminant Features , 2007 .

[13]  Driss Matrouf,et al.  State-of-the-Art Performance in Text-Independent Speaker Verification Through Open-Source Software , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  William M. Campbell,et al.  Text-Independent Speaker Recognition , 2008 .

[15]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[16]  Ahmad Salman,et al.  Learning Speaker-Specific Characteristics With a Deep Neural Architecture , 2011, IEEE Transactions on Neural Networks.

[17]  Ke Chen,et al.  Extracting Speaker-Specific Information with a Regularized Siamese Deep Network , 2011, NIPS.

[18]  Ke Chen,et al.  Exploring speaker-specific characteristics with deep learning , 2011, The 2011 International Joint Conference on Neural Networks.