Tandem deep features for text-dependent speaker verification

Although deep learning has been successfully used in acoustic modeling of speech recognition, it has not been thoroughly investigated and widely accepted for speaker verification. This paper describes an investigation of using various types of deep features in a Tandem fashion for text-dependent speaker verification. Three types of networks are used to extract deep features: restricted Boltzmann machine (RBM), phone discriminant and speaker discriminant deep neural network (DNN). Hidden layer outputs from these networks are concatenated with the original acoustic features and used in a GMM-UBM classifier. The systems with Tandem deep feature were evaluated on RSR2015, a short-term text dependent speaker verification task. Experiments showed that the best Tandem deep feature obtained more than 50% relative EER reduction over the traditional feature in a GMM-UBM framework.

[1]  Kai Yu,et al.  Reshaping deep neural network for fast decoding by node-pruning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Larry P. Heck,et al.  Robustness to telephone handset distortion in speaker recognition by discriminative feature design , 2000, Speech Commun..

[3]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[4]  Yuan Liu,et al.  Speaker verification with deep features , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[5]  B. Yegnanarayana,et al.  Online text-independent speaker verification system using autoassociative neural network models , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[6]  Richard J. Mammone,et al.  Speaker recognition using neural networks and conventional classifiers , 1994, IEEE Trans. Speech Audio Process..

[7]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Bin Ma,et al.  The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[9]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[10]  Hynek Hermansky,et al.  Mixture of Auto-Associative Neural Networks for Speaker Verification , 2011, INTERSPEECH.

[11]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[12]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Pietro Laface,et al.  Speaker recognition by means of Deep Belief Networks , 2013 .

[14]  B. Yegnanarayana,et al.  Autoassociative neural network models for online speaker verification using source features from vowels , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[15]  Hagai Aronowitz,et al.  Text dependent speaker verification using a small development set , 2012, Odyssey.

[16]  Ahsanul Kabir Vector Quantization In Text Dependent Automatic Speaker Recognition Using Mel-frequency Cepstrum Coefficient , 2007 .

[17]  Themos Stafylakis,et al.  Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[18]  Haizhou Li,et al.  I-vectors in the context of phonetically-constrained short utterances for speaker verification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[21]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[24]  Mitch Weintraub,et al.  NONLINEAR DISCRIMINANT FEATURE EXTRACTION FOR ROBUST TEXT-INDEPENDENT SPEAKER RECOGNITION , 1997 .