Deep feature for text-dependent speaker verification

Abstract Recently deep learning has been successfully used in speech recognition, however it has not been carefully explored and widely accepted for speaker verification. To incorporate deep learning into speaker verification, this paper proposes novel approaches of extracting and using features from deep learning models for text-dependent speaker verification. In contrast to the traditional short-term spectral feature, such as MFCC or PLP, in this paper, outputs from hidden layer of various deep models are employed as deep features for text-dependent speaker verification. Fours types of deep models are investigated: deep Restricted Boltzmann Machines, speech-discriminant Deep Neural Network (DNN), speaker-discriminant DNN, and multi-task joint-learned DNN. Once deep features are extracted, they may be used within either the GMM-UBM framework or the identity vector (i-vector) framework. Joint linear discriminant analysis and probabilistic linear discriminant analysis are proposed as effective back-end classifiers for identity vector based deep features. These approaches were evaluated on the RSR2015 data corpus. Experiments showed that deep feature based methods can obtain significant performance improvements compared to the traditional baselines, no matter if they are directly applied in the GMM-UBM system or utilized as identity vectors. The EER of the best system using the proposed identity vector is 0.10%, only one fifteenth of that in the GMM-UBM baseline.

[1]  David K. Burton,et al.  Text-dependent speaker verification using vector quantization source coding , 1985, IEEE Trans. Acoust. Speech Signal Process..

[2]  Hynek Hermansky,et al.  Adaptation transforms of auto-associative neural networks as features for speaker verification , 2012, Odyssey.

[3]  Eduardo Lleida,et al.  Factor analysis with sampling methods for text dependent speaker recognition , 2014, INTERSPEECH.

[4]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[5]  Ruhi Sarikaya,et al.  Bottleneck features for speaker recognition , 2012, Odyssey.

[6]  Douglas E. Sturim,et al.  Speaker verification using text-constrained Gaussian Mixture Models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yun Lei,et al.  Content matching for short duration speaker recognition , 2014, INTERSPEECH.

[9]  J. Oglesby,et al.  Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation , 1995 .

[10]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[11]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[13]  David A. van Leeuwen,et al.  Source-Normalized LDA for Robust Speaker Recognition Using i-Vectors From Multiple Speech Sources , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Nicolas Le Roux,et al.  Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[15]  Pietro Laface,et al.  Speaker recognition by means of Deep Belief Networks , 2013 .

[16]  Sadaoki Furui,et al.  Speaker recognition using HMM composition in noisy environments , 1996, Comput. Speech Lang..

[17]  Themos Stafylakis,et al.  In-domain versus out-of-domain training for text-dependent JFA , 2014, INTERSPEECH.

[18]  Lukás Burget,et al.  Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Patrick Kenny,et al.  First attempt of boltzmann machines for speaker verification , 2012, Odyssey.

[20]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[21]  Bart Kosko,et al.  Neural networks for signal processing , 1992 .

[22]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[23]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[24]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[27]  Najim Dehak,et al.  Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification , 2009 .

[28]  Mitch Weintraub,et al.  NONLINEAR DISCRIMINANT FEATURE EXTRACTION FOR ROBUST TEXT-INDEPENDENT SPEAKER RECOGNITION , 1997 .

[29]  Haizhou Li,et al.  I-vectors in the context of phonetically-constrained short utterances for speaker verification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  David A. van Leeuwen,et al.  Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[33]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[34]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Rita H. Wouhaybi,et al.  Comparison of neural networks for speaker recognition , 1999, ICECS'99. Proceedings of ICECS '99. 6th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.99EX357).

[36]  Qin Jin,et al.  Application of LDA to speaker recognition , 2000, INTERSPEECH.

[37]  Ahmad Salman,et al.  Learning Speaker-Specific Characteristics With a Deep Neural Architecture , 2011, IEEE Transactions on Neural Networks.

[38]  Yuan Liu,et al.  Tandem deep features for text-dependent speaker verification , 2014, INTERSPEECH.

[39]  Bin Ma,et al.  PLDA Modeling in I-Vector and Supervector Space for Speaker Verification , 2012, INTERSPEECH.

[40]  Bin Ma,et al.  The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[41]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[42]  E. Turajlic,et al.  Neural network based speaker verification for security systems , 2012, 2012 20th Telecommunications Forum (TELFOR).

[43]  Yuan Liu,et al.  Speaker verification with deep features , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[44]  Richard J. Mammone,et al.  Speaker recognition using neural networks and conventional classifiers , 1994, IEEE Trans. Speech Audio Process..

[45]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Xuening Sun,et al.  Automatic Speaker Recognition Using Neural Networks , 2004 .