Speaker verification with deep features

Due to great success of deep learning in speech recognition, there has been interest of applying deep learning to speaker verification. Previous investigations usually focus on using deep neural network as new classifiers or to extract speaker dependent features. They are either not compatible with existing speaker verification approaches, or not able to achieve significant performance gain in large scale tasks. Also, all the previous approaches have not addressed the issue of how to make use of extra unsupervised data. This paper proposes a novel feature engineering approach within the deep learning framework for speaker verification. Hidden layer output of deep neural network or deep belief network trained on large amount of speech recognition data are extracted as deep features. These features are then used in a Tandem fashion or concatenated with the original acoustic features for GMM-UBM speaker verification. The proposed approach can make use of large amount of existing speech recognition data without speaker labels and is easy to be combined with other mature classification approaches. Experiments on the core condition of NIST 2006 SRE showed that, in a text independent task, the proposed approach can achieve 12.8% relative EER improvement compared to the standard GMM-UBM systems. In addition, text-dependent speaker verification experiments were also performed and yielded similar significant gain.

[1]  Ruhi Sarikaya,et al.  Bottleneck features for speaker recognition , 2012, Odyssey.

[2]  Lukás Burget,et al.  Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[4]  Richard J. Mammone,et al.  Speaker recognition using neural networks and conventional classifiers , 1994, IEEE Trans. Speech Audio Process..

[5]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[7]  Nicolas Le Roux,et al.  Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[8]  Zhi-Jie Yan,et al.  A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR , 2013, INTERSPEECH.

[9]  Alvin F. Martin,et al.  The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[10]  Christian Igel,et al.  An Introduction to Restricted Boltzmann Machines , 2012, CIARP.

[11]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Hervé Bourlard,et al.  MLP-based factor analysis for tandem speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Ahsanul Kabir Vector Quantization In Text Dependent Automatic Speaker Recognition Using Mel-frequency Cepstrum Coefficient , 2007 .

[15]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[16]  E. Turajlic,et al.  Neural network based speaker verification for security systems , 2012, 2012 20th Telecommunications Forum (TELFOR).

[17]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Xuening Sun,et al.  Automatic Speaker Recognition Using Neural Networks , 2004 .

[19]  Patrick Kenny,et al.  First attempt of boltzmann machines for speaker verification , 2012, Odyssey.

[20]  Pietro Laface,et al.  Speaker recognition by means of Deep Belief Networks , 2013 .

[21]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Rita H. Wouhaybi,et al.  Comparison of neural networks for speaker recognition , 1999, ICECS'99. Proceedings of ICECS '99. 6th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.99EX357).

[23]  Mitch Weintraub,et al.  NONLINEAR DISCRIMINANT FEATURE EXTRACTION FOR ROBUST TEXT-INDEPENDENT SPEAKER RECOGNITION , 1997 .

[24]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[25]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.