论文信息 - Joint Training of Expanded End-to-End DNN for Text-Dependent Speaker Verification

Joint Training of Expanded End-to-End DNN for Text-Dependent Speaker Verification

We propose an expanded end-to-end DNN architecture for speaker verification based on b-vectors as well as d-vectors. We embedded the components of a speaker verification system such as modeling frame-level features, extracting utterance-level features, dimensionality reduction of utterancelevel features, and trial-level scoring in an expanded end-toend DNN architecture. The main contribution of this paper is that, instead of using DNNs as parts of the system trained independently, we train the whole system jointly with a finetune cost after pre-training each part. The experimental results show that the proposed system outperforms the baseline dvector system and i-vector PLDA system.

[1] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[2] Douglas A. Reynolds,et al. A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[3] Sergey Ioffe,et al. Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[4] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[5] Douglas E. Sturim,et al. Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[6] Themos Stafylakis,et al. Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[7] Hiroshi Ishikawa,et al. Let there be color! , 2016, ACM Trans. Graph..

[8] Hsin-Min Wang,et al. Speaker verification using kernel-based binary classifiers with binary operation derived features , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[11] Tara N. Sainath,et al. Locally-connected and convolutional neural networks for small footprint speaker recognition , 2015, INTERSPEECH.

[12] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[13] Bin Ma,et al. The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[14] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Rich Caruana,et al. Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[16] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Dimitri Palaz,et al. Towards End-to-End Speech Recognition , 2016 .

[18] Georg Heigold,et al. End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Patrick Kenny,et al. Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.