Joint Training of Expanded End-to-End DNN for Text-Dependent Speaker Verification

We propose an expanded end-to-end DNN architecture for speaker verification based on b-vectors as well as d-vectors. We embedded the components of a speaker verification system such as modeling frame-level features, extracting utterance-level features, dimensionality reduction of utterancelevel features, and trial-level scoring in an expanded end-toend DNN architecture. The main contribution of this paper is that, instead of using DNNs as parts of the system trained independently, we train the whole system jointly with a finetune cost after pre-training each part. The experimental results show that the proposed system outperforms the baseline dvector system and i-vector PLDA system.

[1]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[2]  Douglas A. Reynolds,et al.  A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[3]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[4]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[5]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[6]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[7]  Hiroshi Ishikawa,et al.  Let there be color! , 2016, ACM Trans. Graph..

[8]  Hsin-Min Wang,et al.  Speaker verification using kernel-based binary classifiers with binary operation derived features , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[11]  Tara N. Sainath,et al.  Locally-connected and convolutional neural networks for small footprint speaker recognition , 2015, INTERSPEECH.

[12]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[13]  Bin Ma,et al.  The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[14]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Dimitri Palaz,et al.  Towards End-to-End Speech Recognition , 2016 .

[18]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.