Stacked bottleneck features for speaker verification

i-Vector modeling has shown to be effective for text independent speaker verification. It represents each utterance as a low-dimensional vector using factor analysis with a GMM supervector. In order to capture more complex speaker statistics, this paper proposes a new feature representation other than i-vectors for speaker verification using neural networks. In this work, stacked bottleneck features are extracted from cascade neural networks based on GMM supervectors. Dropout is integrated into the model to improve generalization error. We compare the proposed method with i-vector approach on NIST SRE2008 female short2-short3 telephone-telephone task. Experimental results demonstrate the efficacy of the proposed method.

[1]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[2]  Rong Zheng,et al.  An iVector extractor using pre-trained neural networks for speaker verification , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[3]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[4]  Yuan Liu,et al.  Tandem deep features for text-dependent speaker verification , 2014, INTERSPEECH.

[5]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[6]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[7]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[8]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Javier Hernando,et al.  Deep belief networks for i-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Javier Hernando,et al.  i-Vector Modeling with Deep Belief Networks for Multi-Session Speaker Recognition , 2014, Odyssey.

[12]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[13]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[16]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.