Investigation of bottleneck features and multilingual deep neural networks for speaker verification

Recently, the integration of deep neural networks (DNNs) with i-vector systems is proved to be effective for speaker verification. This method uses the DNN with senone outputs to produce frame alignments for sufficient statistics extraction. However, two types of data mismatch may degrade the performance of the DNN-based speaker verification systems. First, the DNN requires transcribed training data, while the data sets used for ivector training and extraction are mostly untranscribed. Second, the language of the training data for DNN is limited by the pronunciation lexicon, making the model unsuitable for multilingual tasks. In this paper, we propose to use bottleneck features and multilingual DNNs to narrow the gap caused by the data mismatch. In our method, a DNN is first trained with senone labels to extract bottleneck features. Then a Gaussian mixture model (GMM) is trained with the bottleneck features to produce frame alignments. Additionally, bottleneck features based on multilingual DNNs are explored for multilingual speaker verification. Experiments on the NIST SRE 2008 female short2short3 telephone task (multilingual) and the NIST SRE 2010 female core-extended telephone task (English) demonstrate the effectiveness of the proposed method.

[1]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[3]  Longbiao Wang,et al.  Improvement of distant-talking speaker identification using bottleneck features of DNN , 2013, INTERSPEECH.

[4]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[8]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Meng Cai,et al.  Deep maxout neural networks for speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10]  Yan Song,et al.  i-vector representation based on bottleneck features for language identification , 2013 .

[11]  政子 鶴岡,et al.  1998 IEEE International Conference on SMCに参加して , 1998 .

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[14]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[15]  Javier Hernando,et al.  Deep belief networks for i-vector based speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Ruhi Sarikaya,et al.  Bottleneck features for speaker recognition , 2012, Odyssey.

[17]  Philip N. Garner,et al.  Current trends in multilingual speech processing , 2011 .

[18]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[20]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.