论文信息 - Spoken Language Recognition using X-vectors

Spoken Language Recognition using X-vectors

In this paper, we apply x-vectors to the task of spoken language recognition. This framework consists of a deep neural network that maps sequences of speech features to fixed-dimensional embeddings, called x-vectors. Longterm language characteristics are captured in the network by a temporal pooling layer that aggregates information across time. Once extracted, x-vectors utilize the same classification technology developed for i-vectors. In the 2017 NIST language recognition evaluation, x-vectors achieved excellent results and outperformed our state-ofthe-art i-vector systems. In the post-evaluation analysis presented here, we experiment with several variations of the x-vector framework, and find that the best performing system uses multilingual bottleneck features, data augmentation, and a discriminative Gaussian classifier.

[1] Joaquín González-Rodríguez,et al. Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[2] Florin Curelaru,et al. Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[3] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[4] Yan Song,et al. i-vector representation based on bottleneck features for language identification , 2013 .

[5] Joaquín González-Rodríguez,et al. Frame-by-frame language identification in short utterances using deep neural networks , 2015, Neural Networks.

[6] Sanjeev Khudanpur,et al. A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Alan McCree,et al. Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE15 , 2016, Odyssey.

[8] Alan McCree,et al. Multiclass Discriminative Training of i-vector Language Recognition , 2014, Odyssey.

[9] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Douglas A. Reynolds,et al. Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[11] Alan McCree,et al. Stacked Long-Term TDNN for Spoken Language Recognition , 2016, INTERSPEECH.

[12] Sri Harish Reddy Mallidi,et al. Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[13] Daniel Garcia-Romero,et al. Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14] Alan McCree,et al. Language Recognition for Telephone and Video Speech: The JHU HLTCOE Submission for NIST LRE17 , 2018, Odyssey.

[15] Pavel Matejka,et al. Multilingual bottleneck features for language recognition , 2015, INTERSPEECH.

[16] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.

[17] Joaquín González-Rodríguez,et al. Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Daniel Garcia-Romero,et al. Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[19] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[20] Xiaohui Zhang,et al. Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.