Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE15

This paper presents the JHU HLTCOE submission to the NIST 2015 Language Recognition Evaluation, including critical and novel algorithmic components, use of limited and augmented training data, and additional post-evaluation analysis an d improvements. All of our systems used i-vectors based on Deep Neural Networks (DNNs) with discriminatively-trained Gau ssian classifiers, and linear fusion was performed with durat iondependent scaling. A key innovation was the use of three different kinds of i-vectors: acoustic, phonotactic, and join t. In addition, data augmentation was used to overcome the limite d training data of this evaluation. Post-evaluation analysi s shows the benefits of these design decisions as well as further pote ntial improvements.

[1]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[2]  Alan McCree,et al.  DNN senone MAP multinomial i-vectors for phonotactic language recognition , 2015, Interspeech.

[3]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[4]  Mark J. F. Gales,et al.  Data augmentation for low resource languages , 2014, INTERSPEECH.

[5]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[6]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Alan McCree,et al.  Multiclass Discriminative Training of i-vector Language Recognition , 2014, Odyssey.

[9]  Lukás Burget,et al.  Prosodic speaker verification using subspace multinomial models with intersession compensation , 2010, INTERSPEECH.

[10]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[11]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[12]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[13]  Yu Zhang,et al.  Extracting deep neural network bottleneck features using low-rank matrix factorization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[15]  Xiaohui Zhang,et al.  Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[16]  Douglas A. Reynolds,et al.  Beyond frame independence: parametric modelling of time duration in speaker and language recognition , 2008, INTERSPEECH.

[17]  Xiaodong Cui,et al.  Data augmentation for deep convolutional neural network acoustic modeling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[19]  Alan McCree,et al.  Insights into deep neural networks for speaker recognition , 2015, INTERSPEECH.

[20]  Yun Lei,et al.  Application of Convolutional Neural Networks to Language Identification in Noisy Conditions , 2014, Odyssey.

[21]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .