End-to-End versus Embedding Neural Networks for Language Recognition in Mismatched Conditions

Neural network architectures mapping variable-length speech utterances into fixed dimensional embeddings have started to outperform state-of-the-art i-vector systems in speaker and language recognition tasks. However, neural networks are prone to over-fit to the training domain and may be difficult to adapt to new domains with limited development data. A successful solution, used in recent NIST 2017 language recognition evaluation, consists of training the embedding extractor on out-of-domain data and applying a back-end classifier adapted to the target domain. In this paper, we compare the embedding+back-end approach with the end-to-end evaluation of the neural network to obtain language log-likelihoods. Doing careful adaptation of the networks, we show that end-to-end improved detection cost by 6% relative w.r.t. the best embedding system. We compared two embedding architectures. First, we evaluated embeddings using a temporal mean+stddev pooling layer to capture the long-term sequence information (a.k.a. x-vectors). Second, we present a novel probabilistic embedding framework where the embedding is a hidden variable. The network predicts a Gaussian posterior distribution for the embedding given each feature frame. Finally, the frame level posteriors can be combined in a principled way to obtain sequence level posteriors. In this manner, we obtain an uncertainty measure about the embedding value. Language scores are obtained integrating over the embedding posterior distribution. In our experiments, x-vectors outperformed probabilistic embeddings for embedding+backend systems but both attained comparable results for end-to-end systems.

[1]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[2]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yun Lei,et al.  Application of Convolutional Neural Networks to Language Identification in Noisy Conditions , 2014, Odyssey.

[5]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[6]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[7]  Douglas E. Sturim,et al.  The MIT Lincoln Laboratory / JHU / EPITA-LSE LRE17 System , 2018, Odyssey.

[8]  Bo Xu,et al.  End-to-End Language Identification Using Attention-Based Recurrent Neural Networks , 2016, INTERSPEECH.

[9]  Niko Brümmer,et al.  Tied Variational Autoencoder Backends for i-Vector Speaker Recognition , 2017, INTERSPEECH.

[10]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[11]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[12]  Sanjeev Khudanpur,et al.  Spoken Language Recognition using X-vectors , 2018, Odyssey.

[13]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Alan McCree,et al.  Language Recognition for Telephone and Video Speech: The JHU HLTCOE Submission for NIST LRE17 , 2018, Odyssey.

[16]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[17]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[18]  Mohamed Kamal Omar,et al.  Robust language identification using convolutional neural network features , 2014, INTERSPEECH.

[19]  David Vázquez,et al.  PixelVAE: A Latent Variable Model for Natural Images , 2016, ICLR.

[20]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[21]  William M. Campbell A covariance kernel for svm language recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.