Phone-aware neural language identification

Pure acoustic neural models, particularly the LSTM-RNN model, have shown great potential in language identification (LID). However, the phonetic information has been largely overlooked by most of existing neural LID models, although this information has been used in the conventional phonetic LID systems with a great success. We present a phone- aware neural LID architecture, which is a deep LSTM-RNN LID system but accepts output from an RNN-based ASR system. By utilizing the phonetic knowledge, the LID performance can be significantly improved. Interestingly, even if the test language is not involved in the ASR training, the phonetic knowledge still presents a large contribution. Our experiments conducted on four languages within the Babel corpus demonstrated that the phone-aware approach is highly effective.

[1]  Yan Song,et al.  i-vector representation based on bottleneck features for language identification , 2013 .

[2]  Jean-Luc Gauvain,et al.  Language identification using phone-based acoustic likelihoods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[4]  J. Gonzalez-Dominguez,et al.  Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[5]  Yun Lei,et al.  Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Jean-Luc Gauvain,et al.  A Divide-and-Conquer Approach for Language Identification Based on Recurrent Neural Networks , 2016, INTERSPEECH.

[7]  Mikhail Kotov,et al.  Language Identification Using Time Delay Neural Network D-Vector on Short Utterances , 2016, SPECOM.

[8]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[9]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[10]  Rubén San-Segundo-Hernández,et al.  On the use of phone-gram units in recurrent neural networks for language identification , 2016, Odyssey.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Xiaohui Zhang,et al.  Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[13]  Li-Rong Dai,et al.  LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification , 2016, Odyssey.

[14]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[15]  Doroteo Torre Toledano,et al.  An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[16]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[17]  Yi Liu,et al.  Investigation of Senone-based Long-Short Term Memory RNNs for Spoken Language Recognition , 2016, Odyssey.

[18]  Alan McCree,et al.  Stacked Long-Term TDNN for Spoken Language Recognition , 2016, INTERSPEECH.

[19]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[20]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Lukás Burget,et al.  Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[22]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.