An Improved LSTM For Language Identification

In this paper, we propose a novel framework by combining the phonetic temporal neural model (PTN) with an improved LSTM (IM-LSTM). This is achieved by using an up-down connection from the time t to t+1 in the LSTM structure, which aims to capture the latent information from the previous time step. This updated structure can perform better to discriminate the frame-level phonetic information produced by PTN. On the AP16-OLR language identification dataset, our final model achieves relative growth rate 5.04%, 2.19%, 2.73% on EER and 6.55%, 5.81%, 2.23% on Cavg in 1s, 3s and full-length utterance condition than the standard PTN, respectively. The proposed framework receives a better performance than the standard PTN and other proposed models, particularly in 1s condition. This shows the efficacy and flexibility of the proposed method.

[1]  Dong Wang,et al.  AP16-OL7: A multilingual database for oriental languages and a language recognition baseline , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[2]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Dong Wang,et al.  Phonetic Temporal Neural Model for Language Identification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  William M. Campbell,et al.  Language recognition with support vector machines , 2004, Odyssey.

[5]  J. Foil,et al.  Language identification using noisy speech , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Bin Ma,et al.  The 2015 NIST Language Recognition Evaluation: The Shared View of I2R, Fantastic4 and SingaMS , 2016, INTERSPEECH.

[7]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[8]  Man-Hung Siu,et al.  Automatic language identification using discrete hidden Markov model , 2004, INTERSPEECH.

[9]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[10]  Dong Wang,et al.  AP17-OLR challenge: Data, plan, and baseline , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[11]  Russell B. Ives,et al.  Development of an automatic identification system of spoken languages: Phase I , 1982, ICASSP.

[12]  S. Papson “Model” , 1981 .

[13]  Seiichi Nakagawa,et al.  Speaker-independent, text-independent language identification by HMM , 1992, ICSLP.

[14]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[15]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[16]  Marc A. Zissman,et al.  Automatic language identification using Gaussian mixture and hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Jérôme Farinas,et al.  Modeling prosody for language identification on read and spontaneous speech , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Xiaohui Zhang,et al.  Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.