Phonetic Temporal Neural Model for Language Identification

Deep neural models, particularly the long short-term memory recurrent neural network (LSTM-RNN) model, have shown great potential for language identification (LID). However, the use of phonetic information has been largely overlooked by most existing neural LID methods, although this information has been used very successfully in conventional phonetic LID systems. We present a phonetic temporal neural model for LID, which is an LSTM-RNN LID system that accepts phonetic features produced by a phone-discriminative DNN as the input, rather than raw acoustic features. This new model is similar to traditional phonetic LID methods, but the phonetic knowledge here is much richer: It is at the frame level and involves compacted information of all phones. Our experiments conducted on the Babel database and the AP16-OLR database demonstrate that the temporal phonetic neural approach is very effective, and significantly outperforms existing acoustic neural models. It also outperforms the conventional i-vector approach on short utterances and in noisy conditions.

[1]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Julie S. Amberg,et al.  Introduction: What is language? , 2009 .

[3]  Tanja Schultz,et al.  LVCSR-based language identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Ronald A. Cole,et al.  Perceptual benchmarks for automatic language identification , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[6]  Seiichi Nakagawa,et al.  Speaker-independent, text-independent language identification by HMM , 1992, ICSLP.

[7]  Marc A. Zissman,et al.  Automatic language identification using Gaussian mixture and hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[9]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[10]  Timothy J. Hazen,et al.  Segment-based automatic language identification , 1997 .

[11]  William M. Campbell,et al.  Language recognition with support vector machines , 2004, Odyssey.

[12]  Shubha Kadambe,et al.  Robust spoken language identification using large vocabulary speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  J. Gonzalez-Dominguez,et al.  Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[14]  Jérôme Farinas,et al.  Modeling prosody for language identification on read and spontaneous speech , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[15]  Man-Hung Siu,et al.  Automatic language identification using discrete hidden Markov model , 2004, INTERSPEECH.

[16]  David Crystal,et al.  The Cambridge Encyclopedia of Language , 2012, Modern Language Review.

[17]  Martine Adda-Decker,et al.  Different size multilingual phone inventories and context-dependent acoustic models for language identification , 2005, INTERSPEECH.

[18]  Jirí Navrátil,et al.  Spoken language recognition-a step toward multilinguality in speech processing , 2001, IEEE Trans. Speech Audio Process..

[19]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Ronald A. Cole,et al.  A segment-based approach to automatic language identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[21]  Yi Liu,et al.  Investigation of Senone-based Long-Short Term Memory RNNs for Spoken Language Recognition , 2016, Odyssey.

[22]  Thomas Fang Zheng,et al.  Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[23]  Doroteo Torre Toledano,et al.  An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[24]  Mary P. Harper,et al.  Spoken Language Characterization , 2008 .

[25]  Bernard Comrie,et al.  The World's Major Languages , 1987 .

[26]  Rubén San-Segundo-Hernández,et al.  On the use of phone-gram units in recurrent neural networks for language identification , 2016, Odyssey.

[27]  Alan McCree,et al.  Stacked Long-Term TDNN for Spoken Language Recognition , 2016, INTERSPEECH.

[28]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[29]  Yeshwant K. Muthusamy,et al.  A Segmental Approach to Automatic Language Identification , 1993 .

[30]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[31]  Lukás Burget,et al.  Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[32]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[33]  Li-Rong Dai,et al.  LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification , 2016, Odyssey.

[34]  Masahiko Komatsu,et al.  Human language identification with reduced spectral information , 1999, EUROSPEECH.

[35]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[36]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[37]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[39]  Xiaohui Zhang,et al.  Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[40]  Yan Song,et al.  i-vector representation based on bottleneck features for language identification , 2013 .

[41]  Russell B. Ives,et al.  Development of an automatic identification system of spoken languages: Phase I , 1982, ICASSP.

[42]  Yun Lei,et al.  Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43]  Jean-Luc Gauvain,et al.  A Divide-and-Conquer Approach for Language Identification Based on Recurrent Neural Networks , 2016, INTERSPEECH.

[44]  J. Foil,et al.  Language identification using noisy speech , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  Mikhail Kotov,et al.  Language Identification Using Time Delay Neural Network D-Vector on Short Utterances , 2016, SPECOM.

[46]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.