Automatic language identification using long short-term memory recurrent neural networks

This work explores the use of Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) for automatic language identification (LID). The use of RNNs is motivated by their better ability in modeling sequences with respect to feed forward networks used in previous works. We show that LSTM RNNs can effectively exploit temporal dependencies in acoustic data, learning relevant features for language discrimination purposes. The proposed approach is compared to baseline i-vector and feed forward Deep Neural Network (DNN) systems in the NIST Language Recognition Evaluation 2009 dataset. We show LSTM RNNs achieve better performance than our best DNN system with an order of magnitude fewer parameters. Further, the combination of the different systems leads to significant performance improvements (up to 28%).

[1]  R.A. Cole,et al.  Language identification with neural networks: a feasibility study , 1989, Conference Proceeding IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[2]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[3]  Y.K. Muthusamy,et al.  Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[4]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[7]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[8]  B. Yegnanarayana,et al.  Neural network classifiers for language identification using phonotactic and prosodic features , 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005..

[9]  N. Brummer,et al.  On calibration of language recognition scores , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[10]  G. Montavon Deep learning for spoken language identification , 2009 .

[11]  Elizabeth Shriberg,et al.  A comparison of approaches for modeling prosodic features in speaker recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[13]  Doroteo Torre Toledano,et al.  Multilevel and Session Variability Compensated Language Recognition: ATVS-UAM Systems at NIST LRE 2009 , 2010, IEEE Journal of Selected Topics in Signal Processing.

[14]  Douglas E. Sturim,et al.  The MITLL NIST LRE 2009 language recognition system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Haizhou Li,et al.  Language Identification: A Tutorial , 2011, IEEE Circuits and Systems Magazine.

[16]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[17]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[18]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[19]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[20]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[21]  Eduardo Lleida,et al.  Prosodic features and formant modeling for an ivector-based language recognition system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[23]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.